Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jan 29, 2012, 10:03 AM

Post #1 of 10 (42 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195810#comment-13195810 ]

Erick Erickson commented on SOLR-3017:
--------------------------------------

A couple of questions:
1> I notice that guava is in here. The only other place I see imports for google.common is in the carrot code. Does anyone object to guava getting used in core? I only ask because it's used in so few places, do we prefer apache commons StringUtils for this kind of stuff or do we just not care?

2> In ExtendedDismaxQParserPlugin, around lines 1140 (in noStopwordFilterAnalyzer) there are a couple of tests like:
if (stopwordFilterFactoryClass.isInstance(tf)) {

Scanning the code, it seems like stopwordFilterFactoryClass could be null, an NPE here seems questionable.

Otherwise, this seems fine to me from a tactical perspective, anyone want to weigh in on whether this is a good thing overall?

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jan 31, 2012, 7:02 PM

Post #2 of 10 (35 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197536#comment-13197536 ]

Michael Dodsworth commented on SOLR-3017:
-----------------------------------------

Thanks for the review, Erick. Much appreciated.

1 - I'll create an alternative patch with the guava stuff switched out for StringUtils. I'm personally a big fan of the guava lib but I'm not using it for anything too useful here.

2 - I've actually provided ExtendedSolrQueryParser with a default value (around ExtendedDismaxQParserPlugin:851) for 'stopwordFilterFactoryClass'. It's possible that someone could call 'setStopwordFilterFactoryClass' with null, in which case we would have a NPE. I've no problem adding a defensive null check before the 'isInstance' call. The other option would be to add a @NonNull annotation to that argument...but I'm not sure if findbugs or similar is run as part of the solr build process.



> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 12:19 PM

Post #3 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200018#comment-13200018 ]

Michael Dodsworth commented on SOLR-3017:
-----------------------------------------

Cheers Erick.

Regarding the null check, would it be better to throw an IllegalArgumentException if null is passed through? Either way, it might be a good idea to javadoc the behaviour when null is passed.

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 12:43 PM

Post #4 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200029#comment-13200029 ]

Erick Erickson commented on SOLR-3017:
--------------------------------------

Yeah, it's unclear to me what the "right thing" here is. I'm really a bit fuzzy about whether ever setting this value to null is permissible or whether preventing that is too draconian. I suppose that since we're going from hard checking for StopFilterFactory, we can at least argue that either way it's not likely to break existing code.

Please feel free to update the patch as you see fit, then we'll wait a couple of days for anyone who looks to object and I'll check it in.

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 1:08 PM

Post #5 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200042#comment-13200042 ]

Yonik Seeley commented on SOLR-3017:
------------------------------------

Checking for the stop filter always felt a little hacky to me (I know... I'm the one who did it)... but specifying a classname as a query parameter feels like it goes down the wrong path (esp since seems like a very rare issue). "stopwordFilterClassName" isn't even accurate if it's supposed to be specifying the name of the factory.

Going back to the original issue:
bq. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

The simplest solution here would seem to be to change StopFilterFactory.create() to return a TokenStream?

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 2:00 PM

Post #6 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200085#comment-13200085 ]

Michael Dodsworth commented on SOLR-3017:
-----------------------------------------

That would certainly allow us to hack around the problem in a way that doesn't require a change to the query parser (i.e., extending StopFilterFactory and overriding its create method to return our own filter).

Are we concerned about breaking code that may be calling StopFilterFactory.create() and is expecting a StopFilter (I wonder if there's a reason TokenStream wasn't used originally)?

Agreed on the inaccurate param name. I'll fix that up in the next patch.

Specifying the factory class name as a param *is* optional and, as you say, should be a rare case.
If there's a better, more general fix for this, I'm happy to take that on.



> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 2:10 PM

Post #7 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200092#comment-13200092 ]

Yonik Seeley commented on SOLR-3017:
------------------------------------

bq. Are we concerned about breaking code that may be calling StopFilterFactory.create()

Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere).


> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 5:33 PM

Post #8 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200252#comment-13200252 ]

Erick Erickson commented on SOLR-3017:
--------------------------------------

One word fixes, I love it. Anyway, is there any good reason not to put this in 3.6? I'm running tests after that change now and I'll put it in if this seems reasonable.

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 5:39 PM

Post #9 of 10 (36 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200257#comment-13200257 ]

Michael Dodsworth commented on SOLR-3017:
-----------------------------------------

Great, thank you both.

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Feb 3, 2012, 5:53 PM

Post #10 of 10 (34 views)
Permalink
[jira] [Commented] (SOLR-3017) Allow edismax stopword filter factory implementation to be specified [In reply to]

[ https://issues.apache.org/jira/browse/SOLR-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200264#comment-13200264 ]

Hoss Man commented on SOLR-3017:
--------------------------------

bq. Nope. I've just committed this change in trunk. There wasn't a good reason to use a more specific type (and it was not used anywhere).

FWIW: I'm pretty sure the only reason any of these factories are declared to return specific types (instead of just TokenStream) was SOLR-396 -- which isn't really that important now that lucene & solr development is in a single repo and people can easily commit factories at the same time that new impls are added.

> Allow edismax stopword filter factory implementation to be specified
> --------------------------------------------------------------------
>
> Key: SOLR-3017
> URL: https://issues.apache.org/jira/browse/SOLR-3017
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0
> Reporter: Michael Dodsworth
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-3017-without-guava-alternative.patch, SOLR-3017.patch, SOLR-3017.patch, edismax_stop_filter_factory.patch
>
>
> Currently, the edismax query parser assumes that stopword filtering is being done by StopFilter: the removal of the stop filter is performed by looking for an instance of 'StopFilterFactory' (hard-coded) within the associated field's analysis chain.
> We'd like to be able to use our own stop filters whilst keeping the edismax stopword removal goodness. The supplied patch allows the stopword filter factory class to be supplied as a param, "stopwordFilterClassName". If no value is given, the default (StopFilterFactory) is used.
> Another option I looked into was to extend StopFilterFactory to create our own filter. Unfortunately, StopFilterFactory's 'create' method returns StopFilter, not TokenStream. StopFilter is also final.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.