Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 4, 2009, 4:15 PM

Post #1 of 37 (1145 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773723#action_12773723 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

btw. if somebody comes up with a better name for the analyzer speak up!
@robert: no super, fast or smart please :)

simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 5, 2009, 10:03 AM

Post #2 of 37 (1099 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774006#action_12774006 ]

Uwe Schindler commented on LUCENE-2034:
---------------------------------------

An then we also have an AbstractCoreAnalyzer? weird...

I want to bring this to core, too.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 5, 2009, 10:17 AM

Post #3 of 37 (1100 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774016#action_12774016 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. An then we also have an AbstractCoreAnalyzer? weird... I want to bring this to core, too.

I understand! Lets get this into contrib with either one name. Once we move this up in core it will be called Analyzer anyway so we can refactor it in contrib easily. The name AbstractContribAnalyzer would than again be ok as it would only contain the stopword convenience.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 5, 2009, 10:52 AM

Post #4 of 37 (1100 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774029#action_12774029 ]

Uwe Schindler commented on LUCENE-2034:
---------------------------------------

Even in core it will be a separate class, because it makes tokenStream() and reusableTokenStream() final, so users want to create an old style Analyzer cannot do this. So we need a good name even for core.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 5, 2009, 10:58 AM

Post #5 of 37 (1099 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774032#action_12774032 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

I would be happy to get a better name for it - any suggestions - I'm having a hard time to find one.

its your turn uwe :)

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 8, 2009, 12:15 PM

Post #6 of 37 (1079 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774816#action_12774816 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

Simon, i started looking at this, the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed.

BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to supply a set of words that should not be stemmed.

This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream()


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 3:18 AM

Post #7 of 37 (1066 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774940#action_12774940 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed.
I agree, I missed to extend the testcase instead I changed it to test the constructor only. I will extend it instead.
This testcase is actually a duplicate of testExclusionTableReuse(), it should test tokenStream instead of reusableTokenStream() - will fix this too.

bq. This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream()

that is actually what testExclusionTableReuse() does.

bq. also, i think this setStemExclusionTable stuff is really unrelated to your patch, but a reuse challenge in at least this analyzer. one way to solve it would be to...

I agree with your first point that this is kind of unrelated. I guess we should to that in a different issue while I think it is not that much of a deal as it does not change any functionality though.
I disagree with the reuse challenge, in my opinion analyzers should be immutable thats why I deprecated those methods and added the set to the constructor. The problem with those setters is that you have to be in the same thread to change your set as this will only invalidate the cached version of a token stream hold in a ThreadLocal. The implementation is ambiguous and should go away. The analyzer itself can be shared but the behaviour is kind of unpredictable if you reset the set. If there is an instance of this analyzer around and you call the setter you would expect the analyzer to use the set from the very moment on you call the setter which is not always true.






> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 8:13 AM

Post #8 of 37 (1063 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775011#action_12775011 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

simon, good solution. I agree we should deprecate these analyzer 'setter' methods, which just make things complicated for no good reason.

i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that?


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 8:17 AM

Post #9 of 37 (1055 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775014#action_12775014 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that?

I agree that this is not a very good name as we all discussed during apacheCon. With ReusableAnalyzer I would guess people would expect this analyzer to be reusable which isn't the case or rather is not what this is analyzer is doing. What if we call it ComponentAnalyzer or NewStyleAnalyzer or SmartAnalyzer (ok just kidding)

simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 6:49 AM

Post #10 of 37 (968 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778363#action_12778363 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

Simon, here
{quote}
source.reset(reader);
if(sink != source)
sink.reset(); // only reset if the sink reference is different from source
{quote}

we had a discussion on the mailing list about this: http://www.lucidimagination.com/search/document/cd8a94ebc8a4ea99/bug_in_standardanalyzer_stopanalyzer

I think we should consider removing the if, and unconditionally call sink.reset().
A bad consumer might not follow the rules, although it says in TokenStream javadoc that consumers should call reset()..


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 3:20 AM

Post #11 of 37 (944 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781870#action_12781870 ]

Uwe Schindler commented on LUCENE-2034:
---------------------------------------

bq. set svn EOF property to native - missed that in the last patch
You can cofigure your SVN client to do it automatically and also add the $ID$ props.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 3:46 AM

Post #12 of 37 (940 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781877#action_12781877 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

Simon in my opinion it is ok, about making tokenstream/reusablets final for those non-final contrib analyzers.

i think you should make those non-final analyzers final, too.

then we can get rid of complexity for sure.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 3:54 AM

Post #13 of 37 (945 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781884#action_12781884 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. i think you should make those non-final analyzers final, too.
+1

I think the analyzers should always be final. Maybe there are special cases but for the most of them nobody should subclass.
Same amount of work to make your own anyway.

simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:01 AM

Post #14 of 37 (934 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782472#action_12782472 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. i think you should make those non-final analyzers final, too.
I would prefer to open a sep. issue for making those final & remove deprecated methods, make public String[] private etc.

Once this in in we can refactor all other analyzers and fix them case by case.

simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 8:49 AM

Post #15 of 37 (875 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783737#action_12783737 ]

DM Smith commented on LUCENE-2034:
----------------------------------

I was trying to lurk, but I'm not able to apply the latest patch against trunk. I'm not sure if its me (using Eclipse) or the patch.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 9:01 AM

Post #16 of 37 (882 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783741#action_12783741 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. I was trying to lurk, but I'm not able to apply the latest patch against trunk. I'm not sure if its me (using Eclipse) or the patch.
its most likely the patch. There is so much going on around the analyzers right now. We try to get LUCENE-2094 in and get this ready once it is in. I will update this patch soon.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 9:40 AM

Post #17 of 37 (863 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784303#action_12784303 ]

DM Smith commented on LUCENE-2034:
----------------------------------

Patch looks good. I like how this simplifies the classes.

Some comments based on my use case, which allows a user creating an index to decide whether to use Lucene's default stop words or no stop words at all. No stop words is the default. (I'm also allowing stemming to be optional, but on by default.) These two require me to duplicate the each contrib Analyzers but reuse the parts. (If you're interested, each Lucene index is a whole book, where each paragraph is a document. Every word is potentially meaningful so stop words are not used by default.)

Regarding stop words:
* Some of the analyzers allow for null to be specified for the stop word list. Others require an empty set/file/reader. Those deriving from StopawareAnalyzer allow null. I'd like to see the ability to use null to follow through the rest of the analyzers.
*Some of the analyzers are cluttered with stopword list processing. Maybe WordListLoader could be extended to handle the other ways that contrib/analyzers store their lists? Specifically, how about moving StopawareAnalyzer.loadStopwordSet(...)? It seems to be a better place.
* How about splitting out the stop words to their own class? (I'm digging the word lists out of the analyzers and the lack of uniformity is a pain. Having them standalone would be useful.)
* If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer?
* Shouldn't StopawareAnalyzer be in core? and used in StopAnalyzer? Could it be merged into StopAnalyzer? Other than the loadStopwordSet, it really only adds a method to get the current stopword list.

Regarding 3.1:
There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change?

On a separate note:
In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 9:52 AM

Post #18 of 37 (860 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784305#action_12784305 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

Hi DM, in response to your comments, I would prefer all stoplists to actually be in the resources/* folder as text files.

The reasoning is to encourage use of the different parts of the analyzer, i.e. a Solr user can specify to use a russian stopword list embedded in a russian analyzer,
without using the analyzer itself (maybe they want to use the stemmer but after WordDelimiterFilter and things like that).

somewhat related: I also want to add the stoplists that the snowball project creates to the snowball package in contrib: see LUCENE-2055
This would allow us to remove duplicated functionality, analyzers we have coded in java in lucene that are essentially the same as what snowball does already.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 10:36 AM

Post #19 of 37 (870 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784328#action_12784328 ]

Uwe Schindler commented on LUCENE-2034:
---------------------------------------

{quote}
On a separate note:
In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?
{quote}

That's historical. For 2.9 it was not possible to provide the method covariant with different return type for BW compatibility, so the old ones could not be deprecated. With 3.0 they stayed alive and now there they are.

With Java 1.5, there should be the possibility to provide an covariant overload and deprecate the specializations. I will try out in a separate issue!

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 10:36 AM

Post #20 of 37 (869 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784327#action_12784327 ]

DM Smith commented on LUCENE-2034:
----------------------------------

Robert, I'd like them to be in files as well. But when it really gets down to it, a uniform interface to the the default stop word list is what really matters to me.

Like your use case, I don't see the provided analyzers as much more than a suggestion and default implementation. Currently and in this patch, I have to use them to get to the stop words.

I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying to figure it out for a while, but not with much effort or success). Something like:
{code}
TokenStream construct(Version v, String fieldName, Reader r, StreamSpec ...) {
source = first StreamSpec.create(v, fieldName, r);
result = source;
for the remaining StreamSpec {
result = streamSpec.create(v, fieldName, result);
}
return result;
}
{code}

The purpose of the StreamSpec is to allow a late binding of tokenizers/filters into a chain.

The other part would be to generate a Manifest with version info for Lucene, Java and each component that could be stored in (or with) the index. That way one could compare the manifest to see if the index needs to be rebuilt. This manifest could also be used to reconstruct the TokenStream.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 10:46 AM

Post #21 of 37 (869 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784333#action_12784333 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

bq. Robert, I'd like them to be in files as well. But when it really gets down to it, a uniform interface to the the default stop word list is what really matters to me.

DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 11:03 AM

Post #22 of 37 (865 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784338#action_12784338 ]

DM Smith commented on LUCENE-2034:
----------------------------------

Robert:
bq. DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?

Yes.

Uwe:
bq. Ideally he new methods should return Set<?> but implement this by a CharArraySet (which would be possible then). At the moment the sets are always copied to CharArraySet in each Analyzer.

I agree. That could also simplify some of what Simon is doing. However, the one distinctive of CharArraySet is that it can take input that is not lowercase and ignore the casing. This is what Simon's StopawareAnalyzer.loadStopwordSet(...) allows.

BTW, in some of the analyzers sometimes it is a CharArraySet and other times it is not (when it is via this class). This would make the treatment uniform.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:13 PM

Post #23 of 37 (862 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784432#action_12784432 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. Some of the analyzers allow for null to be specified for the stop word list. Others require an empty set/file/reader. Those deriving from StopawareAnalyzer allow null.
That is true - Stopawareanalyzer uses an empty set if you pass null.

bq. I'd like to see the ability to use null to follow through the rest of the analyzers.
*Some of the analyzers are cluttered with stopword list processing.
The analyzers in this patch are rather a PoC than a complete list. Eventually we will have all analyzers with stopwords to extend StopawareAnalyzer that is also the reason why we have this class. This and some other issues aim to eventually have a consistent way of processing all this stuff related to stopwords. We will also remove all the setters and have Set<?> only ctors for consistency.

bq. If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer?
the problem is that it is static and it should be static. Thats why we define it in each analyzer that uses stopwords. I would like to have it generalized but this seems to be the ideal solution. We could have something like a getDefaultStopSet(Class<? extends StopawareAnalyzer>) but I like the expressiveness of getDefaultStopSet() way better though.

bq. How about splitting out the stop words to their own class?
What do you mean by that? can you elaborate?

bq. There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change?
The should actually go away but I kept them in there because they are somewhat unrelated to this particular issue. Once this is in we will work on removing the deprecated stuff and make analyzers final (at least in contrib).

bq. In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?
that is one thing I hate about WordListLoader. +1 towards Uwe working on them!

bq. I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying to figure it out for a while, but not with much effort or success).
This has been discussed already and we haven't had much of a success though. I can not remember the issue (robert can you remember the factory issue?) but it was basically based on a factory pattern. This would also be my approach to it. That way we could get rid of almost every analyzer. I use such a pattern myself which works quite well.

bq. DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?
+1 for having those words in files. Nevertheless we will have a default stopword list though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:17 PM

Post #24 of 37 (860 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784434#action_12784434 ]

Robert Muir commented on LUCENE-2034:
-------------------------------------

{quote}
This has been discussed already and we haven't had much of a success though. I can not remember the issue (robert can you remember the factory issue?) but it was basically based on a factory pattern. This would also be my approach to it. That way we could get rid of almost every analyzer. I use such a pattern myself which works quite well.
{quote}

I think the only issue is that if I were to design such a thing, it would look just like how the analysis factories work in Solr... (already a solved problem)... maybe I am missing something?

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:27 PM

Post #25 of 37 (863 views)
Permalink
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784440#action_12784440 ]

Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. I think the only issue is that if I were to design such a thing, it would look just like how the analysis factories work in Solr... (already a solved problem)... maybe I am missing something?

no you don't. I just did that when there where no solr around. works pretty much the same way though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.