Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 4, 2009, 4:11 PM

Post #1 of 6 (129 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

Attached a patch with a base class that solves the code duplication issue and simplifies stopword loading for contrib analyzers.
The patch contains 3 analyzers which use this new base class

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 4, 2009, 11:30 PM

Post #2 of 6 (107 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

I have removed a sys.out I had accidentally in there.
The new patch contains more converted analyzers but I guess we should first get this in before we start converting all analyzers in contrib.
Pretty neat though stuff like ChineseAnalyzer only has one method in it after inheriting BaseAnalyzer

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 5, 2009, 9:49 AM

Post #3 of 6 (100 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

I changed the name from BaseAnalyzer to AbstractContribAnalyzer that might do a better job than BaseAnalyzer.



> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 5, 2009, 5:10 PM

Post #4 of 6 (98 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

I externalized the stopword related code into another base analyzer and named the analyzer AbstractAnalyzer and StopawareAnalyzer.



> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 9, 2009, 3:20 AM

Post #5 of 6 (71 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.txt

Updated patch to current trunk (the massive patch with @Override broke the patch)
- Moved AbstractAnalyzer to core
- updated final Analyzers in core to use the AbstractAnalyzer
- fixed TestBrazilianStemmer.java


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 9, 2009, 8:35 AM

Post #6 of 6 (69 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Fix Version/s: (was: 3.0)
3.1

Pushed this to 3.1.
We might want to change existing analyzers in core to use the new analyzer too and make then final. So we are better off with pushing it to 3.1

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.