Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 24, 2009, 3:10 AM

Post #1 of 7 (456 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034,patch

Updated the patch to the current trunk.
I have not removed all the deprecated methods in contrib/analyzers yet - we should open another issue for that IMO.
Yet this patch still brakes back compatibility as some of the none final contrib analyzers extend StopawareAnalyzer with makes the old tokenstream / reusableTokenstream methods final. IMO this should not block this issues for the following reasons:
1. its in contrib - different story for core
2. it is super easy to port them
3. it make the API cleaner and has less code
4. those analyzers might have to change anyway due to the deprecated methods


simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 3:16 AM

Post #2 of 7 (440 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034,patch

set svn EOF property to native - missed that in the last patch

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 5:28 AM

Post #3 of 7 (403 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

I updated this patch to the latest trunk. The patch doesn't remove any deprecated methods from contrib/analysis neither does it mark the other Analyzers final. I think we should do all that in a different issue. I haven't added a note to contrib/CHANGES.TXT yet while it already breaks bw. compat for all none-final analyzers subclassing AbstractAnalyzer / StopawareAnalyzer.
Once we have a consensus on this patch I will add it.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 11, 2009, 9:33 AM

Post #4 of 7 (373 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

Updated the patch to the latest trunk.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 15, 2009, 5:04 PM

Post #5 of 7 (349 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

I renamed AbstractAnalyzer to ReusableAnalyzerBase which reflects pretty much what it does. Yet the Base postfix is pretty common throughout the JDK similarly to the Abstract prefix.
I added little more JavaDoc which brings some clarification when to subclass ReusableAnalyzerBase instead of Analyzer.

I guess this is ready to go in though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 16, 2009, 5:07 AM

Post #6 of 7 (343 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2034:
--------------------------------

Attachment: LUCENE-2034.patch

Simon, the patch looks good to me.

I added a few things:
* removed the duplication in analyzers/bg
* fixed some javadoc buglets
* added CHANGES

If no one objects, I will commit in a few days.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 17, 2009, 4:20 AM

Post #7 of 7 (326 views)
Permalink
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2034:
------------------------------------

Attachment: LUCENE-2034.patch

robert, I updated your patch and moved stopawareAnalzyer to StopwordAnalyzerBase in to core.
I also updated the CHANGES.TXT. THis will enable use to use it in smartcn too.
Seems to be way more consistent though.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.9
> Reporter: Simon Willnauer
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.