Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Oct 30, 2009, 10:01 AM

Post #1 of 10 (547 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Oct 30, 2009, 10:33 AM

Post #2 of 10 (506 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Priority: Minor (was: Major)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Oct 30, 2009, 12:45 PM

Post #3 of 10 (514 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

change these redundant bounds checks to assertions as DM observed.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 1, 2009, 6:50 AM

Post #4 of 10 (476 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

updated patch, shaves off another 5%

my avg indexing throughput:
* cjkanalyzer: 3447k/s
* orig smartcn: 1357k/s
* patched smartcn: 1965k/s

there are serious memory consumption problems in the n^2 part of the algorithm (BiSegGraph), will see about improving it more.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 1, 2009, 8:45 AM

Post #5 of 10 (483 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

create a generic graph that is reusable, used by both SegGraph and BiSegGraph.
This cleans up the code a lot and prevents billions of arraylists from being created in n^2 style.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 1, 2009, 10:39 AM

Post #6 of 10 (478 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

in BiSegGraph, char[] was being created in n^2 fashion for each edge (SegTokenPair), even though its only used for weight calculation.
instead, add methods to BigramDictionary to get the frequency of a bigram: getFrequency(char left[], char right[]) without this silliness.

new figures are:
* cjkanalyzer: 3447k/s
* orig smartcn: 1357k/s
* patched smartcn: 2125k/s


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 6:38 AM

Post #7 of 10 (472 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

latest iteration, gets rid of SegTokenPair/PathNode.
BiSegGraph still isn't as simple or efficient as it should be,
but my indexing speed is up to 2400k/s :)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 9:44 AM

Post #8 of 10 (475 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

refactor a lot of this analyzer:
* move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package
* move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter)
* remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation.

to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now)


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 12:12 PM

Post #9 of 10 (465 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Attachment: LUCENE-2023.patch

fix WordTokenFilter to use Version, because if its not going to output delimiters thru stopFilter and then remove them, then it needs to adjust posInc (depending on version)


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 8, 2009, 1:49 AM

Post #10 of 10 (378 views)
Permalink
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

Fix Version/s: (was: 3.0)
3.1

if no one objects I'd rather work this into 3.1, along with some refactoring of this code.

DM Smith, if this causes you a problem I would rather just upload a java 1.4 patch for you to improve your performance than slip it into 3.0. there was already a bug in 2.9 in this analyzer so I don't want to introduce a new one without having a lot of time to play with this code.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
> Key: LUCENE-2023
> URL: https://issues.apache.org/jira/browse/LUCENE-2023
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation.
> Time to index my chinese corpus is 75% of the previous time.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.