
jira at apache
Nov 2, 2009, 9:44 AM
Post #8 of 10
(475 views)
Permalink
|
|
[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer
[In reply to]
|
|
[ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2023: -------------------------------- Attachment: LUCENE-2023.patch refactor a lot of this analyzer: * move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package * move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter) * remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation. to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now) > Improve performance of SmartChineseAnalyzer > ------------------------------------------- > > Key: LUCENE-2023 > URL: https://issues.apache.org/jira/browse/LUCENE-2023 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.0 > > Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch > > > I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text. > This patch improves the internal hhmm implementation. > Time to index my chinese corpus is 75% of the previous time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene For additional commands, e-mail: java-dev-help [at] lucene
|