Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Aug 3, 2012, 3:58 PM

Post #1 of 3 (67 views)
Permalink
[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams

[ https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom Burton-West updated LUCENE-4286:
------------------------------------

Summary: Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams (was: Add flag to CJKBigramFilter to allow indexing unigrams as well is bigrams)

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
>
> Add an optional flag to the CJKBigramFilter to tell it to also output unigrams. This would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze queries as bigrams unless the query contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for indexing with the "indexUnigrams" flag set and the analyzer for querying without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single character queries. The CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we have to create a separate field to index Han unigrams in order to address single character queries and then write application code to search that separate field if we detect a single character Han query. This is rather kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter used to allow single word queries (although that uses word n-grams rather than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 3, 2012, 5:23 PM

Post #2 of 3 (67 views)
Permalink
[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4286:
--------------------------------

Attachment: LUCENE-4286.patch

first stab at a patch. I think its ok, but needs more tests just to be sure.

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
> Attachments: LUCENE-4286.patch
>
>
> Add an optional flag to the CJKBigramFilter to tell it to also output unigrams. This would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze queries as bigrams unless the query contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for indexing with the "indexUnigrams" flag set and the analyzer for querying without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single character queries. The CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we have to create a separate field to index Han unigrams in order to address single character queries and then write application code to search that separate field if we detect a single character Han query. This is rather kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter used to allow single word queries (although that uses word n-grams rather than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 3, 2012, 6:55 PM

Post #3 of 3 (67 views)
Permalink
[jira] [Updated] (LUCENE-4286) Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4286:
--------------------------------

Attachment: LUCENE-4286.patch

Updated patch with additional docs and tests.

This is ready to commit.

> Add flag to CJKBigramFilter to allow indexing unigrams as well as bigrams
> -------------------------------------------------------------------------
>
> Key: LUCENE-4286
> URL: https://issues.apache.org/jira/browse/LUCENE-4286
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.0-ALPHA, 3.6.1
> Reporter: Tom Burton-West
> Priority: Minor
> Attachments: LUCENE-4286.patch, LUCENE-4286.patch
>
>
> Add an optional flag to the CJKBigramFilter to tell it to also output unigrams. This would allow indexing of both bigrams and unigrams and at query time the analyzer could analyze queries as bigrams unless the query contained a single Han unigram.
> As an example here is a configuration a Solr fieldType with the analyzer for indexing with the "indexUnigrams" flag set and the analyzer for querying without the flag.
> <fieldType name="CJK" autoGeneratePhraseQueries="false">
> −
> <analyzer type="index">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" indexUnigrams="true" han="true"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.ICUTokenizerFactory"/>
> <filter class="solr.CJKBigramFilterFactory" han="true"/>
> </analyzer>
> </fieldType>
> Use case: About 10% of our queries that contain Han characters are single character queries. The CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we have to create a separate field to index Han unigrams in order to address single character queries and then write application code to search that separate field if we detect a single character Han query. This is rather kludgey. With the optional flag, we could configure Solr as above
> This is somewhat analogous to the flags in LUCENE-1370 for the ShingleFilter used to allow single word queries (although that uses word n-grams rather than character n-grams.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.