
jira at apache
Nov 6, 2009, 1:39 AM
Post #1 of 2
(102 views)
Permalink
|
|
[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated
|
|
[ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1370: --------------------------------------- Fix Version/s: 3.0 > Patch to make ShingleFilter output a unigram if no ngrams can be generated > -------------------------------------------------------------------------- > > Key: LUCENE-1370 > URL: https://issues.apache.org/jira/browse/LUCENE-1370 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Reporter: Chris Harris > Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch > > > Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams. > My use case here is speeding up phrase queries. The technique is as follows: > First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows: > "please divide this sentence into shingles" -> > "please", "please divide" > "divide", "divide this" > "this", "this sentence" > "sentence", "sentence into" > "into", "into shingles" > "shingles" > Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner: > "please divide this sentence into shingles" -> > "please divide" > "divide this" > "this sentence" > "sentence into" > "into shingles" > By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this: > "please" -> > [no tokens] > But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this: > "please" -> > "please" > **** > The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests. > **** > I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene For additional commands, e-mail: java-dev-help [at] lucene
|