Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 6, 2009, 1:39 AM

Post #1 of 2 (102 views)
Permalink
[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

[ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1370:
---------------------------------------

Fix Version/s: 3.0

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
> Key: LUCENE-1370
> URL: https://issues.apache.org/jira/browse/LUCENE-1370
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Chris Harris
> Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
> "please", "please divide"
> "divide", "divide this"
> "this", "this sentence"
> "sentence", "sentence into"
> "into", "into shingles"
> "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
> "please divide"
> "divide this"
> "this sentence"
> "sentence into"
> "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
> [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
> "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:40 AM

Post #2 of 2 (63 views)
Permalink
[jira] Updated: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1370:
----------------------------------

Fix Version/s: (was: 3.0)
3.1

I move this to 3.1 as it is a new feature.

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
> Key: LUCENE-1370
> URL: https://issues.apache.org/jira/browse/LUCENE-1370
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Chris Harris
> Assignee: Karl Wettin
> Fix For: 3.1
>
> Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
> "please", "please divide"
> "divide", "divide this"
> "this", "this sentence"
> "sentence", "sentence into"
> "into", "into shingles"
> "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
> "please divide"
> "divide this"
> "this sentence"
> "sentence into"
> "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
> [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
> "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.