Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Updated] (LUCENE-3842) Analyzing Suggester

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

May 10, 2012, 7:58 AM

Post #1 of 5 (238 views)
Permalink
[jira] [Updated] (LUCENE-3842) Analyzing Suggester

[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3842:
--------------------------------

Attachment: LUCENE-3842.patch

merged the patch up to trunk. But it still trips the assert in tokenStreamToAutomaton because the test begins with a stopword.

> Analyzing Suggester
> -------------------
>
> Key: LUCENE-3842
> URL: https://issues.apache.org/jira/browse/LUCENE-3842
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
> it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
> so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 11, 2012, 5:39 AM

Post #2 of 5 (225 views)
Permalink
[jira] [Updated] (LUCENE-3842) Analyzing Suggester [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3842:
--------------------------------

Attachment: LUCENE-3842.patch

updated patch: i fixed the bug in tokenStreamToAutomaton (just use lastEndPos instead)

> Analyzing Suggester
> -------------------
>
> Key: LUCENE-3842
> URL: https://issues.apache.org/jira/browse/LUCENE-3842
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
> it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
> so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 11, 2012, 9:26 AM

Post #3 of 5 (230 views)
Permalink
[jira] [Updated] (LUCENE-3842) Analyzing Suggester [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3842:
---------------------------------------

Attachment: LUCENE-3842.patch

Patch, fixing TS2A to insert holes ... this is causing the AnalyzingCompletionTest.testStandard to fail... we have to fix its query-time to insert holes too...

> Analyzing Suggester
> -------------------
>
> Key: LUCENE-3842
> URL: https://issues.apache.org/jira/browse/LUCENE-3842
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
> it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
> so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 12, 2012, 1:18 PM

Post #4 of 5 (218 views)
Permalink
[jira] [Updated] (LUCENE-3842) Analyzing Suggester [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3842:
---------------------------------------

Attachment: LUCENE-3842.patch

New patch, getting us closer ... I opened up Util.shortestPaths so that you can make a TopNSearcher and seed multiple start nodes into its queue... and I created an intersectPaths method to intersect an automaton with an FST, gathering the end nodes and the accumulated outputs. Then I fixed lookup to use these two to enumerate and complete the paths.

The first test in testStandard now passes, but not the 2nd one (I haven't tried disabling posincs in the StopFilter yet).

> Analyzing Suggester
> -------------------
>
> Key: LUCENE-3842
> URL: https://issues.apache.org/jira/browse/LUCENE-3842
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
> it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
> so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 15, 2012, 8:43 AM

Post #5 of 5 (217 views)
Permalink
[jira] [Updated] (LUCENE-3842) Analyzing Suggester [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3842:
---------------------------------------

Attachment: LUCENE-3842.patch

New patch, fixing some nocommits, and a separate bug where WFSTSuggester can return a dup suggestion. Still more to do...

> Analyzing Suggester
> -------------------
>
> Key: LUCENE-3842
> URL: https://issues.apache.org/jira/browse/LUCENE-3842
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
> it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
> so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
> * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.