Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

May 2, 2012, 12:10 PM

Post #1 of 12 (96 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266829#comment-13266829 ]

Michael McCandless commented on LUCENE-4024:
--------------------------------------------

+1

Nice to see LinearFuzzyTE moved out of core!

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 10:54 AM

Post #2 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269819#comment-13269819 ]

Walter Underwood commented on LUCENE-4024:
------------------------------------------

I'm not sure that the floating point spec should be deprecated. I'd like to have the option of distance 1 for short terms and distance 2 for longer ones. Distance two is necessary to handle transpositions, but gets a very broad match from short terms. Doing that through the float spec might be clumsy, but it would work.


> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 11:14 AM

Post #3 of 12 (87 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269833#comment-13269833 ]

Robert Muir commented on LUCENE-4024:
-------------------------------------

{quote}
Distance two is necessary to handle transpositions
{quote}

Thats not true. The levenshtein distance has changed to include transposition as a primitive edit.

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 12:20 PM

Post #4 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269913#comment-13269913 ]

Jack Krupansky commented on LUCENE-4024:
----------------------------------------

bq. The levenshtein distance has changed to include transposition as a primitive edit

Is there any user-visible doc about that change? I don't see any mention in CHANGES.txt or the Javadoc for FuzzyQuery.

At least according to the Wikipedia, the addition of transposition as a primitive would be referred to as the "Damerau–Levenshtein distance".
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

At least the Javadoc for FuzzyQuery should have a link to whatever the technically correct specification is.

A few examples would be nice as well.



> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 12:26 PM

Post #5 of 12 (84 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269922#comment-13269922 ]

Robert Muir commented on LUCENE-4024:
-------------------------------------

You have to look at the commit, not the patch (which was missing javadocs). See subversion commits tab.

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 12:46 PM

Post #6 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269936#comment-13269936 ]

Walter Underwood commented on LUCENE-4024:
------------------------------------------

Looking at the commit, the Javadoc does not give the default for transpositions. Reading the code, it defaults to true, which is a behavior change. That's fine, but it should be documented.

Like Jack, I think it would be a good idea to specifically say Damerau-Levenshtein.

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 12:52 PM

Post #7 of 12 (88 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269941#comment-13269941 ]

Robert Muir commented on LUCENE-4024:
-------------------------------------

Did you look at mmmm say the changes to FuzzyQuery.java? package.html?
{noformat}
/** Implements the fuzzy search query. The similarity measurement
- * is based on the Levenshtein (edit distance) algorithm.
+ * is based on the Damerau-Levenshtein (optimal string alignment) algorithm.
*
{noformat}

{noformat}
+ * @param transpositions true if transpositions should be treated as a primitive
+ * edit operation. If this is false, comparisons will implement the classic
+ * Levenshtein algorithm.
{noformat}

{noformat}
-<p>Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm.
+<p>Lucene supports fuzzy searches based on Damerau-Levenshtein Distance.
{noformat}



> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 1:04 PM

Post #8 of 12 (88 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269951#comment-13269951 ]

Jack Krupansky commented on LUCENE-4024:
----------------------------------------

I updated svn and see the Javadoc now. A notation in CHANGES.txt would be nice too since this is a user-visiable issue. Should there be a separate issue to update doc for the query parser(s)?

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 1:06 PM

Post #9 of 12 (86 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269953#comment-13269953 ]

Walter Underwood commented on LUCENE-4024:
------------------------------------------

Yes, that is exactly where I looked, and I missed it, sorry. Up late with a barfing child and a barfing Solr server in prod.



> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 1:08 PM

Post #10 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269958#comment-13269958 ]

Steven Rowe commented on LUCENE-4024:
-------------------------------------

bq. I updated svn and see the Javadoc now.

Jack, do you know about the commit notifications mailing list? If not, see http://lucene.apache.org/core/discussion.html for details.

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 1:42 PM

Post #11 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269981#comment-13269981 ]

Robert Muir commented on LUCENE-4024:
-------------------------------------

{quote}
Should there be a separate issue to update doc for the query parser(s) beyond Lucene (I see that Lucene Query Parser is updated)?
{quote}

All queryparsers were updated. But ClassicQueryParser (the lucene one) is the only one that really
documents its syntax though, so thats where the doc update occurred.

The rest of the QPs mostly re-use ClassicQueryParser's syntax and don't document any different syntax.

Seriously (unrelated) if you have javadocs for that queryparser module, in general (especially classes with no javadocs at all!!!!) just throw them up whereever, email list, this issue, some new issue, I dont care. Ill commit them.


> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 7, 2012, 7:03 PM

Post #12 of 12 (85 views)
Permalink
[jira] [Commented] (LUCENE-4024) FuzzyQuery should never do edit distance > 2 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270133#comment-13270133 ]

Chris Male commented on LUCENE-4024:
------------------------------------

I've opened LUCENE-4040 to try to improve the documentation for QPs.

> FuzzyQuery should never do edit distance > 2
> --------------------------------------------
>
> Key: LUCENE-4024
> URL: https://issues.apache.org/jira/browse/LUCENE-4024
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4024.patch
>
>
> Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.
> Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.