Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

May 30, 2008, 8:31 PM

Post #1 of 10 (1078 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601335#action_12601335 ]

Otis Gospodnetic commented on LUCENE-1297:
------------------------------------------

You read my mind, Thomas.
Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then?


> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

May 31, 2008, 8:45 AM

Post #2 of 10 (1037 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601397#action_12601397 ]

Thomas Morton commented on LUCENE-1297:
---------------------------------------

I think the dice coefficient would be nice to have. I'm not sure the jaccard index makes sense in the context of spelling correction since order isn't captured. I implemented JaroWinkler since I'm suggesting proper names and it does a good job with those.

With the StringDistance interface defined, anyone can implement the distance measure however they want. What I think would be very useful is weighted version of edit distance with the weights tuned to your target language/domain. Also with support in solr for specifying this parameter in the SpellCheckRequestHandler, changing this just becomes a config change.




> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 3, 2008, 8:12 PM

Post #3 of 10 (1001 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602151#action_12602151 ]

Otis Gospodnetic commented on LUCENE-1297:
------------------------------------------

Thomas - any chance you can write a simple unit test that exercises JaroWinkler?


> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 10, 2008, 6:37 PM

Post #4 of 10 (928 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604102#action_12604102 ]

Grant Ingersoll commented on LUCENE-1297:
-----------------------------------------

Hi Thomas,

This patch doesn't apply for me from the contrib/spellchecker directory.



> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 10, 2008, 10:38 PM

Post #5 of 10 (918 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604121#action_12604121 ]

Otis Gospodnetic commented on LUCENE-1297:
------------------------------------------

Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page.

I see JaroWinklerDistance.java doesn't have ASL on top.

Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply.

Thanks.


> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance.patch2
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 11, 2008, 7:26 AM

Post #6 of 10 (924 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604231#action_12604231 ]

Grant Ingersoll commented on LUCENE-1297:
-----------------------------------------

{quote}
I didn't see anything about re-using patch names on the wiki. please advise.
{qoute}

I think Otis is just referring to naming patches as something like LUCENE-1297.patch and then you just always keep that name, then JIRA takes care of the versioning and it is always clear which patch is the latest. As for the Wiki, I think it is on the Solr wiki, but should be added to the Lucene one, too.

> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance3.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 12, 2008, 6:26 AM

Post #7 of 10 (910 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604515#action_12604515 ]

Grant Ingersoll commented on LUCENE-1297:
-----------------------------------------

Patch applies cleanly and the tests pass.

Ideally, there would be standalone tests for each of the distance measures that test them outside the context of spell checking.

I think the Jaro-Winkler threshold should be configurable via a setter/constructor. A getter would make sense too, so that one can see what the threshold is.

Also, the TRStringDistance explicitly states that it is not thread safe. I believe it is now being used in a non thread-safe manner. FWIW, I see no reason why it can't be made thread-safe. All of those member variables are being allocated in the getDistance method, so no reason not to just make them local variables, I think.

> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance3.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 12, 2008, 8:28 AM

Post #8 of 10 (897 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604538#action_12604538 ]

Otis Gospodnetic commented on LUCENE-1297:
------------------------------------------

Tom, I agree with Grant and I'll assume you'll update the patch.

As for that TRStringDistance -> LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit.


> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: string_distance3.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 15, 2008, 5:46 AM

Post #9 of 10 (848 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605134#action_12605134 ]

Thomas Morton commented on LUCENE-1297:
---------------------------------------

Hi,
This code used to be in the SpellChecker itself. It just converts the int from the Levenshtein into a value between 0 and 1. 1 being identical, 0 being maximally different. This return value is part of the StringDistance interface and different methods compute this value differently so it's necessary to compute it on a per distance measure basis.

Thanks...Tom

> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1297.patch, LUCENE-1297.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 16, 2008, 6:57 AM

Post #10 of 10 (831 views)
Permalink
[jira] Commented: (LUCENE-1297) Allow other string distance measures in spellchecker [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605290#action_12605290 ]

Grant Ingersoll commented on LUCENE-1297:
-----------------------------------------

+1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good.

> Allow other string distance measures in spellchecker
> ----------------------------------------------------
>
> Key: LUCENE-1297
> URL: https://issues.apache.org/jira/browse/LUCENE-1297
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Affects Versions: 2.4
> Environment: n/a
> Reporter: Thomas Morton
> Assignee: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1297.patch, LUCENE-1297.patch
>
>
> Updated spelling code to allow for other string distance measures to be used.
> Created StringDistance interface.
> Modified existing Levenshtein distance measure to implement interface (and renamed class).
> Verified that change to Levenshtein distance didn't impact runtime performance.
> Implemented Jaro/Winkler distance metric
> Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.