Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Typos in queries

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


eriam at mediavirtuel

May 21, 2008, 11:12 PM

Post #1 of 4 (1092 views)
Permalink
Typos in queries

Hello,

I was wondering if there is already a method to return results when the
query contains typos.

For exemple the user searches for "nokopol" and we have a document that
has "nikopol" in the title and in the body, I'd like to be able to return
it.

I'd like to match for typos only on titles and eventually a few other
fields (for performance reason I don't think matching typos in body text
is appropriate).

Google has such a feature and I think they are doing it with their
statistical data (user usually correct typos and the recognize this).

However as I don't have statistical data and I'd like to provide this ootb
I was thinking I might go with the Levenshtein distance
(http://en.wikipedia.org/wiki/Levenshtein_distance) and fortunatly there
is a perl module that does this (thanks cpan).

So my question is, is there already something I missed included in
KinoSearch (stems are quite different as far as I can tell) that does that
?

And if not, does including that Levenshtein distance calculation to return
"corrected" result set make sense at first sight ?

Thanks

Eriam








_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

May 21, 2008, 11:44 PM

Post #2 of 4 (1032 views)
Permalink
Re: Typos in queries [In reply to]

On Thu, May 22, 2008 at 12:12 AM, Eriam Schaffter
<eriam [at] mediavirtuel> wrote:
> I was wondering if there is already a method to return results when the
> query contains typos.

So far as I know, there is not yet such a method.

Because inverted indexes are keyed by the word you are searching for,
using the Levenshtein distance directly is difficult. You would have
to calculate the distance from each word in the query to each word in
your index. This would only work if your index has a limited
vocabulary, you have a lot of extra processing power, or you come up
with an elegant means of caching your results.

I would recommend an easier solution: use an existing spell checker
like Text::Aspell to 'suggest' spelling corrections, and then
construct a query with these suggestions as part of a weighted Boolean
query. If I recall correctly, you have to hack libaspell if you want
it to return the distance, but otherwise this should work well.

Good luck!

Nathan Kurz
nate [at] verse

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

May 21, 2008, 11:54 PM

Post #3 of 4 (1024 views)
Permalink
Re: Typos in queries [In reply to]

On Thu, May 22, 2008 at 12:44 AM, Nathan Kurz <nate [at] verse> wrote:
> If I recall correctly, you have to hack libaspell if you want
> it to return the distance, but otherwise this should work well.

I found my hacked up version of libaspell. The file you have to
change is: aspell-0.60.5/modules/speller/default/suggest.cpp

Here are the changes I made, although I'd suggest that you make your
own patch as I recall doing this just as an undebugged proof of
concept:

--- suggest.cpp.orig 2007-06-05 18:12:58.000000000 -0600
+++ suggest.cpp 2007-06-05 22:50:49.000000000 -0600
@@ -1219,6 +1227,9 @@
COUT << i->word << '\t' << i->score
<< '\t' << i->soundslike << "\n";
# endif
+ char score_text[16];
+ sprintf(score_text, ":%d", i->score);
+
if (i->repl_list != 0) {
String::size_type pos;
do {
@@ -1228,13 +1239,13 @@
? (bool)sp->check(*dup_pair.first)
: (sp->check((String)dup_pair.first->substr(0,pos))
&& sp->check((String)dup_pair.first->substr(pos+1))) ))
- near_misses_final->push_back(*dup_pair.first);
+ near_misses_final->push_back((String)dup_pair.first->append(score_te
xt));
} while (i->repl_list->adv());
} else {
fix_case(i->word);
dup_pair = duplicates_check.insert(i->word);
if (dup_pair.second )
- near_misses_final->push_back(*dup_pair.first);
+ near_misses_final->push_back((String)dup_pair.first->append(score_text
));
}
}
}

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


peter at peknet

May 22, 2008, 6:16 AM

Post #4 of 4 (1030 views)
Permalink
Re: Typos in queries [In reply to]

On 05/22/2008 01:44 AM, Nathan Kurz wrote:

> I would recommend an easier solution: use an existing spell checker
> like Text::Aspell to 'suggest' spelling corrections, and then
> construct a query with these suggestions as part of a weighted Boolean
> query. If I recall correctly, you have to hack libaspell if you want
> it to return the distance, but otherwise this should work well.

Look too at Search::Tools::Spellcheck which wraps Text::Aspell especially for search apps.
IIRC, libaspell allows you too also build custom dictionaries if your corpus requires it.

--
Peter Karman . peter [at] peknet . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.