Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

SpellChecker performance and usage

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


smokeystu at gmail

Dec 2, 2007, 6:23 PM

Post #1 of 3 (672 views)
Permalink
SpellChecker performance and usage

My question is for anyone who has experience with Lucene's SpellChecker,
especially around its performance characteristics/ramifications.

1. Given the fact that SpellChecker expands a query by adding all the
permutations of potentially misspelled word, how does it perform in general?

2. How are others handling the case where SpellChecker would NOT perform
well if you expand the query adding all the permutations? In other words,
what kind of techniques are people using to get around or alleviate the
performance hit if any?

Any sharing of information or pointers would be appreciated.


smokeystu at gmail

Dec 3, 2007, 7:23 AM

Post #2 of 3 (618 views)
Permalink
SpellChecker performance and usage [In reply to]

My question is for anyone who has experience with Lucene's SpellChecker,
especially around its performance characteristics/ramifications.

1. Given the fact that SpellChecker expands a query by adding all the
permutations of potentially misspelled word, how does it perform in general?

2. How are others handling the case where SpellChecker would NOT perform
well if you expand the query adding all the permutations? In other words,
what kind of techniques are people using to get around or alleviate the
performance hit if any?

Any sharing of information or pointers would be appreciated.


DORONC at il

Dec 3, 2007, 9:59 PM

Post #3 of 3 (614 views)
Permalink
Re: SpellChecker performance and usage [In reply to]

I didn't have performance issues when using the spell checker.
Can you describe what you tried and how long it took, so
people can relate to that.

AFAIK the spell checker in o.a.l.search.spell does not "expand
a query by adding all the permutations of potentially misspelled
word". It is based on building an auxiliary index whose *documents*
are *words* of the main index, going through n-gram tokenization.
A checked word is tokenized that way too, and used as a query on.
the auxiliary index.

There's more wisdom in the query tokenization,
but a simplifying example an help to see how it works:
- a misspelled word 'helo' is tokenized as 'he el lo',
- the auxiliary index contains a document for the correct
word "hello" that was tokenized as 'he el ll lo'
- the score of the document 'hello' would be high when searching
the auxiliary index for 'he el lo'.

The only performance hit is when refreshing/rebuilding the
auxiliary index after the lexicon of the actual index
has changed a lot. But this can be done in the background when
adequate for the application using Lucene and the spell checker.

Doron

smokey <smokeystu [at] gmail> wrote on 03/12/2007 17:23:21:

> My question is for anyone who has experience with Lucene's SpellChecker,
> especially around its performance characteristics/ramifications.
>
> 1. Given the fact that SpellChecker expands a query by adding all the
> permutations of potentially misspelled word, how does it
> perform in general?
>
> 2. How are others handling the case where SpellChecker would NOT perform
> well if you expand the query adding all the permutations? In other words,
> what kind of techniques are people using to get around or alleviate the
> performance hit if any?
>
> Any sharing of information or pointers would be appreciated.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.