Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Building FST-like automaton queries

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


alan.woodward at romseysoftware

Feb 28, 2012, 4:33 AM

Post #1 of 11 (451 views)
Permalink
Building FST-like automaton queries

Hello,

I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog'). My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery. I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.

I'm thinking it should work like this:
1) expand query term to sorted list of possible matches
2) create an FST over those matches
3) plug this FST into an AutomatonQuery subclass.

1) is easy. It's 2) and 3) I'm having trouble with.

All help gratefully received!

Thanks,

Alan Woodward
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Feb 28, 2012, 5:31 AM

Post #2 of 11 (436 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
the insert/delete/transposition changes...

Is the number of edits smallish? Ie you're not concerned about
combinatoric explosion of step 1?

For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
BasicAutomata.makeString(String) on each of your expanded terms, then
BasicOperations.union on all of those automata to make a single
automaton accepting all your expanded terms, then likely call
.determinize() on the resulting automaton (maybe also .minimize() but
I think that may not help). Then pass that automaton to AQ.

We don't yet have a way to drive a query from an FST, but that would
be an interesting addition. EG you could then support weights as
well, to decide how the terms are scored (if certain OCR errors are
more likely than others).

Mike McCandless

http://blog.mikemccandless.com

On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
<alan.woodward [at] romseysoftware> wrote:
> Hello,
>
> I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog'). My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery. I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.
>
> I'm thinking it should work like this:
> 1) expand query term to sorted list of possible matches
> 2) create an FST over those matches
> 3) plug this FST into an AutomatonQuery subclass.
>
> 1) is easy. It's 2) and 3) I'm having trouble with.
>
> All help gratefully received!
>
> Thanks,
>
> Alan Woodward
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


alan.woodward at romseysoftware

Feb 28, 2012, 5:42 AM

Post #3 of 11 (435 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

On 28 Feb 2012, at 13:31, Michael McCandless wrote:

> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
> the insert/delete/transposition changes...
>
> Is the number of edits smallish? Ie you're not concerned about
> combinatoric explosion of step 1?

We're only allowing expansions within an edit distance of 1, which should keep the numbers of terms down.

>
> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
> BasicAutomata.makeString(String) on each of your expanded terms, then
> BasicOperations.union on all of those automata to make a single
> automaton accepting all your expanded terms, then likely call
> .determinize() on the resulting automaton (maybe also .minimize() but
> I think that may not help). Then pass that automaton to AQ.

Excellent, thanks for your help. I'll give that a go.

>
> We don't yet have a way to drive a query from an FST, but that would
> be an interesting addition. EG you could then support weights as
> well, to decide how the terms are scored (if certain OCR errors are
> more likely than others).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
> <alan.woodward [at] romseysoftware> wrote:
>> Hello,
>>
>> I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog'). My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery. I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.
>>
>> I'm thinking it should work like this:
>> 1) expand query term to sorted list of possible matches
>> 2) create an FST over those matches
>> 3) plug this FST into an AutomatonQuery subclass.
>>
>> 1) is easy. It's 2) and 3) I'm having trouble with.
>>
>> All help gratefully received!
>>
>> Thanks,
>>
>> Alan Woodward
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Feb 28, 2012, 6:35 AM

Post #4 of 11 (434 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

On Tue, Feb 28, 2012 at 8:42 AM, Alan Woodward
<alan.woodward [at] romseysoftware> wrote:
>
> On 28 Feb 2012, at 13:31, Michael McCandless wrote:
>
>> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
>> the insert/delete/transposition changes...
>>
>> Is the number of edits smallish? Ie you're not concerned about
>> combinatoric explosion of step 1?
>
> We're only allowing expansions within an edit distance of 1, which should keep the numbers of terms down.

Ahh, ok. So even if the term has two occurrences of cl, only one of
them is allowed to substitute d?

>> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
>> BasicAutomata.makeString(String) on each of your expanded terms, then
>> BasicOperations.union on all of those automata to make a single
>> automaton accepting all your expanded terms, then likely call
>> .determinize() on the resulting automaton (maybe also .minimize() but
>> I think that may not help). Then pass that automaton to AQ.
>
> Excellent, thanks for your help. I'll give that a go.

You might also try the DaciukMihovAutomatonBuilder class (it's in
lucene/test-framework): it builds a minimal deterministic automaton
from sorted terms, very quickly... you'd just have to pre-sort your
terms.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


alan.woodward at romseysoftware

Feb 28, 2012, 7:06 AM

Post #5 of 11 (439 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

>>
>> We're only allowing expansions within an edit distance of 1, which should keep the numbers of terms down.
>
> Ahh, ok. So even if the term has two occurrences of cl, only one of
> them is allowed to substitute d?

Yes, exactly - "cloocl" will be expanded to "doocl" and "clood" only. It's a pretty inaccurate way of searching anyway, and more expansions start leading to too many false positives without improving recall pretty quickly.

>
>>> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
>>> BasicAutomata.makeString(String) on each of your expanded terms, then
>>> BasicOperations.union on all of those automata to make a single
>>> automaton accepting all your expanded terms, then likely call
>>> .determinize() on the resulting automaton (maybe also .minimize() but
>>> I think that may not help). Then pass that automaton to AQ.
>>
>> Excellent, thanks for your help. I'll give that a go.
>
> You might also try the DaciukMihovAutomatonBuilder class (it's in
> lucene/test-framework): it builds a minimal deterministic automaton
> from sorted terms, very quickly... you'd just have to pre-sort your
> terms.

Thanks, will have a look there too.

>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Feb 28, 2012, 7:40 AM

Post #6 of 11 (435 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
> BasicAutomata.makeString(String) on each of your expanded terms, then
> BasicOperations.union on all of those automata to make a single

How many input strings do you have? The API Mike mentioned in from a
port of the Brics library -- making separate automatons and then an
union will result in an attempt to minimize the result and this (when
the set of input strings is large) is a no-no in terms of memory (my
own experience).

I've added a method that creates an optimized automaton from a union
of Strings in one step, but I see this hasn't been ported to Lucene
yet.

http://www.brics.dk/automaton/doc/dk/brics/automaton/BasicAutomata.html#makeStringUnion(java.lang.CharSequence...)

If you could provide a patch that would port that code to Lucene it'd
be great (I guess it's trivial) and would speed up your step (1)
greatly.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Feb 28, 2012, 7:48 AM

Post #7 of 11 (435 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

I filed an issue for that.
https://issues.apache.org/jira/browse/LUCENE-3832

I'll try to port it myself actually. It shouldn't be a big problem.

Dawid

On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
> the insert/delete/transposition changes...
>
> Is the number of edits smallish?  Ie you're not concerned about
> combinatoric explosion of step 1?
>
> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
> BasicAutomata.makeString(String) on each of your expanded terms, then
> BasicOperations.union on all of those automata to make a single
> automaton accepting all your expanded terms, then likely call
> .determinize() on the resulting automaton (maybe also .minimize() but
> I think that may not help).  Then pass that automaton to AQ.
>
> We don't yet have a way to drive a query from an FST, but that would
> be an interesting addition.  EG you could then support weights as
> well, to decide how the terms are scored (if certain OCR errors are
> more likely than others).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
> <alan.woodward [at] romseysoftware> wrote:
>> Hello,
>>
>> I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog').  My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery.  I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.
>>
>> I'm thinking it should work like this:
>> 1) expand query term to sorted list of possible matches
>> 2) create an FST over those matches
>> 3) plug this FST into an AutomatonQuery subclass.
>>
>> 1) is easy.  It's 2) and 3) I'm having trouble with.
>>
>> All help gratefully received!
>>
>> Thanks,
>>
>> Alan Woodward
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Feb 28, 2012, 8:01 AM

Post #8 of 11 (429 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

The issue has a patch -- feel free to try it out.

Dawid

On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss <dawid.weiss [at] gmail> wrote:
> I filed an issue for that.
> https://issues.apache.org/jira/browse/LUCENE-3832
>
> I'll try to port it myself actually. It shouldn't be a big problem.
>
> Dawid
>
> On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless
> <lucene [at] mikemccandless> wrote:
>> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
>> the insert/delete/transposition changes...
>>
>> Is the number of edits smallish?  Ie you're not concerned about
>> combinatoric explosion of step 1?
>>
>> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
>> BasicAutomata.makeString(String) on each of your expanded terms, then
>> BasicOperations.union on all of those automata to make a single
>> automaton accepting all your expanded terms, then likely call
>> .determinize() on the resulting automaton (maybe also .minimize() but
>> I think that may not help).  Then pass that automaton to AQ.
>>
>> We don't yet have a way to drive a query from an FST, but that would
>> be an interesting addition.  EG you could then support weights as
>> well, to decide how the terms are scored (if certain OCR errors are
>> more likely than others).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
>> <alan.woodward [at] romseysoftware> wrote:
>>> Hello,
>>>
>>> I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog').  My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery.  I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.
>>>
>>> I'm thinking it should work like this:
>>> 1) expand query term to sorted list of possible matches
>>> 2) create an FST over those matches
>>> 3) plug this FST into an AutomatonQuery subclass.
>>>
>>> 1) is easy.  It's 2) and 3) I'm having trouble with.
>>>
>>> All help gratefully received!
>>>
>>> Thanks,
>>>
>>> Alan Woodward
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


alan.woodward at romseysoftware

Feb 28, 2012, 8:30 AM

Post #9 of 11 (429 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

Wow, that was quick! Thanks!

I don't think we'll have too many terms per query term - as I said earlier, we're restricting the expansions to those with an edit distance of 1. But this looks cool anyway.

On 28 Feb 2012, at 16:01, Dawid Weiss wrote:

> The issue has a patch -- feel free to try it out.
>
> Dawid
>
> On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss <dawid.weiss [at] gmail> wrote:
>> I filed an issue for that.
>> https://issues.apache.org/jira/browse/LUCENE-3832
>>
>> I'll try to port it myself actually. It shouldn't be a big problem.
>>
>> Dawid
>>
>> On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless
>> <lucene [at] mikemccandless> wrote:
>>> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
>>> the insert/delete/transposition changes...
>>>
>>> Is the number of edits smallish? Ie you're not concerned about
>>> combinatoric explosion of step 1?
>>>
>>> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
>>> BasicAutomata.makeString(String) on each of your expanded terms, then
>>> BasicOperations.union on all of those automata to make a single
>>> automaton accepting all your expanded terms, then likely call
>>> .determinize() on the resulting automaton (maybe also .minimize() but
>>> I think that may not help). Then pass that automaton to AQ.
>>>
>>> We don't yet have a way to drive a query from an FST, but that would
>>> be an interesting addition. EG you could then support weights as
>>> well, to decide how the terms are scored (if certain OCR errors are
>>> more likely than others).
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
>>> <alan.woodward [at] romseysoftware> wrote:
>>>> Hello,
>>>>
>>>> I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog'). My plan is to do this by generating all the possible variants of a term, using an existing list of errors, and then somehow mapping this into an AutomatonQuery. I've been looking around the o.a.l.util.automaton and o.a.l.util.fst packages on trunk, and I *think* that this is possible, but I'm so far failing to work out how to put the various bits together.
>>>>
>>>> I'm thinking it should work like this:
>>>> 1) expand query term to sorted list of possible matches
>>>> 2) create an FST over those matches
>>>> 3) plug this FST into an AutomatonQuery subclass.
>>>>
>>>> 1) is easy. It's 2) and 3) I'm having trouble with.
>>>>
>>>> All help gratefully received!
>>>>
>>>> Thanks,
>>>>
>>>> Alan Woodward
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Feb 28, 2012, 8:34 AM

Post #10 of 11 (440 views)
Permalink
Re: Building FST-like automaton queries [In reply to]

> Wow, that was quick!  Thanks!

The power of open source and coffee break, combined...

> I don't think we'll have too many terms per query term - as I said earlier, we're restricting the expansions to those with an edit distance of 1.  But this looks cool anyway.

Shouldn't make much of a difference if the number of strings is small,
but it was a missing feature that is going to be useful sooner or
later.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Mar 2, 2012, 1:25 PM

Post #11 of 11 (418 views)
Permalink
RE: Building FST-like automaton queries [In reply to]

> > Wow, that was quick!  Thanks!
>
> The power of open source and coffee break, combined...

12 minutes! Wow, that is fast turnaround or a lot of coffee.

-Paul

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.