Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Edit distance and wildcard searching with PhraseQuery

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jplater at healthmarketscience

Nov 11, 2009, 2:48 PM

Post #1 of 5 (818 views)
Permalink
Edit distance and wildcard searching with PhraseQuery

Hi,



I am trying to figure out a way that I can query a Lucene index for a
phrase but have some fuzziness (edit distance and/or wildcard) applied
to the individual terms. An example should help explain what I am
trying to do:



Index contains:

Philadelphia PA



Search is done on:

Philadelphid PA



I want it to result in a hit - basically something like
"Philadelphid~0.75 PA" (that syntax is not valid but explains what I am
looking for). Similarly, I would like to be able to do something like
"Phil* PA" and get a hit as well.



Does anyone know how I can accomplish this? Right now I am having to
hit a look up table to translate the city before searching against the
main index - not a fan of this option.



Thanks.



-Jeff Plater


iorixxx at yahoo

Nov 11, 2009, 2:55 PM

Post #2 of 5 (800 views)
Permalink
Re: Edit distance and wildcard searching with PhraseQuery [In reply to]

What you are looking for is ComplexPhraseQueryParser [1] and implemented in Lucene 2.9.0. It uses SpanQuery family.
It supports "Phil* PA"~10 as well as "Philadelphid~0.75 PA".
Ranges, OR, fuzzy and wildcard inside proximity (phrases).


[1] http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/queryParser/complexPhrase/package-summary.html

[2] https://issues.apache.org/jira/browse/LUCENE-1486


> I am trying to figure out a way that I can query a Lucene
> index for a
> phrase but have some fuzziness (edit distance and/or
> wildcard) applied
> to the individual terms.  An example should help
> explain what I am
> trying to do:
>
>
>
> Index contains:
>
> Philadelphia PA
>
>
>
> Search is done on:
>
> Philadelphid PA
>
>
>
> I want it to result in a hit - basically something like
> "Philadelphid~0.75 PA" (that syntax is not valid but
> explains what I am
> looking for).  Similarly, I would like to be able to
> do something like
> "Phil* PA" and get a hit as well.






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jplater at healthmarketscience

Nov 11, 2009, 3:41 PM

Post #3 of 5 (798 views)
Permalink
RE: Edit distance and wildcard searching with PhraseQuery [In reply to]

Thanks - I tried it out and it seems to work for "Philadelphid~0.75 PA" but I can't get it working for "Phil* PA" yet. Perhaps it is an issue with my Analyzer (I am using WhitespaceAnalyzer)?. Have you used it with wildcard before?

-Jeff

-----Original Message-----
From: AHMET ARSLAN [mailto:iorixxx [at] yahoo]
Sent: Wednesday, November 11, 2009 5:55 PM
To: java-user [at] lucene
Subject: Re: Edit distance and wildcard searching with PhraseQuery

What you are looking for is ComplexPhraseQueryParser [1] and implemented in Lucene 2.9.0. It uses SpanQuery family.
It supports "Phil* PA"~10 as well as "Philadelphid~0.75 PA".
Ranges, OR, fuzzy and wildcard inside proximity (phrases).


[1] http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/queryParser/complexPhrase/package-summary.html

[2] https://issues.apache.org/jira/browse/LUCENE-1486


> I am trying to figure out a way that I can query a Lucene
> index for a
> phrase but have some fuzziness (edit distance and/or
> wildcard) applied
> to the individual terms.  An example should help
> explain what I am
> trying to do:
>
>
>
> Index contains:
>
> Philadelphia PA
>
>
>
> Search is done on:
>
> Philadelphid PA
>
>
>
> I want it to result in a hit - basically something like
> "Philadelphid~0.75 PA" (that syntax is not valid but
> explains what I am
> looking for).  Similarly, I would like to be able to
> do something like
> "Phil* PA" and get a hit as well.






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 11, 2009, 3:51 PM

Post #4 of 5 (795 views)
Permalink
Re: Edit distance and wildcard searching with PhraseQuery [In reply to]

I'd at use something that lowercases the input rather than just
WhitespaceAnalyzer. Remember to use it at index time and query time. Between
your queries and typing things in e-mails, case is often a gotcha.

At least carefully check that your casing is identical.

Best
Erick

On Wed, Nov 11, 2009 at 6:41 PM, Jeff Plater <
jplater [at] healthmarketscience> wrote:

> Thanks - I tried it out and it seems to work for "Philadelphid~0.75 PA" but
> I can't get it working for "Phil* PA" yet. Perhaps it is an issue with my
> Analyzer (I am using WhitespaceAnalyzer)?. Have you used it with wildcard
> before?
>
> -Jeff
>
> -----Original Message-----
> From: AHMET ARSLAN [mailto:iorixxx [at] yahoo]
> Sent: Wednesday, November 11, 2009 5:55 PM
> To: java-user [at] lucene
> Subject: Re: Edit distance and wildcard searching with PhraseQuery
>
> What you are looking for is ComplexPhraseQueryParser [1] and implemented in
> Lucene 2.9.0. It uses SpanQuery family.
> It supports "Phil* PA"~10 as well as "Philadelphid~0.75 PA".
> Ranges, OR, fuzzy and wildcard inside proximity (phrases).
>
>
> [1]
> http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/queryParser/complexPhrase/package-summary.html
>
> [2] https://issues.apache.org/jira/browse/LUCENE-1486
>
>
> > I am trying to figure out a way that I can query a Lucene
> > index for a
> > phrase but have some fuzziness (edit distance and/or
> > wildcard) applied
> > to the individual terms. An example should help
> > explain what I am
> > trying to do:
> >
> >
> >
> > Index contains:
> >
> > Philadelphia PA
> >
> >
> >
> > Search is done on:
> >
> > Philadelphid PA
> >
> >
> >
> > I want it to result in a hit - basically something like
> > "Philadelphid~0.75 PA" (that syntax is not valid but
> > explains what I am
> > looking for). Similarly, I would like to be able to
> > do something like
> > "Phil* PA" and get a hit as well.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


jplater at healthmarketscience

Nov 11, 2009, 4:48 PM

Post #5 of 5 (795 views)
Permalink
RE: Edit distance and wildcard searching with PhraseQuery [In reply to]

Thanks for the suggestion - I double checked the case and it was OK.
Turned out I needed to use the StandardAnalyzer instead of the
WhitespaceAnalyzer.

-Jeff

-----Original Message-----
From: Erick Erickson [mailto:erickerickson [at] gmail]
Sent: Wednesday, November 11, 2009 6:52 PM
To: java-user [at] lucene
Subject: Re: Edit distance and wildcard searching with PhraseQuery

I'd at use something that lowercases the input rather than just
WhitespaceAnalyzer. Remember to use it at index time and query time.
Between
your queries and typing things in e-mails, case is often a gotcha.

At least carefully check that your casing is identical.

Best
Erick

On Wed, Nov 11, 2009 at 6:41 PM, Jeff Plater <
jplater [at] healthmarketscience> wrote:

> Thanks - I tried it out and it seems to work for "Philadelphid~0.75
PA" but
> I can't get it working for "Phil* PA" yet. Perhaps it is an issue
with my
> Analyzer (I am using WhitespaceAnalyzer)?. Have you used it with
wildcard
> before?
>
> -Jeff
>
> -----Original Message-----
> From: AHMET ARSLAN [mailto:iorixxx [at] yahoo]
> Sent: Wednesday, November 11, 2009 5:55 PM
> To: java-user [at] lucene
> Subject: Re: Edit distance and wildcard searching with PhraseQuery
>
> What you are looking for is ComplexPhraseQueryParser [1] and
implemented in
> Lucene 2.9.0. It uses SpanQuery family.
> It supports "Phil* PA"~10 as well as "Philadelphid~0.75 PA".
> Ranges, OR, fuzzy and wildcard inside proximity (phrases).
>
>
> [1]
>
http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/q
ueryParser/complexPhrase/package-summary.html
>
> [2] https://issues.apache.org/jira/browse/LUCENE-1486
>
>
> > I am trying to figure out a way that I can query a Lucene
> > index for a
> > phrase but have some fuzziness (edit distance and/or
> > wildcard) applied
> > to the individual terms. An example should help
> > explain what I am
> > trying to do:
> >
> >
> >
> > Index contains:
> >
> > Philadelphia PA
> >
> >
> >
> > Search is done on:
> >
> > Philadelphid PA
> >
> >
> >
> > I want it to result in a hit - basically something like
> > "Philadelphid~0.75 PA" (that syntax is not valid but
> > explains what I am
> > looking for). Similarly, I would like to be able to
> > do something like
> > "Phil* PA" and get a hit as well.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.