Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


paul at metajure

Jan 31, 2012, 12:48 PM

Post #1 of 5 (442 views)
Permalink
Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

In Lucene, 3.4 I recently implemented "Translating PhraseQuery to SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to matter.

Here is my exact code called from getFieldsQuery once I know I'm looking at a PhraseQuery, but I think it is exactly from the book.

static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) {
Term[] terms = phraseQ.getTerms();
SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
for (int i = 0; i < terms.length; i++) {
clauses[i] = new SpanTermQuery(terms[i]);
}
SpanNearQuery query = new SpanNearQuery(clauses, slop, PHRASE_ORDER_MATTERS);
return query;
}

I put in my own QueryParser and things looked good until I try a phrase with stop words.
Using the old PhraseQuery I got results on a phrase with stop words without extending the slop, but with SpanNearQuery unless the query includes some slop, nothing is found.
This conflicts with the typical use case of a user taking a phrase, pasting into the search bar with quotes and expecting to find his document.
I can't just add some more slop, because it depends on how many stop words are in any sequence in the phrase.

Any suggestions on how to solve the problem of combining the idea of SpanNear (so that words in order in a phrase is better) with text that has stop words removed, so that I can to support the simple use of quotes for exact quoted text matching?

Any Ideas?

-Paul


cdoronc at gmail

Jan 31, 2012, 11:30 PM

Post #2 of 5 (430 views)
Permalink
Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words [In reply to]

Hi,

Code here ignores PhraseQuery (PQ) 's positions:

int[] pp = PQ.getPositions();

These positions have extra gaps when stop words are removed.

To accommodate for this, the overall extra gap can be added to the slope:
int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- boundary
cases)
slope += gap;

I think this is less accurate than PQ:
It does not specify the exact position of the stop word.

For example, assume original text:
A B S D
and S is a stop word.

PQ:
A B S D would match
A S B D would not

Span Near query: both would match.

Perhaps there's a way around this too that I am not aware of.

Also, this code suggestion simplifies in the case that the analyzer in
effect may emit more than one term at the same position - for example when
expanding the query with synonyms, or when keeping originals and stemmed
forms - in that case just comparing pp[0] and pp[pp.length-1] is
insufficient, and the positions should be examined while looping the phrase
terms, something like this:

int dpos = pp[i+1] - p[i]; // (i>0)
if (dpos > 1)
slope += (dpos -1);

Haven't tested this - just to give you an idea what to try next.

Doron

On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill <paul [at] metajure> wrote:

> In Lucene, 3.4 I recently implemented "Translating PhraseQuery to
> SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to
> matter.
>
> Here is my exact code called from getFieldsQuery once I know I'm looking
> at a PhraseQuery, but I think it is exactly from the book.
>
> static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) {
> Term[] terms = phraseQ.getTerms();
> SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
> for (int i = 0; i < terms.length; i++) {
> clauses[i] = new SpanTermQuery(terms[i]);
> }
> SpanNearQuery query = new SpanNearQuery(clauses, slop,
> PHRASE_ORDER_MATTERS);
> return query;
> }
>
> I put in my own QueryParser and things looked good until I try a phrase
> with stop words.
> Using the old PhraseQuery I got results on a phrase with stop words
> without extending the slop, but with SpanNearQuery unless the query
> includes some slop, nothing is found.
> This conflicts with the typical use case of a user taking a phrase,
> pasting into the search bar with quotes and expecting to find his document.
> I can't just add some more slop, because it depends on how many stop words
> are in any sequence in the phrase.
>
> Any suggestions on how to solve the problem of combining the idea of
> SpanNear (so that words in order in a phrase is better) with text that has
> stop words removed, so that I can to support the simple use of quotes for
> exact quoted text matching?
>
> Any Ideas?
>
> -Paul
>
>


paul at metajure

Feb 1, 2012, 11:04 AM

Post #3 of 5 (427 views)
Permalink
RE: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words [In reply to]

Thanks for the discussion, I really appreciate you pointing out that the

> Code here ignores PhraseQuery (PQ) 's positions:

And by "here" you mean my original code not your suggestion.

> To accommodate for this, the overall extra gap can be added to the slope:
> int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- boundary
> cases)
> slope += gap;

At 1st I was thinking my refinement of this would be to consider the original slop provided by the user and only extend it when necessary.
For example:
"The Importance of Being Earnest"~2
Already has enough slop to take into consideration the stop words 'the' and 'of', so no need to just add more to the slop.
But a slop of 2 really means the user would accept.
[The Importance of Really Truly Being Earnest] but I see that requires a slop of 3 to skip [of] [Really] [Truly]

But I'm not sure if I understand the 'edit distance' for a phrase with more than 2 words. Does it apply to _all_the_edits_combined to bring the quoted phrase to match the index phrase as suggested by your calculation?

Also, do any "boundary cases" (as mentioned in your comment) come to mind?

> Also, this code suggestion simplifies in the case that the analyzer in effect may emit more than one
> term at the same position - for example when expanding the query with synonyms, or when keeping
> originals and stemmed forms - in that case just comparing pp[0] and pp[pp.length-1] is insufficient,
> and the positions should be examined while looping the phrase terms, something like this:

I don't understand what you mean that it simplifies, since you already listed the simplification in your first example which I think would work in cases with or without synonyms, so no need to walk through each distance as shown in your later code.

> int dpos = pp[i+1] - p[i]; // (i>0)
> if (dpos > 1)
> slope += (dpos -1);
>
> Haven't tested this - just to give you an idea what to try next.

Thanks for your input, I will experiment with some code that considers the original PQ positions when considering the slop value of any generated SpanNearQuery.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Feb 1, 2012, 1:37 PM

Post #4 of 5 (424 views)
Permalink
RE: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words [In reply to]

>Doron wrote:
> > int gap = (pp[pp.length] - pp[0]) - (pp.length - 1);

int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1);

Don't want to cause an IndexOutOfBoundsException
-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cdoronc at gmail

Feb 1, 2012, 1:44 PM

Post #5 of 5 (425 views)
Permalink
Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words [In reply to]

> int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1);
>
> Don't want to cause an IndexOutOfBoundsException


Right... that's what I meant with "(boundary cases)"...

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.