Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

DisjunctionMaxQuery and scoring

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


bimargulies at gmail

Apr 19, 2012, 10:26 AM

Post #1 of 17 (888 views)
Permalink
DisjunctionMaxQuery and scoring

I am trying to solve a problem using DisjunctionMaxQuery.


Consider a query like:

a:b OR c:d OR e:f OR ...
name:richard OR name:dick OR name:dickie OR name:rich ...

At most, one of the richard names matches. So the match score gets
dragged down by the long list of things that don't match, as the list
can get quite long.

It seemed to me, upon reading the documentation, that I could cure
this problem by creating a query tree that used DisjunctionMaxQuery
around all those nicknames. However, when I built a boolean query that
had, as a clause, a DisjunctionMaxQuery in the place of a pile of
these individual Term queries, the score and the explanation did not
change at all -- in particular, the coord term shows the same number
of total terms. So it looks as if the children of the disjunction
still count.

Is there a way to control that term? Or a better way to express this?
Thinking SQL for a moment, what I'm trying to express is

name IN (richard, dick, dickie, rich)

as a single term query. Reading the javadoc, I am seeing
MultiTermQuery, and I'm that it is what we want.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Apr 19, 2012, 10:34 AM

Post #2 of 17 (882 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
> I am trying to solve a problem using DisjunctionMaxQuery.
>
>
> Consider a query like:
>
> a:b OR c:d OR e:f OR ...
> name:richard OR name:dick OR name:dickie OR name:rich ...
>
> At most, one of the richard names matches. So the match score gets
> dragged down by the long list of things that don't match, as the list
> can get quite long.
>
> It seemed to me, upon reading the documentation, that I could cure
> this problem by creating a query tree that used DisjunctionMaxQuery
> around all those nicknames. However, when I built a boolean query that
> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
> these individual Term queries, the score and the explanation did not
> change at all -- in particular, the coord term shows the same number
> of total terms. So it looks as if the children of the disjunction
> still count.
>
> Is there a way to control that term? Or a better way to express this?
> Thinking SQL for a moment, what I'm trying to express is
>
>   name IN (richard, dick, dickie, rich)
>

I think you just want to disable coord() here? You can do this for
that particular boolean query by passing true to the ctor:

public BooleanQuery(boolean disableCoord)

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 19, 2012, 12:49 PM

Post #3 of 17 (879 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir <rcmuir [at] gmail> wrote:
> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>> I am trying to solve a problem using DisjunctionMaxQuery.
>>
>>
>> Consider a query like:
>>
>> a:b OR c:d OR e:f OR ...
>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>
>> At most, one of the richard names matches. So the match score gets
>> dragged down by the long list of things that don't match, as the list
>> can get quite long.
>>
>> It seemed to me, upon reading the documentation, that I could cure
>> this problem by creating a query tree that used DisjunctionMaxQuery
>> around all those nicknames. However, when I built a boolean query that
>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>> these individual Term queries, the score and the explanation did not
>> change at all -- in particular, the coord term shows the same number
>> of total terms. So it looks as if the children of the disjunction
>> still count.
>>
>> Is there a way to control that term? Or a better way to express this?
>> Thinking SQL for a moment, what I'm trying to express is
>>
>>   name IN (richard, dick, dickie, rich)
>>
>
> I think you just want to disable coord() here? You can do this for
> that particular boolean query by passing true to the ctor:
>
>  public BooleanQuery(boolean disableCoord)

Rob,

How do nested queries work with respect to this? If I build a boolean
query one of whose clauses is a BooleanQuery with coord turned off,
does just the nested query insides get left out of 'coord'?

If so, then your answer certainly seems to be what the doctor ordered.

--benson


>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 19, 2012, 1:03 PM

Post #4 of 17 (871 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

Turning on disableCoord for a nested boolean query does not seem to
change the overall maxCoord term as displayed in explain.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Apr 19, 2012, 1:21 PM

Post #5 of 17 (870 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies <bimargulies [at] gmail> wrote:
> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir <rcmuir [at] gmail> wrote:
>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>> I am trying to solve a problem using DisjunctionMaxQuery.
>>>
>>>
>>> Consider a query like:
>>>
>>> a:b OR c:d OR e:f OR ...
>>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>>
>>> At most, one of the richard names matches. So the match score gets
>>> dragged down by the long list of things that don't match, as the list
>>> can get quite long.
>>>
>>> It seemed to me, upon reading the documentation, that I could cure
>>> this problem by creating a query tree that used DisjunctionMaxQuery
>>> around all those nicknames. However, when I built a boolean query that
>>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>>> these individual Term queries, the score and the explanation did not
>>> change at all -- in particular, the coord term shows the same number
>>> of total terms. So it looks as if the children of the disjunction
>>> still count.
>>>
>>> Is there a way to control that term? Or a better way to express this?
>>> Thinking SQL for a moment, what I'm trying to express is
>>>
>>>   name IN (richard, dick, dickie, rich)
>>>
>>
>> I think you just want to disable coord() here? You can do this for
>> that particular boolean query by passing true to the ctor:
>>
>>  public BooleanQuery(boolean disableCoord)
>
> Rob,
>
> How do nested queries work with respect to this? If I build a boolean
> query one of whose clauses is a BooleanQuery with coord turned off,
> does just the nested query insides get left out of 'coord'?
>
> If so, then your answer certainly seems to be what the doctor ordered.
>

it applies only to that query itself. So if this BQ is a clause to
another BQ that has coord enabled,
that would not change the top-level BQ's coord.

Note: if you don't want coord at all, then you can also plug in a
Similarity that returns 1,
or pick another Similarity like BM25: in trunk only the vector space
impl even does anything for coord()....


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 19, 2012, 2:05 PM

Post #6 of 17 (876 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir <rcmuir [at] gmail> wrote:
> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir <rcmuir [at] gmail> wrote:
>>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>>> I am trying to solve a problem using DisjunctionMaxQuery.
>>>>
>>>>
>>>> Consider a query like:
>>>>
>>>> a:b OR c:d OR e:f OR ...
>>>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>>>
>>>> At most, one of the richard names matches. So the match score gets
>>>> dragged down by the long list of things that don't match, as the list
>>>> can get quite long.
>>>>
>>>> It seemed to me, upon reading the documentation, that I could cure
>>>> this problem by creating a query tree that used DisjunctionMaxQuery
>>>> around all those nicknames. However, when I built a boolean query that
>>>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>>>> these individual Term queries, the score and the explanation did not
>>>> change at all -- in particular, the coord term shows the same number
>>>> of total terms. So it looks as if the children of the disjunction
>>>> still count.
>>>>
>>>> Is there a way to control that term? Or a better way to express this?
>>>> Thinking SQL for a moment, what I'm trying to express is
>>>>
>>>>   name IN (richard, dick, dickie, rich)
>>>>
>>>
>>> I think you just want to disable coord() here? You can do this for
>>> that particular boolean query by passing true to the ctor:
>>>
>>>  public BooleanQuery(boolean disableCoord)
>>
>> Rob,
>>
>> How do nested queries work with respect to this? If I build a boolean
>> query one of whose clauses is a BooleanQuery with coord turned off,
>> does just the nested query insides get left out of 'coord'?
>>
>> If so, then your answer certainly seems to be what the doctor ordered.
>>
>
> it applies only to that query itself. So if this BQ is a clause to
> another BQ that has coord enabled,
> that would not change the top-level BQ's coord.
>
> Note: if you don't want coord at all, then you can also plug in a
> Similarity that returns 1,
> or pick another Similarity like BM25: in trunk only the vector space
> impl even does anything for coord()....

Robert, I'm sorry that my density is approaching lead. My problem is
that I want coord, but I want to control which terms are counted and
which are not. I suppose I can accomplish this with my own scorer. My
hope was that there was a way to express "This group of terms counts
as one for coord".

In other words, for a subset of fields in the query, I want to scale
the entire score by the fraction of them that match.

Another way to think about this, which might be no use at all, is to
wonder: is there a way to charge a score penalty for failure to match
a particular query term? That would, from another direction, address
the underlying effect I'm trying to get.



>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Apr 19, 2012, 2:10 PM

Post #7 of 17 (912 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies <bimargulies [at] gmail> wrote:
> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir <rcmuir [at] gmail> wrote:
>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir <rcmuir [at] gmail> wrote:
>>>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>>>> I am trying to solve a problem using DisjunctionMaxQuery.
>>>>>
>>>>>
>>>>> Consider a query like:
>>>>>
>>>>> a:b OR c:d OR e:f OR ...
>>>>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>>>>
>>>>> At most, one of the richard names matches. So the match score gets
>>>>> dragged down by the long list of things that don't match, as the list
>>>>> can get quite long.
>>>>>
>>>>> It seemed to me, upon reading the documentation, that I could cure
>>>>> this problem by creating a query tree that used DisjunctionMaxQuery
>>>>> around all those nicknames. However, when I built a boolean query that
>>>>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>>>>> these individual Term queries, the score and the explanation did not
>>>>> change at all -- in particular, the coord term shows the same number
>>>>> of total terms. So it looks as if the children of the disjunction
>>>>> still count.
>>>>>
>>>>> Is there a way to control that term? Or a better way to express this?
>>>>> Thinking SQL for a moment, what I'm trying to express is
>>>>>
>>>>>   name IN (richard, dick, dickie, rich)
>>>>>
>>>>
>>>> I think you just want to disable coord() here? You can do this for
>>>> that particular boolean query by passing true to the ctor:
>>>>
>>>>  public BooleanQuery(boolean disableCoord)
>>>
>>> Rob,
>>>
>>> How do nested queries work with respect to this? If I build a boolean
>>> query one of whose clauses is a BooleanQuery with coord turned off,
>>> does just the nested query insides get left out of 'coord'?
>>>
>>> If so, then your answer certainly seems to be what the doctor ordered.
>>>
>>
>> it applies only to that query itself. So if this BQ is a clause to
>> another BQ that has coord enabled,
>> that would not change the top-level BQ's coord.
>>
>> Note: if you don't want coord at all, then you can also plug in a
>> Similarity that returns 1,
>> or pick another Similarity like BM25: in trunk only the vector space
>> impl even does anything for coord()....
>
> Robert, I'm sorry that my density is approaching lead. My problem is
> that I want coord, but I want to control which terms are counted and
> which are not. I suppose I can accomplish this with my own scorer. My
> hope was that there was a way to express "This group of terms counts
> as one for coord".

So just structure your boolean query appropriately?

BQ1(coord=true)
BQ2(coord=false): 25 terms
BQ3(coord=false): 87 terms

BQ1's coord is based on how many subscorers match (out of 2, BQ2 and
BQ3). If both match its 2/2 otherwise 1/2.

But in this example BQ2 and BQ3 disable coord themselves, hiding the
fact they accept 25 and 87 terms respectively and appearing as a
single sub for coord().

Does this make sense? you can extend this idea to control this however
you want by structuring the BQ appropriately so your BQ's with
"synonyms" have coord=0

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 19, 2012, 2:15 PM

Post #8 of 17 (873 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir <rcmuir [at] gmail> wrote:
> On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir <rcmuir [at] gmail> wrote:
>>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir <rcmuir [at] gmail> wrote:
>>>>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies <bimargulies [at] gmail> wrote:
>>>>>> I am trying to solve a problem using DisjunctionMaxQuery.
>>>>>>
>>>>>>
>>>>>> Consider a query like:
>>>>>>
>>>>>> a:b OR c:d OR e:f OR ...
>>>>>> name:richard OR name:dick OR name:dickie OR name:rich ...
>>>>>>
>>>>>> At most, one of the richard names matches. So the match score gets
>>>>>> dragged down by the long list of things that don't match, as the list
>>>>>> can get quite long.
>>>>>>
>>>>>> It seemed to me, upon reading the documentation, that I could cure
>>>>>> this problem by creating a query tree that used DisjunctionMaxQuery
>>>>>> around all those nicknames. However, when I built a boolean query that
>>>>>> had, as a clause, a DisjunctionMaxQuery in the place of a pile of
>>>>>> these individual Term queries, the score and the explanation did not
>>>>>> change at all -- in particular, the coord term shows the same number
>>>>>> of total terms. So it looks as if the children of the disjunction
>>>>>> still count.
>>>>>>
>>>>>> Is there a way to control that term? Or a better way to express this?
>>>>>> Thinking SQL for a moment, what I'm trying to express is
>>>>>>
>>>>>>   name IN (richard, dick, dickie, rich)
>>>>>>
>>>>>
>>>>> I think you just want to disable coord() here? You can do this for
>>>>> that particular boolean query by passing true to the ctor:
>>>>>
>>>>>  public BooleanQuery(boolean disableCoord)
>>>>
>>>> Rob,
>>>>
>>>> How do nested queries work with respect to this? If I build a boolean
>>>> query one of whose clauses is a BooleanQuery with coord turned off,
>>>> does just the nested query insides get left out of 'coord'?
>>>>
>>>> If so, then your answer certainly seems to be what the doctor ordered.
>>>>
>>>
>>> it applies only to that query itself. So if this BQ is a clause to
>>> another BQ that has coord enabled,
>>> that would not change the top-level BQ's coord.
>>>
>>> Note: if you don't want coord at all, then you can also plug in a
>>> Similarity that returns 1,
>>> or pick another Similarity like BM25: in trunk only the vector space
>>> impl even does anything for coord()....
>>
>> Robert, I'm sorry that my density is approaching lead. My problem is
>> that I want coord, but I want to control which terms are counted and
>> which are not. I suppose I can accomplish this with my own scorer. My
>> hope was that there was a way to express "This group of terms counts
>> as one for coord".
>
> So just structure your boolean query appropriately?
>
> BQ1(coord=true)
>  BQ2(coord=false): 25 terms
>  BQ3(coord=false): 87 terms
>
> BQ1's coord is based on how many subscorers match (out of 2, BQ2 and
> BQ3). If both match its 2/2 otherwise 1/2.
>
> But in this example BQ2 and BQ3 disable coord themselves, hiding the
> fact they accept 25 and 87 terms respectively and appearing as a
> single sub for coord().
>
> Does this make sense? you can extend this idea to control this however
> you want by structuring the BQ appropriately so your BQ's with
> "synonyms" have coord=0

Robert,

This makes perfect sense, it is what I thought you meant to begin
with. I tried it and thought that it did not work. Or, perhaps, I am
misreading the 'explain' output. Or, more likely, I goofed altogether.
I'll go back and recheck my results and post some explain output if I
can't find my mistake.

--benson




>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 19, 2012, 3:36 PM

Post #9 of 17 (864 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

I see why I'm so confused, but I think I need to construct a simpler test case.

My top-level BooleanQuery, which has disableCoord=false, has 22
clauses. All but three are ordinary SHOULD TermQueries. the remainder
are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
(that's a bug).

However, at the end of the explain trace, I see:

0.45 = coord(9/20) I think that my nested Boolean, for which I've been
flipping coord on and off to see what happens, is somehow not
participating at all. So switching it's coord on and off has no
effect.

Why 20? Why not 22? Is this just an explain quirk? Should I shove all
this code up to 3.6 from 2.9.3 before bugging you further?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dmurga at gmail

Apr 19, 2012, 4:05 PM

Post #10 of 17 (865 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Apr 19, 2012, at 6:36 PM, Benson Margulies <bimargulies [at] gmail> wrote:

> I see why I'm so confused, but I think I need to construct a simpler test case.
>
> My top-level BooleanQuery, which has disableCoord=false, has 22
> clauses. All but three are ordinary SHOULD TermQueries. the remainder
> are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> (that's a bug).
>
> However, at the end of the explain trace, I see:
>
> 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> flipping coord on and off to see what happens, is somehow not
> participating at all. So switching it's coord on and off has no
> effect.
>
> Why 20? Why not 22? Is this just an explain quirk? Should I shove all
> this code up to 3.6 from 2.9.3 before bugging you further?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Apr 19, 2012, 4:37 PM

Post #11 of 17 (858 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies <bimargulies [at] gmail> wrote:
> I see why I'm so confused, but I think I need to construct a simpler test case.
>
> My top-level BooleanQuery, which has disableCoord=false, has 22
> clauses. All but three are ordinary SHOULD TermQueries. the remainder
> are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> (that's a bug).
>
> However, at the end of the explain trace, I see:
>
> 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> flipping coord on and off to see what happens, is somehow not
> participating at all. So switching it's coord on and off has no
> effect.
>
> Why 20? Why not 22? Is this just an explain quirk?

I am not sure (also not sure i understand your example totally), but
at the same time could be as simple as the fact you have 2 prohibited
(MUST_NOT) clauses. These don't count towards coord()

I think its hard to tell from your description (just since it doesn't
have all the details). an explain or test case or something like that
would might be more efficient if its still not making sense...

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dmurga at gmail

Apr 19, 2012, 5:32 PM

Post #12 of 17 (864 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

[apologies for the earlier errant send]

I think
BooleanQuery bq = new BooleanQuery(false);
doesn't quite accomplish the desired "name IN (dick, rich)" scoring
behavior. This is because (name:dick | name:rich) with coord=false would
score the 'document' "Dick Rich" higher than "Rich" because the former has
two term matches and the latter only one. In contrast, I think the desire
is that one and only one of the terms in the document match those in the
BooleanQuery so that "Rich" would score higher than "Dick Rich", given
document length normalization. It's almost like a desire for
BooleanQuery bq = new BooleanQuery(false);
bq.set*Maximum*NumberShouldMatch(1);

Is there a good way to accomplish this?

On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir <rcmuir [at] gmail> wrote:

> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies <bimargulies [at] gmail>
> wrote:
> > I see why I'm so confused, but I think I need to construct a simpler
> test case.
> >
> > My top-level BooleanQuery, which has disableCoord=false, has 22
> > clauses. All but three are ordinary SHOULD TermQueries. the remainder
> > are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> > (that's a bug).
> >
> > However, at the end of the explain trace, I see:
> >
> > 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> > flipping coord on and off to see what happens, is somehow not
> > participating at all. So switching it's coord on and off has no
> > effect.
> >
> > Why 20? Why not 22? Is this just an explain quirk?
>
> I am not sure (also not sure i understand your example totally), but
> at the same time could be as simple as the fact you have 2 prohibited
> (MUST_NOT) clauses. These don't count towards coord()
>
> I think its hard to tell from your description (just since it doesn't
> have all the details). an explain or test case or something like that
> would might be more efficient if its still not making sense...
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


bimargulies at gmail

Apr 19, 2012, 5:42 PM

Post #13 of 17 (862 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

FWIW, there seems to be an explain bug in 2.9.1 that is fixed in
3.6.0, so I'm no longer confused about the actual behavior.


On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd <dmurga [at] gmail> wrote:
> [apologies for the earlier errant send]
>
> I think
>  BooleanQuery bq = new BooleanQuery(false);
> doesn't quite accomplish the desired "name IN (dick, rich)" scoring
> behavior. This is because (name:dick | name:rich) with coord=false would
> score the 'document' "Dick Rich" higher than "Rich" because the former has
> two term matches and the latter only one. In contrast, I think the desire
> is that one and only one of the terms in the document match those in the
> BooleanQuery so that "Rich" would score higher than "Dick Rich", given
> document length normalization. It's almost like a desire for
> BooleanQuery bq = new BooleanQuery(false);
>  bq.set*Maximum*NumberShouldMatch(1);
>
> Is there a good way to accomplish this?
>
> On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir <rcmuir [at] gmail> wrote:
>
>> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies <bimargulies [at] gmail>
>> wrote:
>> > I see why I'm so confused, but I think I need to construct a simpler
>> test case.
>> >
>> > My top-level BooleanQuery, which has disableCoord=false, has 22
>> > clauses. All but three are ordinary SHOULD TermQueries. the remainder
>> > are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
>> > (that's a bug).
>> >
>> > However, at the end of the explain trace, I see:
>> >
>> > 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
>> > flipping coord on and off to see what happens, is somehow not
>> > participating at all. So switching it's coord on and off has no
>> > effect.
>> >
>> > Why 20? Why not 22? Is this just an explain quirk?
>>
>> I am not sure (also not sure i understand your example totally), but
>> at the same time could be as simple as the fact you have 2 prohibited
>> (MUST_NOT) clauses. These don't count towards coord()
>>
>> I think its hard to tell from your description (just since it doesn't
>> have all the details). an explain or test case or something like that
>> would might be more efficient if its still not making sense...
>>
>> --
>> lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Apr 19, 2012, 11:15 PM

Post #14 of 17 (852 views)
Permalink
RE: DisjunctionMaxQuery and scoring [In reply to]

Hi,
> I think
> BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the
> desired "name IN (dick, rich)" scoring behavior. This is because
(name:dick |
> name:rich) with coord=false would score the 'document' "Dick Rich" higher
> than "Rich" because the former has two term matches and the latter only
one.
> In contrast, I think the desire is that one and only one of the terms in
the
> document match those in the BooleanQuery so that "Rich" would score higher
> than "Dick Rich", given document length normalization. It's almost like a
desire
> for BooleanQuery bq = new BooleanQuery(false);
> bq.set*Maximum*NumberShouldMatch(1);

I that case DisjunctionMaxQuery is the way to go (it will only count the hit
with highest score and not add scores (coord or not coord doesn't matter
here).


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Apr 19, 2012, 11:27 PM

Post #15 of 17 (856 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd <dmurga [at] gmail> wrote:
> In contrast, I think the desire
> is that one and only one of the terms in the document match those in the
> BooleanQuery so that "Rich" would score higher than "Dick Rich", given
> document length normalization. It's almost like a desire for
> BooleanQuery bq = new BooleanQuery(false);
>  bq.set*Maximum*NumberShouldMatch(1);
>

you can, by returning a customized weight with a coord impl that
PUNISHES documents that match > 1 sub.

Take a look at http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queries/src/java/org/apache/lucene/queries/BoostingQuery.java
for some inspiration, especially this part:

BooleanQuery result = new BooleanQuery() {
@Override
public Weight createWeight(IndexSearcher searcher) throws IOException {
return new BooleanWeight(searcher, false) {

@Override
public float coord(int overlap, int max) {
// your logic here when overlap == 1, > 1, etc

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Apr 19, 2012, 11:33 PM

Post #16 of 17 (860 views)
Permalink
RE: DisjunctionMaxQuery and scoring [In reply to]

Hi,

Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
achieve this, you have to change the coord function in your
similarity/BooleanWeight used for this query.

Either way: If you want a group of terms that get only one score if at least
one of the terms match (SQL IN), but not add them at all,
DisjunctionMaxQuery is fine. I think this is what Benson asked for.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe [at] thetaphi]
> Sent: Friday, April 20, 2012 8:16 AM
> To: java-user [at] lucene; david_murgatroyd [at] hotmail
> Subject: RE: DisjunctionMaxQuery and scoring
>
> Hi,
> > I think
> > BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
> > the desired "name IN (dick, rich)" scoring behavior. This is because
> (name:dick |
> > name:rich) with coord=false would score the 'document' "Dick Rich"
> > higher than "Rich" because the former has two term matches and the
> > latter only
> one.
> > In contrast, I think the desire is that one and only one of the terms
> > in
> the
> > document match those in the BooleanQuery so that "Rich" would score
> > higher than "Dick Rich", given document length normalization. It's
> > almost like a
> desire
> > for BooleanQuery bq = new BooleanQuery(false);
> > bq.set*Maximum*NumberShouldMatch(1);
>
> I that case DisjunctionMaxQuery is the way to go (it will only count the
hit with
> highest score and not add scores (coord or not coord doesn't matter here).
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bimargulies at gmail

Apr 20, 2012, 4:16 AM

Post #17 of 17 (858 views)
Permalink
Re: DisjunctionMaxQuery and scoring [In reply to]

Uwe and Robert,

Thanks. David and I are two peas in one pod here at Basis.

--benson

On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
> Hi,
>
> Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
> achieve this, you have to change the coord function in your
> similarity/BooleanWeight used for this query.
>
> Either way: If you want a group of terms that get only one score if at least
> one of the terms match (SQL IN), but not add them at all,
> DisjunctionMaxQuery is fine. I think this is what Benson asked for.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
>> -----Original Message-----
>> From: Uwe Schindler [mailto:uwe [at] thetaphi]
>> Sent: Friday, April 20, 2012 8:16 AM
>> To: java-user [at] lucene; david_murgatroyd [at] hotmail
>> Subject: RE: DisjunctionMaxQuery and scoring
>>
>> Hi,
>> > I think
>> >  BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
>> > the desired "name IN (dick, rich)" scoring behavior. This is because
>> (name:dick |
>> > name:rich) with coord=false would score the 'document' "Dick Rich"
>> > higher than "Rich" because the former has two term matches and the
>> > latter only
>> one.
>> > In contrast, I think the desire is that one and only one of the terms
>> > in
>> the
>> > document match those in the BooleanQuery so that "Rich" would score
>> > higher than "Dick Rich", given document length normalization. It's
>> > almost like a
>> desire
>> > for BooleanQuery bq = new BooleanQuery(false);
>> >   bq.set*Maximum*NumberShouldMatch(1);
>>
>> I that case DisjunctionMaxQuery is the way to go (it will only count the
> hit with
>> highest score and not add scores (coord or not coord doesn't matter here).
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.