Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

comparing lucene scores across queries

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


patrick.diviacco at gmail

Mar 28, 2011, 12:43 AM

Post #1 of 14 (1235 views)
Permalink
comparing lucene scores across queries

Hi,

sorry I've already asked few days ago, but I got no reply and I really need
some help on this..

I'm running several queries against a doc collection. The queries are
documents of the collection itself, I need to measure how similar is each
document to the rest of the collection.

Now, Lucene returns me a score per query, but I've been told such score is
not comparable across queries. Is this correct ?

For example, arem't these scores comparable ?
query1, score:8.324234
query2, score:3.324238

If so, why not ? Isn't the cosine similarity between the query vector and
collection docs vectors ? I really need a comparable measure.

thanks


uwe at thetaphi

Mar 28, 2011, 1:03 AM

Post #2 of 14 (1221 views)
Permalink
RE: comparing lucene scores across queries [In reply to]

No, scores are in general not comparable between different queries. The
problem lies in many things:
- Each query has a norm factor that makes it more compareable if they are
sub clauses of a BooleanQuery. But you are right, this norm factor should be
the same.
- Some queries like FuzzyQuery rely on the terms in index and those matches
the query
- Inside Boolean queries, there is also a coord-factor involved

If you are always using the same simple type of query (e.g. simple
TermQuery, only with different term) on the same index, you can compare the
scores. As soon as you are using complex queries (e.g several terms compared
in a BooleanQuery as QueryParser produces), the scores are no longer
comparable.

You can read more on all factors that are included in scoring:
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Simila
rity.html

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> Sent: Monday, March 28, 2011 9:44 AM
> To: java-user [at] lucene
> Subject: comparing lucene scores across queries
>
> Hi,
>
> sorry I've already asked few days ago, but I got no reply and I really
need
> some help on this..
>
> I'm running several queries against a doc collection. The queries are
> documents of the collection itself, I need to measure how similar is each
> document to the rest of the collection.
>
> Now, Lucene returns me a score per query, but I've been told such score is
> not comparable across queries. Is this correct ?
>
> For example, arem't these scores comparable ?
> query1, score:8.324234
> query2, score:3.324238
>
> If so, why not ? Isn't the cosine similarity between the query vector and
> collection docs vectors ? I really need a comparable measure.
>
> thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 28, 2011, 1:08 AM

Post #3 of 14 (1208 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

Hi, thanks for reply.

Yeah, I've read the Similarity class documentation several times, but I need
some tip.

My queries are BooleanQueries but they always have the same structure (the
same structure of the docs, they are actually docs from collection): 3
fields.

What if I simplify the similarity scores, by removing coord factor and just
leaving the cosine similarity which is comparable ?

I want to underline the fact that my boolean queries are just a combination
of "field:term" items, and I always have the same 3 fields with different
terms obviously.

Thanks




On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:

> No, scores are in general not comparable between different queries. The
> problem lies in many things:
> - Each query has a norm factor that makes it more compareable if they are
> sub clauses of a BooleanQuery. But you are right, this norm factor should
> be
> the same.
> - Some queries like FuzzyQuery rely on the terms in index and those matches
> the query
> - Inside Boolean queries, there is also a coord-factor involved
>
> If you are always using the same simple type of query (e.g. simple
> TermQuery, only with different term) on the same index, you can compare the
> scores. As soon as you are using complex queries (e.g several terms
> compared
> in a BooleanQuery as QueryParser produces), the scores are no longer
> comparable.
>
> You can read more on all factors that are included in scoring:
>
> http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Simila
> rity.html
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > Sent: Monday, March 28, 2011 9:44 AM
> > To: java-user [at] lucene
> > Subject: comparing lucene scores across queries
> >
> > Hi,
> >
> > sorry I've already asked few days ago, but I got no reply and I really
> need
> > some help on this..
> >
> > I'm running several queries against a doc collection. The queries are
> > documents of the collection itself, I need to measure how similar is each
> > document to the rest of the collection.
> >
> > Now, Lucene returns me a score per query, but I've been told such score
> is
> > not comparable across queries. Is this correct ?
> >
> > For example, arem't these scores comparable ?
> > query1, score:8.324234
> > query2, score:3.324238
> >
> > If so, why not ? Isn't the cosine similarity between the query vector and
> > collection docs vectors ? I really need a comparable measure.
> >
> > thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Mar 28, 2011, 1:11 AM

Post #4 of 14 (1210 views)
Permalink
RE: comparing lucene scores across queries [In reply to]

Hi Patrick,

You can disable the coord factor in the constructor of BooleanQuery.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> Sent: Monday, March 28, 2011 10:09 AM
> To: java-user [at] lucene
> Subject: Re: comparing lucene scores across queries
>
> Hi, thanks for reply.
>
> Yeah, I've read the Similarity class documentation several times, but I
need
> some tip.
>
> My queries are BooleanQueries but they always have the same structure
> (the same structure of the docs, they are actually docs from collection):
3
> fields.
>
> What if I simplify the similarity scores, by removing coord factor and
just
> leaving the cosine similarity which is comparable ?
>
> I want to underline the fact that my boolean queries are just a
combination
> of "field:term" items, and I always have the same 3 fields with different
> terms obviously.
>
> Thanks
>
>
>
>
> On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> > No, scores are in general not comparable between different queries.
> > The problem lies in many things:
> > - Each query has a norm factor that makes it more compareable if they
> > are sub clauses of a BooleanQuery. But you are right, this norm factor
> > should be the same.
> > - Some queries like FuzzyQuery rely on the terms in index and those
> > matches the query
> > - Inside Boolean queries, there is also a coord-factor involved
> >
> > If you are always using the same simple type of query (e.g. simple
> > TermQuery, only with different term) on the same index, you can
> > compare the scores. As soon as you are using complex queries (e.g
> > several terms compared in a BooleanQuery as QueryParser produces), the
> > scores are no longer comparable.
> >
> > You can read more on all factors that are included in scoring:
> >
> >
> http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
> > Simila
> > rity.html
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe [at] thetaphi
> >
> >
> > > -----Original Message-----
> > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > Sent: Monday, March 28, 2011 9:44 AM
> > > To: java-user [at] lucene
> > > Subject: comparing lucene scores across queries
> > >
> > > Hi,
> > >
> > > sorry I've already asked few days ago, but I got no reply and I
> > > really
> > need
> > > some help on this..
> > >
> > > I'm running several queries against a doc collection. The queries
> > > are documents of the collection itself, I need to measure how
> > > similar is each document to the rest of the collection.
> > >
> > > Now, Lucene returns me a score per query, but I've been told such
> > > score
> > is
> > > not comparable across queries. Is this correct ?
> > >
> > > For example, arem't these scores comparable ?
> > > query1, score:8.324234
> > > query2, score:3.324238
> > >
> > > If so, why not ? Isn't the cosine similarity between the query
> > > vector and collection docs vectors ? I really need a comparable
measure.
> > >
> > > thanks
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 28, 2011, 1:35 AM

Post #5 of 14 (1206 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

Cool, so just to be sure, if I disable the coord factor I can finally
compare my BooleanQuery results ?



On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:

> Hi Patrick,
>
> You can disable the coord factor in the constructor of BooleanQuery.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > Sent: Monday, March 28, 2011 10:09 AM
> > To: java-user [at] lucene
> > Subject: Re: comparing lucene scores across queries
> >
> > Hi, thanks for reply.
> >
> > Yeah, I've read the Similarity class documentation several times, but I
> need
> > some tip.
> >
> > My queries are BooleanQueries but they always have the same structure
> > (the same structure of the docs, they are actually docs from collection):
> 3
> > fields.
> >
> > What if I simplify the similarity scores, by removing coord factor and
> just
> > leaving the cosine similarity which is comparable ?
> >
> > I want to underline the fact that my boolean queries are just a
> combination
> > of "field:term" items, and I always have the same 3 fields with different
> > terms obviously.
> >
> > Thanks
> >
> >
> >
> >
> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> > > No, scores are in general not comparable between different queries.
> > > The problem lies in many things:
> > > - Each query has a norm factor that makes it more compareable if they
> > > are sub clauses of a BooleanQuery. But you are right, this norm factor
> > > should be the same.
> > > - Some queries like FuzzyQuery rely on the terms in index and those
> > > matches the query
> > > - Inside Boolean queries, there is also a coord-factor involved
> > >
> > > If you are always using the same simple type of query (e.g. simple
> > > TermQuery, only with different term) on the same index, you can
> > > compare the scores. As soon as you are using complex queries (e.g
> > > several terms compared in a BooleanQuery as QueryParser produces), the
> > > scores are no longer comparable.
> > >
> > > You can read more on all factors that are included in scoring:
> > >
> > >
> > http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
> > > Simila
> > > rity.html
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe [at] thetaphi
> > >
> > >
> > > > -----Original Message-----
> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > > Sent: Monday, March 28, 2011 9:44 AM
> > > > To: java-user [at] lucene
> > > > Subject: comparing lucene scores across queries
> > > >
> > > > Hi,
> > > >
> > > > sorry I've already asked few days ago, but I got no reply and I
> > > > really
> > > need
> > > > some help on this..
> > > >
> > > > I'm running several queries against a doc collection. The queries
> > > > are documents of the collection itself, I need to measure how
> > > > similar is each document to the rest of the collection.
> > > >
> > > > Now, Lucene returns me a score per query, but I've been told such
> > > > score
> > > is
> > > > not comparable across queries. Is this correct ?
> > > >
> > > > For example, arem't these scores comparable ?
> > > > query1, score:8.324234
> > > > query2, score:3.324238
> > > >
> > > > If so, why not ? Isn't the cosine similarity between the query
> > > > vector and collection docs vectors ? I really need a comparable
> measure.
> > > >
> > > > thanks
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > For additional commands, e-mail: java-user-help [at] lucene
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


patrick.diviacco at gmail

Mar 28, 2011, 1:48 AM

Post #6 of 14 (1210 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

One more thing, instead of extending the BooleanQuery class to remove the
coord factor, can I also extend the Similarity class to do it ?

Still the other question is open: just to be sure, if I disable the coord
factor I can finally compare my BooleanQuery results ?

thanks

>
>
>
> On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:
>
>> Hi Patrick,
>>
>> You can disable the coord factor in the constructor of BooleanQuery.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>>
>>
>> > -----Original Message-----
>> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
>> > Sent: Monday, March 28, 2011 10:09 AM
>> > To: java-user [at] lucene
>> > Subject: Re: comparing lucene scores across queries
>> >
>> > Hi, thanks for reply.
>> >
>> > Yeah, I've read the Similarity class documentation several times, but I
>> need
>> > some tip.
>> >
>> > My queries are BooleanQueries but they always have the same structure
>> > (the same structure of the docs, they are actually docs from
>> collection):
>> 3
>> > fields.
>> >
>> > What if I simplify the similarity scores, by removing coord factor and
>> just
>> > leaving the cosine similarity which is comparable ?
>> >
>> > I want to underline the fact that my boolean queries are just a
>> combination
>> > of "field:term" items, and I always have the same 3 fields with
>> different
>> > terms obviously.
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
>> >
>> > > No, scores are in general not comparable between different queries.
>> > > The problem lies in many things:
>> > > - Each query has a norm factor that makes it more compareable if they
>> > > are sub clauses of a BooleanQuery. But you are right, this norm factor
>> > > should be the same.
>> > > - Some queries like FuzzyQuery rely on the terms in index and those
>> > > matches the query
>> > > - Inside Boolean queries, there is also a coord-factor involved
>> > >
>> > > If you are always using the same simple type of query (e.g. simple
>> > > TermQuery, only with different term) on the same index, you can
>> > > compare the scores. As soon as you are using complex queries (e.g
>> > > several terms compared in a BooleanQuery as QueryParser produces), the
>> > > scores are no longer comparable.
>> > >
>> > > You can read more on all factors that are included in scoring:
>> > >
>> > >
>> > http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
>> > > Simila
>> > > rity.html
>> > >
>> > > -----
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > http://www.thetaphi.de
>> > > eMail: uwe [at] thetaphi
>> > >
>> > >
>> > > > -----Original Message-----
>> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
>> > > > Sent: Monday, March 28, 2011 9:44 AM
>> > > > To: java-user [at] lucene
>> > > > Subject: comparing lucene scores across queries
>> > > >
>> > > > Hi,
>> > > >
>> > > > sorry I've already asked few days ago, but I got no reply and I
>> > > > really
>> > > need
>> > > > some help on this..
>> > > >
>> > > > I'm running several queries against a doc collection. The queries
>> > > > are documents of the collection itself, I need to measure how
>> > > > similar is each document to the rest of the collection.
>> > > >
>> > > > Now, Lucene returns me a score per query, but I've been told such
>> > > > score
>> > > is
>> > > > not comparable across queries. Is this correct ?
>> > > >
>> > > > For example, arem't these scores comparable ?
>> > > > query1, score:8.324234
>> > > > query2, score:3.324238
>> > > >
>> > > > If so, why not ? Isn't the cosine similarity between the query
>> > > > vector and collection docs vectors ? I really need a comparable
>> measure.
>> > > >
>> > > > thanks
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> > > For additional commands, e-mail: java-user-help [at] lucene
>> > >
>> > >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>


uwe at thetaphi

Mar 28, 2011, 2:36 AM

Post #7 of 14 (1207 views)
Permalink
RE: comparing lucene scores across queries [In reply to]

Hi,

You don't need to extend BooleanQuery, you can just pass "true" in its ctor,
see: http://s.apache.org/QvK
Of course you can also subclass DefaultSimilarity and return 1 as coord, but
that is more work than passing true to a ctor.

For your type of queries, disabling coord should be enough, but I am not
100% sure! Why not simply try it out?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> Sent: Monday, March 28, 2011 10:49 AM
> To: java-user [at] lucene
> Subject: Re: comparing lucene scores across queries
>
> One more thing, instead of extending the BooleanQuery class to remove the
> coord factor, can I also extend the Similarity class to do it ?
>
> Still the other question is open: just to be sure, if I disable the coord
factor I
> can finally compare my BooleanQuery results ?
>
> thanks
>
> >
> >
> >
> > On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> >> Hi Patrick,
> >>
> >> You can disable the coord factor in the constructor of BooleanQuery.
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe [at] thetaphi
> >>
> >>
> >> > -----Original Message-----
> >> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> >> > Sent: Monday, March 28, 2011 10:09 AM
> >> > To: java-user [at] lucene
> >> > Subject: Re: comparing lucene scores across queries
> >> >
> >> > Hi, thanks for reply.
> >> >
> >> > Yeah, I've read the Similarity class documentation several times,
> >> > but I
> >> need
> >> > some tip.
> >> >
> >> > My queries are BooleanQueries but they always have the same
> >> > structure (the same structure of the docs, they are actually docs
> >> > from
> >> collection):
> >> 3
> >> > fields.
> >> >
> >> > What if I simplify the similarity scores, by removing coord factor
> >> > and
> >> just
> >> > leaving the cosine similarity which is comparable ?
> >> >
> >> > I want to underline the fact that my boolean queries are just a
> >> combination
> >> > of "field:term" items, and I always have the same 3 fields with
> >> different
> >> > terms obviously.
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> >
> >> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
> >> >
> >> > > No, scores are in general not comparable between different queries.
> >> > > The problem lies in many things:
> >> > > - Each query has a norm factor that makes it more compareable if
> >> > > they are sub clauses of a BooleanQuery. But you are right, this
> >> > > norm factor should be the same.
> >> > > - Some queries like FuzzyQuery rely on the terms in index and
> >> > > those matches the query
> >> > > - Inside Boolean queries, there is also a coord-factor involved
> >> > >
> >> > > If you are always using the same simple type of query (e.g.
> >> > > simple TermQuery, only with different term) on the same index,
> >> > > you can compare the scores. As soon as you are using complex
> >> > > queries (e.g several terms compared in a BooleanQuery as
> >> > > QueryParser produces), the scores are no longer comparable.
> >> > >
> >> > > You can read more on all factors that are included in scoring:
> >> > >
> >> > >
> >> >
> http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
> >> > ch/
> >> > > Simila
> >> > > rity.html
> >> > >
> >> > > -----
> >> > > Uwe Schindler
> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> >> > > eMail: uwe [at] thetaphi
> >> > >
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> >> > > > Sent: Monday, March 28, 2011 9:44 AM
> >> > > > To: java-user [at] lucene
> >> > > > Subject: comparing lucene scores across queries
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > sorry I've already asked few days ago, but I got no reply and I
> >> > > > really
> >> > > need
> >> > > > some help on this..
> >> > > >
> >> > > > I'm running several queries against a doc collection. The queries
> >> > > > are documents of the collection itself, I need to measure how
> >> > > > similar is each document to the rest of the collection.
> >> > > >
> >> > > > Now, Lucene returns me a score per query, but I've been told such
> >> > > > score
> >> > > is
> >> > > > not comparable across queries. Is this correct ?
> >> > > >
> >> > > > For example, arem't these scores comparable ?
> >> > > > query1, score:8.324234
> >> > > > query2, score:3.324238
> >> > > >
> >> > > > If so, why not ? Isn't the cosine similarity between the query
> >> > > > vector and collection docs vectors ? I really need a comparable
> >> measure.
> >> > > >
> >> > > > thanks
> >> > >
> >> > >
> >> > >
---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> > > For additional commands, e-mail: java-user-help [at] lucene
> >> > >
> >> > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 28, 2011, 2:38 AM

Post #8 of 14 (1212 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

ok thanks, I will pass well I dunno how to verify it. Even if I try then I
get some scores, but I dunno if comparing them is reliable.


On 28 March 2011 11:36, Uwe Schindler <uwe [at] thetaphi> wrote:

> Hi,
>
> You don't need to extend BooleanQuery, you can just pass "true" in its
> ctor,
> see: http://s.apache.org/QvK
> Of course you can also subclass DefaultSimilarity and return 1 as coord,
> but
> that is more work than passing true to a ctor.
>
> For your type of queries, disabling coord should be enough, but I am not
> 100% sure! Why not simply try it out?
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > Sent: Monday, March 28, 2011 10:49 AM
> > To: java-user [at] lucene
> > Subject: Re: comparing lucene scores across queries
> >
> > One more thing, instead of extending the BooleanQuery class to remove the
> > coord factor, can I also extend the Similarity class to do it ?
> >
> > Still the other question is open: just to be sure, if I disable the coord
> factor I
> > can finally compare my BooleanQuery results ?
> >
> > thanks
> >
> > >
> > >
> > >
> > > On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:
> > >
> > >> Hi Patrick,
> > >>
> > >> You can disable the coord factor in the constructor of BooleanQuery.
> > >>
> > >> Uwe
> > >>
> > >> -----
> > >> Uwe Schindler
> > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: uwe [at] thetaphi
> > >>
> > >>
> > >> > -----Original Message-----
> > >> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > >> > Sent: Monday, March 28, 2011 10:09 AM
> > >> > To: java-user [at] lucene
> > >> > Subject: Re: comparing lucene scores across queries
> > >> >
> > >> > Hi, thanks for reply.
> > >> >
> > >> > Yeah, I've read the Similarity class documentation several times,
> > >> > but I
> > >> need
> > >> > some tip.
> > >> >
> > >> > My queries are BooleanQueries but they always have the same
> > >> > structure (the same structure of the docs, they are actually docs
> > >> > from
> > >> collection):
> > >> 3
> > >> > fields.
> > >> >
> > >> > What if I simplify the similarity scores, by removing coord factor
> > >> > and
> > >> just
> > >> > leaving the cosine similarity which is comparable ?
> > >> >
> > >> > I want to underline the fact that my boolean queries are just a
> > >> combination
> > >> > of "field:term" items, and I always have the same 3 fields with
> > >> different
> > >> > terms obviously.
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
> > >> >
> > >> > > No, scores are in general not comparable between different
> queries.
> > >> > > The problem lies in many things:
> > >> > > - Each query has a norm factor that makes it more compareable if
> > >> > > they are sub clauses of a BooleanQuery. But you are right, this
> > >> > > norm factor should be the same.
> > >> > > - Some queries like FuzzyQuery rely on the terms in index and
> > >> > > those matches the query
> > >> > > - Inside Boolean queries, there is also a coord-factor involved
> > >> > >
> > >> > > If you are always using the same simple type of query (e.g.
> > >> > > simple TermQuery, only with different term) on the same index,
> > >> > > you can compare the scores. As soon as you are using complex
> > >> > > queries (e.g several terms compared in a BooleanQuery as
> > >> > > QueryParser produces), the scores are no longer comparable.
> > >> > >
> > >> > > You can read more on all factors that are included in scoring:
> > >> > >
> > >> > >
> > >> >
> > http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
> > >> > ch/
> > >> > > Simila
> > >> > > rity.html
> > >> > >
> > >> > > -----
> > >> > > Uwe Schindler
> > >> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > >> > > eMail: uwe [at] thetaphi
> > >> > >
> > >> > >
> > >> > > > -----Original Message-----
> > >> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > >> > > > Sent: Monday, March 28, 2011 9:44 AM
> > >> > > > To: java-user [at] lucene
> > >> > > > Subject: comparing lucene scores across queries
> > >> > > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > sorry I've already asked few days ago, but I got no reply and I
> > >> > > > really
> > >> > > need
> > >> > > > some help on this..
> > >> > > >
> > >> > > > I'm running several queries against a doc collection. The
> queries
> > >> > > > are documents of the collection itself, I need to measure how
> > >> > > > similar is each document to the rest of the collection.
> > >> > > >
> > >> > > > Now, Lucene returns me a score per query, but I've been told
> such
> > >> > > > score
> > >> > > is
> > >> > > > not comparable across queries. Is this correct ?
> > >> > > >
> > >> > > > For example, arem't these scores comparable ?
> > >> > > > query1, score:8.324234
> > >> > > > query2, score:3.324238
> > >> > > >
> > >> > > > If so, why not ? Isn't the cosine similarity between the query
> > >> > > > vector and collection docs vectors ? I really need a comparable
> > >> measure.
> > >> > > >
> > >> > > > thanks
> > >> > >
> > >> > >
> > >> > >
> ---------------------------------------------------------------------
> > >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > >> > > For additional commands, e-mail: java-user-help [at] lucene
> > >> > >
> > >> > >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > >> For additional commands, e-mail: java-user-help [at] lucene
> > >>
> > >>
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Mar 28, 2011, 2:44 AM

Post #9 of 14 (1207 views)
Permalink
RE: comparing lucene scores across queries [In reply to]

Hi,

As you seem to want to do very specific things, it might still be
interesting to provide a modified Similarity (by subclassing
DefaultSimilaity). You could then e.g. return also 1.0 to disable the
queryNorm() which may also be a problem (but it isn't for your queries).
Theoretically, you can change the Similarity to only have the cosine
similarity left over - if you only want to use that one.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> Sent: Monday, March 28, 2011 11:39 AM
> To: java-user [at] lucene
> Subject: Re: comparing lucene scores across queries
>
> ok thanks, I will pass well I dunno how to verify it. Even if I try then I
get some
> scores, but I dunno if comparing them is reliable.
>
>
> On 28 March 2011 11:36, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> > Hi,
> >
> > You don't need to extend BooleanQuery, you can just pass "true" in its
> > ctor,
> > see: http://s.apache.org/QvK
> > Of course you can also subclass DefaultSimilarity and return 1 as
> > coord, but that is more work than passing true to a ctor.
> >
> > For your type of queries, disabling coord should be enough, but I am
> > not 100% sure! Why not simply try it out?
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe [at] thetaphi
> >
> >
> > > -----Original Message-----
> > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > Sent: Monday, March 28, 2011 10:49 AM
> > > To: java-user [at] lucene
> > > Subject: Re: comparing lucene scores across queries
> > >
> > > One more thing, instead of extending the BooleanQuery class to
> > > remove the coord factor, can I also extend the Similarity class to do
it ?
> > >
> > > Still the other question is open: just to be sure, if I disable the
> > > coord
> > factor I
> > > can finally compare my BooleanQuery results ?
> > >
> > > thanks
> > >
> > > >
> > > >
> > > >
> > > > On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:
> > > >
> > > >> Hi Patrick,
> > > >>
> > > >> You can disable the coord factor in the constructor of
BooleanQuery.
> > > >>
> > > >> Uwe
> > > >>
> > > >> -----
> > > >> Uwe Schindler
> > > >> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > >> eMail: uwe [at] thetaphi
> > > >>
> > > >>
> > > >> > -----Original Message-----
> > > >> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > >> > Sent: Monday, March 28, 2011 10:09 AM
> > > >> > To: java-user [at] lucene
> > > >> > Subject: Re: comparing lucene scores across queries
> > > >> >
> > > >> > Hi, thanks for reply.
> > > >> >
> > > >> > Yeah, I've read the Similarity class documentation several times,
> > > >> > but I
> > > >> need
> > > >> > some tip.
> > > >> >
> > > >> > My queries are BooleanQueries but they always have the same
> > > >> > structure (the same structure of the docs, they are actually docs
> > > >> > from
> > > >> collection):
> > > >> 3
> > > >> > fields.
> > > >> >
> > > >> > What if I simplify the similarity scores, by removing coord
factor
> > > >> > and
> > > >> just
> > > >> > leaving the cosine similarity which is comparable ?
> > > >> >
> > > >> > I want to underline the fact that my boolean queries are just a
> > > >> combination
> > > >> > of "field:term" items, and I always have the same 3 fields with
> > > >> different
> > > >> > terms obviously.
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
> > > >> >
> > > >> > > No, scores are in general not comparable between different
> > queries.
> > > >> > > The problem lies in many things:
> > > >> > > - Each query has a norm factor that makes it more compareable
if
> > > >> > > they are sub clauses of a BooleanQuery. But you are right, this
> > > >> > > norm factor should be the same.
> > > >> > > - Some queries like FuzzyQuery rely on the terms in index and
> > > >> > > those matches the query
> > > >> > > - Inside Boolean queries, there is also a coord-factor involved
> > > >> > >
> > > >> > > If you are always using the same simple type of query (e.g.
> > > >> > > simple TermQuery, only with different term) on the same index,
> > > >> > > you can compare the scores. As soon as you are using complex
> > > >> > > queries (e.g several terms compared in a BooleanQuery as
> > > >> > > QueryParser produces), the scores are no longer comparable.
> > > >> > >
> > > >> > > You can read more on all factors that are included in scoring:
> > > >> > >
> > > >> > >
> > > >> >
> > > http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
> > > >> > ch/
> > > >> > > Simila
> > > >> > > rity.html
> > > >> > >
> > > >> > > -----
> > > >> > > Uwe Schindler
> > > >> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > >> > > eMail: uwe [at] thetaphi
> > > >> > >
> > > >> > >
> > > >> > > > -----Original Message-----
> > > >> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > >> > > > Sent: Monday, March 28, 2011 9:44 AM
> > > >> > > > To: java-user [at] lucene
> > > >> > > > Subject: comparing lucene scores across queries
> > > >> > > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > sorry I've already asked few days ago, but I got no reply and
I
> > > >> > > > really
> > > >> > > need
> > > >> > > > some help on this..
> > > >> > > >
> > > >> > > > I'm running several queries against a doc collection. The
> > queries
> > > >> > > > are documents of the collection itself, I need to measure how
> > > >> > > > similar is each document to the rest of the collection.
> > > >> > > >
> > > >> > > > Now, Lucene returns me a score per query, but I've been told
> > such
> > > >> > > > score
> > > >> > > is
> > > >> > > > not comparable across queries. Is this correct ?
> > > >> > > >
> > > >> > > > For example, arem't these scores comparable ?
> > > >> > > > query1, score:8.324234
> > > >> > > > query2, score:3.324238
> > > >> > > >
> > > >> > > > If so, why not ? Isn't the cosine similarity between the
query
> > > >> > > > vector and collection docs vectors ? I really need a
comparable
> > > >> measure.
> > > >> > > >
> > > >> > > > thanks
> > > >> > >
> > > >> > >
> > > >> > >
> > ---------------------------------------------------------------------
> > > >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > >> > > For additional commands, e-mail: java-user-
> help [at] lucene
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >>
---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > >> For additional commands, e-mail: java-user-help [at] lucene
> > > >>
> > > >>
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 28, 2011, 3:21 AM

Post #10 of 14 (1228 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

I see, well if you say the norm isn't a problem for my case, I will just
disable the coord factor by initializing BooleanQuery(true); and I should be
done.

If this is not correct, please anybody let me know.

On 28 March 2011 11:44, Uwe Schindler <uwe [at] thetaphi> wrote:

> Hi,
>
> As you seem to want to do very specific things, it might still be
> interesting to provide a modified Similarity (by subclassing
> DefaultSimilaity). You could then e.g. return also 1.0 to disable the
> queryNorm() which may also be a problem (but it isn't for your queries).
> Theoretically, you can change the Similarity to only have the cosine
> similarity left over - if you only want to use that one.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > Sent: Monday, March 28, 2011 11:39 AM
> > To: java-user [at] lucene
> > Subject: Re: comparing lucene scores across queries
> >
> > ok thanks, I will pass well I dunno how to verify it. Even if I try then
> I
> get some
> > scores, but I dunno if comparing them is reliable.
> >
> >
> > On 28 March 2011 11:36, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> > > Hi,
> > >
> > > You don't need to extend BooleanQuery, you can just pass "true" in its
> > > ctor,
> > > see: http://s.apache.org/QvK
> > > Of course you can also subclass DefaultSimilarity and return 1 as
> > > coord, but that is more work than passing true to a ctor.
> > >
> > > For your type of queries, disabling coord should be enough, but I am
> > > not 100% sure! Why not simply try it out?
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe [at] thetaphi
> > >
> > >
> > > > -----Original Message-----
> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > > Sent: Monday, March 28, 2011 10:49 AM
> > > > To: java-user [at] lucene
> > > > Subject: Re: comparing lucene scores across queries
> > > >
> > > > One more thing, instead of extending the BooleanQuery class to
> > > > remove the coord factor, can I also extend the Similarity class to do
> it ?
> > > >
> > > > Still the other question is open: just to be sure, if I disable the
> > > > coord
> > > factor I
> > > > can finally compare my BooleanQuery results ?
> > > >
> > > > thanks
> > > >
> > > > >
> > > > >
> > > > >
> > > > > On 28 March 2011 10:11, Uwe Schindler <uwe [at] thetaphi> wrote:
> > > > >
> > > > >> Hi Patrick,
> > > > >>
> > > > >> You can disable the coord factor in the constructor of
> BooleanQuery.
> > > > >>
> > > > >> Uwe
> > > > >>
> > > > >> -----
> > > > >> Uwe Schindler
> > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > > >> eMail: uwe [at] thetaphi
> > > > >>
> > > > >>
> > > > >> > -----Original Message-----
> > > > >> > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > > >> > Sent: Monday, March 28, 2011 10:09 AM
> > > > >> > To: java-user [at] lucene
> > > > >> > Subject: Re: comparing lucene scores across queries
> > > > >> >
> > > > >> > Hi, thanks for reply.
> > > > >> >
> > > > >> > Yeah, I've read the Similarity class documentation several
> times,
> > > > >> > but I
> > > > >> need
> > > > >> > some tip.
> > > > >> >
> > > > >> > My queries are BooleanQueries but they always have the same
> > > > >> > structure (the same structure of the docs, they are actually
> docs
> > > > >> > from
> > > > >> collection):
> > > > >> 3
> > > > >> > fields.
> > > > >> >
> > > > >> > What if I simplify the similarity scores, by removing coord
> factor
> > > > >> > and
> > > > >> just
> > > > >> > leaving the cosine similarity which is comparable ?
> > > > >> >
> > > > >> > I want to underline the fact that my boolean queries are just a
> > > > >> combination
> > > > >> > of "field:term" items, and I always have the same 3 fields with
> > > > >> different
> > > > >> > terms obviously.
> > > > >> >
> > > > >> > Thanks
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On 28 March 2011 10:03, Uwe Schindler <uwe [at] thetaphi> wrote:
> > > > >> >
> > > > >> > > No, scores are in general not comparable between different
> > > queries.
> > > > >> > > The problem lies in many things:
> > > > >> > > - Each query has a norm factor that makes it more compareable
> if
> > > > >> > > they are sub clauses of a BooleanQuery. But you are right,
> this
> > > > >> > > norm factor should be the same.
> > > > >> > > - Some queries like FuzzyQuery rely on the terms in index and
> > > > >> > > those matches the query
> > > > >> > > - Inside Boolean queries, there is also a coord-factor
> involved
> > > > >> > >
> > > > >> > > If you are always using the same simple type of query (e.g.
> > > > >> > > simple TermQuery, only with different term) on the same index,
> > > > >> > > you can compare the scores. As soon as you are using complex
> > > > >> > > queries (e.g several terms compared in a BooleanQuery as
> > > > >> > > QueryParser produces), the scores are no longer comparable.
> > > > >> > >
> > > > >> > > You can read more on all factors that are included in scoring:
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
> > > > >> > ch/
> > > > >> > > Simila
> > > > >> > > rity.html
> > > > >> > >
> > > > >> > > -----
> > > > >> > > Uwe Schindler
> > > > >> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > > >> > > eMail: uwe [at] thetaphi
> > > > >> > >
> > > > >> > >
> > > > >> > > > -----Original Message-----
> > > > >> > > > From: Patrick Diviacco [mailto:patrick.diviacco [at] gmail]
> > > > >> > > > Sent: Monday, March 28, 2011 9:44 AM
> > > > >> > > > To: java-user [at] lucene
> > > > >> > > > Subject: comparing lucene scores across queries
> > > > >> > > >
> > > > >> > > > Hi,
> > > > >> > > >
> > > > >> > > > sorry I've already asked few days ago, but I got no reply
> and
> I
> > > > >> > > > really
> > > > >> > > need
> > > > >> > > > some help on this..
> > > > >> > > >
> > > > >> > > > I'm running several queries against a doc collection. The
> > > queries
> > > > >> > > > are documents of the collection itself, I need to measure
> how
> > > > >> > > > similar is each document to the rest of the collection.
> > > > >> > > >
> > > > >> > > > Now, Lucene returns me a score per query, but I've been told
> > > such
> > > > >> > > > score
> > > > >> > > is
> > > > >> > > > not comparable across queries. Is this correct ?
> > > > >> > > >
> > > > >> > > > For example, arem't these scores comparable ?
> > > > >> > > > query1, score:8.324234
> > > > >> > > > query2, score:3.324238
> > > > >> > > >
> > > > >> > > > If so, why not ? Isn't the cosine similarity between the
> query
> > > > >> > > > vector and collection docs vectors ? I really need a
> comparable
> > > > >> measure.
> > > > >> > > >
> > > > >> > > > thanks
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > ---------------------------------------------------------------------
> > > > >> > > To unsubscribe, e-mail:
> java-user-unsubscribe [at] lucene
> > > > >> > > For additional commands, e-mail: java-user-
> > help [at] lucene
> > > > >> > >
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > > >> For additional commands, e-mail: java-user-help [at] lucene
> > > > >>
> > > > >>
> > > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > For additional commands, e-mail: java-user-help [at] lucene
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


hossman_lucene at fucit

Mar 28, 2011, 4:57 PM

Post #11 of 14 (1181 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

: I see, well if you say the norm isn't a problem for my case, I will just
: disable the coord factor by initializing BooleanQuery(true); and I should be
: done.

querynorm hsouldn't be a problem (since your booleanqueries all have hte
same structure, and odn't use query boosts ... i assume) but field norm
might be; i also don't see anything mentioned so far in this thread that
describes how you'll work arround the tf and idf values being theretically
unbounded (unless your docs are all of identical length)

ultimatley, attempts at comparing scores across different searches all
come down to normalizing (either explicitly or implicitly) and normalizing
requires that you have a "max possible score" you can normalize relative
to -- not just a "max score for the index", but a max score in the scope
of all theretical documents (because otherwise the comparison isn't fair
given an arbitrary corpus)

with the default similarity, you can't really define a "max possible
score" for a given query because tf and idf are not bounded functions.


There have been a few nice discussions about this general concept over the
years, here's the first once i found doing a quick search...

http://www.gossamer-threads.com/lists/lucene/java-user/61075





-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 29, 2011, 1:31 AM

Post #12 of 14 (1194 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

hey Hoss,

thanks for your reply. I thought I've solved the issue according to Uwe, the
queries without coord function were reasonably comparable, but now you
actually reopened it.

So, I need to be sure I'm making them comparable and I would like to ask the
following.

My BooleanQueries have similar structure. Important: they only contain
TermQueries. The fields are always 3 but the terms number can vary... this
is an example of BooleanQuery (sorry for the syntax):

field1:term1, SHOULD
field1:term2, SHOULD
field2:term1, SHOULD
field2:term2, SHOULD
field2:term3, SHOULD
field3:term1, SHOULD
...

If it is not clear how the BooleanQueries are, I can print some of them for
you. They have same number of fields but different number of terms.

1- Do you still think QueryNorm is not an issue ? Funny, because in the
documentation I can read:
QueryNorm(q) is a normalizing factor used to make scores between queries
comparable. This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts to
make scores from different queries (or even different indexes) comparable.

It seems I can compare queries from the documentation.



2- I don't think I'm using queryBoosts, are they enabled by default in the
BooleanQuery ?

3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
SHould I disable it ? Is it a issue ?

4- If I'm not wrong Uwe told me I can compute comparable cosine
similarities even with documents of different length. Tf and Idf are
unbounded, and my docs have different length. Can't I measure the similarity
between query and doc vectors anyway ?

5 - Again, I've been told I can compare queries and from documentation, I
can see that queryNorm factor normalizes all queries. But you are saying I
should manually normalize them somehow ? It is not clear

thanks
Patrick


> querynorm hsouldn't be a problem (since your booleanqueries all have hte
> same structure, and odn't use query boosts ... i assume) but field norm
> might be; i also don't see anything mentioned so far in this thread that
> describes how you'll work arround the tf and idf values being theretically
> unbounded (unless your docs are all of identical length)
>
> ultimatley, attempts at comparing scores across different searches all
> come down to normalizing (either explicitly or implicitly) and normalizing
> requires that you have a "max possible score" you can normalize relative
> to -- not just a "max score for the index", but a max score in the scope
> of all theretical documents (because otherwise the comparison isn't fair
> given an arbitrary corpus)
>
> with the default similarity, you can't really define a "max possible
> score" for a given query because tf and idf are not bounded functions.
>
>
> There have been a few nice discussions about this general concept over the
> years, here's the first once i found doing a quick search...
>
> http://www.gossamer-threads.com/lists/lucene/java-user/61075
>
>
>
>
>
> -Hoss
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Mar 29, 2011, 1:48 AM

Post #13 of 14 (1197 views)
Permalink
RE: comparing lucene scores across queries [In reply to]

> thanks for your reply. I thought I've solved the issue according to Uwe,
the
> queries without coord function were reasonably comparable, but now you
> actually reopened it.
>
> So, I need to be sure I'm making them comparable and I would like to ask
the
> following.
>
> My BooleanQueries have similar structure. Important: they only contain
> TermQueries. The fields are always 3 but the terms number can vary... this
is
> an example of BooleanQuery (sorry for the syntax):
>
> field1:term1, SHOULD
> field1:term2, SHOULD
> field2:term1, SHOULD
> field2:term2, SHOULD
> field2:term3, SHOULD
> field3:term1, SHOULD
> ...
>
> If it is not clear how the BooleanQueries are, I can print some of them
for
> you. They have same number of fields but different number of terms.
>
> 1- Do you still think QueryNorm is not an issue ? Funny, because in the
> documentation I can read:
> QueryNorm(q) is a normalizing factor used to make scores between queries
> comparable. This factor does not affect document ranking (since all ranked
> documents are multiplied by the same factor), but rather just attempts to
> make scores from different queries (or even different indexes) comparable.
>
> It seems I can compare queries from the documentation.

But as you are always using the same type of query (TermQuery), the
QueryNorm should not change, so no issue at all. It differs if you have a
variable number of Boolean clauses, the Query norm could help you to make
the queries comparable. But if you only have always the same looking BQ with
exact same number of TQ in it (only different terms) its not an issue at
all. In all other cases, the query norm helps to compare e.g. a BQ with 5 TQ
clauses with another BQ that has 8 TQ clauses.

> 2- I don't think I'm using queryBoosts, are they enabled by default in the
> BooleanQuery ?

Query boost are only active if you do TermQuery.setBoost(anything != 1.0f).

> 3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
> SHould I disable it ? Is it a issue ?

FieldNorm should not be a problem, as it's an indexed feature. So the same
document has always the same FieldNorm (which is a combination of length
norm, indexing document boost). If two queries hit the same document the
scores for this document should be comparable, as the FieldNorm is the same
for both cases.

See point 6) in the Similarity docs: norm(t,d)

> 4- If I'm not wrong Uwe told me I can compute comparable cosine
similarities
> even with documents of different length. Tf and Idf are unbounded, and my
> docs have different length. Can't I measure the similarity between query
and
> doc vectors anyway ?

The field norm normalizes that. So where is the problem?

> 5 - Again, I've been told I can compare queries and from documentation, I
> can see that queryNorm factor normalizes all queries. But you are saying I
> should manually normalize them somehow ? It is not clear

It only affects different querys (e.g. number of Boolean clauses differ,
type of queries differ).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


patrick.diviacco at gmail

Mar 29, 2011, 1:58 AM

Post #14 of 14 (1181 views)
Permalink
Re: comparing lucene scores across queries [In reply to]

hey Uwe, so from your last answer, I understand I'm done.. no need to do
anything, I can already compare the queries.

However there is actually a misunderstanding: my booleanqueries have
variable number of boolean clauses because the fields are fixed but the
terms per field are not. So, for example, I have:

BooleanQuery1:
field1:term, SHOULD
field1:term, SHOULD
field2:term, SHOULD
field2:term, SHOULD
field2:term, SHOULD
field3:term, SHOULD

BooleanQuery2:
field1:term, SHOULD
field2:term, SHOULD
field3:term, SHOULD
field3:term, SHOULD
field3:term, SHOULD
field3:term, SHOULD
field3:term, SHOULD

Is any of the points we discussed so far not anymore valid ?

thanks

On 29 March 2011 10:48, Uwe Schindler <uwe [at] thetaphi> wrote:

> > thanks for your reply. I thought I've solved the issue according to Uwe,
> the
> > queries without coord function were reasonably comparable, but now you
> > actually reopened it.
> >
> > So, I need to be sure I'm making them comparable and I would like to ask
> the
> > following.
> >
> > My BooleanQueries have similar structure. Important: they only contain
> > TermQueries. The fields are always 3 but the terms number can vary...
> this
> is
> > an example of BooleanQuery (sorry for the syntax):
> >
> > field1:term1, SHOULD
> > field1:term2, SHOULD
> > field2:term1, SHOULD
> > field2:term2, SHOULD
> > field2:term3, SHOULD
> > field3:term1, SHOULD
> > ...
> >
> > If it is not clear how the BooleanQueries are, I can print some of them
> for
> > you. They have same number of fields but different number of terms.
> >
> > 1- Do you still think QueryNorm is not an issue ? Funny, because in the
> > documentation I can read:
> > QueryNorm(q) is a normalizing factor used to make scores between queries
> > comparable. This factor does not affect document ranking (since all
> ranked
> > documents are multiplied by the same factor), but rather just attempts to
> > make scores from different queries (or even different indexes)
> comparable.
> >
> > It seems I can compare queries from the documentation.
>
> But as you are always using the same type of query (TermQuery), the
> QueryNorm should not change, so no issue at all. It differs if you have a
> variable number of Boolean clauses, the Query norm could help you to make
> the queries comparable. But if you only have always the same looking BQ
> with
> exact same number of TQ in it (only different terms) its not an issue at
> all. In all other cases, the query norm helps to compare e.g. a BQ with 5
> TQ
> clauses with another BQ that has 8 TQ clauses.
>
> > 2- I don't think I'm using queryBoosts, are they enabled by default in
> the
> > BooleanQuery ?
>
> Query boost are only active if you do TermQuery.setBoost(anything != 1.0f).
>
> > 3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
> > SHould I disable it ? Is it a issue ?
>
> FieldNorm should not be a problem, as it's an indexed feature. So the same
> document has always the same FieldNorm (which is a combination of length
> norm, indexing document boost). If two queries hit the same document the
> scores for this document should be comparable, as the FieldNorm is the same
> for both cases.
>
> See point 6) in the Similarity docs: norm(t,d)
>
> > 4- If I'm not wrong Uwe told me I can compute comparable cosine
> similarities
> > even with documents of different length. Tf and Idf are unbounded, and my
> > docs have different length. Can't I measure the similarity between query
> and
> > doc vectors anyway ?
>
> The field norm normalizes that. So where is the problem?
>
> > 5 - Again, I've been told I can compare queries and from documentation, I
> > can see that queryNorm factor normalizes all queries. But you are saying
> I
> > should manually normalize them somehow ? It is not clear
>
> It only affects different querys (e.g. number of Boolean clauses differ,
> type of queries differ).
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.