Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Score combination - Filtering vs. Querying

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


gmuresan at acm

Jun 15, 2011, 7:05 PM

Post #1 of 4 (350 views)
Permalink
Score combination - Filtering vs. Querying

(From a newbie in Lucene, with IR background)

The issue that I have is well exemplified by section 3.4.5 "Combining
queries: BooleanQuery" in LIA, 2nd ed. The example uses BooleanQuery to
combine
- a TermQuery, for matching document topic, for which the TF-IDF scoring
makes sense; and
- a NumericRangeQuery, whose purpose is to filter by publication date.

I extended the example code to output the query and the explanation:

Title AND Date = +subject:search +pubmonth:[201001 TO 201012]
----------
Lucene in Action, Second Edition
1.6848878 = (MATCH) sum of:
1.3560408 = (MATCH) weight(subject:search in 9), product of:
0.9443832 = queryWeight(subject:search), product of:
2.871802 = idf(docFreq=1, maxDocs=13)
0.3288469 = queryNorm
1.435901 = (MATCH) fieldWeight(subject:search in 9), product of:
1.0 = tf(termFreq(subject:search)=1)
2.871802 = idf(docFreq=1, maxDocs=13)
0.5 = fieldNorm(field=subject, doc=9)
0.3288469 = (MATCH) ConstantScoreQuery(pubmonth:[201001 TO 201012]),
product of:
1.0 = boost
0.3288469 = queryNorm

Computing a queryNorm for the NumericRangeQuery has no meaning. Instead of
simply filtering by date, this component contributes a substantial amount
(0.3288469) to the overall score (especially if the title match has a low
score).

In my own (inherited) application I have multiple textual queries, matching
against different fields, combined with several NumericRangeQueries. The
contributions of the latter to the scores makes it hard to control boosts of
different fields.

The logical course of action seems to me to replace the NumericRangeQueries
with filters. This means removing the NumericRangeQueries from the overall
BooleanQuery and separately build a filter that combines corresponding
NumericRangeFilters. Several options that I have are:
- Use BooleanFilter
- Use ChainFilter
- In order to change as little code as possible, keep the code that
combines all NumericRangeQueries into a BooleanQuery, and wrap that in a
QueryWrapperFilter.

Q1: Are there any (performance ?) advantages or disadvantages for each of
these options ?
Q2: Are there any plans to improve Lucene in terms of dealing in a
principled way with this issue of combining TermQueries and
NumericRangeQueries ?


--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070380p3070380.html
Sent from the Lucene - General mailing list archive at Nabble.com.


gmuresan at acm

Jun 15, 2011, 7:24 PM

Post #2 of 4 (339 views)
Permalink
Score combination - Filtering vs. Querying [In reply to]

The issue that I have is well exemplified by section 3.4.5 "Combining
queries: BooleanQuery" in LIA, 2nd ed. The example uses BooleanQuery to
combine
- a TermQuery, for matching document topic, for which the TF-IDF scoring
makes sense; and
- a NumericRangeQuery, whose purpose is to filter by publication date.

I extended the example code to output the query and the explanation:

Title AND Date = +subject:search +pubmonth:[201001 TO 201012]
----------
Lucene in Action, Second Edition
1.6848878 = (MATCH) sum of:
1.3560408 = (MATCH) weight(subject:search in 9), product of:
0.9443832 = queryWeight(subject:search), product of:
2.871802 = idf(docFreq=1, maxDocs=13)
0.3288469 = queryNorm
1.435901 = (MATCH) fieldWeight(subject:search in 9), product of:
1.0 = tf(termFreq(subject:search)=1)
2.871802 = idf(docFreq=1, maxDocs=13)
0.5 = fieldNorm(field=subject, doc=9)
0.3288469 = (MATCH) ConstantScoreQuery(pubmonth:[201001 TO 201012]),
product of:
1.0 = boost
0.3288469 = queryNorm

Computing a queryNorm for the NumericRangeQuery has no meaning. Instead of
simply filtering by date, this component contributes a substantial amount
(0.3288) to the overall score (especially if the title match has a low
score).

In my own (inherited) application I have multiple textual queries, matching
against different fields, combined with several NumericRangeQueries. The
contributions of the latter to the scores makes it hard to control boosts of
different fields.

The logical course of action seems to me to replace the NumericRangeQueries
with filters. This means removing the NumericRangeQueries from the overall
BooleanQuery and separately build a filter that combines corresponding
NumericRangeFilters. Several options that I have are:
- Use BooleanFilter
- Use ChainFilter
- In order to change as little code as possible, keep the code that
combines all NumericRangeQueries into a BooleanQuery, and wrap that in a
QueryWrapperFilter.

Q1: Are there any (performance ?) advantages or disadvantages for each of
these options ?
Q2: Are there any plans to improve Lucene in terms of dealing in a
principled way with this issue of combining TermQueries and
NumericRangeQueries ?


--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3070425.html
Sent from the Lucene - General mailing list archive at Nabble.com.


gmuresan at acm

Jun 15, 2011, 7:32 PM

Post #3 of 4 (335 views)
Permalink
Re: Score combination - Filtering vs. Querying [In reply to]

...
I've read more forum discussions on this issue and some people point out
(like LIA 2nd ed, p.183, does) that using a filter reduces the number of
documents under consideration and impacts IDF and therefore the overall
score. Moreover, the recommendation in such forum discussions is that,
unless a high performance gain can be obtained via CachingWrapperFilter,
MUST BooleanClauses are preferred to Filters.

This doesn't quite make sense to me: the number of documents in the
collection, the size of the vocabulary, the size of each posting list and
the IDF of each term are known after indexing and should not be affected by
filtering.

To test this, I further modified the same LIA example and compared the use
of a BooleanClause and the use of a Filter:

Q = category:/technology/computers/programming/methodology
category:/philosophy/eastern +pubmonth:[200501 TO 201012]
----------
Tao Te Ching ???
1.4739084 = (MATCH) product of:
2.2108626 = (MATCH) sum of:
1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
0.68659997 = queryWeight(category:/philosophy/eastern), product of:
2.871802 = idf(docFreq=1, maxDocs=13)
0.23908332 = queryNorm
2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
1.0 = tf(termFreq(category:/philosophy/eastern)=1)
2.871802 = idf(docFreq=1, maxDocs=13)
1.0 = fieldNorm(field=category, doc=4)
0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]),
product of:
1.0 = boost
0.23908332 = queryNorm
0.6666667 = coord(2/3)

Q = +(category:/technology/computers/programming/methodology
category:/philosophy/eastern) +pubmonth:[200501 TO 201012]
----------
Tao Te Ching ???
1.224973 = (MATCH) sum of:
0.9858896 = (MATCH) product of:
1.9717792 = (MATCH) sum of:
1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
0.68659997 = queryWeight(category:/philosophy/eastern), product of:
2.871802 = idf(docFreq=1, maxDocs=13)
0.23908332 = queryNorm
2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
1.0 = tf(termFreq(category:/philosophy/eastern)=1)
2.871802 = idf(docFreq=1, maxDocs=13)
1.0 = fieldNorm(field=category, doc=4)
0.5 = coord(1/2)
0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]),
product of:
1.0 = boost
0.23908332 = queryNorm

Q = category:/technology/computers/programming/methodology
category:/philosophy/eastern
Date = pubmonth:[200501 TO 201112]
----------
Tao Te Ching ???
1.0153353 = (MATCH) product of:
2.0306706 = (MATCH) sum of:
2.0306706 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
0.70710677 = queryWeight(category:/philosophy/eastern), product of:
2.871802 = idf(docFreq=1, maxDocs=13)
0.24622406 = queryNorm
2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
1.0 = tf(termFreq(category:/philosophy/eastern)=1)
2.871802 = idf(docFreq=1, maxDocs=13)
1.0 = fieldNorm(field=category, doc=4)
0.5 = coord(1/2)

Comparing the results, I see that:
- maxDocs and IDF are the same;
- queryNorm and coord can be different. The correct values are the ones
obtained when using Filter; BooleanClauses introduce artificial query terms
that affect these metrics;
- the BooleanClause also introduces a ConstantScoreQuery that further
impacts the "true" score.

I would conclude that from the perspective of obtaining "true" scores, using
Filter is preferred to using MUST BooleanClause in a BooleanQuery.

The TF-IDF model (as well as other IR models) was developed for text-like
features. The assumptions made in that model do not apply to numeric fields
such as date or longitude/latitude, appropriate for faceted filtering, so
the two models should not be mixed in a common query.

Q3. Considering that all expert opinions that I've read in forums speak
against Filter-ing, is there something that I'm missing ?


--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3070439.html
Sent from the Lucene - General mailing list archive at Nabble.com.


suyalpravesh at yahoo

Jun 24, 2011, 5:52 AM

Post #4 of 4 (309 views)
Permalink
Re: Score combination - Filtering vs. Querying [In reply to]

>I would conclude that from the perspective of obtaining "true" scores, using
Filter is preferred to using MUST BooleanClause in a BooleanQuery.

Correct, I suppose :)

Use Filters when you just want additional filters without impacting the
boost.

>.....using MUST BooleanClause in a BooleanQuery
Then it would be normally parsed by QueryParser and a Query is generated and
which contributes to the score

Thanx
Pravesh

--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3104007.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.