Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

There is a mismatch between the score for a wildcard match and an exact match

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


paul_t100 at fastmail

Mar 9, 2012, 2:42 AM

Post #1 of 3 (440 views)
Permalink
There is a mismatch between the score for a wildcard match and an exact match

There is a mismatch between the score for a wildcard match and an exact
match

I search for

|recording:live OR recording:luve*
|

And here is the Explain Output from Search

|DocNo:0:1.4196585:11111111-1cf0-4d1f-aca7-2a6f89e34b36
1.4196585 = (MATCH) max plus0.1 times others of:
0.3763506 = (MATCH) ConstantScore(recording:luve*), product of:
1.0 = boost
0.3763506 = queryNorm
1.3820235 = (MATCH) weight(recording:luve in0), product of:
0.7211972 = queryWeight(recording:luve), product of:
1.9162908 = idf(docFreq=1, maxDocs=5)
0.3763506 = queryNorm
1.9162908 = (MATCH) fieldWeight(recording:luve in0), product of:
1.0 = tf(termFreq(recording:luve)=1)
1.9162908 = idf(docFreq=1, maxDocs=5)
1.0 = fieldNorm(field=recording, doc=0)

DocNo:1:0.3763506:22222222-1cf0-4d1f-aca7-2a6f89e34b36
0.3763506 = (MATCH) max plus0.1 times others of:
0.3763506 = (MATCH) ConstantScore(recording:luve*), product of:
1.0 = boost
0.3763506 = queryNorm
|

In my test I have 5 documents one contains an exact match, another a
wildcard match and the other three do not match all. The score of the
exact match is *1.4* compared to *0.37* for the wildcard match, thats
nearly a factor of *4*. With a much larger index the score for an exact
match on a rare term compared to a wildcard search would be even higher.

The whole difference is due to the different scoring mechism used for
wildcard to exact match, wildcards don't take tf/idf or lengthnorm into
account you just get a constant score for each match. Now I'm not
bothered about tf or lengthnorm in my data domain it doesnt make much
difference but the *idf* score is a real killer. Because the matching
doc is found once in 5 documents its idf contribution is idf squared i.e
*3.61*

I know this constant score is quicker than calculating the
tf*idf*lengthnorm for each wildcard match but it doesn't make sense to
me for the idf to contribute so much to the score. I also know I can
change the rewrite method but there are two problems with this.

1.

Scoring rewrite methods perform less well because they are
calculating idf, tf and lengthnorm. idf is the only value I need.

2.

Ones that do calculate the score dont make much sense either as they
would calculate the idf of the matching term even though this isn't
what was actually search for and this term could be rarer than what
I was actually searching for, possibly boosting it higher than the
exact match.

(I could also change the similarity class to override the idf
calculation so it always returns 1 but that doesn't make sense because
the idf is very useful for comparing exact matches to different words

i.e recording:luve OR recording:luve* OR recording:the OR recording:the*

I would want matches to *luve* to score higher than matches to the
common word *the* )

So does a rewrite method already exist or is possible for it to just
calculate the idf of the term it was trying to match to so for example
in this case I search for 'luve' and the wildcard matches on 'luvely'
that it would multiple the luvely match by the idf of luve (3.61). This
way my wildcard match would be comparable to the exact match and I can
just change my query to boost the exact match slightly so exact match
would always score higher than wildcard match but not too much higher

i.e

|recording:live^1.2 OR recording:luve*
|

and with this mythical rewrite method this would give (depending on
queryNorm):

* Doc 0:0:1.692
* Doc 1:0:1.419


paul_t100 at fastmail

Mar 9, 2012, 4:23 AM

Post #2 of 3 (402 views)
Permalink
Re: There is a mismatch between the score for a wildcard match and an exact match [In reply to]

On 09/03/2012 10:42, Paul Taylor wrote:
> There is a mismatch between the score for a wildcard match and an
> exact match
>
Just found the problem has been reported
https://issues.apache.org/jira/browse/LUCENE-2557 not quite whether
there is a solution available yet.


Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Mar 9, 2012, 3:48 PM

Post #3 of 3 (397 views)
Permalink
Re: There is a mismatch between the score for a wildcard match and an exact match [In reply to]

On 09/03/2012 12:23, Paul Taylor wrote:
> On 09/03/2012 10:42, Paul Taylor wrote:
>> There is a mismatch between the score for a wildcard match and an
>> exact match
>>
> Just found the problem has been reported
> https://issues.apache.org/jira/browse/LUCENE-2557 not quite whether
> there is a solution available yet.
>
>
> Paul
>
FYI, seem to have a working version, posted answer
http://stackoverflow.com/questions/9632602/there-is-a-mismatch-between-the-score-for-a-wildcard-match-and-an-exact-match/9642475#9642475

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.