Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User
a "fair" similarity
 

Index | Next | Previous | View Flat


lucenelist2005 at danielnaber

Aug 14, 2006, 5:20 PM


Views: 15362
Permalink
a "fair" similarity

Hi,

as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.

For example, take these two artificial documents:

doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

When searching for "x" doc1 will get a higher ranking, even though "x"
makes up 1/10 of the terms in both documents.

Using this similarity implementation seems to "fix" that:

class MySim extends DefaultSimilarity {

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / numTerms);
}

public float tf(float freq) {
return (float)freq;
}

}

It's basically just the default implementation with Math.sqrt() removed. Is
this the correct approach? Are there any problems to expect? I just tested
it with the documents cited above.

The use case is that I want to boost fields, e.g. "body:foo^2 title:blah".
This could lead to strange results if title is already preferred just
because it's shorter.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Subject User Time
a "fair" similarity lucenelist2005 at danielnaber Aug 14, 2006, 5:20 PM
    Re: a "fair" similarity mike at curtin Aug 14, 2006, 6:26 PM
        Re: a "fair" similarity carp at alias-i Nov 21, 2006, 2:16 PM
    Re: a "fair" similarity fro at octo Jan 21, 2008, 12:00 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.