
Karsten.Konrad at xtramind
Dec 3, 2003, 3:24 PM
Post #2 of 7
(1031 views)
Permalink
|
Hi, >> Do they produce same ranking results? No; Lucene's operations on query weight and length normalization is not equivalent to a vanilla cosine in vector space. >> I guess the 2nd approach will be more precise but slow. Query similarity will indeed be faster, but may actually not be worse. A straightforward cosine without IDF weighting of terms (as Lucene does) will almost certainly be less precise if you have documents of different length - word occurence probabilities in texts of different lengths vary greatly, and the cosine of independent longer texts will often be greater than those that actually have the same topic, but are short, just because of randomly found non-content words. If, on the other hand, you choose the right TF/IDF weighting of terms, the cosine in this warped vector space could be (a) equivalent to the one Lucene does - requires some work to do so, or (b) might even get better on average. However, the last time I counted, there where about 250 different TF/IDF formulas around in IR publications, machine learning, computational linguistics and so on. Performance depends on domain and language. But if I was you, I just would start playing and have fun with the stuff... Karsten -----Ursprüngliche Nachricht----- Von: Jing Su [mailto:J.Su [at] cs] Gesendet: Dienstag, 2. Dezember 2003 18:12 An: lucene-user [at] jakarta Betreff: Document Similarity Hi, I have read some posts in user/developer archives about Lucene-based document similarity comparison. In summary there are two approaches are mentioned: 1 - Construct document to a query; 2 - Calculate each document to be a vector, then rank accoring to their distance (cosine). Do they produce same ranking results? Is there any other way to do so? I guess the 2nd approach will be more precise but slow. Thanks. Jing --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta For additional commands, e-mail: lucene-user-help [at] jakarta --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta For additional commands, e-mail: lucene-user-help [at] jakarta
|