teddyyyy123 at gmail
Apr 25, 2012, 2:20 PM
additionally, anybody knows roughly (of course the details are a secret,
but I guess the main ideas should be
common enough these days) how google does fast ranking in cases of
multi-term queries with AND ?
(if their postings are sorted by PageRank order, then it's understandable
that a single term query would quickly return the top-k, but if it's
multi-term, they would have to traverse the entire lists to find the
insersection set, because the lists are not sorted by docId, as in the
Lucene paper case)
On Wed, Apr 25, 2012 at 2:13 PM, Yang <teddyyyy123 [at] gmail> wrote:
> I read the paper by Doug "Space optimizations for total ranking",
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)
> particularly the total ranking algorithm described there needs to traverse
> down the entire postings list for all the query terms,
> so in case of very common query terms like "yellow dog", either of the 2
> terms may have a very very long postings list in case of web search,
> are they all really traversed in current lucene/Solr ? or any heuristics
> to truncate the list are actually employed?
> in the case of returning top-k results, I can understand that partitioning
> the postings list into multiple machines, and then combining the top-k
> from each would work,
> but if we are required to return "the 100th result page", i.e. results
> ranked from 990--1000th, then each partition would still have to find out
> the top 1000, so
> partitioning would not help much.
> overall, is there any up-to-date detailed docs on the internal algorithms
> of lucene?
> Thanks a lot