fancyerii at gmail
Apr 27, 2012, 12:25 AM
On Thu, Apr 26, 2012 at 5:13 AM, Yang <teddyyyy123 [at] gmail> wrote:
> I read the paper by Doug "Space optimizations for total ranking",
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)
> particularly the total ranking algorithm described there needs to traverse
> down the entire postings list for all the query terms,
> so in case of very common query terms like "yellow dog", either of the 2
> terms may have a very very long postings list in case of web search,
> are they all really traversed in current lucene/Solr ? or any heuristics
> to truncate the list are actually employed?
you can read related papers about early termination, they are closely
related to ranking algorithm. Now lucene did little thing of this
area. Also is it's ranking algorithm.
> in the case of returning top-k results, I can understand that partitioning
> the postings list into multiple machines, and then combining the top-k
That's distributed searching, solr has this ability. Even for a single
node, for conjunction query(and query), lucene will use skip list in
posting to speed up. for disjunction query(or query), lucene will use
BooleanScorer rather than BooleanScorer2. BooleanScorer is TAAT(Term
at a Time) algorithm while BooleanScorer2 is DAAT(Document at a Time).
> from each would work,
> but if we are required to return "the 100th result page", i.e. results
> ranked from 990--1000th, then each partition would still have to find out
> the top 1000, so
> partitioning would not help much.
yes, that's why many search engines will not allow user visit page
number greater than a threshold. for most application, users usually
only visit top results. That's why ranking algorithm is important. if
you found your users always turn to next page, I think you should
consider your application. you should provide more filter condition or
improving ranking algorithm.
> overall, is there any up-to-date detailed docs on the internal algorithms
> of lucene?
if you can read Chinese, I recommend
http://www.cnblogs.com/forfuture1978/category/300665.htm. you may also
find some of my blogs about lucene/solr in
blog.csdn.net/fancyerII(I am not a persistent person, and plan of
writing blogs of lucene/solr is not continued)
anyhow, the source code is the best resource.
> Thanks a lot
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene