ralf.heyde at gmx
Apr 26, 2012, 12:17 AM
i dont know the correct implementation. All that I can say is, that you should take a look at query optimization in state-of-the-art database systems. The generell solution is to select this part of a query first, which reduces the resultset most.
For example you try to search for a term like "I'm looking for a term" - each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)).
Term / Frequency
I'm - often
looking - normal
for - stopword (too often and maybe ignored)
a - stopword (too often and maybe ignored)
term - rare
So - Lucene might to select all documents, which contain "term", then on the "small" resultset looking for documents matching "looking" ... and so on.
For that reason Lucene stores information like Histograms and other "things" ...
I hope I gave you a simple example, which Lucene might use.
Greets from Berlin,
-------- Original-Nachricht --------
> Datum: Wed, 25 Apr 2012 14:20:10 -0700
> Von: Yang <teddyyyy123 [at] gmail>
> An: java-user [at] lucene
> Betreff: Re: lucene algorithm ?
> additionally, anybody knows roughly (of course the details are a secret,
> but I guess the main ideas should be
> common enough these days) how google does fast ranking in cases of
> multi-term queries with AND ?
> (if their postings are sorted by PageRank order, then it's understandable
> that a single term query would quickly return the top-k, but if it's
> multi-term, they would have to traverse the entire lists to find the
> insersection set, because the lists are not sorted by docId, as in the
> Lucene paper case)
> On Wed, Apr 25, 2012 at 2:13 PM, Yang <teddyyyy123 [at] gmail> wrote:
> > I read the paper by Doug "Space optimizations for total ranking",
> > since it was written a long time ago, I wonder what algorithms lucene
> > (regarding postings list traversal and score calculation, ranking)
> > particularly the total ranking algorithm described there needs to
> > down the entire postings list for all the query terms,
> > so in case of very common query terms like "yellow dog", either of the 2
> > terms may have a very very long postings list in case of web search,
> > are they all really traversed in current lucene/Solr ? or any
> > to truncate the list are actually employed?
> > in the case of returning top-k results, I can understand that
> > the postings list into multiple machines, and then combining the top-k
> > from each would work,
> > but if we are required to return "the 100th result page", i.e. results
> > ranked from 990--1000th, then each partition would still have to find
> > the top 1000, so
> > partitioning would not help much.
> > overall, is there any up-to-date detailed docs on the internal
> > of lucene?
> > Thanks a lot
> > Yang
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene