
marvin at rectangular
Apr 2, 2007, 10:40 AM
Post #1 of 1
(513 views)
Permalink
|
|
Pruning (was PolyFilter and Plans)
|
|
On Apr 2, 2007, at 10:00 AM, Mike Wexler wrote: > What is this pruning. It's a heuristic optimization that can be applied when... 1) You have a way of establishing an absolute rank for all documents -- e.g. page score, date of publication, price. 2) That primary ranking heavily influences which documents you want returned. If your data doesn't lend itself to that kind of primary ranking technique, pruning is of no use to you. The main application is for large search engines that use something akin to Google's (patented) pagerank algorithm. At index time, you sort documents so that the lowest document numbers have the highest rank. At search-time, you "prune" the result set by stopping the hit- collection process early. The idea is that if you are looking for 100 hits, after you've seen 1000 or so you've probably seen most or all of the good ones, so you don't need to process e.g. the other 5,000,000. With a segment-based index like that used in Lucene and KS, you have to process each segment, but even if you have 25 segments, that's only 25,000 docs which must be processed completely. The rest must be "skipped", which isn't free, but is still cheap compared to full scoring. The idea is to pay out a small price in relevance in return for a large performance benefit. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|