
marvin at rectangular
Jun 11, 2007, 5:05 PM
Post #2 of 5
(738 views)
Permalink
|
On Jun 11, 2007, at 2:04 PM, Hans Dieter Pearcey wrote: > I looked through the MultiSearcher source and saw the dire warnings > about "the sort cache problem", but I don't know what that means > and the > missing support for filters doesn't seem to be described at all. Technically, MultiSearcher can handle Filters... it's SearchClient that can't. The problem is the overhead of passing the cached BitVector in a Filter between client and server. There's no fix pending, I'm afraid. The "sort cache problem" is really a problem of collating search results from disparate searchers. For this one, there is a fix available in theory, but it's costly in terms of memory. In KS, the sort cache is an array of 32-bit integers, one per document, corresponding to the "term number" in the lexicon for the sort field. If this is our doc set... my @docs = ( { name => 'fido', species => 'dog' }, # doc 0 { name => 'spot', species => 'dog' }, # doc 1 { name => 'whiskers', species => 'cat' }, # doc 2 ); ... then the lexicon for the field 'species' will look like this... my @lexicon = ( cat, # term 0 dog, # term 1 ); ... and this will be our sort cache: my @sort_cache = ( 1, 1, 0 ); It's dirt cheap to sort docs by term because of the sort cache: just look up the term numbers for the two documents and compare. if ( $sort_cache[$doc_num_a] < $sort_cache[$doc_num_b] ) { ... } The problem is that the central machine doesn't have access to the sort caches on the satellites, so when it comes time to select the creme de la creme from 5 sets of satellite top 10 hits, the central machine doesn't know how docs from different satellites compare. The straightforward solution is to have the satellites send not just doc numbers to the central machine, but doc numbers accompanied by sort field values. # now: ( 451, 20, 52 ) # fixed: ( [ 451, 'cat' ], [ 20, 'dog' ], [ 52, 'human' ] ) Then the central machine could make string comparisons of the sort field values from different machines. Looking up those values is kind of expensive though. Lexicons are only partially kept in memory (1 out of every 128 terms), and with a multi-segment index, you have to perform 1 disk scan per segment. Say you have 10 hits and 25 segments. That's 250 disk seeks to associate each hit with a sort field value. :( To avoid that cost, we might have to load entire lexicons for sort fields into memory. I've been trying to avoid that, but I don't see how. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|