
marvin at rectangular
Jul 10, 2007, 1:24 AM
Post #2 of 2
(446 views)
Permalink
|
On Jul 3, 2007, at 2:24 PM, Nathan Kurz wrote: > On 7/2/07, Marvin Humphrey <marvin [at] rectangular> wrote: >> do { >> ... >> } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit >> (filter_bits) )); > > I was thinking more along the lines of something that would fit into > the search infrastructure. One could create a FilterScorer that would > wrap an existing scorer. Something like (obviously not exact): > > $boolean_scorer = $query->scorer; > $filter_scorer = new FilterScorer($boolean_scorer, @filters); > $filter_scorer->collect(); > > While advancing through documents, FilterScorer works like an > AndScorer, but for scoring it only calls Tally on the internal > BooleanScorer. If one wanted, the actual filters could be > BitVectorScorers that would use a cached BitVector rather than reading > in Postings. Sounds like an idea with potential. I'm not sure you should expect huge gains over the existing Filter implementation, but I'd be happy to be surprised. > how would a Scorer optimized for matching alone > be different? The main difference is that wouldn't have any need for > Tally to be called. There's that. Another big gain would come from a custom Posting type. MatchPosting: <doc_delta>+ ScorePosting: <doc_delta, freq, shared_boost, <position>+ >+ RichPosting: <doc_delta, freq, <position, boost>+ >+ MatchPosting files have less data in them than ScorePosting files. (: Or they would if they were actually done. :) ScorePosting files have less data in them than RichPosting files. Anyone who wants a brief definition of posting, see... http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ IRTheory.html See these pages for the initial brainstorming sessions: http://wiki.apache.org/lucene-java/FlexibleIndexing http://wiki.apache.org/lucene-java/ ConversationsBetweenDougMarvinAndGrant The implementation is slightly different in KS than was envisioned for Lucene. In Lucene, posting information for different fields resides in shared files. In KS, postings files are broken up by field -- so there's only one Posting format per file. Also, there's no Posting class in Lucene, and the Lucene version of PostingList is TermDocs, which isn't quite the same. In Lucene, you subclass TermDocs to get more data per document. In KS, you subclass Posting, but you still use a PostingList. Eventually, the idea is that people like yourself will create custom Posting subclasses to supply custom Scorer subclasses with exactly the data they need and no more. > I think the only problem with using the existing > Scorers for matching is that they do a lot of unnecessary Tallying if > the score is thrown away. > > So maybe rather than having Tally called as part of the > $scorer->collect() loop, it could be optionally called by the > HitCollector. A TopDocCollector would call Tally and use the score to > order hits, but a SortCollector wouldn't have to. Sure. > HitCollector would > be passed a Scorer instead of just a score. FWIW, Tally doesn't exist in Lucene, which only uses scorer.doc() and scorer.score(). Tally is supposed to be the vessel which carries "your score and more!". :) > > This was a late night idea, but it still seems to make sense to me > this morning. > Does it make sense to you? > >> PostingList objects are iterators that refill from disk; BitVectors >> are memory only. Few objects refilling from disk means fewer disk >> seeks means better performance. > > My intuition is that BitVectors are going to be a win only in very > specific situations. If you can fit your entire index in RAM, the > system page cache is going to treat you well after the first access, > no BitVector needed. If your index is _much_ larger than RAM, the > BitVector is going to be unwieldily large too. If your index is a > size comparable to RAM, even at a 1:32 ratio your (multiple) cached > BitVectors are going to be displacing index that would otherwise have > been cached. > > I wonder if one would get more bang by optimizing the index structure > (as you are already doing) than by resorting to BitVectors. My > personal impulse is to figure out how the indexes could be mmap'ed in > and used directly, thus saving the processing time in reconstituting > them and allowing for better shared resources across multiple search > threads/processes. My guess is that this could be a win over even the > BitVectors. Possible. We'll just need to benchmark it. >> However, BitVectors don't work well remotely, since they're too big >> to pass around -- that's why SearchClient doesn't support filtering. > > This does strike me as a major disadvantage. Almost everything else > I've looked at in KinoSearch seems like it scales quite nicely to > multiple machines. > It would be nice to enable someone to use this for a new Google :). Yes. Google's search results are the the sum of many small indexes, and in another sense, the sum of many scoring algorithms. The KS infrastructure should scale up so that at least large specialized search engines are possible. For the scoring algorithms, we'll have to work together as a community. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/
|