
marvin at rectangular
Jul 18, 2007, 8:56 AM
Post #2 of 5
(795 views)
Permalink
|
On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote: > The problem is that while it is easy to subclass a term scorer, it's > difficult to get that subclassed scorer to be used. The intent is that each Posting subclass will have a fixed association with a corresponding TermScorer subclass. You're not supposed to be able to override that association without additional subclassing. Furthermore, each Posting subclass shall wholly define a posting file format. This is very different from Lucene. In Lucene, the code to read postings data is spread out over many classes, resulting in a tight binding between code base and file format which impedes extensibility. The postings data itself is spread out over 2-3 index files per field -- .frq and .prx, and optionally .fNUM -- whose formats have slowly become mangled by the persistent nibbling of feeping creatures. Take a look at <http://lucene.apache.org/java/ docs/fileformats.html#Frequencies> and try to imagine writing actual code to support that spec. :) The introduction of the Posting class in KS and the ironclad association of a posting_type with a particular field is an attempt to disentangle that situation by applying the refactoring technique I referenced earlier, "replace conditionals with polymorphism". Instead of keeping track of "hasPositions", "hasBoost", "hasPayload" and such, we point a read function at the postings file which knows *exactly* how to decode it. Posting currently works great for internal use, but it's not quite ready for public subclassing yet. The next step along the path as I see it is to refactor Posting and the classes that touch it so that writing a Posting subclass is as simple as defining... * a write method * a read method * a make_scorer method * a TermScorer subclass that overrides Scorer_Tally In either C or Perl. :) Another advantage of the unified posting file design is superior locality of reference compared with Lucene's spread-out posting files. This is an engineering tradeoff, as when matching only doc number on a ScorePosting field, Lucene has to plow through less data... but when positions are needed, KS does fewer hard disk seeks. (: It would seem natural for us to exploit positions data fully, since we're reading it regardless -- which is one reason I'm enthusiastic about your positional work. :) Lastly, Posting and PostingList also happen to align well with IR theory, making for what seems to me is a more coherent conceptual OO model than Lucene's TermDocs/TermPositions. (A "posting" is one "term" indexing one "document", and each "term" in an index has a "posting list" of documents which it indexes - see <http:// www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ IRTheory.html>.) Here's a rundown of how TermScorer subclasses might interact with their corresponding Posting subclasses: * If a field uses MatchPosting, then iterating over a posting list is really just iterating over a set of document numbers (transported via MatchPosting objects). * ScorePosting adds position, freq, and boost information. * RichPosting is like ScorePosting with enhanced boost info. * A "PartOfSpeechPosting" might consist of a document number and an integer representing "verb", "noun", etc -- allowing someone to perform linguistic analysis of a document collection. * An MMapMatchPosting might encode document numbers using native 32-bit integers rather than delta vints. * A proprietary GraphPosting might include a node_id and a node_type, facilitating the kind of search engine described in that FaceBook blog post -- weighting the document score according to data looked up in an external structure. In all cases, these Posting subclasses would meet the minimum criteria of supplying a document number, but that would be all they had in common. A PartOfSpeech term scorer wouldn't really know what to make of a GraphPosting. > I could bypass PostingList->make_scorer, as PhraseScorer does, and > work directly on the Posting. This feels presumptive, and reliant on > Posting internals. I'm confused by your comfort in having > PhraseScorer do so. The only way to get a PhraseScorer is via the factory method PhraseWeight->make_scorer, which will return undef unless it detects that the posting subclass isa ScorePosting, i.e. it makes positional data available. We're safe. > Either it would > need an ISA() check and a cast, or some way of interrogating the > Scorer as to whether certain features are available.) I don't believe that will be necessary. We just need to make sure that whatever technique we use for inquiring about the positional data is safe. Tally has a num_sproxen member indicating how many ScoreProx objects it has. That number would always be 0 for a Tally supplied by a MatchPostingScorer. You know, as an alternative to deleting all this positional stuff, we could more or less finish it off. :) You and I both need positional data. TermScorer provides it. PhraseScorer provides it modulo a couple glaring, easily-squashed bugs. We just need the BooleanScorer components to provide it, too. The hardest part of what I want to do is figuring out exactly how positional data should contribute to the score. It's going to be tricky enough that I think it will require some trial and error. However, we can defer that task. The positional data itself probably won't change, so we can get the positional data engine working and just not connect it up. One thing I'm concerned about is RAM footprint. Let's try and come up with a worst case scenario. 'the OR (and OR (it OR is)))' KS isn't optimized for book-sized documents, but let's say that's what's in an index, and the query above hits "War and Peace". Each TermScorer will have a Posting object that grows to some large number of KiB or even MiB to accommodate so many positions. That's OK, but is there any danger of geometric growth as our position-resolving algorithm recurses? > I'd like to have a path that would > continue to work even as I move to other posting formats, particularly > the mmap() approach I mentioned. That's not gonna fly. You'll need dedicated Posting subclasses for mmap. MMapMatchPosting, MMapScorePosting, etc. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|