
nate at verse
Jul 6, 2008, 1:35 PM
Post #3 of 3
(1176 views)
Permalink
|
On Mon, Jun 30, 2008 at 9:22 PM, Marvin Humphrey <marvin [at] rectangular> wrote: > The main thing to take away from this paper is the simple fact that > *competing formats exist*. Our goal must be to devise a robust and > efficient plugin format, one that would allow us to use PForDelta > compression, what the paper calls "VByte" (basically what KS uses now) or > any of the other coding strategies. Yes, that was the point I took home as well. I was a little surprised by the relatively poor performance of the variable byte encoding, though, or conversely by the very good performance of some of the block methods. This probably means I need to be more aware of branch prediction when thinking about optimization. > The KinoSearch::Posting abstract class currently encapsulates our plugin > format, but it has some limitations. My original plan was to have a > one-to-one relationship between Posting subclasses and index formats, but > that turns out to be insufficient, and the PForDelta algo shows why. Yes, although I might question the word 'insufficient', as it might be taken to imply we need even more Posting classes to encompass multiway relationships. But I agree that the Posting class requires special thought as to how it will be extended to allow for smooth interaction between custom scorers and custom index formats. The goal here, in my mind, is to make it possible to write a custom index format that works with all existing scorers for which the index holds the relevant data. Vice versa, it should also be possible to write a custom scorer that makes use of existing indexes without having to modify or subclass these indexes. > However, Posting was not designed to maintain state well enough to batch > process -- the writing method just took several arguments describing the > last posting and the current posting in the loop. We're really going to > need a dedicated PostingEncoder class to handle something like that. And > then probably we will need a dedicated PostingDecoder class as well for > search-time. This might provide good generality, but I think I prefer a more minimalist solution, with the Posting class acting as a passive container that is filled by the Index and scored by Scorer. The Scorer chooses the class, since it presumably needs all the data to score with and passes it to the Index which fills in the data fields. The index would need some custom logic for parents of the fullest Posting class it can handle, but I think this would be straightforward. Block compression and the like would all happen within the index, and the Posting classes and Scorers would remain blissfully ignorant. I haven't thought about it, but I think the same benefits would extend to Indexers. Nathan Kurz nate [at] verse _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|