
marvin at rectangular
Sep 18, 2008, 9:25 PM
Post #11 of 15
(11742 views)
Permalink
|
On Sep 17, 2008, at 1:16 PM, Nathan Kurz wrote: > But since the processing time is > probably close to proportional to the file size, maybe this is where > Lucene has the advantage. I've now performed the experiment pitting Lucene's TermQuery against KinoSearch's. The results were worse than expected, and extra positions decoding overhead seem to be a bigger factor than I'd thought it would be. For a somewhat common term ("Reuters" in the Reuters corpus), KS is 3x slower. For a very common term ("the"), it's around 6x slower, and that's worse than it sounds because the 6x factor is multiplying a number that's big to begin with. Proliferating positions account for the non-constant slowdown: "the" occurs at many positions, and thus has greater per-document scanning cost in comparison to "Reuters", which most likely appears once per document. > Re-indexing KinoSearch without positions and re-running your previous > searches would also be an inverse way to test this hypothesis. The way to do this in KS is to override FieldSpec->posting() in a subclass so that it returns a MatchPosting instead of a ScorePosting. MatchPosting was a placeholder until today, but now a provisional implementation has been completed for testing purposes. Happily, with MatchPosting, we get much closer to Lucene performance: around 1.3x for "Reuters" and around 1.7x for "the". MatchPostingScorer doesn't presently take document/field boost into account, so it's doing a little less work than Lucene's TermScorer, but nevertheless, the results are instructive. Since we're still slower than Lucene even with MatchPosting, we may want to work on optimizing TermScorer. Probably we need some sort of bulk read -- a la what Lucene does, see below. We may also want to restructure things so that we're reading at the level of PostingList rather than Posting to cut down on method call overhead from nesting. Right now, Lucene's TermScorer.next() calls TermDocs.read(int[],int[]) once every 20 calls. In contrast, KinoSearch's TermScorer_Next calls PList_Next() and PList_Get_Posting() every time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ /** Advances to the next document matching the query. * <br>The iterator over the matching documents is buffered using * {@link TermDocs#read(int[],int[])}. * @return true iff there is another document matching the query. */ public boolean next() throws IOException { pointer++; if (pointer >= pointerMax) { pointerMax = termDocs.read(docs, freqs); // refill buffer if (pointerMax != 0) { pointer = 0; } else { termDocs.close(); // close stream doc = Integer.MAX_VALUE; // set to sentinel value return false; } } doc = docs[pointer]; return true; } _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|