adamfletcher.work at googlemail
Nov 10, 2007, 3:23 AM
Post #3 of 7
On (09/11/07 20:51), Marvin Humphrey wrote:
Re: I'm getting fewer than expected results when supplying multiple fields
[In reply to]
> Hello Adam,
> Thanks for the detailed report.
> > I'm using the devel version (0.20_05).
> Was this index originally built under 0.20_04, and does it have
> deletions? That's one known bug, leading to index corruption.
No, I created the index using 0.20_05 and it doesn't contain any
> Also, how many segments are in the index? (You can tell at a glance
> by counting files with a ".cf" extension within the index directory.)
There is one segment file in the index, which is about 69MB in size.
> > Performing a search on another field, fieldx:foo, gives me 4481 hits.
> > I have confirmed that this quantity is correct for this field.
> > When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> > lower quantity of 4449 hits. I've lost 32 documents.
> This behavior suggests a bug in either ANDScorer or one of the
> PostingList subclasses.
> One possibility is that PostingList is reading records incorrectly,
> so that the iterated doc nums don't match what ought to be in that
I did consider that, particularly when I read about you changing them
to start at 1 (rather than zero), but that change doesn't affect
0.20_05. I added in some debug code to output *my* unique identifier
for each document returned, but that didn't reveal anything more to me.
> The second possibility is that PostingList is fine, but ANDScorer is
> performing the intersection improperly.
> > Why would the 3 searches not yield the same results?
> They should.
> There are two stages of compilation for that particular query string:
> QueryParser produces a BooleanQuery, and BooleanQuery produces a
> BooleanScorer wrapping an ANDScorer. ANDScorer operates on an array
> of subscorers (in this case there would be two TermScorers in the
> array), and the order in which the subscorers are arranged matters in
> terms of how the intersection algorithm plays out.
> My intuition is that if it's not the deletions issue, and that
> ANDScorer_skip_to is to blame. The algo, which is very similar to
> that used by PhraseScorer, is only mildly convoluted, but it happens
> to be hard to write tests for.
> If you can supply a failing test case, I will work with that
> directly. Otherwise, I'll attempt to improve testing for ANDScorer
> and hope that the bug shows itself.
I'll try to do that. I have already tried to rebuild the index so that
it only contains the 2 fields mentioned, and 4481 records, but the
results from that index are correct.
I'll strip out the irrelevent code/data and send my data and test case
to you off-list once I've got a refined example.
KinoSearch mailing list
KinoSearch [at] rectangular