
adamfletcher.work at googlemail
Nov 10, 2007, 3:23 AM
Post #3 of 7
(1397 views)
Permalink
|
|
Re: I'm getting fewer than expected results when supplying multiple fields
[In reply to]
|
|
On (09/11/07 20:51), Marvin Humphrey wrote: > Hello Adam, > > Thanks for the detailed report. > > > I'm using the devel version (0.20_05). > > Was this index originally built under 0.20_04, and does it have > deletions? That's one known bug, leading to index corruption. No, I created the index using 0.20_05 and it doesn't contain any deletions. > Also, how many segments are in the index? (You can tell at a glance > by counting files with a ".cf" extension within the index directory.) There is one segment file in the index, which is about 69MB in size. > > Performing a search on another field, fieldx:foo, gives me 4481 hits. > > I have confirmed that this quantity is correct for this field. > > > > When I do the following : search.pl q="all:1 AND fieldx:foo", I get a > > lower quantity of 4449 hits. I've lost 32 documents. > > This behavior suggests a bug in either ANDScorer or one of the > PostingList subclasses. > > [snip] > > One possibility is that PostingList is reading records incorrectly, > so that the iterated doc nums don't match what ought to be in that > array. I did consider that, particularly when I read about you changing them to start at 1 (rather than zero), but that change doesn't affect 0.20_05. I added in some debug code to output *my* unique identifier for each document returned, but that didn't reveal anything more to me. > [snip] > > The second possibility is that PostingList is fine, but ANDScorer is > performing the intersection improperly. > > > Why would the 3 searches not yield the same results? > > They should. > > There are two stages of compilation for that particular query string: > QueryParser produces a BooleanQuery, and BooleanQuery produces a > BooleanScorer wrapping an ANDScorer. ANDScorer operates on an array > of subscorers (in this case there would be two TermScorers in the > array), and the order in which the subscorers are arranged matters in > terms of how the intersection algorithm plays out. > > My intuition is that if it's not the deletions issue, and that > ANDScorer_skip_to is to blame. The algo, which is very similar to > that used by PhraseScorer, is only mildly convoluted, but it happens > to be hard to write tests for. > > If you can supply a failing test case, I will work with that > directly. Otherwise, I'll attempt to improve testing for ANDScorer > and hope that the bug shows itself. I'll try to do that. I have already tried to rebuild the index so that it only contains the 2 fields mentioned, and 4481 records, but the results from that index are correct. I'll strip out the irrelevent code/data and send my data and test case to you off-list once I've got a refined example. Thanks, Adam _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|