
marvin at rectangular
Sep 9, 2007, 3:46 PM
Post #6 of 13
(1299 views)
Permalink
|
On Sep 8, 2007, at 11:28 AM, Nathan Kurz wrote: > On 9/7/07, Matthew O'Connor <matthew.oconnor [at] socialtext> wrote: >> I think I've found a bug with phrase queries. Matthew, excellent job producing a repeatable test scenario. > The problem > was an unsigned integer underflow if the first occurrence of a term > was at a position less than its phrase offset. Nathan, excellent job identifying and fixing the bug. > I've fixed it by changing to using from a u32_t to an i32_t and > casting the compare, which seemed like the least invasive change. I appreciate the light touch. :) What I actually committed was slightly different, and attempts to explain the algorithm more effectively: /* Discard positions that occur too early in the field to match as * a part of the phrase. For example, if the field begins with * "The ants go marching one by one", that initial "The" cannot * match as the second term in a phrase search for * "fight the power". */ target = phrase_offset; while (candidates < candidates_end && *candidates < target) { candidates++; } if (candidates == candidates_end) break; > It does slightly reduce the range of legal positions to 31-bits, but I > don't think this is going to catch anyone. It's already limited to that by code elsewhere, and such a situation is pretty esoteric in any case. With ordinary analyzers, you'd need to have a field value that was longer than 2 GB for that to come into play at all, and even then that assumes each byte is a token holding down a position. Where people might theoretically have gotten tripped up is via accidentally running out of room after $token->set_pos_inc ($large_number) during indexing. I don't think limiting the range of usable positions to 31 bits impedes usability, but *checking* the range, as this bug illustrates, is a good idea. I've added one such check with revision 2526. > I didn't look at the .15 > version, but presume the problem can be fixed in the same simple > manner. Yes. The algorithm hadn't changed. I committed the changes to trunk as 2524 and to maint as 2525. Thanks! > There are probably more invasive alternatives if the range reduction > is unacceptable. Also, it looks possible to speed up the loop a bit, > but probably at the cost of readability. I'll hold off on trying this > unless it's thought to be a bottleneck. I'd be surprised if it were. This code is already significantly different from the Lucene PhraseScorer code and theoretically faster. In Lucene, each position is iterated over with a method call to termPositions.nextPosition(), and objects representing each term in the phrase are kept in a priority queue. In KS, we load all the positions at once, then traverse the arrays with pointer math, using an "anchor set" algorithm not in Lucene. The Lucene technique uses less memory. The KS technique has less function-call overhead. I'm not certain whether that was the right trade to make historically, but I am nearly certain that there's no way to do what we want to do with expanded position scoring without keeping all positions loaded. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|