
marvin at rectangular
Apr 14, 2007, 10:32 AM
Post #2 of 7
(707 views)
Permalink
|
On Apr 14, 2007, at 9:26 AM, Simon Cozens wrote: > use Glob; Haven't seen 'Glob' before, and it doesn't look like this is it: http://search.cpan.org/~holt/makepp-1.19/Glob.pm Can you point me at the right module docs? > Here's the earliest result when we're retrieving 10: > % perl test.pl 10 > 523 2005-10-05 > > And the earliest when we're retrieving 30: > % perl test.pl 30 > 836 2004-11-29 > > Now it is of course the earliest of the 10 hits, and the > earliest of the 30 hits, and of course these are different, > but what I really wanted was the earliest of the 10 earliest > hits and earliest of the 30 earliest hits and these ought to > be the same! > > I don't know if this is a bug. Looks like a bug. Thank you for the report. If your index has been incrementally updated and consists of more than one segment, you may wish to try the current subversion trunk, version 2306. Recent fixes to MultiTermList may affect the results of sorted search. svn co -r 2306 http://www.rectangular.com/svn/kinosearch/trunk kinosearch If that doesn't solve the problem, we'll need to isolate a failing test case. The relevant test file, t/511-sort_spec.t, does not contain a stress test at present. Your current problems suggest how we might compose one. > It's a bug if setting a sort_spec > is expected to sort the document collection, but if it is > expected merely to sort the result set, then it's just a major > annoyance; either way it looks like I have to retrieve all the > hits and them sort them myself. That would be useless. KinoSearch is expected to work with result sets reaching into the millions. With a SortSpec in force, KS uses a different criteria for sorting items in the priority queue where hits are collected. Instead of documents with the highest scores winning the most favorable queue positions, documents with the lowest "term number" win out. (Or highest, depending on whether reverse sorting has been requested.) During a search, all hits are passed through the hit collector; those with the most desirable values are retained by the queue. No matter how large the priority queue, the "best" should always remain the best. Ordering of documents which "tie" will not necessarily be consistent, but ordering of docs whose sort criteria differ -- as in your case -- ought to be. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|