marvin at rectangular
Mar 7, 2007, 10:05 PM
Post #8 of 11
On Mar 7, 2007, at 8:36 PM, Chris Nandor wrote:
> I was updating my searcher code, and previously I had been setting the
> offset passed to seek() using $hits->total_hits.
I don't understand what the use is of this, unless it's to naively
retrieve the last (worst) matches.
FYI, in KS 0.15 and earlier, calling total_hits before seek()
actually triggers a call to seek(0, 100) internally. It's not
possible to know how many documents a query matches without running
the whole scoring routine.
Note that there's not much difference between calling seek(0, 10) and
seek(0, 100). The only change is the size of a priority queue; the
cost of matching and scoring remains the same.
KS 0.15 also performed unnecessary seeks in some cases -- for
instance, calling seek(0, 10) when you've already called seek(0, 100)
shouldn't be necessary, but KS was doing that if you called seek(0,
10) after total_hits(). This has changed in 0.20. Credit to Henry
for identifying the issue.
> But now I can't get that
> before I call seek(),
While you were "able" to get it before, you still had doubled costs.
> and as a result, I was passing num_wanted => 0 to
> search(). This bug in my code causing a bus error in KS.
Heh. I'll go fix that.
> That said, I wonder if 0 or something similar might be a way to denote
> "send everything."
There are memory and performance implications for setting a large
num_wanted. Hits are collected in a priority queue, and the size of
the queue is determined by num_wanted.
> My workaround now is to send $reader->num_docs instead,
> which is fine too, I think.
That will work -- sort of. If your index is large, that's gonna be a
huge priority queue. Each element in the queue is either a ScoreDoc
(16 bytes) or, when sorting, a FieldDoc (20 bytes presently, and
probably about to grow to take in an arbitrary string).