nate at verse
Jul 2, 2007, 2:17 PM
On 7/2/07, Marvin Humphrey <marvin [at] rectangular> wrote:
> Rather than add new public APIs, it's
> time to yank Hits->seek (and simplify Searcher),
This is definitely a direction I would like to see. As a raving
lunatic, I have the nagging fear that I will come across as a raving
lunatic if I keep raving without offering code. But in my crazed
silence, I've been making some stabs at simplifying the search API.
I've been peeling away layers, trying to figure out the minimum that
one can work with.
In short, what I've done is to write a set of more generic search
classes (Perl and C) that don't presume any particular scoring system.
Right now I'm concentrating on my specific application of searching
short documents with Proximity and Scope scoring, but ideally I'll
figure out how to retrofit the existing scoring mechanisms as well.
It's not working yet, and may not be the right direction for
KinoSearch as a whole, but it's helping me understand the task better
and might eventually make for good fodder for a simplified KinoSearch
The hierarchy I'm working with is much flatter and more
straightforward: a reusable Query produces an index-specific Scorer
that returns a HitCollector: no Searcher, no Weight objects, no
Similarity object, no twisty mazes, no delayed inits, and no
Hits->seek. It's probably not quite generic enough for general use,
but I think it's going to be possible to move it in that direction
once I get it working. It's certainly simpler to comprehend, and I
think it might end up more efficient as well.
I started along the path of trying to subclass the existing Scoring
classes to make them work the way I wanted, but it wasn't working
well. Line by line the existing code is great to work with, but the
overall hierarchy hardcodes TF/IDF scoring at a fairly deep level.
While it's been slow, I've been much happier trying to factor this out
rather than overriding it class by class. And by redoing it, I'm
getting a much clearer understanding about how the current setup
I've got tons of questions, but I'll limit myself to two for now:
1) Is there a good reason to keep BooleanScorer at the C level, rather
than moving it up into Perl? Its main function is to create the real
scorer out of the Boolean components (ANDScorer, ORScorer, etc) and it
seems like this real scorer could be assembled just as well by the
query processor. And having this all in Perl would give people a
better example of how to create their own custom Query parsers.
2) The current ORScorer calls Tally on its subscorers at the same time
it is skipping through documents, rather than at the end of the phase.
Is this a good practice that I should emulate? My instinct is that
it would be inefficient for certain types of queries:
((expensive-phrase OR expensive-phrase) AND rare-filter). Or is this
less problematic than it seems? If it's fine in general, then I'd be
tempted to combine the Next and Tally stages more generally.
nate [at] verse
ps. I like the direction of KinoSearch::Simple, particularly the
integration of the indexing and searching. I'm tempted to think that
rather than calling it 'Simple', you should just call it 'KinoSearch'
and eventually have it be the main API.