
nate at verse
Jul 2, 2007, 2:17 PM
Views: 1635
Permalink
|
On 7/2/07, Marvin Humphrey <marvin [at] rectangular> wrote: > Rather than add new public APIs, it's > time to yank Hits->seek (and simplify Searcher), This is definitely a direction I would like to see. As a raving lunatic, I have the nagging fear that I will come across as a raving lunatic if I keep raving without offering code. But in my crazed silence, I've been making some stabs at simplifying the search API. I've been peeling away layers, trying to figure out the minimum that one can work with. In short, what I've done is to write a set of more generic search classes (Perl and C) that don't presume any particular scoring system. Right now I'm concentrating on my specific application of searching short documents with Proximity and Scope scoring, but ideally I'll figure out how to retrofit the existing scoring mechanisms as well. It's not working yet, and may not be the right direction for KinoSearch as a whole, but it's helping me understand the task better and might eventually make for good fodder for a simplified KinoSearch API. The hierarchy I'm working with is much flatter and more straightforward: a reusable Query produces an index-specific Scorer that returns a HitCollector: no Searcher, no Weight objects, no Similarity object, no twisty mazes, no delayed inits, and no Hits->seek. It's probably not quite generic enough for general use, but I think it's going to be possible to move it in that direction once I get it working. It's certainly simpler to comprehend, and I think it might end up more efficient as well. I started along the path of trying to subclass the existing Scoring classes to make them work the way I wanted, but it wasn't working well. Line by line the existing code is great to work with, but the overall hierarchy hardcodes TF/IDF scoring at a fairly deep level. While it's been slow, I've been much happier trying to factor this out rather than overriding it class by class. And by redoing it, I'm getting a much clearer understanding about how the current setup works. I've got tons of questions, but I'll limit myself to two for now: 1) Is there a good reason to keep BooleanScorer at the C level, rather than moving it up into Perl? Its main function is to create the real scorer out of the Boolean components (ANDScorer, ORScorer, etc) and it seems like this real scorer could be assembled just as well by the query processor. And having this all in Perl would give people a better example of how to create their own custom Query parsers. 2) The current ORScorer calls Tally on its subscorers at the same time it is skipping through documents, rather than at the end of the phase. Is this a good practice that I should emulate? My instinct is that it would be inefficient for certain types of queries: ((expensive-phrase OR expensive-phrase) AND rare-filter). Or is this less problematic than it seems? If it's fine in general, then I'd be tempted to combine the Next and Tally stages more generally. Nathan Kurz nate [at] verse ps. I like the direction of KinoSearch::Simple, particularly the integration of the indexing and searching. I'm tempted to think that rather than calling it 'Simple', you should just call it 'KinoSearch' and eventually have it be the main API.
|