
nate at verse
Jul 14, 2007, 5:20 PM
Views: 1016
Permalink
|
|
removing position code from scorer subclasses
|
|
On 7/13/07, Marvin Humphrey <marvin [at] rectangular> wrote: > On Jul 13, 2007, at 11:49 AM, Nathan Kurz wrote: > > > What should I tackle next? > > Remove position-generating code in Scorer subclasses and Tally, > now that the decision has been made not to implement the > position-aware BooleanScorer in core. > I think I understand enough to do what you suggest, but could we discuss the end goal a bit first? In particular, I'd like to talk more about how positions will be made available to a parent scorer, and how custom scorers will interface with custom indexes. As I mentioned in a previous message, I think that PhraseScorer is a good example to think about. If we come up with a solution that allows for a generalized and efficient phrase scorer, I think we'll be on the right path. Here are some premises I'm starting with: 1. There are two stages to a scorer: matching and scoring. 2. Matching is done for every possible document, thus should be optimizable. 3. The scoring stage only occurs for documents that match, so can be slower. You are proposing that the position information (when needed) be passed along as part of the Tally. This creates problems for scorers like PhraseScorer, which need[*] to get positions from its subscorers before determining if the document is a match. The current version of PhraseScorer sidesteps this by working directly off the position data in the Posting. While I like the efficiency of this approach, I'm worried it is not going to extend well to custom index formats. In particular, I don't want the author of a new index format to be required to write a CustomPhraseScorer. I'd also like it to be possible to do phrase-type matching on items other than raw terms: things like "a b|c d". I think there needs to be a way to signal to a Scorer that its parent wants this Positions to be set. You suggested that this be done by subclasses instead, but I don't see this working well. First, it would double the number of Scorer classes, or worse, it wouldn't double them and if you wanted a position scorer you'd have to subclass the existing scorers yourself. Second, I think that each position-passing subclass is going to end up duplicating most of the code within its 'parent' class (efficient but bug-prone), or doing the work twice (think of a PhraseScorer that has already found a phrase). I think the path between these rocks is going to involve conflating some of the notions of Posting and Scorer, in particular making position information available from a Scorer in the way analagous to that which PhraseScorer currently grabs it directly from Postings. Rather than having the position data within the Tally, I think it needs to be part of the base Scorer class, so that it can always be accessed as directly as Scorer->positions. Thus here's my view of the end goal of the Scorer classes: Scorer components (And, Or, Nor, Phrase) are directly reusable. 1. be used for Match-only (simple -- don't call Tally) 2. be used for Scoring (great --- call Tally on subscorers as necessary) 3. Make positions of matches available to parent scorer prior to Tally Custom scoring algorithms can layer on top of standard components. 1. Base components do not presume any particular scoring scheme. 2. Scorer can signal to its subscorers that it needs position data. 3. Scorers fail gracefully if required data not in the index. Custom index formats require only a single custom term scorer. 1. This term scorer is the sole interface that a search touches. 2. Scorers don't need to be aware of underlying index format. 3. A speed-for-size trading mmap'ed index format should be possible. How do these strike you as goals? These make sense to me, but perhaps I'm confusing a particular implementation with a general need. Alternatively, I'm willing to just remove the position data as you suggest, but I think it might produce a better end result if I better understood where we were headed. Nathan Kurz nate [at] verse * According to my premises, which might be wrong. One way out would be to relax the requirement that Tally only be called on Matching documents. This would have things work the way that the current ORScorer does, which seems to be working. But we can come up with cases where the performance of this approach might be poor, and I'm worried that these cases might end up being my normal usage pattern.
|