nate at verse
Jul 14, 2007, 5:20 PM
On 7/13/07, Marvin Humphrey <marvin [at] rectangular> wrote:
removing position code from scorer subclasses
> On Jul 13, 2007, at 11:49 AM, Nathan Kurz wrote:
> > What should I tackle next?
> Remove position-generating code in Scorer subclasses and Tally,
> now that the decision has been made not to implement the
> position-aware BooleanScorer in core.
I think I understand enough to do what you suggest, but could we
discuss the end goal a bit first? In particular, I'd like to talk
more about how positions will be made available to a parent scorer,
and how custom scorers will interface with custom indexes.
As I mentioned in a previous message, I think that PhraseScorer is a
good example to think about. If we come up with a solution that
allows for a generalized and efficient phrase scorer, I think we'll be
on the right path. Here are some premises I'm starting with:
1. There are two stages to a scorer: matching and scoring.
2. Matching is done for every possible document, thus should be optimizable.
3. The scoring stage only occurs for documents that match, so can be slower.
You are proposing that the position information (when needed) be
passed along as part of the Tally. This creates problems for scorers
like PhraseScorer, which need[*] to get positions from its subscorers
before determining if the document is a match.
The current version of PhraseScorer sidesteps this by working directly
off the position data in the Posting. While I like the efficiency of
this approach, I'm worried it is not going to extend well to custom
index formats. In particular, I don't want the author of a new index
format to be required to write a CustomPhraseScorer. I'd also like it
to be possible to do phrase-type matching on items other than raw
terms: things like "a b|c d".
I think there needs to be a way to signal to a Scorer that its parent
wants this Positions to be set. You suggested that this be done by
subclasses instead, but I don't see this working well. First, it
would double the number of Scorer classes, or worse, it wouldn't
double them and if you wanted a position scorer you'd have to subclass
the existing scorers yourself. Second, I think that each
position-passing subclass is going to end up duplicating most of the
code within its 'parent' class (efficient but bug-prone), or doing the
work twice (think of a PhraseScorer that has already found a phrase).
I think the path between these rocks is going to involve conflating
some of the notions of Posting and Scorer, in particular making
position information available from a Scorer in the way analagous to
that which PhraseScorer currently grabs it directly from
Postings. Rather than having the position data within the Tally, I
think it needs to be part of the base Scorer class, so that it can
always be accessed as directly as Scorer->positions.
Thus here's my view of the end goal of the Scorer classes:
Scorer components (And, Or, Nor, Phrase) are directly reusable.
1. be used for Match-only (simple -- don't call Tally)
2. be used for Scoring (great --- call Tally on subscorers as necessary)
3. Make positions of matches available to parent scorer prior to Tally
Custom scoring algorithms can layer on top of standard components.
1. Base components do not presume any particular scoring scheme.
2. Scorer can signal to its subscorers that it needs position data.
3. Scorers fail gracefully if required data not in the index.
Custom index formats require only a single custom term scorer.
1. This term scorer is the sole interface that a search touches.
2. Scorers don't need to be aware of underlying index format.
3. A speed-for-size trading mmap'ed index format should be possible.
How do these strike you as goals? These make sense to me, but perhaps
I'm confusing a particular implementation with a general need.
Alternatively, I'm willing to just remove the position data as you
suggest, but I think it might produce a better end result if I better
understood where we were headed.
nate [at] verse
* According to my premises, which might be wrong. One way out would be
to relax the requirement that Tally only be called on Matching
documents. This would have things work the way that the current
ORScorer does, which seems to be working. But we can come up with
cases where the performance of this approach might be poor, and I'm
worried that these cases might end up being my normal usage pattern.