
marvin at rectangular
Jan 23, 2008, 10:33 PM
Post #1 of 1
(369 views)
Permalink
|
|
Subclassable Highlighter (was Re: KinoSearch feature suggestions)
|
|
On Jan 23, 2008, at 6:59 AM, Peter Karman wrote: > fwiw, Search::Tools offers highlighting and excerpting (snipping) > via the building of > complex regular expressions. See > http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/ > Snipper.pm > http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/ > HiLiter.pm > > The algorithm I use for snipping/excerpting is slow, and I would > love to see how a > different approach could improve performance. I believe the primary > reason my approach is > slow is that it uses a big regex. KinoSearch's highlighter is fast because it utilizes information generated at index time and stored in the "term vectors" file. Each "vectorized" field's data consists of... * Term text. * Each term's position in the field, measured in tokens. * Each position's start offset, measured in Unicode code points. * Each position's end offset, measured in Unicode code points. Because the start offset and end offset are stored, it is possible to highlight stemmed terms accurately. For instance, if a field starts off with "Horses are fast", the stemmed text "hors" is stored along with a start offset of 0 and an end offset of 6, allowing us to insert highlighting emphasis marks at those positions. The same technique could be used to e.g. highlight synonyms after synonym analysis. The essence of the Highlighter is that after we have a result set, we rerun the query against the documents one-at-a-time and see what parts are most important. For this to work, we need... * Query/Scorer classes which are capable of telling us why they scored a document the way they did. Right now, this is done via $query->extract_terms, but that's a crude mechanism that will not hold up for esoteric subclasses of Query. * Access to the parsed, analyzed document. If we did not store the "term vectors" information, we would have the option of rerunning analysis on the fly. Unfortunately, this doesn't work well if you have either large documents or costly Analyzer chains. So, storing some serialized version of the parsed document which can be reassembled into an object quickly will remain a crucial facet of the KinoSearch highlighter. I wish it were realistic to perform analysis on the fly, because then it would not be necessary to worry about the file format of persistent term vector data within the index. TermVectors probably won't be part of the official file spec, in order to limit the clutter. However, for backwards compatibility purposes, we'll still be stuck with the format once it's set. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch[at]rectangular.com http://www.rectangular.com/mailman/listinfo/kinosearch
|