
marvin at rectangular
Jul 16, 2008, 11:21 PM
Post #2 of 2
(820 views)
Permalink
|
On Jul 16, 2008, at 6:23 AM, Michael Greb wrote: >>> Perhaps it may make sense to have an argument that allows you to >>> specify a character/string to prefer breaking on that defaults to >>> '\.'. Please note that Highlighter's API has changed since the last dev release. Here's Highlighter's current algorithm: * Hand the Query a document and ask it what sections of the field in question it thinks are important, if any. Any "hot" sections are expressed via HighlightSpan objects, which define a start_offset, an end_offset, and a floating point "weight". * Take all the HighlightSpan objects and create a HeatMap, which muxes all the spans plus adds bonus heat whenever spans occur close together. * Analyze the HeatMap and find the hottest section of the field, using boundaries a little larger than the desired excerpt size. (Right now, it's find_best_fragment() that does this, but it's not clear that that method needs to be public.) * Use Highlighter::find_sentence_boundaries to locate bounds inside and immediately outside the hot window. * Have Highlighter::raw_excerpt determine the formal boundaries of the excerpt. Use sentence boundaries when possible, but apply ellipses when necessary. * Have Highlighter::highlight_excerpt process the raw excerpt by applying Highlighter::highlight and Highlighter::encode. The question right now is what the APIs should look like for find_sentence_boundaries() and raw_excerpt(). FWIW, they are surprisingly hard to implement, because grammatical inconsistencies are hard to avoid and there are lots of edge cases. For starters: Right now, find_sentence_boundaries() returns an array of start offsets delimiting sentence starts. However, this is not ideal; it would be better to know what the exact end offsets are as well. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|