
marvin at rectangular
Jan 28, 2008, 5:27 PM
Post #21 of 62
(6903 views)
Permalink
|
On Jan 28, 2008, at 3:39 PM, Father Chrysostomos wrote: > On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote: > >> my $highlighter = KinoSearch::Highlight::Highlighter->new( >> searcher => $searcher, >> query => $query, >> ); > > Another problem with this approach is that the highlighter can only > be used for one query. If a second search is made with the same > $searcher, another highlighter is needed. True, but I can't think of where that would cause a problem. Can you think of one? In contrast, if we add a Query member to each HitDoc, that means the Query will have to be serialized/deserialized if we send the hit over the network. Part of the reason that Highlighter's API looks the way it does was the limitation that Highlighters had to do their work from inside a Hits object. That was a kludge, necessitated by the fact that it was possible to know the doc_num from within the Hits object (and thus possible to fetch the relevant DocVector), but impossible to know the doc_num from the hashref returned by $hits->fetch_hit_hashref. Now that we're about to return a HitDoc object instead of a plain hashref, we're not bound by that constraint, and I'm very much looking forward to zapping Hits::create_excerpts. In fact, we could simplify further. Now that we don't have to stick all our excerpts into $hashref->{excerpts}, we can return the excerpts as scalars, one-at-a-time -- eliminating both add_spec() and generate_excerpts(). my $highlighter = KinoSearch::Highlight::Highlighter->new( searcher => $searcher, # required query => $query, # required field => 'content', # required excerpt_length => 150, # default: 200 formatter => $formatter, # default: a SimpleHTMLFormatter encoder => $encoder, # default: a SimpleHTMLEncoder ); for my $hit ( $hits->fetch_hit ) { my $excerpt = $highlighter->single_excerpt($hit); ... } Juggling how params get set is a superficial change compared with e.g. making single_excerpt() public, so it isn't that important. However, I wonder if this lighter-weight vision for a highlighter makes you more comfortable. To my mind, it's OK if highlighters are ephemeral and you create a new one for each query. > Unless $searcher can have a ->get_last_query method.... Yikes, that'd be asking for trouble! > Also, when it comes to the highlight_data method, which class > should be responsible for removing duplicate HighlightSpans? Should > I make this a method of Highlighter itself? When would there be duplicates? I suppose you'd see the same positions multiple times for a query like 'lincoln "lincoln bedroom"', but you'd get different weights. That query would probably yield two spans with data like this... { start_offset => 15, end_offset => 22, weight 1.2 } { start_offset => 15, end_offset => 30, weight 3.5 } ... with the second span having a higher weight to reflect the relative rarity of the phrase compared to the single term. > I don’t remember whether I told you: I’m working on these changes > to Highlighter, and I think I will have a patch ready soon. I'm working on the Doc class right now. You should see some commits over the next few hours. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|