Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

 

 

First page Previous page 1 2 3 Next page Last page  View All kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


sprout at cpan

Jan 23, 2008, 12:48 PM

Post #1 of 62 (7570 views)
Permalink
Subclassable Highlighter (was: Re: KinoSearch feature suggestions)

On Jan 23, 2008, at 6:32 AM, Marvin Humphrey wrote:

>>> How about if we outsource excerpting to subclasses of a new class,
>>> KinoSearch::Highlight::Excerpter?
>>
>> I think I can have a patch for this in a couple of days.
>
> Sweet. :)

Since the highlighter’s main job is to create the excerpt, I think it
would actually be better if we made it easy to subclass by dividing up
its _gen_excerpt method.

So, we’d have:

• gen_excerpt

This will call starts_and_ends and calc_best_location, then pass
beginning and ending offsets for the excerpt to
gen_excerpt_from_offsets. A subclass can override this to call the
latter multiple times.

• starts_and_ends

Just _starts_and_ends renamed, so that subclasses can call it while
still using the public API.

• calc_best_location

_calc_best_location renamed, and made to return a list in list context.

• get_excerpt_from_offsets

This will ‘round off’ the offsets passed to it to the nearest sentence
boundary, if possible, and then call format_excerpt (passing it a
couple of flags to indicate whether ellipsis marks are needed).

• format_excerpt

This will take of all the formatting, calling the formatter and
encoder as needed, and adding ellipsis marks.


Please let me know if this is too complex and there is a better way I
haven’t thought of....

>
>
>> But the *offsets* of the page breaks need to be recorded. Counting
>> is not sufficient. I still have to think more about how this should
>> work—unless you have some ideas.
>
> We can modify that function to record offsets in a Perl array. This
> (untested) variant renders those offsets as counts of Unicode code
> points:
>
> [...]

I don’t know why I didn’t see this sooner, but the indexer/tokenizer/
whatever doesn’t need to care about form feeds. A highlighter subclass
can use your counting method (or y///) to see how many occur before
the excerpt, so that problem has solved itself, as it were.
_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 23, 2008, 2:10 PM

Post #2 of 62 (6923 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 23, 2008, at 12:48 PM, I wrote:

> Since the highlighter’s main job is to create the excerpt, I think
> it would actually be better if we made it easy to subclass by
> dividing up its _gen_excerpt method.
>
> So, we’d have:
>
> • gen_excerpt
>
> This will call starts_and_ends and calc_best_location, then pass
> beginning and ending offsets for the excerpt to
> gen_excerpt_from_offsets. A subclass can override this to call the
> latter multiple times.
>
> • starts_and_ends
>
> Just _starts_and_ends renamed, so that subclasses can call it while
> still using the public API.
>
> • calc_best_location
>
> _calc_best_location renamed, and made to return a list in list
> context.
>
> • get_excerpt_from_offsets
>
> This will ‘round off’ the offsets passed to it to the nearest
> sentence boundary, if possible, and then call format_excerpt
> (passing it a couple of flags to indicate whether ellipsis marks are
> needed).
>
> • format_excerpt
>
> This will take of all the formatting, calling the formatter and
> encoder as needed, and adding ellipsis marks.
>

Here is an initial patch. It’s still missing docs and some arg-checking.

The tests in 303-highlighter.t all pass. I’ve not yet tried
subclassing it though.

There is one difference in behaviour, however. The default ellipsis is
not longer passed to the encoder. Is this a problem? (It’s not too
hard to fix this, but it increases the complexity of the code a little.)


Father Chrysostomos


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 23, 2008, 4:43 PM

Post #3 of 62 (6925 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

This time with the patch attached. :-)

On Jan 23, 2008, at 2:10 PM, Father Chrysostomos wrote:

> Here is an initial patch. It˘s still missing docs and some arg-
> checking.
>
> The tests in 303-highlighter.t all pass. I˘ve not yet tried
> subclassing it though.
>
> There is one difference in behaviour, however. The default ellipsis
> is not longer passed to the encoder. Is this a problem? (It˘s not
> too hard to fix this, but it increases the complexity of the code a
> little.)
>
>
> Father Chrysostomos
>
Attachments: ks-highlighter.diff (11.5 KB)


marvin at rectangular

Jan 23, 2008, 10:33 PM

Post #4 of 62 (6919 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 23, 2008, at 12:48 PM, Father Chrysostomos wrote:

> Since the highlighter’s main job is to create the excerpt, I think
> it would actually be better if we made it easy to subclass by
> dividing up its _gen_excerpt method.

OK, I can see the rationale for this. The Excerpter would probably
need access to the Formatter and the Encoder anyway. At that point,
it's not really an excerpter -- it's a highlighter.

However, I think that the internal methods of Highlighter ought to
remain internal. They were not designed to be public. The division
of labor amongst them isn't particularly elegant or clean. They call
other, non-public methods within the KinoSearch suite.

I think we need the public interface for Highlighter to be more
general. And perhaps it usage shouldn't be integrated into Hits as
it is now. I think this, which is along the lines of Peter's
Search::Tools::HiLiter, would be the better:

my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher,
);
my $hits = $searcher->search( query => $query_string );
while ( my $hit = $hits->fetch_hit ) {
my $excerpts = $highlighter->generate_excerpts($hit);
...
}

In fact, using Search::Tools::HiLiter would look almost the same:

my $highliter = Search::Tools::HiLiter->new( query =>
$query_string );
my $hits = $searcher->search( query => $query_string );
while ( my $hit = $hits->fetch_hit_hashref ) {
my $excerpt = $highliter->light( $hit->{content} );
...
}

The only public methods on Highlighter would be generate_excerpts
($hit), get_formatter(), and get_encoder(). If we add others -- and
I can see how that would benefit you -- they should be more coherent
than the current internal methods.

Having the Highlighter operate standalone a la Search::Tools::HiLiter
just makes more sense. Unfortunately, the problem with that design
is that $hits->fetch_hit_hashref returns a plain old hashref, and
that's not enough. Crucially, the hashref does not convey the
document number, which is needed to retrieve the DocVector associated
with the hit from the $searcher.

I've pondered how to associate the return value of $hits-
>fetch_hit_hashref with a document number for a while, without
arriving at a satisfactory solution.

1) Stuffing it into the hashref like $hashref->{_kino_doc_num} is
cheesy. It's unexpected, messes up the parallels with DBI's
fetch_row_hashref, and would get in the way if someone wanted to
iterate over their fields using Perl's hash-manipulation functions.

2) It might make sense to implement KinoSearch::Document as a blessed
hashref. However, it would be the only class in all of KS which
didn't subclass KinoSearch::Obj, and thus which couldn't be used at
the C level.

3) Implementing KinoSearch::Document as a standard KS object and
giving it a $document->to_hashref method would work...

while ( my $hit = $hits->fetch_hit ) {
my $hashref = $hit->to_hashref;
my $excerpts = $highlighter->generate_excerpts($hit);
...
}

... but it's a little less elegant than returning a plain old hashref
and there's extra overhead from unnecessary string copying.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 24, 2008, 9:45 AM

Post #5 of 62 (6920 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 23, 2008, at 10:33 PM, Marvin Humphrey wrote:

> However, I think that the internal methods of Highlighter ought to
> remain internal. They were not designed to be public. The division
> of labor amongst them isn't particularly elegant or clean. They
> call other, non-public methods within the KinoSearch suite.
>
> I think we need the public interface for Highlighter to be more
> general. And perhaps it usage shouldn't be integrated into Hits as
> it is now. I think this, which is along the lines of Peter's
> Search::Tools::HiLiter, would be the better:
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher,
> );
> my $hits = $searcher->search( query => $query_string );
> while ( my $hit = $hits->fetch_hit ) {
> my $excerpts = $highlighter->generate_excerpts($hit);
> ...
> }

Sounds good.

> The only public methods on Highlighter would be
> generate_excerpts($hit), get_formatter(), and get_encoder(). If we
> add others -- and I can see how that would benefit you -- they
> should be more coherent than the current internal methods.

I˘d certainly like to avoid copying and pasting the code for
calculating the best location and for ˇrounding˘ the ends to the
nearest sentence. What would you suggest (or dictate, since you˘re in
charge :-) that the methods be?

>
>
> Having the Highlighter operate standalone a la
> Search::Tools::HiLiter just makes more sense. Unfortunately, the
> problem with that design is that $hits->fetch_hit_hashref returns a
> plain old hashref, and that's not enough. Crucially, the hashref
> does not convey the document number, which is needed to retrieve the
> DocVector associated with the hit from the $searcher.
>
> I've pondered how to associate the return value of $hits-
> >fetch_hit_hashref with a document number for a while, without
> arriving at a satisfactory solution.
>
> 1) Stuffing it into the hashref like $hashref->{_kino_doc_num} is
> cheesy. It's unexpected, messes up the parallels with DBI's
> fetch_row_hashref, and would get in the way if someone wanted to
> iterate over their fields using Perl's hash-manipulation functions.
>
> 2) It might make sense to implement KinoSearch::Document as a
> blessed hashref. However, it would be the only class in all of KS
> which didn't subclass KinoSearch::Obj, and thus which couldn't be
> used at the C level.
>
> 3) Implementing KinoSearch::Document as a standard KS object and
> giving it a $document->to_hashref method would work...
>
> while ( my $hit = $hits->fetch_hit ) {
> my $hashref = $hit->to_hashref;
> my $excerpts = $highlighter->generate_excerpts($hit);
> ...
> }
>
> ... but it's a little less elegant than returning a plain old
> hashref and there's extra overhead from unnecessary string copying.

The object could have %{} overloading, but that would cause extra
overhead, as the method would have to check ˇcaller˘ each time, to
make sure it˘s not KS:H:Highlighter. But since the number of hits is
likely to be relatively small, maybe that˘s not too bad.

Another thing you could do is to use Hash::Util::FieldHash(::Compat)
to assign a document number to the hash ref. This would be quite
inelegant, though, and it˘s not clear exactly which class should own
the field hash.


Father Chrysostomos


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 25, 2008, 2:11 AM

Post #6 of 62 (6920 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 24, 2008, at 9:45 AM, Father Chrysostomos wrote:

> I’d certainly like to avoid copying and pasting the code for
> calculating the best location and for ‘rounding’ the ends to the
> nearest sentence.

Okeedoke.

> What would you suggest (or dictate, since you’re in charge :-) that
> the methods be?

Well, let's brainstorm.

Right now, we're using Query->extract_terms as the raw input to the
Highlighter. There's a nasty, undocumented kludge in there:
PhraseQuery returns an arrayref, while everything else returns an
array, allowing Highlighter to differentiate between terms that
should only match if they're in a phrase and terms that should match
everywhere. Then Highlighter basically duplicates the phrase
matching logic of PhraseScorer and then conflates all phrase-matching
positions with all other positions.

This is not an extensible approach. Highlighter would need to be
modified to duplicate the matching logic of any arbitrary Query/
Scorer regime in order to add its positions.

The only way to acquire highlight data extensibly is to go back up
the chain to Query.

my $highlight_data = $query->highlight_data($doc_vector,
$field_name);

In its most basic form, the highlight data could be an array of
positions. However, I think it ought to be something richer -- an
array of HighlightSpan objects.

my $highlight_span = KinoSearch::Highlight::HighlightSpan->new(
start_offset => 0,
end_offset => 16,
weight => 3.0
);

Highlighter can offer a public method, heat_map(), which takes an
array of HighlightSpan objects as input, and returns a
KinoSearch::Highlight::HeatMap object. This object would serve as a
vessel for the kind of information currently conveyed via
_starts_and_ends and _calc_best_location. In theory, a HeatMap
object might supply an array of float, one per character in the
field; in practice, we'll need to dial that back.

The default Highlighter would use the HeatMap to find a single
contiguous snippet. Your subclass would use it to find multiple
snippets.

As for the "rounding the ends" code... maybe a method called
find_sentence_boundaries? generate_excerpts() can then make use of
the boundary information however it sees fit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 25, 2008, 8:28 AM

Post #7 of 62 (6915 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 25, 2008, at 2:11 AM, Marvin Humphrey wrote:

> my $highlight_data = $query->highlight_data($doc_vector,
> $field_name);
>
> In its most basic form, the highlight data could be an array of
> positions.

Is there any reason this needs to be an array, rather than a list?

@highlight_data = ...

> However, I think it ought to be something richer -- an array of
> HighlightSpan objects.
>
> my $highlight_span = KinoSearch::Highlight::HighlightSpan->new(
> start_offset => 0,
> end_offset => 16,
> weight => 3.0
> );
>
> Highlighter can offer a public method, heat_map(), which takes an
> array of HighlightSpan objects as input, and returns a
> KinoSearch::Highlight::HeatMap object. This object would serve as a
> vessel for the kind of information currently conveyed via
> _starts_and_ends and _calc_best_location. In theory, a HeatMap
> object might supply an array of float, one per character in the
> field; in practice, we'll need to dial that back.
>
> The default Highlighter would use the HeatMap to find a single
> contiguous snippet. Your subclass would use it to find multiple
> snippets.
>
> As for the "rounding the ends" code... maybe a method called
> find_sentence_boundaries? generate_excerpts() can then make use of
> the boundary information however it sees fit.

Sounds more like a utility function to me: just pass it a string and
it returns a two-element list, the number of chars to chop of each
end. Could this be exported?


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 25, 2008, 1:39 PM

Post #8 of 62 (6930 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 24, 2008, at 9:45 AM, Father Chrysostomos wrote:

> The [hit] object could have %{} overloading, but that would cause
> extra overhead, as the method would have to check ˇcaller˘ each
> time, to make sure it˘s not KS:H:Highlighter. But since the number
> of hits is likely to be relatively small, maybe that˘s not too bad.
>

Here˘s a patch that implements KS:S:Hits that way:
Attachments: ks-hits.diff (5.43 KB)


marvin at rectangular

Jan 26, 2008, 6:36 AM

Post #9 of 62 (6921 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 25, 2008, at 1:39 PM, Father Chrysostomos wrote:

>
> On Jan 24, 2008, at 9:45 AM, Father Chrysostomos wrote:
>
>> The [hit] object could have %{} overloading, but that would cause
>> extra overhead, as the method would have to check ‘caller’ each
>> time, to make sure it’s not KS:H:Highlighter. But since the number
>> of hits is likely to be relatively small, maybe that’s not too bad.

> Here’s a patch that implements KS:S:Hits that way:

This works. :)

In fact, it works better than you think.

I see that you're working off of one of the CPAN releases, since
you're basing KinoSearch::Search::Hit on KinoSearch::Util::Class,
which has gone away in svn trunk. You can check out a copy of trunk
here:

svn co http://www.rectangular.com/svn/kinosearch/trunk ks

Please note that trunk isn't stable -- the file format is in flux.
(In case you're running anything live that would be affected.)

In trunk, everything is now a subclass of KinoSearch::Obj, which uses
the inside-out model for Perl-space member vars. Because Obj isn't
hash-based, there's no need to check caller():

use overload
fallback => 1,
'%{}' => \&get_hashref;

There happens to be a glitch, but I'm sure it's solvable. When I add
overloading to a class based on Obj, it works in 5.10.0 but fails in
5.8.8. This is presumably due to a bug in a perlapi function back in
5.8.8: sv_bless() (or whatever) isn't adding overload magic. We'll
have to do that manually. However, I'm not quite sure what the exact
XS incantation is, so I'll have to work up a test case and post to
PerlMonks.

This hashref-overloading is cool enough that I'm going to bring back
the Doc class and use it for that:

# (both params are optional)
my $doc = KinoSearch::Doc->new(
fields => { title => 'foo' },
boost => 3,
);
$doc->{content} = "foo foo";

# (passing a plain hashref will trigger an internal call to Doc-
>new)
$invindexer->add_doc($doc);

In fact, I think Hit should be a subclass of Doc, with the only
change being that $hit->get_boost throws an error (because Doc's
index-time boost is not preserved accurately in the index).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 26, 2008, 2:09 PM

Post #10 of 62 (6919 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 23, 2008, at 10:33 PM, Marvin Humphrey wrote:

>
> On Jan 23, 2008, at 12:48 PM, Father Chrysostomos wrote:
>
>> Since the highlighter˘s main job is to create the excerpt, I think
>> it would actually be better if we made it easy to subclass by
>> dividing up its _gen_excerpt method.
>
> OK, I can see the rationale for this. The Excerpter would probably
> need access to the Formatter and the Encoder anyway. At that point,
> it's not really an excerpter -- it's a highlighter.
>
> However, I think that the internal methods of Highlighter ought to
> remain internal. They were not designed to be public. The division
> of labor amongst them isn't particularly elegant or clean. They
> call other, non-public methods within the KinoSearch suite.
>
> I think we need the public interface for Highlighter to be more
> general. And perhaps it usage shouldn't be integrated into Hits as
> it is now. I think this, which is along the lines of Peter's
> Search::Tools::HiLiter, would be the better:
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher,
> );
> my $hits = $searcher->search( query => $query_string );
> while ( my $hit = $hits->fetch_hit ) {
> my $excerpts = $highlighter->generate_excerpts($hit);
> ...
> }
>
> In fact, using Search::Tools::HiLiter would look almost the same:
>
> my $highliter = Search::Tools::HiLiter->new( query =>
> $query_string );
> my $hits = $searcher->search( query => $query_string );
> while ( my $hit = $hits->fetch_hit_hashref ) {
> my $excerpt = $highliter->light( $hit->{content} );
> ...
> }
>
> The only public methods on Highlighter would be
> generate_excerpts($hit), get_formatter(), and get_encoder(). If we
> add others -- and I can see how that would benefit you -- they
> should be more coherent than the current internal methods.

Do you mean to eliminate add_spec? For which fields will the default
highlighter create excerpts? Might I suggest the following:

1) Keep add_spec.

2) Forget about get_(formatter|encoder), since each spec might have a
different one.

3) Make generate_excerpts call generate_excerpt (_gen_excerpt
renamed); or maybe we should call it single_excerpt, to differentiate
between it and the former more easily. single_excerpt will be called
with its current args, and can be overridden in a subclass. The $spec
passed to singe_excerpt can be documented to contain the args passed
to add_spec, with default filled in. So $spec->{limit} should be
removed and calculated in the default single_excerpt method instead of
in add_spec.

4) All the suggestions you made in the other message (heat_map, $query-
>highlight_data, and find_sentence_boundaries).


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 26, 2008, 2:12 PM

Post #11 of 62 (6899 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 26, 2008, at 2:09 PM, I wrote:

> 3) Make generate_excerpts call generate_excerpt (_gen_excerpt
> renamed); or maybe we should call it single_excerpt, to
> differentiate between it and the former more easily. single_excerpt
> will be called with its current args, and can be overridden in a
> subclass. The $spec passed to singe_excerpt can be documented to
> contain the args passed to add_spec, with default filled in.

I mean with defaults used where args were omitted.


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 27, 2008, 1:49 PM

Post #12 of 62 (6924 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 26, 2008, at 2:09 PM, I wrote:

> 1) Keep add_spec.
>
> 2) Forget about get_(formatter|encoder), since each spec might have
> a different one.
>
> 3) Make generate_excerpts call generate_excerpt (_gen_excerpt
> renamed); or maybe we should call it single_excerpt, to
> differentiate between it and the former more easily. single_excerpt
> will be called with its current args, and can be overridden in a
> subclass. The $spec passed to singe_excerpt can be documented to
> contain the args passed to add_spec, with default filled in. So
> $spec->{limit} should be removed and calculated in the default
> single_excerpt method instead of in add_spec.
>
> 4) All the suggestions you made in the other message (heat_map,
> $query->highlight_data, and find_sentence_boundaries).
>

I˘m trying to implement this now. I˘ve noticed what I believe to be a
couple of bugs in _calc_best_location, but I want to check with you to
make sure:

sub _calc_best_location {
...
for my $loc_index ( 0 .. $#$posits ) {
# only score positions that are in range
my $location = $posits->[$loc_index][0];
my $other_loc_index = $loc_index - 1;
while ( $other_loc_index > 0 ) {

Should this not be >= ?

my $diff = $location - $posits->[$other_loc_index][0];
last if $diff > $window;
my $num_tokens_at_pos = $posits->[$other_loc_index][2];
$locations{$location}
+= ( 1 / ( 1 + log($diff) ) ) * $num_tokens_at_pos;
--$other_loc_index;
}
$other_loc_index = $loc_index + 1;
while ( $other_loc_index <= $#$posits ) {
my $diff = $posits->[$other_loc_index] - $location;

Shouldn˘t $posits->[$other_loc_index] have [0] on the end?

last if $diff > $window;
my $num_tokens_at_pos = $posits->[$other_loc_index][2];
$locations{$location}
+= ( 1 / ( 1 + log($diff) ) ) * $num_tokens_at_pos;
++$other_loc_index;
}
}

# return the highest scoring position
return ( sort { $locations{$b} <=> $locations{$a} } keys
%locations )[0];
}


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 27, 2008, 5:14 PM

Post #13 of 62 (6906 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 26, 2008, at 2:09 PM, Father Chrysostomos wrote:
> Do you mean to eliminate add_spec?

No, that was just an oversight while writing untested code for email.

> 2) Forget about get_(formatter|encoder), since each spec might have
> a different one.

Yes. I'd unconsciously reverted to the old API, where there was
only one formatter/encoder per highlighter. (: It's because I make
mistakes like these that KS has arg checking everywhere. :)

> 3) Make generate_excerpts call generate_excerpt (_gen_excerpt
> renamed); or maybe we should call it single_excerpt, to
> differentiate between it and the former more easily. single_excerpt
> will be called with its current args, and can be overridden in a
> subclass. The $spec passed to singe_excerpt can be documented to
> contain the args passed to add_spec, with default filled in. So
> $spec->{limit} should be removed and calculated in the default
> single_excerpt method instead of in add_spec.

Sounds well thought through. I concur with making single_excerpt
public() with that API.

We'll need to add one extra named arg to the add_spec list. "hit"?
Or actually, how about "doc"?

# User code:
my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher,
);
$highlighter->add_spec( name => 'content' );
my $excerpts = $highlighter->generate_excerpts($hit);

# Internally, highlighter calls single_excerpt:
for my $spec ( @{ $specs{$$self} } ) {
$excerpts->{ $spec->{name} } = $self->single_excerpt(
%$spec,
doc => $hit,
);
}

The DocVector object would be retrieved within single_excerpt() --
which becomes possible once the Highlighter gets a Searcher at
construction time.

I'm a little uncertain about dedicating the name "Hit" to the class
for the documents that Hits::fetch_hit returns. Sure, it works, but
"hit" is used elsewhere, e.g. the HitCollector class, which doesn't
deal with *this* kind of "hit". These are essentially Doc objects.
So I'm thinking make them a subclass of Doc called HitDoc, and have
the named arg for single_excerpt() be "doc". Sound good?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 27, 2008, 5:34 PM

Post #14 of 62 (6931 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 27, 2008, at 1:49 PM, Father Chrysostomos wrote:

> I’ve noticed what I believe to be a couple of bugs in
> _calc_best_location, but I want to check with you to make sure:
>
> sub _calc_best_location {
> ...
> for my $loc_index ( 0 .. $#$posits ) {
> # only score positions that are in range
> my $location = $posits->[$loc_index][0];
> my $other_loc_index = $loc_index - 1;
> while ( $other_loc_index > 0 ) {
>
> Should this not be >= ?

Oof. Coming back to it, this code is kind of hard to grok. :\ And
test (so there are no tests for it).

I think you're right.

>
> my $diff = $location - $posits->[$other_loc_index][0];
> last if $diff > $window;
> my $num_tokens_at_pos = $posits->[$other_loc_index][2];
> $locations{$location}
> += ( 1 / ( 1 + log($diff) ) ) * $num_tokens_at_pos;
> --$other_loc_index;
> }
> $other_loc_index = $loc_index + 1;
> while ( $other_loc_index <= $#$posits ) {
> my $diff = $posits->[$other_loc_index] - $location;
>
> Shouldn’t $posits->[$other_loc_index] have [0] on the end?

Yes.

Existing high-level tests for Highlighter still pass after making
these changes, so I've commited them as r2956. Thank you for finding
the bugs.

This code will end up in HeatMap. I look forward to refactoring it
for clarity and writing up some tests for it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 27, 2008, 6:26 PM

Post #15 of 62 (6912 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 27, 2008, at 5:14 PM, Marvin Humphrey wrote:

> We'll need to add one extra named arg to the add_spec list. "hit"?
> Or actually, how about "doc"?
>
> # User code:
> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher,
> );
> $highlighter->add_spec( name => 'content' );
> my $excerpts = $highlighter->generate_excerpts($hit);
>
> # Internally, highlighter calls single_excerpt:
> for my $spec ( @{ $specs{$$self} } ) {
> $excerpts->{ $spec->{name} } = $self->single_excerpt(
> %$spec,
> doc => $hit,
> );
> }

Actually what I˘ve done so far hash ->singe_excerpt($hit, \%spec), but
I think what you have is better. I˘ve also made doc_vector an
attribute of Hit (see the attached file [if I remember to attach it
after typing this message]). Now I˘m not certain that the hit needs a
reference to the doc number, but it˘s in there. Also, it has a
reference to the query, so that single_excerpt can call

$hit_doc->highlight_data( $excerpt_field )

and just has to pass one arg. This highlight_data method calls the
method of the same name on the query and then sorts its return value
and removes duplicates.

> The DocVector object would be retrieved within single_excerpt() --
> which becomes possible once the Highlighter gets a Searcher at
> construction time.

DocVector is currently documented as a private class. Do we want a
ˇpublicly subclassable˘ method to have to deal with it?
>
>
> I'm a little uncertain about dedicating the name "Hit" to the class
> for the documents that Hits::fetch_hit returns. Sure, it works, but
> "hit" is used elsewhere, e.g. the HitCollector class, which doesn't
> deal with *this* kind of "hit". These are essentially Doc objects.
> So I'm thinking make them a subclass of Doc called HitDoc, and have
> the named arg for single_excerpt() be "doc". Sound good?

This sounds fine to me. But I don˘t know what the Doc is currently
for....

Also, do we need a HighlightSpan object? Won˘t a simple hash do?
Likewise with a heat map.


Father Chrysostomos
Attachments: Hit.pm (2.15 KB)


sprout at cpan

Jan 27, 2008, 6:28 PM

Post #16 of 62 (6907 views)
Permalink
Re: Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 27, 2008, at 6:26 PM, I wrote:

> Actually what I˘ve done so far hash ->singe_excerpt($hit, \%spec),...

Oops! I can˘t spell ˇhas.˘ :-)
_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 27, 2008, 7:56 PM

Post #17 of 62 (6918 views)
Permalink
Re: Subclassable Highlighter (was: Re: KinoSearch feature suggestions) [In reply to]

On Jan 27, 2008, at 6:26 PM, Father Chrysostomos wrote:

> Actually what I’ve done so far hash ->singe_excerpt($hit, \%spec),
> but I think what you have is better.

Glad it works for you. With only a couple exceptions, argument lists
in KS take one of the following two forms:

$foo->($single_arg);
$foo->(%labeled_params);

> I’ve also made doc_vector an attribute of Hit (see the attached
> file [if I remember to attach it after typing this message]). Now
> I’m not certain that the hit needs a reference to the doc number,
> but it’s in there.

People who don't want highlighting shouldn't have to pay the penalty
of retrieving the DocVector by default.
In any case, having HitDoc carry all that information seems kind of
messy.

The doc_num is the key that enables you to get at all that info --
and it's just a nice, featherweight integer.

> Also, it has a reference to the query, so that single_excerpt can call
>
> $hit_doc->highlight_data( $excerpt_field )
>
> and just has to pass one arg.

Oh, yeah -- forgot about that.

How about supplying the query as an argument to Highlighter's
constructor? That's how Search::Tools::HiLiter works.

my $hiliter = Search::Tools::HiLiter->new(
query => $query,
);

my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher,
query => $query,
);

If the 'query' param is just a query string, we can ask $searcher to
perform its default parsing (currently in
KinoSearch::Search::Searchable::_prepare_simple_search). If it's an
object, we assume that it's the same Query object that was supplied
to $searcher->search.

> This highlight_data method calls the method of the same name on the
> query and then sorts its return value and removes duplicates.
>
>> The DocVector object would be retrieved within single_excerpt() --
>> which becomes possible once the Highlighter gets a Searcher at
>> construction time.
>
> DocVector is currently documented as a private class. Do we want a
> ‘publicly subclassable’ method to have to deal with it?

Ultimately, we want the information that is currently in DocVector to
be available via a public API.

The word "vector" is only there for Lucene legacy reasons.
TermVector data in Lucene is used for other things besides
highlighting -- but IMO it's only highlighting that's crucial.
Highlighting is really important.

I think that as we write the Query methods for obtaining
HighlightSpans, we might have ideas about how to improve DocVector/
TermVector. By the time that's done, hopefully we'll have an
acceptable public API.

Following up on this, I think FieldSpec::vectorized should be
replaced by FieldSpec::highlightable. highlightable() (or maybe just
"highlight"?) should default to 0, unlike the way
FieldSpec::vectorized() currently defaults to 1. However, the schema
used by KinoSearch::Simple will enable highlighting by default, so
that novice users will still have it easy.

> But I don’t know what the Doc is currently for....

It will be clear once Doc is integrated into the indexing stage.

I have a cargo-cult solution that makes overloading work correctly
with KinoSearch::Obj subclasses under 5.8.8, but I'm trying to
understand why it works by spelunking the relevant parts of the Perl
source, and that's slowing me down.

> Also, do we need a HighlightSpan object? Won’t a simple hash do?
> Likewise with a heat map.

There are a number of reasons to use objects instead of hashes.

First, auto-vivification is evil. Mistyped hash keys should not not
result in subtly incorrect behavior, e.g. because a default was used
instead of the supplied arg.

Second, classes which pass around hashes are tightly coupled, and
thus fragile and convoluted. Spelunking the code for the old
Test::Harness (which I had to do when hacking on David Wheeler's
JavaScript port of it) was really unpleasant for this reason. Using
an intermediate class (with accessor methods) to convey data between
two classes forces much more robust and well-defined interaction.

Third, a lot of this stuff is going to get ported to C eventually,
where using a dedicated class is actually easier than passing around
a Hash object because hash manipulation syntax isn't built into the
language.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 27, 2008, 10:46 PM

Post #18 of 62 (6931 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:

> How about supplying the query as an argument to Highlighter's
> constructor? That's how Search::Tools::HiLiter works.
>
> my $hiliter = Search::Tools::HiLiter->new(
> query => $query,
> );
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher,
> query => $query,
> );
>
> If the 'query' param is just a query string, we can ask $searcher to
> perform its default parsing (currently in
> KinoSearch::Search::Searchable::_prepare_simple_search). If it's an
> object, we assume that it's the same Query object that was supplied
> to $searcher->search.

The only thing that bothers me is that the same objects have to be
passed around so much:

$hits = $searcher->search( query => $query );
$highlighter = new KinoSearch::Highlight::Highlighter
searcher => $searcher, # I want to eliminate these two
query => $query; # extra mentions from user code
$highlighter ->add_spec(...);
$hits->create_excerpts(
highlighter => $highligher
);

Having said that, I can live with it. (You know, we could have the
searcher and query passed to the HitDoc, to avoid the DocVector
overhead you mention.)

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


peter at peknet

Jan 28, 2008, 6:26 AM

Post #19 of 62 (6910 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On 01/28/2008 12:46 AM, Father Chrysostomos wrote:
>
> On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:

>
> The only thing that bothers me is that the same objects have to be
> passed around so much:
>
> $hits = $searcher->search( query => $query );
> $highlighter = new KinoSearch::Highlight::Highlighter
> searcher => $searcher, # I want to eliminate these two
> query => $query; # extra mentions from user code
> $highlighter ->add_spec(...);
> $hits->create_excerpts(
> highlighter => $highligher
> );
>

IIRC, there are places in Search::Tools where, if not present, the query is extracted from
whatever *is* present. E.g. this naive code:

package KinoSearch::Highlight::Highlighter;
sub new {
my $self = shift;
my %args = @_;
my $searcher = $args{searcher} || croak;
my $query = $args{query} || $searcher->query || croak;
# yada yada
}

--
Peter Karman . peter [at] peknet . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 28, 2008, 3:39 PM

Post #20 of 62 (6902 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:

> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher,
> query => $query,
> );

Another problem with this approach is that the highlighter can only be
used for one query. If a second search is made with the same
$searcher, another highlighter is needed.

Unless $searcher can have a ->get_last_query method....


Also, when it comes to the highlight_data method, which class should
be responsible for removing duplicate HighlightSpans? Should I make
this a method of Highlighter itself?


I don˘t remember whether I told you: I˘m working on these changes to
Highlighter, and I think I will have a patch ready soon.


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 28, 2008, 5:27 PM

Post #21 of 62 (6903 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 28, 2008, at 3:39 PM, Father Chrysostomos wrote:

> On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:
>
>> my $highlighter = KinoSearch::Highlight::Highlighter->new(
>> searcher => $searcher,
>> query => $query,
>> );
>
> Another problem with this approach is that the highlighter can only
> be used for one query. If a second search is made with the same
> $searcher, another highlighter is needed.

True, but I can't think of where that would cause a problem. Can you
think of one?

In contrast, if we add a Query member to each HitDoc, that means the
Query will have to be serialized/deserialized if we send the hit over
the network.

Part of the reason that Highlighter's API looks the way it does was
the limitation that Highlighters had to do their work from inside a
Hits object. That was a kludge, necessitated by the fact that it was
possible to know the doc_num from within the Hits object (and thus
possible to fetch the relevant DocVector), but impossible to know the
doc_num from the hashref returned by $hits->fetch_hit_hashref.

Now that we're about to return a HitDoc object instead of a plain
hashref, we're not bound by that constraint, and I'm very much
looking forward to zapping Hits::create_excerpts.

In fact, we could simplify further. Now that we don't have to stick
all our excerpts into $hashref->{excerpts}, we can return the
excerpts as scalars, one-at-a-time -- eliminating both add_spec() and
generate_excerpts().

my $highlighter = KinoSearch::Highlight::Highlighter->new(
searcher => $searcher, # required
query => $query, # required
field => 'content', # required
excerpt_length => 150, # default: 200
formatter => $formatter, # default: a SimpleHTMLFormatter
encoder => $encoder, # default: a SimpleHTMLEncoder
);
for my $hit ( $hits->fetch_hit ) {
my $excerpt = $highlighter->single_excerpt($hit);
...
}

Juggling how params get set is a superficial change compared with
e.g. making single_excerpt() public, so it isn't that important.
However, I wonder if this lighter-weight vision for a highlighter
makes you more comfortable. To my mind, it's OK if highlighters are
ephemeral and you create a new one for each query.

> Unless $searcher can have a ->get_last_query method....

Yikes, that'd be asking for trouble!

> Also, when it comes to the highlight_data method, which class
> should be responsible for removing duplicate HighlightSpans? Should
> I make this a method of Highlighter itself?

When would there be duplicates? I suppose you'd see the same
positions multiple times for a query like 'lincoln "lincoln
bedroom"', but you'd get different weights. That query would
probably yield two spans with data like this...

{ start_offset => 15, end_offset => 22, weight 1.2 }
{ start_offset => 15, end_offset => 30, weight 3.5 }

... with the second span having a higher weight to reflect the
relative rarity of the phrase compared to the single term.

> I don’t remember whether I told you: I’m working on these changes
> to Highlighter, and I think I will have a patch ready soon.

I'm working on the Doc class right now. You should see some commits
over the next few hours.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 28, 2008, 6:12 PM

Post #22 of 62 (6898 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 28, 2008, at 5:27 PM, Marvin Humphrey wrote:

>
> On Jan 28, 2008, at 3:39 PM, Father Chrysostomos wrote:
>
>> On Jan 27, 2008, at 7:56 PM, Marvin Humphrey wrote:
>>
>>> my $highlighter = KinoSearch::Highlight::Highlighter->new(
>>> searcher => $searcher,
>>> query => $query,
>>> );
>>
>> Another problem with this approach is that the highlighter can only
>> be used for one query. If a second search is made with the same
>> $searcher, another highlighter is needed.
>
> True, but I can't think of where that would cause a problem. Can
> you think of one?

I thought of this only when I looked in 303-highlight.t, which does
just that. I don˘t see how it would cause a problem in a real-world
situation.

> In fact, we could simplify further. Now that we don't have to stick
> all our excerpts into $hashref->{excerpts}, we can return the
> excerpts as scalars, one-at-a-time -- eliminating both add_spec()
> and generate_excerpts().
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new(
> searcher => $searcher, # required
> query => $query, # required
> field => 'content', # required
> excerpt_length => 150, # default: 200
> formatter => $formatter, # default: a SimpleHTMLFormatter
> encoder => $encoder, # default: a SimpleHTMLEncoder
> );
> for my $hit ( $hits->fetch_hit ) {

while(my $hit = $hits->fetch_hit) { # :-)

>
> my $excerpt = $highlighter->single_excerpt($hit);
> ...
> }
>
> Juggling how params get set is a superficial change compared with
> e.g. making single_excerpt() public, so it isn't that important.
> However, I wonder if this lighter-weight vision for a highlighter
> makes you more comfortable.

It certainly does. Having the constructor and add_spec combined makes
it much easier to use.

>> Also, when it comes to the highlight_data method, which class
>> should be responsible for removing duplicate HighlightSpans? Should
>> I make this a method of Highlighter itself?
>
> When would there be duplicates?

Well, the current _starts_and_ends checks for them. I just thought it
must be doing it for a reason.

>> I don˘t remember whether I told you: I˘m working on these changes
>> to Highlighter, and I think I will have a patch ready soon.
>
> I'm working on the Doc class right now. You should see some commits
> over the next few hours.

The patch I thought I would have ready soon isn˘t currently working.
It appears to be highlighting words more or less at random. I˘m
sending it anyway just to show you what I˘m doing. Parts of it will
need to be re-factored, too, now that I˘ve read this message of yours.
Attachments: ks-highlighter.diff (49.3 KB)


sprout at cpan

Jan 29, 2008, 4:40 PM

Post #23 of 62 (6893 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 28, 2008, at 5:27 PM, Marvin Humphrey wrote:

> When would there be duplicates? I suppose you'd see the same
> positions multiple times for a query like 'lincoln "lincoln
> bedroom"', but you'd get different weights. That query would
> probably yield two spans with data like this...
>
> { start_offset => 15, end_offset => 22, weight 1.2 }
> { start_offset => 15, end_offset => 30, weight 3.5 }

I see this is not the same weight that the highlighter is currently
using (1 for each TermQuery, the number of terms for each
PhraseQuery). Should we use $query->make_weight->get_value?

> I'm working on the Doc class right now. You should see some commits
> over the next few hours.

Do you intend to make $reader->fetch_doc return a HitDoc?


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Jan 29, 2008, 5:48 PM

Post #24 of 62 (6896 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 29, 2008, at 4:40 PM, Father Chrysostomos wrote:

>
>> { start_offset => 15, end_offset => 22, weight 1.2 }
>> { start_offset => 15, end_offset => 30, weight 3.5 }
>
> I see this is not the same weight that the highlighter is currently
> using (1 for each TermQuery, the number of terms for each
> PhraseQuery). Should we use $query->make_weight->get_value?

Yes, I was thinking something like that. It gets a little convoluted.

I think we should divide and conquer, focusing first on HitDoc
(almost done), HighlightSpan (trivial) and HeatMap (harder).

>> I'm working on the Doc class right now. You should see some
>> commits over the next few hours.
>
> Do you intend to make $reader->fetch_doc return a HitDoc?

Yes, exactly. That means a HitDoc has to implement serialization so
that it can be sent across a network, which is tricky. I'm building
up to that.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Jan 29, 2008, 5:57 PM

Post #25 of 62 (6907 views)
Permalink
Re: Subclassable Highlighter [In reply to]

On Jan 29, 2008, at 5:48 PM, Marvin Humphrey wrote:

> I think we should divide and conquer, focusing first on HitDoc
> (almost done), HighlightSpan (trivial) and HeatMap (harder).

Well, here˘s HighlightSpan:
Attachments: HighlightSpan.pm (1.57 KB)

First page Previous page 1 2 3 Next page Last page  View All kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.