
marvin at rectangular
Apr 23, 2008, 9:06 PM
Post #15 of 17
(2209 views)
Permalink
|
On Apr 18, 2008, at 6:22 AM, jack_tanner [at] yahoo wrote: > Right. How about something like this: > > $doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1); > $doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2); > > I like that this gets the doc from the invindex rather than a > searcher. InvIndex is a low-level class. (FYI, it's actually something different in maint and devel, but in both cases it's low-level). KinoSearch::Index::IndexReader, which has a private fetch_doc() method, more closely resembles what you're looking for. # private method my $doc = $reader->fetch_doc($doc_num); Searchable also specs a fetch_doc() method which is implemented in Searcher as a call to $self->{reader}->fetch_doc(). However, those fetch_doc() methods operate on KinoSearch's internal document numbers. KS document numbers aren't presently a part of official API, and they change over time, which makes them both confusing and of limited use. What you're talking about is adding some sort of retrieve-by-primary- key facility to KS. > It makes clear that it returns a doc, not a hit. FYI, in devel, $hits->fetch_hit() returns a HitDoc, which is a subclass of Doc. > It either succeeds (we get *the* doc, not any other doc), or fails. Well, this is really database territory, which isn't KinoSearch's element. Adding primary key constraints is something that could potentially be done via a KSx subclass, but it would be very awkward in core. You can fake a retrieval by primary key something like this: package MySearcher; use base qw( KinoSearch::Searcher ); sub im_feeling_lucky { my ( $self, $key ) = @_; my $termquery = KinoSearch::Search::TermQuery->new( field => 'pri_key_field', term => $key, ); my $hits = $searcher->search( query => $term_query, num_wanted return $hits->fetch_hit; } That will work if you have your own exterior mechanism for guaranteeing the uniqueness of a particular field during indexing. > $similarity = $doc1->get_cosine($doc2); > > And more generally, > > $similarity = $doc1->get_similarity($doc2, $my_similarity_fxn); Interesting. Similarity measures are implemented using pluggable classes in KinoSearch, which suggests this... my $sim = KinoSearch::Search::Similarity->new; my $score = $sim->cosine( $doc1, $doc2 ); Doc objects are just collections of stored fields, though. They have no idea what terms they contain. They have no idea how they're parsed, and a Similarity object wouldn't have any idea how to parse them either. But here's where some fruitful possibilities arise. Currently, KinoSearch writes a part of the index called "term vectors", for which Highlighter is the primary consumer. The term vector information consists of lists of the terms present in each field, along with frequency, positions, start_offsets, and end_offsets. KS accesses this information like so: # Fetch a DocVector object, from which TermVector objects may be extracted. my $doc_vec = $searcher->fetch_doc_vec($doc_num); The following cosine() method could theoretically work, because at least all the information that's needed is present: my $score = $sim->cosine( $doc_vec1, $doc_vec2 ); However, we'd need to expose a few more public APIs. First, we need a way of obtaining document numbers from a search. The easiest way to make this happen is to expose get_doc_num for HitDoc. (There are other places as well, that's just the easiest and it would work for our purposes.) Second we need to expose DocVector, or rather, an improvement upon DocVector because DocVector isn't ready for prime-time. What Highlighter and you really need is a pre-analyzed document. (Highlighter could actually work by analyzing fields on the fly -- indeed, Lucene's highlighter can be set up that way -- except for the fact that analyzing on the fly can be unacceptably slow for large documents or costly analyzers.) The questions are... * What's a better name than DocVector? AnalyzedDoc? * Should we store any other information besides the terms and their positions, start_offsets and end_offsets? * How should the data file be formatted? This is something I really want to nail in the file format, because that's the hardest thing to change. > - KS retrieval is asymmetrical (and that's fine). Let > similarity(I,A,B) be a function that specifies document A as query > against index I, iterates over the hits until it gets to document B, > and returns the score of document B. Then similarity(I,A,B) != > similarity(I,B,A). I handled this by retrieving both > similarity(I,A,B) and similarity(I,B,A) and taking the average. > > - One issue that still puzzles me is that KS is apparently capable > of a hit score greater than 1! Is that really true? Yeah, absolutely. It's the same way with Lucene, and KS scoring is directly derived from the Lucene scoring model. Lucene and KS only care about coarse relative ranking, so there are some adulterations and approximations in the similarity calculations. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|