
marvin at rectangular
Jan 23, 2008, 10:31 PM
Post #1 of 6
(977 views)
Permalink
|
|
Wildcards (was Re: KinoSearch feature suggestions)
|
|
On Jan 23, 2008, at 6:59 AM, Peter Karman wrote: > Yes, Swish-e supports the '*' to match 1 or more characters at the > end of the word, and > '?' to match exactly one character anywhere in the word. However, > Swish-e does that by > means of a 256-byte wide lookup table (iirc), which works only > because Swish-e supports > single-byte encodings. Interesting. Is the lookup table used when matching terms, or also postings? Is there a clear division between the two stages, as there is in KS? Accumulating the set of terms that match the wildcard isn't that hard. Here's a simple example with a trailing wildcard. my $wildcard_query_string = "pet*"; my ($frag) = $wildcard_query_string =~ /(.*?)\*/; my $lexicon = $index_reader->look_up_field("content"); $lexicon->seek($frag); my @terms; while ( $lexicon->next ) { my $term = $lexicon->get_term; last unless index( $term->get_text, $frag ) == 0; push @terms, $term; } From there, we assemble a priority queue of PostingLists: my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar @terms ); for my $term (@terms) { my $posting_list = $index_reader->posting_list($term); $pri_q->insert($posting_list); } The problem we have now is that the priority queue of PostingLists probably isn't a good way to zip through a lot of matching terms. There's going to be some disk seeking, as the results for "peter" and "petroleum" and "petunia" are interleaved. Hmm... If we punt on scoring, it might make sense from an i/o standpoint to iterate through all the matches up front and save a BitVector with matching doc nums set. Actually, if we iterate up front, we could find out the IDF of the fragment and then use that to assess a crude score. However, we wouldn't have TF info available unless we use something bigger than a BitVector to hold the temporary results. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|