
marvin at rectangular
Jan 25, 2008, 11:34 AM
Post #2 of 4
(750 views)
Permalink
|
|
Re: Wildcards (was: Re: KinoSearch feature suggestions)
[In reply to]
|
|
On Jan 25, 2008, at 8:38 AM, Father Chrysostomos wrote: > I think I’m confused as to what the lexicon is for. I've just committed some revised docs for Lexicon. Please let me know if this is sufficiently clear: KinoSearch::Index::Lexicon - Iterator for a field's Terms. =head1 SYNOPSIS my $lexicon = $index_reader->blank_lexicon('content'); while ( $lexicon->next ) { print $lexicon->get_term->get_text . "\n"; } =head1 DESCRIPTION A Lexicon is an iterator which provides access to all the unique Terms for a given field in sorted order. If an index consists of two documents with a 'content' field holding "three blind mice" and "three musketeers" respectively, then the code from the synopsis above would print this: blind mice musketeers three > In your earlier example, you used > > my $lexicon = $reader->look_up_field($field); I had just changed the name of this method to be "blank_lexicon". my $lexicon = $reader->blank_lexicon($field); However, that name choice seems to have been a bad call, judging by your reaction: > Why would we need to create a blank one? Let's see if we can't come up with a less confusing solution. Here's the rationale for the current names: IndexReader provides factory methods which supply both Lexicons and PostingLists. my $seeked_lexicon = $reader->lexicon($term); my $unseeked_lexicon = $reader->blank_lexicon($field); my $seeked_posting_list = $reader->posting_list($term); my $unseeked_posting_list = $reader->blank_posting_list($field); blank_lexicon() and blank_posting_list() both return uninitialized iterators. lexicon() and posting_list() both return iterators which have been pre-located via seek($term). With the name change, I'd hoped to eliminate any ambiguity as to whether "look_up_field" would return a Lexicon or a PostingList, and also psychologically reinforce the implication of parallel behavior between "blank_lexicon" and "blank_posting_list". One possible fix would be to supply only two methods, and have them behave polymorphically based on whether a Term or a plain string representing a field name was supplied: my $seeked_lexicon = $reader->lexicon($term); my $unseeked_lexicon = $reader->lexicon($field); my $seeked_posting_list = $reader->posting_list($term); my $unseeked_posting_list = $reader->posting_list($field); That would be fine at the Perl level. However, I don't want to do that at the C level, where these methods are actually implemented, and it would be best if the C API and the Perl API resembled each other as closely as possible. So here's my latest thought: my $seeked_lexicon = $reader->lexicon( 'content', 'foo' ); my $unseeked_lexicon = $reader->lexicon( 'content', undef ); my $seeked_posting_list = $reader->posting_list( 'content', 'foo' ); my $unseeked_posting_list = $reader->posting_list( 'content', undef ); Does that make sense? > Or is the idea to have a lexicon that covers multiple fields? Lexicons explicitly do *not* cover multiple fields. In this, they differ from the corresponding Lucene class, TermEnum, which always iterates over all terms in all fields, sorting first by field name then by term text. The file formats differ as well: for KS, each field gets a dedicated file for simplicity's sake, while in Lucene, values for multiple fields coexist within the same file. There are two reasons behind this: 1. At some point, I intend to add support for specifying custom sorting routines via FieldSpec, which would affect the order in which the Terms in the Lexicon are sorted. Managing multiple comparison functions within a single Lexicon object would be complex and ugly. 2. At some point, it would be nice to support non-text fields. Hmm. Thinking over the second point, perhaps it would be best if Lexicons only stored field values rather than terms. In Lucene, that wouldn't work because TermEnum objects handle multiple fields, but in KS, the field is fixed. Making such a change wouldn't be trivial, but it's probably worthwhile. > Another thing: Since ‘pet*’ is essentially a type of simple > regular expression, why not provide support for Regexp queries? It > should be no less efficient if we look for a literal prefix > (completely untested): > > # get the literal prefix of the regexp, if any. > if($self->{re} =~ > /^ > (?: # prefix for qr//'s, without allowing /i : > \(\? ([a-hj-z]*) (?:-[a-z]*)?: > )? > (\\[GA]|\^) # anchor > ([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or > comments) > /x > ) {{ > my ($mod,$anchor,$prefix) = ($1,$2,$3); > $anchor eq '^' and $mod =~ /m/ and last; > $mod =~ /x/ and $prefix =~ s/\s+//g; > $self->{prefix} = $prefix; > }} A RegexQuery class would be nice to have, but it would have some significant limitations. If it used the existing KS index data structures, it would not behave like a typical SQL regex or LIKE query, matching the regex against the non-tokenized contents for each field. If you did something like this... my $regex_query = KSx::Search::RegexQuery->new( field => 'content', regex => qr/three blind/, ); ... and the 'content' field was tokenized, the regex wouldn't match against any of the values in the Lexicon, since e.g. "blind" doesn't match qr/three blind/. In contrast, a trailing wildcard is relatively easy to implement, because you can find all the terms that match with this routine: my ($wild_text) = "pet*" =~ /^(.*?)\*/; my $lexicon = $index_reader->posting_list( 'content', $wild_text ); while (1) { my $term = $lexicon->get_term; last unless $term; my $term_text = $term->get_text; last unless index( $term_text, $wild_text ) == 0; push @term_texts, $term_text; last unless $lexicon->next; } A non-trailing wildcard is more expensive to implement, because it requires iterating over the entire Lexicon. > Then a wild card query could be a subclass that does the following > to its input: > > $str = quotemeta $str; > for($str) { > s/\\\*/.*/g; > s/(?:\.\*){2,}/.*/g; > s/^/^/; > s/\z/\\z/; > } Using regexes to modify regexes is probably not something I would have thought to do, but that's all the better. My goal is to provide the scaffolding. I'm focused on getting Lexicon and PostingList right, so that you can use or abuse them as you wish. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|