
marvin at rectangular
Sep 10, 2008, 8:57 PM
Post #1 of 1
(2378 views)
Permalink
|
|
r3868 - in trunk/perl/lib/KinoSearch/Docs: . Cookbook
|
|
Author: creamyg Date: 2008-09-10 20:57:22 -0700 (Wed, 10 Sep 2008) New Revision: 3868 Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook.pod trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod Log: Write another draft of the Cookbook. Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod =================================================================== --- trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod 2008-09-11 01:43:18 UTC (rev 3867) +++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod 2008-09-11 03:57:22 UTC (rev 3868) @@ -5,22 +5,22 @@ =head1 ABSTRACT -At the core of every Searcher object is an IndexReader, and when an -IndexReader object is created, a small portion of the InvIndex is loaded into -memory. Additional caches are filled as relevant queries arrive. +When a L<Searcher|KinoSearch::Searcher> object is created, a small portion of +the invindex is loaded into memory; additional caches are filled as relevant +queries arrive. For small document collections on lightly-loaded servers, the +time it takes to warm up the Searcher isn't worth worrying about. For large +document collections or busy servers, though, the warmup time may become +significant, in which case reusing the Searcher is likely to speed up your +application. -For small document collections on lightly-loaded servers, the time to warm up -the Searcher/Reader isn't worth worrying about. For large document -collections or busy servers, the warmup time may become significant, in which -case reusing the Searcher is likely to speed up your application. - =head1 FastCGI -A script running under standard CGI runs once per request. In contrast, a -script running on FastCGI webserver using the CGI::Fast module from CPAN -starts upon the first request then executes a loop once per request. +A script running under standard CGI runs once per request; in contrast, a +script running on a FastCGI-enabled webserver using the CGI::Fast module from +CPAN starts up on the first request then executes a loop once per request. -Create your Searcher outside this loop: +Create your Searcher outside this loop, so that the object persists over +multiple requests: my $searcher = KinoSearch::Searcher->new( invindex => MySchema->read('/path/to/invindex/') @@ -77,8 +77,9 @@ fetch hits 0.006 0.008 75.602% _stop_ 0.000 0.008 0.186% -As the numbers indicate, for a simple term query, the time to initialize the -Searcher overwhelms the time to execute the search and return results. +Its clear from those numbers that for a simple term query, the time it takes +to initialize the Searcher swamps the time it takes to execute the search and +return results. =head1 COPYRIGHT Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod =================================================================== --- trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod 2008-09-11 01:43:18 UTC (rev 3867) +++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod 2008-09-11 03:57:22 UTC (rev 3868) @@ -39,10 +39,10 @@ =back -PrefixQuery on its own isn't enough because Query objects are mainly -containers for metadata describing what to search for, and as such they can't -do much -- they merely express a spec and leave the implementation of that -spec to their companion classes. +The PrefixQuery class on its own isn't enough because a Query object's role is +limited to expressing an abstract specification for the search. A Query is +basically nothing but metadata; execution is left to the Query's companion +Compiler and Scorer. Here's a simplified sketch illustrating how a Searcher's search() method ties together the three classes. @@ -57,13 +57,13 @@ =head2 PrefixQuery -The PrefixQuery class has two basic attributes: a query string and a field +Our PrefixQuery class will have two attributes: a query string and a field name. package PrefixQuery; use base qw( KinoSearch::Search::Query ); use Carp; - + # Inside-out member vars and hand-rolled accessors. my %query_string; my %field; @@ -77,19 +77,14 @@ my $query_string = delete $args{query_string}; my $field = delete $args{field}; my $self = $class->SUPER::new(%args); - - # Validate and assign required parameters. confess("'query_string' param is required") unless defined $query_string; + confess("Invalid query_string: '$query_string'") + unless $query_string =~ /\*\s*$/; confess("'field' param is required") unless defined $field; $query_string{$$self} = $query_string; $field{$$self} = $field; - - # Only support trailing wildcards, i.e. "hous*" but not "hou*s". - confess("Invalid query_string: '$query_string'") - unless $query_string =~ /\*\s*$/; - return $self; } @@ -123,7 +118,7 @@ Searchable objects have access to certain statistical information about the collections they represent; for instance, a Searchable can tell you how many -documents there are... +documents are in the collection... my $maximum_number_of_docs_in_collection = $searchable->max_docs; @@ -148,7 +143,7 @@ sub make_scorer { my ( $self, $index_reader ) = @_; - + # Acquire a Lexicon and seek it to our query string. my $substring = $self->get_parent->get_query_string; $substring =~ s/\*.\s*$//; @@ -156,7 +151,7 @@ my $lexicon = $index_reader->lexicon( field => $field ); return unless $lexicon; $lexicon->seek($substring); - + # Accumulate PostingLists for each matching term. my @posting_lists; while ( defined( my $term = $lexicon->get_term ) ) { @@ -171,17 +166,19 @@ last unless $lexicon->next; } return unless @posting_lists; - + return PrefixScorer->new( posting_lists => \@posting_lists ); } PrefixCompiler gets access to an L<IndexReader|KinoSearch::Search::IndexReader> object when make_scorer() gets -called. From the IndexReader we acquire a Lexicon, which is a list of a -field's unique terms; we iterate over the terms in the Lexicon, acquiring a -PostingList for each term that matches our prefix. +called. From the IndexReader we acquire a +L<Lexicon|KinoSearch::Index::Lexicon>, which is an iterator for a field's +unique terms; we scan through the Lexicon's terms, acquiring a +L<PostingList|KinoSearch::Index::PostingList> for each term that matches our +prefix. -Each of these PostingList objects represents a list of documents which match +Each of these PostingList objects represents a set of documents which match the query. =head2 PrefixScorer @@ -190,17 +187,17 @@ package PrefixScorer; use base qw( KinoSearch::Search::Scorer ); - + # Inside-out member vars. my %doc_nums; my %tally; my %tick; - + sub new { my ( $class, %args ) = @_; my $posting_lists = delete $args{posting_lists}; my $self = $class->SUPER::new(%args); - + # Cheesy but simple way of interleaving PostingList doc sets. my %all_doc_nums; for my $posting_list (@$posting_lists) { @@ -210,11 +207,11 @@ } my @doc_nums = sort { $a <=> $b } keys %all_doc_nums; $doc_nums{$$self} = \@doc_nums; - + $tick{$$self} = -1; $tally{$$self} = KinoSearch::Search::Tally->new; $tally{$$self}->set_score(1.0); # fixed score of 1.0 - + return $self; } @@ -259,13 +256,10 @@ return $tally{$$self}; } -=head1 CONCLUSION +=head1 Usage -To see PrefixQuery action, try feeding it the query string in the sample US -constitution search.cgi app. - -If you're feeling ambitious, you can also try extending -KinoSearch::QueryParser to support PrefixQuery, as described in +To try out PrefixQuery, insert the FlatQueryParser module (which supports +PrefixQuery) into the search.cgi sample app, as described in L<KinoSearch::Docs::Cookbook::CustomQueryParser>. =head1 COPYRIGHT Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod =================================================================== --- trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod 2008-09-11 01:43:18 UTC (rev 3867) +++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod 2008-09-11 03:57:22 UTC (rev 3868) @@ -5,7 +5,7 @@ =head1 ABSTRACT -Create a custom search query language, using KinoSearch::QueryParser and +Implement a custom search query language using KinoSearch::QueryParser and Parse::RecDescent. =head1 Grammar-based vs. hand-rolled @@ -48,7 +48,7 @@ We'll use a fixed field name of "content", and a fixed choice of English PolyAnalyzer. - package SimpleQueryParser; + package FlatQueryParser; use KinoSearch::Search::TermQuery; use KinoSearch::Search::PhraseQuery; use KinoSearch::Search::ORQuery; @@ -83,8 +83,8 @@ ); } -This private _tokenize() method treats double-quote delimited material as a -phrase and everything else as a term: +Our private _tokenize() method treats double-quote delimited material as a +single token and splits on whitespace everywhere else. sub _tokenize { my ( $self, $query_string ) = @_; @@ -106,7 +106,9 @@ The main parsing routine creates an array of tokens by calling _tokenize(), runs the tokens through through the PolyAnalyzer, creates TermQuery or -PhraseQuery objects, and adds each of the sub-queries to the primary ORQuery. +PhraseQuery objects according to how many tokens emerge from the +PolyAnalyzer's split() method, and adds each of the sub-queries to the primary +ORQuery. sub parse { my ( $self, $query_string ) = @_; @@ -204,11 +206,9 @@ this time -- KinoSearch::QueryParser's constructor requires a Schema which conveys field and Analyzer information, so we can just defer to that. - package SimpleQueryParser; + package FlatQueryParser; use base ( KinoSearch::QueryParser ); - ... - our %rd_parser; sub new { @@ -272,7 +272,7 @@ and if multiple fields are required, creates an ORQuery which mults out e.g. C<foo> into C<(title:foo OR content:foo)>. -=head1 Extending the query language. +=head1 Extending the query language To add support for trailing wildcards to our query language, first we need to modify our grammar, adding a C<prefix_query> production and tweaking the @@ -283,7 +283,7 @@ | prefix_query | term_query - preix_query: + prefix_query: /(\w+\*)/ { KinoSearch::Search::LeafQuery->new( text => $1 ) } @@ -310,12 +310,12 @@ } } -=head1 USAGE +=head1 Usage Insert any of our custom parsers into the search.cgi sample app to get a feel for how they behave: - my $parser = SimpleQueryParser->new( schema => $searcher->get_schema ); + my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); my $query = $parser->parse( $cgi->param('q') || '' ); my $hits = $searcher->search( query => $query, Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook.pod =================================================================== --- trunk/perl/lib/KinoSearch/Docs/Cookbook.pod 2008-09-11 01:43:18 UTC (rev 3867) +++ trunk/perl/lib/KinoSearch/Docs/Cookbook.pod 2008-09-11 03:57:22 UTC (rev 3868) @@ -4,16 +4,10 @@ =head1 DESCRIPTION -Each of the recipes in the Cookbook uses the completed -L<Tutorial|KinoSearch::Docs::Tutorial> application as its point of departure. -The materials can be found in the C<sample> directory at the root of the -KinoSearch distribution: +The Cookbook provides thematic documentation covering some of KinoSearch's +more sophisticated features. For a step-by-step introduction to KinoSearch, +see L<KinoSearch::Docs::Tutorial>. - sample/USConSchema.pm # custom KinoSearch::Schema subclass - sample/invindexer.plx # indexing app - sample/search.cgi # search app - sample/us_constitution # html documents - =head2 Chapters =over @@ -21,7 +15,7 @@ =item * L<KinoSearch::Docs::Cookbook::CachedSearcher> - Improve search-time -performance under FastCGI or mod_perl by reusing a cached Searcher/IndexReader. +performance under FastCGI or mod_perl by reusing a cached Searcher. =item * @@ -31,11 +25,22 @@ =item * -L<KinoSearch::Docs::Cookbook::CustomQueryParser> - Create a custom search -query language, using KinoSearch::QueryParser and Parse::RecDescent. +L<KinoSearch::Docs::Cookbook::CustomQueryParser> - Define your own custom +search query syntax using KinoSearch::QueryParser and Parse::RecDescent. =back +=head2 Materials + +Some of the recipes in the Cookbook reference the completed +L<Tutorial|KinoSearch::Docs::Tutorial> application. These materials can be +found in the C<sample> directory at the root of the KinoSearch distribution: + + sample/USConSchema.pm # custom KinoSearch::Schema subclass + sample/invindexer.pl # indexing app + sample/search.cgi # search app + sample/us_constitution # html documents + =head1 COPYRIGHT Copyright 2005-2008 Marvin Humphrey _______________________________________________ kinosearch-commits mailing list kinosearch-commits [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch-commits
|