Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Re: OpenQueryParser

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


marvin at rectangular

Apr 28, 2008, 1:00 PM

Post #1 of 2 (815 views)
Permalink
Re: OpenQueryParser

On Apr 27, 2008, at 3:07 PM, Nathan Kurz wrote:

> You could have the Parser build a tree with a special field type of
> 'any', which then gets expanded out to multiple fields at a later
> stage.

I think the most elegant solution is to use undef/NULL for the 'any'
field type. (We'll have to modify TermQuery etc. to accept an undef
value for 'field'.)

> I'd sort of like to have this stage anyway, since it keep the
> Parser more independent of the Index, and would let me do tricks like
> replacing OrScorer with MyOrScorer.

OK, let's run with the idea of an intermediate representation before
we get to Query objects.

The final output of the parser *has* to be a Query object, because
this code snippet has to work:

my $query = $query_parser->parse($query_string);
my $hits = $searcher->search( query => $query );

Internally, though, we can divide up parse() into two stages:

my $tree = $query_parser->tree($query_string);
my $query = $query_parser->build($tree);

The tree stage would be constructed from "ParseNode" objects and would
be single-field (except where the field gets explicitly specified via
the query string).

Hmm. I think we'll need ParseNode subclasses like ANDNode, ORNode,
PhraseNode, and so on. Are we any better off than when the tree was
being built with ANDQuery, ORQuery, etc? I don't see how we are.

So... keep the two-stage compilation, but have the both stages output
Query objects and just make it possible to walk child nodes for
ANDQuery, ORQuery, etc.

> Instead of trying to build an
> optimizing Parser, you could do the optimizations and checks in a
> separate pass and keep the Parser simpler.

It seems like a sound principle that the parser should not optimize.
Where do you normally perform optimization? Not in the lexer, not in
the parser... optimization is normally the compiler's job.

And that finally suggests a decent replacement name for the Weight
class: it should be renamed to KinoSearch::Search::Compiler.

To refine what we want out of the parser class, though... it ought to
be abstract syntax tree built from Query objects, rather than a parse
tree per se. (<http://en.wikipedia.org/wiki/Abstract_syntax_tree>)
For example, we want the output of the parser for both of these query
string inputs...

'foo -bar'
'foo AND NOT bar'

... to be exactly equivalent to this:

my $foo_query = TermQuery->new( field => undef, term => 'foo' );
my $bar_query = TermQuery->new( field => undef, term => 'bar' );
my $not_bar_query = NOTQuery->new( query => $bar_query );
my $and_query = ANDQuery->new;
$and_query->add_child( query => $foo_query );
$and_query->add_child( query => $not_bar_query );

That's lossy -- because we lose the ability to recreate the input
query string -- but we don't lose any intent and it's not a true
optimization.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

Apr 28, 2008, 3:38 PM

Post #2 of 2 (757 views)
Permalink
Re: OpenQueryParser [In reply to]

On Mon, Apr 28, 2008 at 2:00 PM, Marvin Humphrey <marvin [at] rectangular> wrote:
> I think the most elegant solution is to use undef/NULL for the 'any' field
> type. (We'll have to modify TermQuery etc. to accept an undef value for
> 'field'.)

Sounds good.

> The final output of the parser *has* to be a Query object, because this
> code snippet has to work:
>
> my $query = $query_parser->parse($query_string);
> my $hits = $searcher->search( query => $query );

Fine by me.

> So... keep the two-stage compilation, but have the both stages output Query
> objects and just make it possible to walk child nodes for ANDQuery, ORQuery,
> etc.

Yes, definitely simpler this way.

> And that finally suggests a decent replacement name for the Weight class:
> it should be renamed to KinoSearch::Search::Compiler.

This certainly makes me happier than Weight.

> To refine what we want out of the parser class, though... it ought to be
> abstract syntax tree built from Query objects, rather than a parse tree per
> se. (<http://en.wikipedia.org/wiki/Abstract_syntax_tree>)

Pedantic, but true :). I don't see a great benefit in distinguishing
these two (an AST looks a lot to me like a Parse Tree for slightly
different input), but I will attempt to change my terminology.

> For example, we
> want the output of the parser for both of these query string inputs...
>
> 'foo -bar'
> 'foo AND NOT bar'
>
> ... to be exactly equivalent to this:
> [snip]

Yes. Although, personally, I'm not above canonicalizing the input
string as text before passing it to the parser. One could allow the
Parser to directly accept either of these, and massage the other into
form. This is much simpler than writing your own parser if you want
just want a small change in the grammar.

> That's lossy -- because we lose the ability to recreate the input query
> string -- but we don't lose any intent and it's not a true optimization.

True, but I don't think this is a problem. It's probably best to save
the initial query string at some top level, so it can be redisplayed
later, but this can be outside KinoSearch core.

Nathan Kurz
nate [at] verse

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.