Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

OpenQueryParser (was "opening up the scorers")

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


marvin at rectangular

Apr 23, 2008, 9:21 PM

Post #1 of 2 (825 views)
Permalink
OpenQueryParser (was "opening up the scorers")

On Apr 22, 2008, at 2:39 PM, Nathan Kurz wrote:

> On Mon, Apr 21, 2008 at 10:54 PM, Marvin Humphrey
> <marvin [at] rectangular> wrote:
>> Instead of opening up the core class, I'd be more inclined to write
>> and
>> release KSx::Search::OpenQueryParser, which would look a lot like the
>> current QueryParser but single-field and with factory methods.
>
> I'm pretty happy with that approach, although I think there might be a
> little more safe maneuvering room before the slope gets slippery.
> Opening up the API to allow syntax changes seems excessive;

I think another nice feature for OpenQueryParser would be to base it
on Parse::YAPP or something like that, and have the grammar be
configurable via a constructor param. The core QueryParser is based
off of regexes, which is fast and dependency-free but not extensible.
A grammar-based QueryParser would offer more opportunities for
customization.

The problem faced by any of these single-field parsers, though, is
that things get messy when you try to combine queries that involve
multiple fields, which is a very common practical need. Say you're
searching for "foo AND NOT bar", you parse that twice for the "title"
and "content" fields, then join the two parsed queries with an
ORQuery. You end up with something like this:

title:(foo AND NOT bar) OR content:(foo AND NOT bar)

Unfortunately, ORing the two result sets together means that any
document where the title matches 'foo AND NOT bar' will match
regardless of whether the content field contains 'bar' -- and that's
probably not what the end user wants.

This was a bug in KS that got fixed a while back, and it took making
QueryParser multi-field to fix it. What QueryParser does now is
essentially this:

(title:foo OR content:foo) AND NOT (title:bar OR content:bar)

I don't see a way to fix that problem except at a low-level via a
multi-field parser. Do you?

> I ignored
> QueryParser and wrote my own, but my fear is that the burden of
> writing a parser is going to stop anyone from casually experimenting
> with different scorers.

I can see that. There are some custom Query subclass concepts that
don't require mods to QueryParser to be practical, but there are lots
that do.

> A stray thought: QueryParser implies that it is parsing a Query,
> whereas it's probably clearer to think of it as building a query from
> some text, with the output tree being the actual Query. I don't
> suppose that QueryBuilder strikes you as a clearer name? It would
> make it clearer what it does...

It's arguable. QueryParser does parse a query string, after all.

>>> I strongly think you want to 'return the universe' [for a bare NOT
>>> query].
>>>
>>
>> Returning the universe is a perfectly reasonable behavior for some
>> applications. However, I strongly disagree that it should be the
>> default
>> behavior for the core QueryParser.
>>
>> If I write NOTQuery, at least then it becomes possible to implement
>> your
>> desired behavior. It's probably best if I focus my energies on
>> that task.
>
> Probably an agree to disagree sort of situation.

The goal is to behave as an end user typing into a search box on a
website would expect. The big web search engine sites set the trends,
and KinoSearch's core QueryParser follows.

However, I think it would be reasonable for OpenQueryParser to have
return-the-universe behavior by default. How easy would it be to
switch it off?

> My main preference would be to have the Scorer
> capable of ordering and returning large numbers of results without
> blowing up --- whether it does so by default is merely a detail.

KS won't blow up, because the standard TopDocs search uses a finite-
sized HitQueue to order results on the fly as scoring proceeds rather
than accumulating a giant array of hits and sorting by score at the end.

> So yes, implementing a NOTQuery that the default parser optimizes out
> would be just fine for my purposes, although I might try to argue that
> this optimization should take place at some later stage to allow for a
> simpler Parser.

I don't think we can move it outside of QueryParser. We'd have to
adapt every search method individually, which wouldn't be practical or
wise.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

Apr 27, 2008, 3:07 PM

Post #2 of 2 (762 views)
Permalink
Re: OpenQueryParser (was "opening up the scorers") [In reply to]

On Wed, Apr 23, 2008 at 10:21 PM, Marvin Humphrey
<marvin [at] rectangular> wrote:
> The problem faced by any of these single-field parsers, though, is that
> things get messy when you try to combine queries that involve multiple
> fields, which is a very common practical need.
> ...
> I don't see a way to fix that problem except at a low-level via a
> multi-field parser. Do you?

You could have the Parser build a tree with a special field type of
'any', which then gets expanded out to multiple fields at a later
stage. I'd sort of like to have this stage anyway, since it keep the
Parser more independent of the Index, and would let me do tricks like
replacing OrScorer with MyOrScorer. Instead of trying to build an
optimizing Parser, you could do the optimizations and checks in a
separate pass and keep the Parser simpler.

> > A stray thought: QueryParser implies that it is parsing a Query,
> > whereas it's probably clearer to think of it as building a query from
> > some text, with the output tree being the actual Query. I don't
> > suppose that QueryBuilder strikes you as a clearer name? It would
> > make it clearer what it does...
> >
>
> It's arguable. QueryParser does parse a query string, after all.

I think that's part of the problem. In my mind, a Query is just
string, not a Tree. Having a QueryParser that parses a Query (string)
and returns a ParseTree would be great. Having it parse a query and
return a Query is confusing.

> The goal is to behave as an end user typing into a search box on a website
> would expect. The big web search engine sites set the trends, and
> KinoSearch's core QueryParser follows.

Do users really expect this behaviour, or is a shortcut taken by
programmers? Realizing that probably only a tiny number of end users
ever use stop words at all, if _I_ were to type '-foo' into a site
search box, I would expect it to return all documents that do not
contain the word 'foo', probably ordered by popularity. This would
certainly be more useful than claiming that no documents match. That
said, despite urging you to make KinoSearch more general, I agree that
out of the box it should work the way that users expect as a site
search engine, and that any other uses should be secondary.

> > My main preference would be to have the Scorer
> > capable of ordering and returning large numbers of results without
> > blowing up --- whether it does so by default is merely a detail.
> >
>
> KS won't blow up, because the standard TopDocs search uses a finite-sized
> HitQueue to order results on the fly as scoring proceeds rather than
> accumulating a giant array of hits and sorting by score at the end.

'Blow up' was sloppy speech on my part. 'Grind to a halt' would
probably be closer. I'd like to have a Scorer that is either smart
enough to avoid processing the entire index for queries that match
(almost) every document, or fast enough that processing the entire
index is no big deal.

I haven't thought about it for a while, but at one point I had a
scheme to do this with a minimum document number and a maximum
document score. If the whole HitQueue was at the max score, you could
return early. If a max score occurs at less than the minimum document
number, you skip it as already returned. This would let you
semi-efficiently do things like return hits 1,000,000 to 1,000,100,
although sometimes you'd need a second pass to pick up stragglers.

Nathan Kurz
nate [at] verse

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.