
nate at verse
Apr 27, 2008, 3:07 PM
Post #2 of 2
(762 views)
Permalink
|
|
Re: OpenQueryParser (was "opening up the scorers")
[In reply to]
|
|
On Wed, Apr 23, 2008 at 10:21 PM, Marvin Humphrey <marvin [at] rectangular> wrote: > The problem faced by any of these single-field parsers, though, is that > things get messy when you try to combine queries that involve multiple > fields, which is a very common practical need. > ... > I don't see a way to fix that problem except at a low-level via a > multi-field parser. Do you? You could have the Parser build a tree with a special field type of 'any', which then gets expanded out to multiple fields at a later stage. I'd sort of like to have this stage anyway, since it keep the Parser more independent of the Index, and would let me do tricks like replacing OrScorer with MyOrScorer. Instead of trying to build an optimizing Parser, you could do the optimizations and checks in a separate pass and keep the Parser simpler. > > A stray thought: QueryParser implies that it is parsing a Query, > > whereas it's probably clearer to think of it as building a query from > > some text, with the output tree being the actual Query. I don't > > suppose that QueryBuilder strikes you as a clearer name? It would > > make it clearer what it does... > > > > It's arguable. QueryParser does parse a query string, after all. I think that's part of the problem. In my mind, a Query is just string, not a Tree. Having a QueryParser that parses a Query (string) and returns a ParseTree would be great. Having it parse a query and return a Query is confusing. > The goal is to behave as an end user typing into a search box on a website > would expect. The big web search engine sites set the trends, and > KinoSearch's core QueryParser follows. Do users really expect this behaviour, or is a shortcut taken by programmers? Realizing that probably only a tiny number of end users ever use stop words at all, if _I_ were to type '-foo' into a site search box, I would expect it to return all documents that do not contain the word 'foo', probably ordered by popularity. This would certainly be more useful than claiming that no documents match. That said, despite urging you to make KinoSearch more general, I agree that out of the box it should work the way that users expect as a site search engine, and that any other uses should be secondary. > > My main preference would be to have the Scorer > > capable of ordering and returning large numbers of results without > > blowing up --- whether it does so by default is merely a detail. > > > > KS won't blow up, because the standard TopDocs search uses a finite-sized > HitQueue to order results on the fly as scoring proceeds rather than > accumulating a giant array of hits and sorting by score at the end. 'Blow up' was sloppy speech on my part. 'Grind to a halt' would probably be closer. I'd like to have a Scorer that is either smart enough to avoid processing the entire index for queries that match (almost) every document, or fast enough that processing the entire index is no big deal. I haven't thought about it for a while, but at one point I had a scheme to do this with a minimum document number and a maximum document score. If the whole HitQueue was at the max score, you could return early. If a max score occurs at less than the minimum document number, you skip it as already returned. This would let you semi-efficiently do things like return hits 1,000,000 to 1,000,100, although sometimes you'd need a second pass to pick up stragglers. Nathan Kurz nate [at] verse _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|