
marvin at rectangular
May 22, 2007, 2:03 PM
Post #2 of 3
(560 views)
Permalink
|
On May 22, 2007, at 11:51 AM, Chris Nandor wrote: > First, why does this return no results (from 601-queryparser.t)? > > my @docs = ( 'x', 'y', 'z', 'x a', 'x a b', 'x a b c', 'x foo a b > c d', ); > > '-x -c' => [ 0, 0, 0, 0, ], QueryParser does not obey perfect boolean logic, preferring instead to imitate the interfaces of popular web search engines. The way that query gets parsed, it doesn't match anything: it produces two negated boolean clauses, which effectively act as filters. There's nothing positive to pass through those filters, though, so you get no hits. We don't want to have the query string '- a_term_that_isnt_in_any_docs_SDUEFAHF' return the entire collection by default. If you search for '-foo' at Google, Yahoo, Wikipedia, etc, you get no matches. KinoSearch behaves the same way. > And if not, then is there a > decent workaround for this? That depends on your application, and for the time being it requires a non-public kludge which may or may not do what you want. What you'd need to do is combine QueryParser's output with a query that matches all documents. A MatchAllDocsQuery and associated Weight/Scorer classes would fit the bill, but they don't exist as of yet. In their absence, a MatchFieldQuery might serve as a substitute, so long as the field is present in all docs. > Second question: what's left before beta/release of KS 0.20? There are four mandatory items left. * QueryParser should have the 'field:foo' syntax switched off by default. That way query strings like 'http://www.foo.com' and 'PHP::Interpreter' will behave more sensibly. * BooleanScorer should take position into account. Tally is the vessel that's supposed to facilitate this, but it needs to be augmented to deal with fields. This won't be a visible API change, but it will affect performance (negatively) and relevance ranking (positively), so it will affect how people tune high-performance applications. * KinoSearch::Simple::HTML needs to be finished. This involves writing an Analyzer which integrates an HTML parser and uses RichPosting's capabilities to store per-position boost based on visual weight of text. * Windows compatibility needs to be restored. For this, I'll have to scare up access to a Windows box I can install MSVC on. There are two other items that I would have liked to have gotten in. * Finish MatchPosting. * Add some non-text field types: integers, floats, epoch, etc. I think Swish3 will need these, and I have a plan for how I think they will work. This is a complicated task, though, and we might discover that an optimum implementation involves breaking compat. Now would be the best time to do that. However, I'm not going to hold up 0.20 for those. MatchPosting will be easy and can be added later. Adding non-text field types is ambitious, and waiting for that feature might be making the perfect the enemy of the good. FWIW, there have been many other forces tugging at me over the last couple weeks -- a cousin's wedding, ramp-up of a significant new contract, plus a bunch of random stuff. However, I'm contractually obligated to finish KinoSearch::Simple::HTML, and terribly excited about all the other work that's gone in and eager to release. After completing the checklist above, we'll be looking at release candidates for 0.20. And once 0.20 proves itself stable, I anticipate releasing 1.0. As far as my plans for 1.0 are concerned, we're perilously close to feature-complete. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|