marvin at rectangular
May 22, 2007, 2:03 PM
Post #2 of 3
On May 22, 2007, at 11:51 AM, Chris Nandor wrote:
> First, why does this return no results (from 601-queryparser.t)?
> my @docs = ( 'x', 'y', 'z', 'x a', 'x a b', 'x a b c', 'x foo a b
> c d', );
> '-x -c' => [ 0, 0, 0, 0, ],
QueryParser does not obey perfect boolean logic, preferring instead
to imitate the interfaces of popular web search engines.
The way that query gets parsed, it doesn't match anything: it
produces two negated boolean clauses, which effectively act as
filters. There's nothing positive to pass through those filters,
though, so you get no hits.
We don't want to have the query string '-
a_term_that_isnt_in_any_docs_SDUEFAHF' return the entire collection
by default. If you search for '-foo' at Google, Yahoo, Wikipedia,
etc, you get no matches. KinoSearch behaves the same way.
> And if not, then is there a
> decent workaround for this?
That depends on your application, and for the time being it requires
a non-public kludge which may or may not do what you want.
What you'd need to do is combine QueryParser's output with a query
that matches all documents. A MatchAllDocsQuery and associated
Weight/Scorer classes would fit the bill, but they don't exist as of
yet. In their absence, a MatchFieldQuery might serve as a
substitute, so long as the field is present in all docs.
> Second question: what's left before beta/release of KS 0.20?
There are four mandatory items left.
* QueryParser should have the 'field:foo' syntax switched off by
default. That way query strings like 'http://www.foo.com' and
'PHP::Interpreter' will behave more sensibly.
* BooleanScorer should take position into account. Tally is the
vessel that's supposed to facilitate this, but it needs to be
augmented to deal with fields. This won't be a visible API change,
but it will affect performance (negatively) and relevance ranking
(positively), so it will affect how people tune high-performance
* KinoSearch::Simple::HTML needs to be finished. This involves
writing an Analyzer which integrates an HTML parser and uses
RichPosting's capabilities to store per-position boost based on
visual weight of text.
* Windows compatibility needs to be restored. For this, I'll have to
scare up access to a Windows box I can install MSVC on.
There are two other items that I would have liked to have gotten in.
* Finish MatchPosting.
* Add some non-text field types: integers, floats, epoch, etc. I
Swish3 will need these, and I have a plan for how I think they will
work. This is a complicated task, though, and we might discover
an optimum implementation involves breaking compat. Now would
best time to do that.
However, I'm not going to hold up 0.20 for those. MatchPosting will
be easy and can be added later. Adding non-text field types is
ambitious, and waiting for that feature might be making the perfect
the enemy of the good.
FWIW, there have been many other forces tugging at me over the last
couple weeks -- a cousin's wedding, ramp-up of a significant new
contract, plus a bunch of random stuff. However, I'm contractually
obligated to finish KinoSearch::Simple::HTML, and terribly excited
about all the other work that's gone in and eager to release.
After completing the checklist above, we'll be looking at release
candidates for 0.20. And once 0.20 proves itself stable, I
anticipate releasing 1.0. As far as my plans for 1.0 are concerned,
we're perilously close to feature-complete.