marvin at rectangular
Jul 2, 2007, 8:18 AM
On Jun 29, 2007, at 5:45 AM, Hans Dieter Pearcey wrote:
> I don't see how I'd do this just in terms of matching. Maybe I don't
> understand SHOULD?
If you add two clauses to a BooleanQuery with SHOULD, then their
result sets get OR'd together.
$bool_query->add_clause( query => $term_query_a, occur =>
$bool_query->add_clause( query => $term_query_b, occur =>
> If some particular selection mechanism is available both as a Query
> and as a
> Filter -- e.g. BooleanQuery, which you can also use as part of a
> Queryfilter --
> is there any reason to prefer one over the other, assuming that you
> are (as I
> am) only interested in matching, not scoring? Do Filters have any
> kind of
> startup overhead compared to Queries, etc.?
If you don't care about scoring and you can reuse Filters, you should
use as many as practical.
Scorers require hitting the disk.
QueryFilters and PolyFilters, once their internal caches are warmed,
The startup cost for a RangeFilter only happens once per field per
IndexReader, when a portion of that field's lexicon is read into
memory. The main per-query cost is a single burst of disk activity
to look up the search term and and assign it a "term number" based on
where it falls in the lexicon, after which everything else is CPU
crunching and memory access.
>> I think the ultimate solution will be to make MatchFieldQuery public
>> and give it a constant score which defaults to zero. Then it could
>> be combined with a RangeFilter to produce the same effect as a
>> ConstantScoreRangeQuery. MatchFieldQuery is relatively simple, and
>> lets you do things that require kludges otherwise.
> I had found MatchFieldQuery, and thought that that might work, but
> didn't know
> enough internals to be sure. I like this idea. What can I do to
> make it work?
Sorry for the delayed response -- I had to think this over.
I've resisted making MatchFieldQuery public because I didn't feel
like its API was mature enough. I'm still not sure about it, and I
don't want to add it to the list of things that have to get done
prior to the release of 0.20. For the time being, I suggest you go
ahead and use MatchFieldQuery as is, but mark that aspect of your
module experimental. Looking forward, you can help move things along
by participating in design discussions about subclassing strategies.
A lot of the KS public API and class design is pretty solid. To
touch on one aspect, I'm pleased that the Query components allow you
to create your own query building mechanism as an alternative to
QueryParser. I'm also more certain than ever that the decision to
limit QueryParser to a much simpler syntax than its Lucene
counterpart was the right one. What you are doing demonstrates that
it is possible to write custom KSx extensions to play the Query-
building role, and if someone wants to write a Lucene-ish query
parser that supports syntax like 'boost^3', they can. Core
KinoSearch, by opting out of the more complex high-level task, lowers
its support costs and maintains greater flexibility.
This is successful modularization, "divide and conquer", "loose
coupling", etc, in action. Every class has its own reasonably
contained problem domain. There are no "God Objects" that know too
much or do too much. The components tolerate being assembled into
many different configurations.
The main goal of KinoSearch 0.30 will be to reproduce this
flexibility across more phases of search and indexing. Scorer should
be public and it should not be so challenging to subclass. If that
were already the case, somebody could whip up KSx::Search::RangeQuery
and you could use it without waiting for me to act.
For 0.20, though, it's time to think reductively (to echo a sentiment
expressed by Nathan Kurz). Rather than add new public APIs, it's
time to yank Hits->seek (and simplify Searcher), migrate some
documentation out of POD and onto to the new wiki, and possibly
redact the public APIs for Analyzer, Token, and TokenBatch, marking
them as experimental once again so that we have the option to modify