marvin at rectangular
Jul 2, 2007, 1:36 PM
On Jul 2, 2007, at 8:38 AM, Hans Dieter Pearcey wrote:
>> If you add two clauses to a BooleanQuery with SHOULD, then their
>> result sets get OR'd together.
>> $bool_query->add_clause( query => $term_query_a, occur =>
>> 'SHOULD' );
>> $bool_query->add_clause( query => $term_query_b, occur =>
>> 'SHOULD' );
> Is this true even when (like me) you are only interested in matching?
In theory the (unfinished) MatchPosting class is supposed to help out
with situations like yours. However, because it doesn't store token
position, it doesn't support phrase matching, and maybe it needs to
> Also, is there some reason that this isn't documented?
No, there's no reason. I guess I thought the capabilities were
implied by the class name. Looks like usability testing has revealed
a flaw! ;)
> a Query that did something like "match all documents"
That would be the as-yet-non-existent MatchAllDocsQuery, which would
have an interface similar to MatchFieldQuery. The difference between
the two would be analogous to the difference between 'SELECT doc_num'
and 'SELECT doc_num WHERE foo IS NOT NULL'.
>> If you don't care about scoring and you can reuse Filters, you should
>> use as many as practical.
> What if I can't reuse Filters, but I don't care about scoring?
If you can't reuse QueryFilters or PolyFilters, they offer no
advantage. They're probably mildly less efficient than just adding a
clause to the query.
>> This is successful modularization, "divide and conquer", "loose
>> coupling", etc, in action. Every class has its own reasonably
>> contained problem domain. There are no "God Objects" that know too
>> much or do too much. The components tolerate being assembled into
>> many different configurations.
> I agree that core KS has taken the right direction here.
> The one place where this seems less true is the distinction between
> scoring and
> matching, as I noted previously.
Lucene has a family of Query subclasses called SpanQueries, which I
have not ported and don't intend to ever put in KinoSearch's core.
What I'd like to do is make it possible for someone to write a
KSx::Spans distro. It might even include a KSx::Spans::QueryParser
subclass that uses SpanTermQuery in place of TermQuery and so on.
Analogously, it should be possible to create a suite of Queries/
Scorers which are optimized for matching alone. I believe that the
changes to KinoSearch's file format in 0.20 and the introduction of
Posting should facilitate this, but the OO infrastructure needs more
In the meantime, the current KS query classes don't exactly suck if
you just need matching. :)
> My guess (because I don't know anything about
> IR theory, or whatever) is that you assumed that of COURSE people
> wouldn't want
> just matching and not scoring,
It's true that returning results ranked by relevancy is something I
put a high priority on, but I've definitely thought about other
cases. It's just that unstructured search is a more pressing
problem. There are a lot of good databases out there. KS shouldn't
aspire to compete with PostgreSQL.
> Of course, I'm approaching it from a different direction, so I have
> assumptions; I want to treat KS more like a traditional database,
> which means I
> have different expectations, 'unique' constraints, stuff like that.
KinoSearch is always going to be optimized for the use case of a
large number of queries against a single view of an index.
I don't think we'll have to make a choice between matching alone and
matching with scoring, though. It should be possible to support both