marvin at rectangular
Jan 27, 2008, 8:10 PM
Post #6 of 21
On Jan 25, 2008, at 12:11 PM, Nathan Kurz wrote:
> Does one really want a search for "speling OR spelling" to prefer
> the mis-speling?
Heh. Perhaps one way to solve this would be via a special TermQuery
subclass that overrides the IDF calculation.
Or maybe the default TermQuery class can do flat scoring and
TFIDFTermQuery would override? I imagine that would make you happy. ;)
> In my opinion, one wants the parser to have access to the TF
> information and to (optionally) use it when creating the query.
IDF is known when compiling the Query to a Weight to a Scorer, but TF
is per-document. You aren't going to have access to TF at the Scorer-
> one wants the the IDF information to be available to the scorer for
> it's optional use. But the scorer should not care directly about TF,
> only about the weight that has been input for each query term.
Well, this is certainly do-able in theory.
> The core should provide rock solid Boolean components, and a
> means of plugging in alternate parsers and scorers.
Nicely put. I concur.
>> TF/IDF is hard to beat as a default system.
> TF/IDF is an excellent means for sorting a large database of full text
> news articles by relevance based on naively entered keywords. To a
> reasonable approximation, web search can be viewed in this light. But
> its utility in other situations varies :).
TF/IDF needs to continue to be the IR model you get when you fire up
standard KS. But the idea of focusing on pure boolean components is
attractive. It would be killer if we could abstract TF/IDF to a
KinoSearch's Query/Weight/Scorer compile phase, though less
convoluted than the Lucene model, is still complex. Perhaps we can
divide things up into layers and make the bottom layer boolean-only
and simpler. To get the good TF/IDF scoring we still have to do some
complex weighting, but maybe we can move that up into
KinoSearch::Search::TFIDF::TFIDFTermQuery and such.
>> However, I'd like to make it possible to override, not just at
>> search time,
>> but at index time.
> I'm not sure I understand this. Is this in the sense of making
> certain parts of the index optional, or does it go deeper than this?
I don't think I'd stated it that well. ;)
KinoSearch is, at its heart, a segmented inverted indexer. Right
now, each segment has four main components:
* Doc Storage
* Term Vectors
I'd like to make it possible to add other components.
Consider the problem of determining whether a particular lat/lon
pairing falls within a given square. You can use doubled-up
RangeFilters for that, but it's not efficient. You have to take the
intersection of two slices of the index: EVERYTHING with a $lat
between $lat_min and $lat_max, and EVERYTHING with a $lon between
$lon_min and $lon_max -- thus you end up churning through a lot of
docs that are *way* outside the box.
R-trees are a more efficient data structure for geospatial
searching. However, there's no RTreeWriter writing R-tree data to
each segment in KS by default. I'd like to write one and make it
easy to integrate via InvIndexer/SegWriter.
A fellow actually hacked R-trees into Lucene -- the project is called
"GeoLucene" -- but for a variety of reasons, it wasn't very easy or
I think it's possible to make KS quite friendly to such extensions --
much more friendly than Lucene, with its crazy file format tightly
coupled to a gigantic core code base.
>> That's the rationale behind the introduction of the abstract
>> base classes KinoSearch::Index::Reader and
>> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
>> first distro to use these capabilities.
> I've been watching the commits, but haven't really had an idea of
> where you are going. Could you offer an overview when you have the
Well, a lot of what's gone in during the last month or so has been
straightforward porting of Perl code to C code. Figuring out an
elegant way for abstract methods to call back to the host language
from C, allowing them to be overridden via *either* Perl OR C, was a
real breakthrough. I'd like to finish the porting task and get some
experimental Java bindings up and running. One motivation is to make
it possible to benchmark KS use the Lucene benchmarking contrib code
directly -- eliminating the need to port it.
But beyond that, a central goal is to make KS as extensible as
possible, so that it is reasonably easy for motivated hackers -- like
you, like the GeoLucene guy -- to try out IR models that are suitable
for use within the context of a segmented inverted index.
Powerful search engines (Google being the archetype) don't rely on
one IR mechanism alone, but rather balance a slew of them. You can
do this in KS at a basic level by indexing both stemmed and unstemmed
versions of the same text, increasing the size of the index and your
search-time costs, but improving search accuracy: a search for
"horsing" still matches "horse", "horses", etc but *prefers* the
exact match of "horsing".
In order to improve search accuracy beyond the limits of TF/IDF,
especially when dealing with large collections, we need to be able to
scale up both by spreading to multiple machines AND by layering
different IR models on top of each other. That's where KS is headed,
and as things progress, I'm more and more confident that it's going
to work out well.
KinoSearch mailing list
KinoSearch [at] rectangular