
marvin at rectangular
Jan 27, 2008, 8:10 PM
Post #6 of 21
(2464 views)
Permalink
|
On Jan 25, 2008, at 12:11 PM, Nathan Kurz wrote: > Does one really want a search for "speling OR spelling" to prefer > the mis-speling? Heh. Perhaps one way to solve this would be via a special TermQuery subclass that overrides the IDF calculation. Or maybe the default TermQuery class can do flat scoring and TFIDFTermQuery would override? I imagine that would make you happy. ;) > In my opinion, one wants the parser to have access to the TF > information and to (optionally) use it when creating the query. IDF is known when compiling the Query to a Weight to a Scorer, but TF is per-document. You aren't going to have access to TF at the Scorer- compilation stage. > And > one wants the the IDF information to be available to the scorer for > it's optional use. But the scorer should not care directly about TF, > only about the weight that has been input for each query term. Well, this is certainly do-able in theory. > The core should provide rock solid Boolean components, and a > means of plugging in alternate parsers and scorers. Nicely put. I concur. >> TF/IDF is hard to beat as a default system. > > TF/IDF is an excellent means for sorting a large database of full text > news articles by relevance based on naively entered keywords. To a > reasonable approximation, web search can be viewed in this light. But > its utility in other situations varies :). Heh. :) TF/IDF needs to continue to be the IR model you get when you fire up standard KS. But the idea of focusing on pure boolean components is attractive. It would be killer if we could abstract TF/IDF to a higher level. KinoSearch's Query/Weight/Scorer compile phase, though less convoluted than the Lucene model, is still complex. Perhaps we can divide things up into layers and make the bottom layer boolean-only and simpler. To get the good TF/IDF scoring we still have to do some complex weighting, but maybe we can move that up into KinoSearch::Search::TFIDF::TFIDFTermQuery and such. >> However, I'd like to make it possible to override, not just at >> search time, >> but at index time. > > I'm not sure I understand this. Is this in the sense of making > certain parts of the index optional, or does it go deeper than this? I don't think I'd stated it that well. ;) KinoSearch is, at its heart, a segmented inverted indexer. Right now, each segment has four main components: * Doc Storage * Postings * Lexicons * Term Vectors I'd like to make it possible to add other components. Consider the problem of determining whether a particular lat/lon pairing falls within a given square. You can use doubled-up RangeFilters for that, but it's not efficient. You have to take the intersection of two slices of the index: EVERYTHING with a $lat between $lat_min and $lat_max, and EVERYTHING with a $lon between $lon_min and $lon_max -- thus you end up churning through a lot of docs that are *way* outside the box. R-trees are a more efficient data structure for geospatial searching. However, there's no RTreeWriter writing R-tree data to each segment in KS by default. I'd like to write one and make it easy to integrate via InvIndexer/SegWriter. A fellow actually hacked R-trees into Lucene -- the project is called "GeoLucene" -- but for a variety of reasons, it wasn't very easy or elegant. http://www.doc.ic.ac.uk/~es106/thesis/ MultidimensionalIndexingInLucene.pdf https://sourceforge.net/projects/geolucene/ http://www.gossamer-threads.com/lists/lucene/java-dev/53378 I think it's possible to make KS quite friendly to such extensions -- much more friendly than Lucene, with its crazy file format tightly coupled to a gigantic core code base. >> That's the rationale behind the introduction of the abstract >> base classes KinoSearch::Index::Reader and >> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the >> first distro to use these capabilities. > > I've been watching the commits, but haven't really had an idea of > where you are going. Could you offer an overview when you have the > time? Well, a lot of what's gone in during the last month or so has been straightforward porting of Perl code to C code. Figuring out an elegant way for abstract methods to call back to the host language from C, allowing them to be overridden via *either* Perl OR C, was a real breakthrough. I'd like to finish the porting task and get some experimental Java bindings up and running. One motivation is to make it possible to benchmark KS use the Lucene benchmarking contrib code directly -- eliminating the need to port it. But beyond that, a central goal is to make KS as extensible as possible, so that it is reasonably easy for motivated hackers -- like you, like the GeoLucene guy -- to try out IR models that are suitable for use within the context of a segmented inverted index. Powerful search engines (Google being the archetype) don't rely on one IR mechanism alone, but rather balance a slew of them. You can do this in KS at a basic level by indexing both stemmed and unstemmed versions of the same text, increasing the size of the index and your search-time costs, but improving search accuracy: a search for "horsing" still matches "horse", "horses", etc but *prefers* the exact match of "horsing". In order to improve search accuracy beyond the limits of TF/IDF, especially when dealing with large collections, we need to be able to scale up both by spreading to multiple machines AND by layering different IR models on top of each other. That's where KS is headed, and as things progress, I'm more and more confident that it's going to work out well. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|