
marvin at rectangular
Jun 7, 2007, 9:43 PM
Post #2 of 2
(559 views)
Permalink
|
On Jun 7, 2007, at 4:56 PM, Nathan Kurz wrote: > Are there any examples yet of using customized > rich position formats? The RichPosting file format is done. BooleanScorer does not yet take position into account by default. That's near the top of my TODO list. There are not yet any meaningful examples using RichPosition. KinoSearch::Simple::HTML will be the first. That's just below the adaptations to BooleanScorer on my TODO list. > For my particular use case, I'd like to be able to provide a strong > boost to docs in which the query terms occur within the same > 'sentence', which would be delineated by the occurrence of some > regexp. Rather than storing position as a single increasing int, I'd > store it as a (sentence_num, word_num) pair, with word_num increasing > monotonically until a new sentence is started, at which point > sentence_num is incremented and word_num is restarted at zero. The > scorer would then give a bonus if all the terms share a common > sentence number. The adaptations to BooleanScorer will make it behave in a way which is somewhat similar to this, especially if your custom Tokenizer separates sentences by injecting a large position increment. > I started looking at the code, and it seems like this would be > possible if I define a custom tokenizer, a custom posting, and a > custom scorer (what else?), A custom Query and Weight. Exposing an API which allows people to create their own custom scoring algorithms relatively easily would be the ultimate ambition for KinoSearch. I REALLY, REALLY want to make that happen. The New York Times recently published an article about the process by which Google tweaks its search engine. http://www.nytimes.com/2007/06/03/business/yourmoney/03google.html THAT is the direction I want to take. There's pressure on indexers like KS and Lucene to behave more like relational databases -- people want transactions, instant updates, and so on. But the RDBMS path is well-traveled. The unexplored, exciting territory is all in the search domain. I would love to sculpt KinoSearch into a modular open source framework where people could try out scoring ideas like yours easily. Many of the puzzle pieces are already in place: Schema, Posting, and the recent Analyzer API changes Peter Karman and I hammered out all help. Crucially, I think the file format design is done and will support future expansion without problems. I believe the tight coupling between file format and code base that KinoSearch inherited from Lucene has been release. At the same time... it's important to just get 0.20 out, and make the API stable enough that people can build apps with it safely. Plus, for the next couple months I have a full-time contract job unrelated to KS. So I just need to fix bugs, hack the last few items of the TODO list and release an official version, and try to avoid being seduced by this really, really interesting but difficult problem. :) > but I can't figure out how to do this > easily without just editing some of the existing classes in place. It > seems like it should be possible to do this with some artful > subclassing, I can't figure out how to do it. Absolutely, it should be possible. But there are still a lot of rough private APIs that would need to be refined and made public. > For example, > Index::SegWriter seems hard-coded to use the built-in > Analysis::TokenBatch in a way that I'm not sure how to override > gracefully. Subclass SegWriter and then redefine add_doc? Then call > a subclassed InvIndex which call my subclass? I like your basic idea. How would you like your Analyzer subclass to look? And how would you like to change SegWriter->add_doc? I think the high level code would look like this: my $custom_seg_writer = CustomSegWriter->new; my $invindexer = KinoSearch::InvIndexer->new( invindex => $invindex, seg_writer => $custom_seg_writer, ); We'd keep InvIndexer simple and make subclassing SegWriter the point of departure for advanced users. SegWriter->add_doc would need to be broken up somehow to make it more customizable. That would either happen by adding methods to SegWriter itself, or more likely, by pushing bigger chunks of responsibility down to components like DocWriter and Analyzer. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|