
sprout at cpan
Feb 29, 2008, 10:11 AM
Post #1 of 3
(1882 views)
Permalink
|
On Feb 26, 2008, at 8:30 PM, Marvin Humphrey wrote: > > On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote: > >> I¢ve written some code that follows approach #1 above, namely, it >> iterates through the posting lists one after the other, keeping a >> list of doc nums that have been seen. It counts them afterwards, to >> get an accurate ¡doc_freq¢. > > PostingList objects have a get_doc_freq() method, so you can just do > this: > > my $doc_freq = 0; > $doc_freq += $_->get_doc_freq for @posting_lists; There is a problem with this approach that is best demonstrated with an example: If there are two documents, one containing ¡dog¢ and ¡dot,¢ and the other containing just ¡dog¢, and the search term is ¡do*¢, then the doc freq should be 2, since the term matches two docs. The doc freqs of the individual docs are 2 and 1, respectively, so if we add them together we get 3, and if we average them out, we get 1.5, neither of which is the right answer. > > >> Is this something you would be willing to include in core, so I >> don¢t have to repeat it in multiple subclasses? > > > That approach, which is the one used in > KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good > option -- it's just makes for the simplest and shortest code sample. > > First, you lose all the information other than document numbers. > When iterating over a PostingList, you'd typically want to access > info like the number of times the term appears in the document. Thank you for pointing this out. I¢ve just realised that the WildCardQuery implementation I¢m working on iterates through them twice. I¢ll optimise it later. > It would be overkill to add a large, complex CompositePostingList > class to KS right now, in order to avoid short-term code duplication. Fair enough. _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|