Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Re: Wildcards

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


sprout at cpan

Feb 29, 2008, 10:11 AM

Post #1 of 3 (1882 views)
Permalink
Re: Wildcards

On Feb 26, 2008, at 8:30 PM, Marvin Humphrey wrote:

>
> On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote:
>
>> I¢ve written some code that follows approach #1 above, namely, it
>> iterates through the posting lists one after the other, keeping a
>> list of doc nums that have been seen. It counts them afterwards, to
>> get an accurate ¡doc_freq¢.
>
> PostingList objects have a get_doc_freq() method, so you can just do
> this:
>
> my $doc_freq = 0;
> $doc_freq += $_->get_doc_freq for @posting_lists;

There is a problem with this approach that is best demonstrated with
an example: If there are two documents, one containing ¡dog¢ and
¡dot,¢ and the other containing just ¡dog¢, and the search term is
¡do*¢, then the doc freq should be 2, since the term matches two docs.
The doc freqs of the individual docs are 2 and 1, respectively, so if
we add them together we get 3, and if we average them out, we get 1.5,
neither of which is the right answer.

>
>
>> Is this something you would be willing to include in core, so I
>> don¢t have to repeat it in multiple subclasses?
>
>
> That approach, which is the one used in
> KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good
> option -- it's just makes for the simplest and shortest code sample.
>
> First, you lose all the information other than document numbers.
> When iterating over a PostingList, you'd typically want to access
> info like the number of times the term appears in the document.

Thank you for pointing this out. I¢ve just realised that the
WildCardQuery implementation I¢m working on iterates through them
twice. I¢ll optimise it later.


> It would be overkill to add a large, complex CompositePostingList
> class to KS right now, in order to avoid short-term code duplication.

Fair enough.


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Feb 29, 2008, 2:16 PM

Post #2 of 3 (1737 views)
Permalink
Re: Wildcards [In reply to]

On Feb 29, 2008, at 10:11 AM, Father Chrysostomos wrote:

> If there are two documents, one containing ‘dog’ and ‘dot,’ and the
> other containing just ‘dog’, and the search term is ‘do*’, then the
> doc freq should be 2, since the term matches two docs. The doc freqs
> of the individual docs are 2 and 1, respectively, so if we add them
> together we get 3, and if we average them out, we get 1.5, neither
> of which is the right answer.

Hrmm, yes, you're right about that. There doesn't seem to be a good
way to get the doc_freq for a wildcard at search-time without
iterating through the posting lists. If we really want to know that
number, I think the only way is to generate it at index-time. We're
back to the "index all substrings" approach -- which you weren't all
that enthused about.

Are you going to be able to do what you need to do using the public
API for PostingList now exposed?

> Thank you for pointing this out. I’ve just realised that the
> WildCardQuery implementation I’m working on iterates through them
> twice. I’ll optimise it later.


OK. Please keep in mind that iterating through posting lists for
common terms is where the bulk of the cost lies when searching large
indexes.

>> It would be overkill to add a large, complex CompositePostingList
>> class to KS right now, in order to avoid short-term code duplication.
>
> Fair enough.


I should add a couple comments... To scale up well, a
CompositePostingList class would need to be implemented in C. Same
thing for a WildCardAnalyzer. Neither of them really needs to be in
core -- they'll use C APIs that are planned to be exposed. But we
have to get to the point where they actually are exposed, so I'm going
to continue focusing on developing the C API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Feb 29, 2008, 2:34 PM

Post #3 of 3 (1734 views)
Permalink
Re: Wildcards [In reply to]

On Feb 29, 2008, at 2:16 PM, Marvin Humphrey wrote:

> Are you going to be able to do what you need to do using the public
> API for PostingList now exposed?

Yes, I think so. I¢m just about to write tests for my wild-card
module, so I¢ll find out soon.

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.