Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

get doc/query similarity

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


jack_tanner at yahoo

Apr 10, 2008, 2:33 PM

Post #1 of 17 (2385 views)
Permalink
get doc/query similarity

Hi there,

Thanks for making KinoSearch. I've a two questions.

1) I'd like to compute TF and IDF between a query and one specific
indexed document. What's the best way to do that?

2) Same as (1), but the query is also an indexed document. Does anything
change?

Thanks in advance.

P.S. FYI, I could not subscribe to this list, post messages, or apparently even e-mail marvin at rectangular directly from my hotmail account.



__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 10, 2008, 3:30 PM

Post #2 of 17 (2287 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 10, 2008, at 2:33 PM, jack_tanner [at] yahoo wrote:

> 1) I'd like to compute TF and IDF between a query and one specific
> indexed document. What's the best way to do that?

Hmm, IDF for a *query*, not just a term? A query could be a lot of
different things. To know the IDF, you have to know how many
documents the query matches. To do that for an arbitrary query, you
have to run a search. KinoSearch::Search::Similarity has a private
idf() method, but it works on terms, not arbitrary queries...

Let's assume you mean a term, for the sake of getting things started.
Let's also assume that you don't really mean "one specific document",
even though that's exactly what you said. :)

Here's some code that goes in that general direction: it prints out TF
for each document which matches a specific term. It requires svn
trunk and uses some private methods:

my $invindex = MySchema->open('/path/to/invindex');
my $reader = KinoSearch::Index::IndexReader->open(
invindex => $invindex,
);
my $posting_list = $reader->posting_list(
field => 'title',
term => 'foo',
);
my $sim = $invindex->get_schema->fetch_sim('title');
while ( my $doc_num = $posting_list->next ) {
my $doc = $reader->fetch_doc($doc_num);
my $posting = $posting_list->get_posting;
my $num_occurrences = $posting->get_freq;
my $tf = $sim->tf($freq);
print "'$doc->{title}' FREQ: $num_occurrences TF: $tf\n";
}

> P.S. FYI, I could not subscribe to this list, post messages, or
> apparently even e-mail marvin at rectangular directly from my
> hotmail account.

Interesting. I received your private email and wrote back. Maybe
hotmail is blocking rectangular.com or something. AOL blockaded me
once because the previous tenants on the Comcast IP block
rectangular.com got assigned to weren't good netizens.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 11, 2008, 12:07 PM

Post #3 of 17 (2283 views)
Permalink
Re: get doc/query similarity [In reply to]

> From: Marvin Humphrey <marvin [at] rectangular>
>
> Let's assume you mean a term, for the sake of getting things started.
> Let's also assume that you don't really mean "one specific document",
> even though that's exactly what you said. :)

Thanks for that example. Let me be more clear about what is desired: I need to compute the similarity of two indexed documents. I'd like it if the metric was more sophisticated than mere term overlap. At a minimum, it could be Jaccard (i.e., doc length-normalized term overlap). It would be preferable to have something that takes corpus statistics into account. For example, if in my corpus some term T has high TF and low IDF (occurs often and in many docs), then such a term could be downweighted. Could you suggest a way of doing this? Ideally with KS 0.162?

> Interesting. I received your private email and wrote back. Maybe
> hotmail is blocking rectangular.com or something. AOL blockaded me
> once because the previous tenants on the Comcast IP block
> rectangular.com got assigned to weren't good netizens.

I never got your response at my hotmail address, not even in the spam folder. If you like, I could forward a complaint to Hotmail's postmaster. Please send a test e-mail to my @yahoo and cc my @hotmail, and I'll forward that along.




__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 15, 2008, 7:13 AM

Post #4 of 17 (2278 views)
Permalink
Re: get doc/query similarity [In reply to]

Ping? Still trying to compute similarity of two indexed docs... A weighted cosine or some such.

Thanks in advance.

----- Original Message ----
> From: "jack_tanner [at] yahoo" <jack_tanner [at] yahoo>
> To: KinoSearch discussion forum <kinosearch [at] rectangular>
> Sent: Friday, April 11, 2008 3:07:05 PM
> Subject: Re: [KinoSearch] get doc/query similarity
>
> From: Marvin Humphrey
> >
> > Let's assume you mean a term, for the sake of getting things started.
> > Let's also assume that you don't really mean "one specific document",
> > even though that's exactly what you said. :)
>
> Thanks for that example. Let me be more clear about what is desired: I need to
> compute the similarity of two indexed documents. I'd like it if the metric was
> more sophisticated than mere term overlap. At a minimum, it could be Jaccard
> (i.e., doc length-normalized term overlap). It would be preferable to have
> something that takes corpus statistics into account. For example, if in my
> corpus some term T has high TF and low IDF (occurs often and in many docs), then
> such a term could be downweighted. Could you suggest a way of doing this?
> Ideally with KS 0.162?
>
> > Interesting. I received your private email and wrote back. Maybe
> > hotmail is blocking rectangular.com or something. AOL blockaded me
> > once because the previous tenants on the Comcast IP block
> > rectangular.com got assigned to weren't good netizens.
>
> I never got your response at my hotmail address, not even in the spam folder. If
> you like, I could forward a complaint to Hotmail's postmaster. Please send a
> test e-mail to my @yahoo and cc my @hotmail, and I'll forward that along.
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch [at] rectangular
> http://www.rectangular.com/mailman/listinfo/kinosearch
>




____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 15, 2008, 10:30 AM

Post #5 of 17 (2275 views)
Permalink
Re: get doc/query similarity [In reply to]

> Ping? Still trying to compute similarity of two indexed docs... A
> weighted cosine or some such.

I've started a reply to this several times, then balled it up and
ashcanned it. I understand what you want theoretically, and the
document frequency and term frequency information is in the index and
accessible at least via private APIS. The question is how to achieve
whatever your end goal is efficiently and conveniently.

> Thanks for that example. Let me be more clear about what is desired:
> I need to compute the similarity of two indexed documents.

Are you doing something akin to a "more like this" query? What does
the end user API look like?

The brute force way is to take the contents of a document or possibly
a distillation of the contents and use that as your query, hand off to
a Searcher and see what the search gives back. That gives you a bunch
of docs, though -- not just one. You can constrain the search by
adding a "primary key"-type requirement, though performance of such a
search might be a concern with large indexes due to the way KS
compiles its queries.

If that doesn't meet your needs, then I'm not sure how to answer. I'm
approaching this like a Query/Scorer design question -- I assume that
you need not only to compare two documents, but that you need to do it
*more than once*. Is that right?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

Apr 15, 2008, 12:19 PM

Post #6 of 17 (2276 views)
Permalink
Re: get doc/query similarity [In reply to]

On Tue, Apr 15, 2008 at 8:13 AM, <jack_tanner [at] yahoo> wrote:
> Ping? Still trying to compute similarity of two indexed docs... A weighted cosine or some such.

Hi Jack --

Like Marvin asks, it would help to know some more details about what
you hoping to do. If you don't need exact control of the algorithm,
it sounds like you should be able to generate a long Boolean query
based on the contents of the initial document, and let the default
TF/IDF scorer handle the details.

This is going to be a pretty expensive query, though, and depending on
your usage patterns you might want to precompute these. Depending on
the overlap of your documents and how heavily you make use of
stop-words, presume you may have to sift through about half your
corpus, either from disk or memory depending on your situation.

If you need more control over the scoring, Marvin may try to convince
you to use the Index API's directly. Don't let him get off this easy!
Having thought about this for a good three minutes :), I'm convinced
this is a good test for the flexibility of the KinoSearch
architecture. With a couple custom Scorers, and a custom Collector if
you need all combinations, you should be able to make this work very
well.

That said, if speed is a priority, if you need arbitrary matches, and
if you can't precompute the correlations, there might be some fancier
approaches to use. For example, doing the scoring term by term rather
than doc by doc could save you a lot of function call overhead and be
several times faster. This might require some more custom pieces,
though. But I'm not sure which ones --- get Marvin to write that
overview doc of how all the pieces fit together. ;)

Nathan Kurz
nate [at] verse

ps. Marvin --- the term-by-term approach might be a useful general
optimization for a special purpose additive OrScorer. It's a
speed-memory tradeoff: instead of computing a final score for each
document and moving on, you allocate an array of scores with an entry
for each doc in your corpus. For each term occurrence, you add a
partial score to the doc slot in the array. Because this can be done
in a tight loop, this can be really fast, especially if the score
array can fit in L2 and if you read in the occurrence data non-cached.

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 15, 2008, 1:11 PM

Post #7 of 17 (2275 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 15, 2008, at 12:19 PM, Nathan Kurz wrote:

> This is going to be a pretty expensive query, though, and depending on
> your usage patterns you might want to precompute these. Depending on
> the overlap of your documents and how heavily you make use of
> stop-words, presume you may have to sift through about half your
> corpus, either from disk or memory depending on your situation.

Right.

Gory details: KS can't do super-effective primary key query
optimization because of index data compression. It doesn't know
*where* in the big blob of postings data the bit relating to document
primary_key_id=23412 lies. It's smart enough to stop scanning when no
more docs can match, but it still has to scan each posting list up to
that point.

> ps. Marvin --- the term-by-term approach might be a useful general
> optimization for a special purpose additive OrScorer.

Yeah, term-at-a-time scoring is great stuff, it's just that the
combining scorers in KS all need to go doc-at-a-time in order to
handle boolean constraints without blowing up.

I've been thinking about adding new public classes ORQuery, ANDQuery,
ANDNOTQuery and ANDORQuery. BooleanQuery would either be deprecated
or removed; the logic from the compilation phase of BooleanScorer's
first iteration would be moved to QueryParser.

Historical note: the original BooleanScorer, used in KS maint, is not
a wrapper around other combining scorers, it's an altogether different
beast. The SHOULD/MUST/MUST_NOT BooleanClause API fits that
implementation, but is, uh, less boolean than the modern component-
based approach allows.

Would that help with A) a term-at-a-time ORScorer, and B) your
subclasses?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 15, 2008, 9:33 PM

Post #8 of 17 (2278 views)
Permalink
Re: get doc/query similarity [In reply to]

> From: Marvin Humphrey <marvin [at] rectangular>
>
> I've started a reply to this several times, then balled it up and
> ashcanned it. I understand what you want theoretically, and the
> document frequency and term frequency information is in the index and
> accessible at least via private APIS. The question is how to achieve
> whatever your end goal is efficiently and conveniently.

So, you're asking me why exactly I want to go shoot myself in the foot. :)

The setting is NOT a general IR application. I'm working with a very small corpus, and expensive operations are just fine with me.

This is a kind of an duplicate detection task. I have a corpus of documents written by a known, small set of authors. I want to rank the authors w.r.t. how much they repeat themselves. To do that, I want to take all docs written by the same author, compute their pairwise similarities, and then average those similarities. (Probably just take the mean.) I'm going to repeat this for all authors. At the end, I have a "repetitiveness" score for each author. This score is the actual end goal.

> The brute force way is to take the contents of a document or possibly
> a distillation of the contents and use that as your query, hand off to
> a Searcher and see what the search gives back. That gives you a bunch
> of docs, though -- not just one. You can constrain the search by
> adding a "primary key"-type requirement, though performance of such a
> search might be a concern with large indexes due to the way KS
> compiles its queries.

I can definitely do that, and then just loop over the hits until I get the doc of interest. The only problem is if the doc of interest is not retrieved at all... but then I can assign that a score of 0.




____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 15, 2008, 11:59 PM

Post #9 of 17 (2271 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 15, 2008, at 9:33 PM, jack_tanner [at] yahoo wrote:
> This is a kind of an duplicate detection task. I have a corpus of
> documents written by a known, small set of authors. I want to rank
> the authors w.r.t. how much they repeat themselves. To do that, I
> want to take all docs written by the same author, compute their
> pairwise similarities, and then average those similarities.
> (Probably just take the mean.) I'm going to repeat this for all
> authors. At the end, I have a "repetitiveness" score for each
> author. This score is the actual end goal.

Neat. Not that this is what you're doing, but I can imagine something
like this being used as a supervisory tool for people who get paid for
generating content when the primary criteria is volume rather than
quality. Copy-and-paste documents with minor variations would appear
tightly grouped in vector space.

>> The brute force way is to take the contents of a document or possibly
>> a distillation of the contents and use that as your query, hand off
>> to
>> a Searcher and see what the search gives back. That gives you a
>> bunch
>> of docs, though -- not just one. You can constrain the search by
>> adding a "primary key"-type requirement, though performance of such a
>> search might be a concern with large indexes due to the way KS
>> compiles its queries.
>
> I can definitely do that, and then just loop over the hits until I
> get the doc of interest. The only problem is if the doc of interest
> is not retrieved at all... but then I can assign that a score of 0.


Please let us know how it goes.

I suggest using only one field, otherwise you might get some
distortions and exaggerations in the scoring curves as artifacts of
the query parsing wizard.

You may also run afoul of the max_clause_count of 1024 in BooleanQuery
because the queries will have so many components. To defeat this in
0.1x, add this to your code:

# hack to override safety feature
local
$KinoSearch::Search::BooleanQuery::instance_vars{max_clause_count}
= $a_really_big_number;

KinoSearch's scoring model uses Lucene's slight variant on vanilla TF/
IDF. Length normalization is in there; the resolution is low, but
that shouldn't matter. The one thing that's a little unusual is the
addition of a "coord" function which boosts OR'd queries when multiple
clauses match. It will affect your scores, but probably not too much
since the formula is proportional: num_matchers / max_matchers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 16, 2008, 7:43 AM

Post #10 of 17 (2267 views)
Permalink
Re: get doc/query similarity [In reply to]

> From: Marvin Humphrey <marvin [at] rectangular>
>
> Neat. Not that this is what you're doing, but I can imagine something
> like this being used as a supervisory tool for people who get paid for
> generating content when the primary criteria is volume rather than
> quality. Copy-and-paste documents with minor variations would appear
> tightly grouped in vector space.

Bingo. There are many situations where the authors are generating content under incentives, and where quality may be degraded in favor of volume or some other parameters.

> Please let us know how it goes.

Will do, and thanks for your time and advice. I still think it'd be nice if KS exposed such a similarity computation via the API; it'd be much more efficient that way.




____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 16, 2008, 11:04 AM

Post #11 of 17 (2263 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 16, 2008, at 7:43 AM, jack_tanner [at] yahoo wrote:

> Will do, and thanks for your time and advice. I still think it'd be
> nice if KS exposed such a similarity computation via the API; it'd
> be much more efficient that way.

I agree, and I would have liked to have discussed that. Had you not
been constrained by having to use the maint branch, I might have
steered things in that direction.

A lot of best work on KS, both high-level design and low-level code,
has arisen from collaborations between myself and someone who has an
itch to scratch. I'm always on the lookout for such potential partners.

In your case, though, my impression was that you were quite
knowledgeable, but that your project did not need the devel branch
badly enough to guarantee sustained momentum over the course of what
would likely be a drawn-out design discussion.

Exposing similarity measures would be superficially easy -- all the
relevant material is in KinoSearch::Search::Similarity. However, the
actual APIs to interface with the math in Similarity are internal and
not set up for use the way you described your needs. The bigger
problems were how to get at "an indexed document", how to list its
terms, and so on, outside of the context of the existing search API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 18, 2008, 6:22 AM

Post #12 of 17 (2253 views)
Permalink
Re: get doc/query similarity [In reply to]

> From: Marvin Humphrey <marvin [at] rectangular>
>
> In your case, though, my impression was that you were quite
> knowledgeable, but that your project did not need the devel branch
> badly enough to guarantee sustained momentum over the course of what
> would likely be a drawn-out design discussion.

Oh, I'm always up for a discussion. As to whether I'm knowledgeable enough, no idea. It may just seem that way. :) But I'm happy to try to help.

> Exposing similarity measures would be superficially easy -- all the
> relevant material is in KinoSearch::Search::Similarity. However, the
> actual APIs to interface with the math in Similarity are internal and
> not set up for use the way you described your needs. The bigger
> problems were how to get at "an indexed document", how to list its
> terms, and so on, outside of the context of the existing search API.

Right. How about something like this:

$doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1);
$doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2);

I like that this gets the doc from the invindex rather than a searcher. It makes clear that it returns a doc, not a hit. It either succeeds (we get *the* doc, not any other doc), or fails.

$similarity = $doc1->get_cosine($doc2);

And more generally,

$similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);

At indexing time, we probably do this:

$invindexer->spec_field(
name => 'doc_id',
analyzed => 0,
vectorized => 0,
indexed => 1,
);

On another note, here are my notes from implementing my doc/doc similarity code.

- As you point out, it'd be nice to get back an indexed document. I sidestepped this by recreating a document for each query from scratch. These were Boolean OR queries with lots of clauses.

- KS retrieval is asymmetrical (and that's fine). Let similarity(I,A,B) be a function that specifies document A as query against index I, iterates over the hits until it gets to document B, and returns the score of document B. Then similarity(I,A,B) != similarity(I,B,A). I handled this by retrieving both similarity(I,A,B) and similarity(I,B,A) and taking the average.

- One issue that still puzzles me is that KS is apparently capable of a hit score greater than 1! Is that really true?

- Here's a sample output:

Redundancy computation for author: 3600
textID 25 similar to 23 at 3.02506327629089
textID 25 similar to 24 at 3.00168991088867
textID 25 similar to 22 at 2.99010539054871
textID 25 similar to 21 at 1.60162734985352
textID 22 similar to 23 at 2.88369727134705
textID 22 similar to 25 at 2.82533693313599
textID 22 similar to 24 at 2.2787299156189
textID 22 similar to 21 at 1.63472175598145
textID 21 similar to 22 at 2.0984148979187
textID 21 similar to 25 at 1.85871315002441
textID 21 similar to 23 at 1.83884310722351
textID 21 similar to 24 at 1.51606929302216
textID 24 similar to 25 at 3.20844388008118
textID 24 similar to 23 at 2.80741047859192
textID 24 similar to 22 at 2.6320960521698
textID 24 similar to 21 at 1.36477994918823
textID 23 similar to 25 at 2.93027353286743
textID 23 similar to 22 at 2.9176025390625
textID 23 similar to 24 at 2.4804675579071
textID 23 similar to 21 at 1.44625544548035
author 3600 redundancy = 47.3403416872025 / 20 = 2.367017




____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


nate at verse

Apr 19, 2008, 10:54 AM

Post #13 of 17 (2252 views)
Permalink
Re: get doc/query similarity [In reply to]

On Fri, Apr 18, 2008 at 7:22 AM, <jack_tanner [at] yahoo> wrote:
> $doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1);
> $doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2);

Something like this seems like a fine idea. Being able to pull a doc
out of the index seems useful.

> $similarity = $doc1->get_cosine($doc2);
> $similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);

I'm less happy with this approach. This relates to my general feeling
that the way to solve this problem is not by adding new interfaces,
but by generalizing the scoring mechanism to allow alternative scorers
like this one. The 'right' solution would be to create a
CosineScorer, and make it work with the existing infrastructure.
Adding convenience methods to score individual documents would be
great, but should be added to the Scorer (or Similarity? or Weight?)
rather than to the Doc.

For background, one of the ways I was abusing KinoSearch in the past
was trying to use it as a framework for the Netflix Prize, which is a
contest to predict movie ratings: http://www.netflixprize.com
Loosely, my Docs were Users, my Terms were Movies. Much like Jack was
doing for similarity between documents, I wanted to be use Kinosearch
to do kNN searches for similar users. If you squint hard enough, this
problem is really similar to full text search. I failed and went with
a custom framework, but I don't see any reason why KinoSearch couldn't
accommodate other similar abuses.

Nathan Kurz
nate [at] verse

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 20, 2008, 8:45 PM

Post #14 of 17 (2263 views)
Permalink
Re: get doc/query similarity [In reply to]

> From: Nathan Kurz <nate [at] verse>
>
> > $similarity = $doc1->get_cosine($doc2);
> > $similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);
>
> I'm less happy with this approach. This relates to my general feeling
> that the way to solve this problem is not by adding new interfaces,
> but by generalizing the scoring mechanism to allow alternative scorers
> like this one. The 'right' solution would be to create a
> CosineScorer, and make it work with the existing infrastructure.
> Adding convenience methods to score individual documents would be
> great, but should be added to the Scorer (or Similarity? or Weight?)
> rather than to the Doc.

I can dig that.




____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 23, 2008, 9:06 PM

Post #15 of 17 (2208 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 18, 2008, at 6:22 AM, jack_tanner [at] yahoo wrote:

> Right. How about something like this:
>
> $doc1 = $invindex->get_doc(id_field => 'doc_id', id_value => $id1);
> $doc2 = $invindex->get_doc(id_field => 'doc_id', id_value => $id2);
>
> I like that this gets the doc from the invindex rather than a
> searcher.

InvIndex is a low-level class. (FYI, it's actually something
different in maint and devel, but in both cases it's low-level).
KinoSearch::Index::IndexReader, which has a private fetch_doc()
method, more closely resembles what you're looking for.

# private method
my $doc = $reader->fetch_doc($doc_num);

Searchable also specs a fetch_doc() method which is implemented in
Searcher as a call to $self->{reader}->fetch_doc().

However, those fetch_doc() methods operate on KinoSearch's internal
document numbers. KS document numbers aren't presently a part of
official API, and they change over time, which makes them both
confusing and of limited use.

What you're talking about is adding some sort of retrieve-by-primary-
key facility to KS.

> It makes clear that it returns a doc, not a hit.

FYI, in devel, $hits->fetch_hit() returns a HitDoc, which is a
subclass of Doc.

> It either succeeds (we get *the* doc, not any other doc), or fails.

Well, this is really database territory, which isn't KinoSearch's
element. Adding primary key constraints is something that could
potentially be done via a KSx subclass, but it would be very awkward
in core.

You can fake a retrieval by primary key something like this:

package MySearcher;
use base qw( KinoSearch::Searcher );

sub im_feeling_lucky {
my ( $self, $key ) = @_;
my $termquery = KinoSearch::Search::TermQuery->new(
field => 'pri_key_field',
term => $key,
);
my $hits = $searcher->search( query => $term_query, num_wanted
return $hits->fetch_hit;
}

That will work if you have your own exterior mechanism for
guaranteeing the uniqueness of a particular field during indexing.

> $similarity = $doc1->get_cosine($doc2);
>
> And more generally,
>
> $similarity = $doc1->get_similarity($doc2, $my_similarity_fxn);

Interesting. Similarity measures are implemented using pluggable
classes in KinoSearch, which suggests this...

my $sim = KinoSearch::Search::Similarity->new;
my $score = $sim->cosine( $doc1, $doc2 );

Doc objects are just collections of stored fields, though. They have
no idea what terms they contain. They have no idea how they're
parsed, and a Similarity object wouldn't have any idea how to parse
them either.

But here's where some fruitful possibilities arise.

Currently, KinoSearch writes a part of the index called "term
vectors", for which Highlighter is the primary consumer. The term
vector information consists of lists of the terms present in each
field, along with frequency, positions, start_offsets, and
end_offsets. KS accesses this information like so:

# Fetch a DocVector object, from which TermVector objects may be
extracted.
my $doc_vec = $searcher->fetch_doc_vec($doc_num);

The following cosine() method could theoretically work, because at
least all the information that's needed is present:

my $score = $sim->cosine( $doc_vec1, $doc_vec2 );

However, we'd need to expose a few more public APIs.

First, we need a way of obtaining document numbers from a search. The
easiest way to make this happen is to expose get_doc_num for HitDoc.
(There are other places as well, that's just the easiest and it would
work for our purposes.)

Second we need to expose DocVector, or rather, an improvement upon
DocVector because DocVector isn't ready for prime-time.

What Highlighter and you really need is a pre-analyzed document.
(Highlighter could actually work by analyzing fields on the fly --
indeed, Lucene's highlighter can be set up that way -- except for the
fact that analyzing on the fly can be unacceptably slow for large
documents or costly analyzers.) The questions are...

* What's a better name than DocVector? AnalyzedDoc?
* Should we store any other information besides the terms and
their positions, start_offsets and end_offsets?
* How should the data file be formatted?

This is something I really want to nail in the file format, because
that's the hardest thing to change.

> - KS retrieval is asymmetrical (and that's fine). Let
> similarity(I,A,B) be a function that specifies document A as query
> against index I, iterates over the hits until it gets to document B,
> and returns the score of document B. Then similarity(I,A,B) !=
> similarity(I,B,A). I handled this by retrieving both
> similarity(I,A,B) and similarity(I,B,A) and taking the average.
>
> - One issue that still puzzles me is that KS is apparently capable
> of a hit score greater than 1! Is that really true?

Yeah, absolutely. It's the same way with Lucene, and KS scoring is
directly derived from the Lucene scoring model. Lucene and KS only
care about coarse relative ranking, so there are some adulterations
and approximations in the similarity calculations.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


jack_tanner at yahoo

Apr 27, 2008, 1:49 PM

Post #16 of 17 (2202 views)
Permalink
Re: get doc/query similarity [In reply to]

----- Original Message ----
> From: Marvin Humphrey <marvin [at] rectangular>

[.snip discussion of retrieval of a doc via IndexReader or a Searcher]

OK, I grant that KS doesn't want to be an RDB. What seems to me to be a good idea is a method of retrieving *one specific* doc from the index, given some key, that is relatively cheaper than doing a search.

> You can fake a retrieval by primary key something like this:
>
> package MySearcher;
> use base qw( KinoSearch::Searcher );
>
> sub im_feeling_lucky {
> my ( $self, $key ) = @_;
> my $termquery = KinoSearch::Search::TermQuery->new(
> field => 'pri_key_field',
> term => $key,
> );
> my $hits = $searcher->search( query => $term_query, num_wanted
> => 1 );
> return $hits->fetch_hit;
> }
>
> That will work if you have your own exterior mechanism for
> guaranteeing the uniqueness of a particular field during indexing.

There's nothing wrong with that API. Suppose we could extend that by telling KS that we indeed have an exterior mechanism of guaranteeing the uniqueness of pri_key_field. It could then stop searching the moment it gets the first hit, and return that. If we fail in our uniqueness guarantee, well, it still only returns one hit, and we can't know deterministically which one.

Moreover, KS could have a special-cased optimization for searching that field, perhaps requireing some syntax restrictions on the value (numeric only? single token only?).

By the way, there seems to be a related mechanism for $invindexer->delete_docs_by_term().

> First, we need a way of obtaining document numbers from a search. The
> easiest way to make this happen is to expose get_doc_num for HitDoc.
> (There are other places as well, that's just the easiest and it would
> work for our purposes.)

But you just said you don't want to expose internal KS doc numbers?

> * What's a better name than DocVector? AnalyzedDoc?

Yes.

> * Should we store any other information besides the terms and
> their positions, start_offsets and end_offsets?

You probably want to make sure that term frequencies are accessible, as well as left/right positional context. I'm not making claims about how these should be stored, just that they're accessible efficiently.

One similarity metric that's useful to compute is doc-doc similarity over token or character n-grams. How would one do that in our brave new world?

> * How should the data file be formatted?

That's a bit beyond my reach.

> Yeah, absolutely. It's the same way with Lucene, and KS scoring is
> directly derived from the Lucene scoring model. Lucene and KS only
> care about coarse relative ranking, so there are some adulterations
> and approximations in the similarity calculations.

Something that may be useful is a toggle to normalize the returned scores...

Can one do pseudo-relevance feedback using KS? That is, run a search, get some hits, then use the hits as the terms for a new search. Optionally, loop over the hits and exclude unwanted docs before executing the new search.






____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


marvin at rectangular

Apr 28, 2008, 2:11 PM

Post #17 of 17 (2193 views)
Permalink
Re: get doc/query similarity [In reply to]

On Apr 27, 2008, at 1:49 PM, jack_tanner [at] yahoo wrote:

> What seems to me to be a good idea is a method of retrieving *one
> specific* doc from the index, given some key, that is relatively
> cheaper than doing a search.

Here's the slightly more optimized version -- which, unlike the
earlier version, will work with a Searcher but not a MultiSearcher.
Also, it returns the first hit rather than the highest scoring hit.

package MySearcher;
use base qw( KinoSearch::Searcher );

sub single_hit {
my ( $self, $key ) = @_;
my $reader = $self->get_reader;
my $posting_list = $reader->posting_list(
field => 'pri_key_field',
term => $key,
);
return unless $posting_list;
my $doc_num = $posting_list->next;
return unless $doc_num;
return $reader->fetch_doc($doc_num);
}

If the term is unique, I doubt that the there will be significant
performance differences between the two implementations. The i/o
costs are exactly the same -- the searching technique just involves a
few more wrappers.

The more important difference is that you can get a PostingList from
an IndexReader, but not from a Searcher. PostingLists, like
IndexReaders, are tied to individual machines. Searcher is a single-
machine subclass of Searchable; and Searchable's interface is designed
to work with document collections which may or may not reside on a
single machine.

I realize that sounds kind of complicated, but the point is that we
have to balance two competing design goals with Searcher. On one hand,
it's a single entrance point in a large, flexible retrieval API -- and
it has to remain a responsible role player within the larger system.
On the other hand, it's the most popular entrance point, so it has to
be easy to use without needing to grok the whole system.

> Suppose we could extend that by telling KS that we indeed have an
> exterior mechanism of guaranteeing the uniqueness of pri_key_field.
> It could then stop searching the moment it gets the first hit, and
> return that.

As soon as the Searcher exhausts the posting list, scoring stops. So
the cost to score all hits actually depends on whether you
successfully enforce uniqueness from outside. :)

> Moreover, KS could have a special-cased optimization for searching
> that field, perhaps requireing some syntax restrictions on the value
> (numeric only? single token only?).

You'll need to set the field to be unanalyzed; otherwise you have add
an Analyzer into the mix.

> By the way, there seems to be a related mechanism for $invindexer-
> >delete_docs_by_term().

InvIndexer is always operating on a single machine.

Hmm. Maybe InvIndexer should be a subclass of a more general abstract
class defining an indexing interface which isn't necessarily tied to
one machine -- i.e. an index-time counterpart to Searchable.

>> First, we need a way of obtaining document numbers from a search.
>> The
>> easiest way to make this happen is to expose get_doc_num for HitDoc.
>> (There are other places as well, that's just the easiest and it would
>> work for our purposes.)
>
> But you just said you don't want to expose internal KS doc numbers?

I've been trying to keep them hidden, and we've gotten pretty far
without them.

But it was inevitable that doc numbers would get exposed eventually,
because they are needed by many of the low-level, expert components
we're starting to expose: PostingList, HitCollector, etc.

The downside of exposing KinoSearch's document numbers is that their
behavior isn't intuitive: they're consistent for the life of a
Searchable or IndexReader, but outside of that, they sometimes change.

>> * Should we store any other information besides the terms and
>> their positions, start_offsets and end_offsets?
>
> You probably want to make sure that term frequencies are accessible,

Good thought. They're already in there, but I had neglected to
mention it.

> as well as left/right positional context.

I think you're just using different names for what I'm calling
"start_offset" and "end_offset", which are measured in Unicode code
points from the top of the field.

> One similarity metric that's useful to compute is doc-doc similarity
> over token or character n-grams. How would one do that in our brave
> new world?

I think you'd want to actually index the n-grams by specifying some
sort of n-gram Analyzer subclass, then use the same techniques we've
been discussing.

> Something that may be useful is a toggle to normalize the returned
> scores...


They did that with the Lucene version of Hits and it seems to have
resulted in a lot of confusion. I think it's misleading, because it
gives the illusion of an absolute similarity measure but not the
reality.

You can always normalize scores yourself off of the score of the first
HitDoc returned by the Hits object. (Though that technique doesn't
work with sorted results because the first hit probably isn't the
highest scoring.)

> Can one do pseudo-relevance feedback using KS? That is, run a
> search, get some hits, then use the hits as the terms for a new
> search. Optionally, loop over the hits and exclude unwanted docs
> before executing the new search.


You can do that, but I've never been impressed with how "more like
this" queries behave using vanilla TF-IDF. Rare terms in the feedback
documents which are unrelated to the original query -- especially
proper names -- tend to spawn high-scoring irrelevant results.

To make such a feedback system work well, I think you'd need be
working with a decomposed term space a la LSA and apply the term space
of the original query as a filter on the term space of each feedback
doc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.