Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

similarity function

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


joel at su3analytics

Oct 28, 2009, 6:29 AM

Post #1 of 3 (483 views)
Permalink
similarity function

Hi,

Given a query with multiple terms, e.g. fish oil, and searching across
multiple fields e.g.

query= fieldA:fish fieldA:oil fieldB:fish fieldB:oil etc...

I don't want to give any more weight to documents that match the same
word multiple times (either in the same, or different fields). I am only
interested in lending additional weight to a match of both words (fish
and oil) in the SAME field.

So for example if I have documents:

Doc1
fieldA=fish is good for you
fieldB=vegetable oil and sunflower oil is good for you

and Doc2
fieldA=fish oil is good for you
fieldB=bla bla bla

with the default similarity I would have 3 term matches in document 1
(fish, oil, oil) and 2 in document 2 (fish, oil), but I only want to
count 2 term matches in document 1 (fish, oil) and I want to give
increased weight to the two matches in document 2 because they occur in
the same field (fieldA).

Any ideas? Is there a simple way to achieve this? (it goes without
saying I want to match both documents, i.e. don't want to use quotes
"fish oil")

Thanks,
Joel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


joel at su3analytics

Oct 28, 2009, 6:41 AM

Post #2 of 3 (453 views)
Permalink
Re: similarity function [In reply to]

I suppose this could be summarised as:

"how do i set the score of each document result to be the score of that
of the field that best matches the search terms"?


-----Original Message-----
From: Joel Halbert <joel [at] su3analytics>
Reply-To: java-user [at] lucene
To: Lucene Users <java-user [at] lucene>
Subject: similarity function
Date: Wed, 28 Oct 2009 13:29:29 +0000

Hi,

Given a query with multiple terms, e.g. fish oil, and searching across
multiple fields e.g.

query= fieldA:fish fieldA:oil fieldB:fish fieldB:oil etc...

I don't want to give any more weight to documents that match the same
word multiple times (either in the same, or different fields). I am only
interested in lending additional weight to a match of both words (fish
and oil) in the SAME field.

So for example if I have documents:

Doc1
fieldA=fish is good for you
fieldB=vegetable oil and sunflower oil is good for you

and Doc2
fieldA=fish oil is good for you
fieldB=bla bla bla

with the default similarity I would have 3 term matches in document 1
(fish, oil, oil) and 2 in document 2 (fish, oil), but I only want to
count 2 term matches in document 1 (fish, oil) and I want to give
increased weight to the two matches in document 2 because they occur in
the same field (fieldA).

Any ideas? Is there a simple way to achieve this? (it goes without
saying I want to match both documents, i.e. don't want to use quotes
"fish oil")

Thanks,
Joel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Nov 8, 2009, 4:04 PM

Post #3 of 3 (381 views)
Permalink
Re: similarity function [In reply to]

: "how do i set the score of each document result to be the score of that
: of the field that best matches the search terms"?

you'll want something like this psuedo code...

DisjunctionMaxQuery dq = new DMQ
foreach fieldname in list_of_fields {
BooleanQuery bq = new BQ
foreach word in list_of_words {
bq.add(new TermQuery(fieldname,word), SHOULD)
}
bq.setMinSHouldMatch(1)
}
dq.setTieBreaker(0.0)


...the DisjunctioNmaxQuery will only take it's score from whichever of hte
BooleanQueries scores highest, and the setMinSHouldMatch will ensure that
those boolean queries will match as long as at least one of the words is
found in that field, but the more words that match the higher the score.

then all you need to do is modify your similarity class to change the tf()
function so that a doc doesn't get a really high score just for matching
one word many many times.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.