Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Term Boost Threshold

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ihasmax at gmail

Nov 13, 2009, 2:09 PM

Post #1 of 10 (1037 views)
Permalink
Term Boost Threshold

Hi,
I am trying to move from a system where I counted the frequency of terms by
hand in a highlighter to determine if a result was useful to me. In an
earlier post on this list someone suggested I could boost the terms that are
useful to me and only accept hits above a certain threshold. However, in my
tests, I can't seem to find a deterministic way of calculating a threshold.

Here is an example of what I mean:
My query: "John Smith" "John Smith Manufacturing" "San Francisco"
"California"

Results are only useful to me if they contain the first term "John Smith"
and/or the second term "John Smith Manufacturing" or any combination with
the other San Fran and California terms. However, results with just "San
Francisco" or "California" can be ignored.

I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
Francisco"^2 "California"^1

But I can't seem to find a good method of calculating a cut-off score and
filtering out the results that are only San Fran or California using the
term boosting and resulting score. I also don't care about frequency,
meaning that I want the result even if John Smith occurs once, and I don't
want a document with "San Francisco" a million times to score higher than
the single result for John Smith.

Sorry if that's confusing.

Any ideas?

Thanks,
Max


jake.mannix at gmail

Nov 13, 2009, 2:16 PM

Post #2 of 10 (1010 views)
Permalink
Re: Term Boost Threshold [In reply to]

Hi Max,

You want a query like

("San Francisco" OR "California") AND ("John Smith" OR "John Smith
Manufacturing")

essentially? You can give Lucene exactly this query and it will require
that
either "John Smith" or "John Smith Manufacturing" be present, but will score
results which have these and one or more of San Fran or CA higher. And in
fact will score highest results which match all terms.

Does that help?

-jake

On Fri, Nov 13, 2009 at 2:09 PM, Max Lynch <ihasmax [at] gmail> wrote:

> Hi,
> I am trying to move from a system where I counted the frequency of terms by
> hand in a highlighter to determine if a result was useful to me. In an
> earlier post on this list someone suggested I could boost the terms that
> are
> useful to me and only accept hits above a certain threshold. However, in
> my
> tests, I can't seem to find a deterministic way of calculating a threshold.
>
> Here is an example of what I mean:
> My query: "John Smith" "John Smith Manufacturing" "San Francisco"
> "California"
>
> Results are only useful to me if they contain the first term "John Smith"
> and/or the second term "John Smith Manufacturing" or any combination with
> the other San Fran and California terms. However, results with just "San
> Francisco" or "California" can be ignored.
>
> I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
> Francisco"^2 "California"^1
>
> But I can't seem to find a good method of calculating a cut-off score and
> filtering out the results that are only San Fran or California using the
> term boosting and resulting score. I also don't care about frequency,
> meaning that I want the result even if John Smith occurs once, and I don't
> want a document with "San Francisco" a million times to score higher than
> the single result for John Smith.
>
> Sorry if that's confusing.
>
> Any ideas?
>
> Thanks,
> Max
>


ihasmax at gmail

Nov 13, 2009, 2:24 PM

Post #3 of 10 (1009 views)
Permalink
Re: Term Boost Threshold [In reply to]

> You want a query like
>
> ("San Francisco" OR "California") AND ("John Smith" OR "John Smith
> Manufacturing")
>

Won't his require San Francisco or California to be present? I do not
require them to be, I only require "John Smith" OR "John Smith
Manufacturing", but I want to get a bigger score if the city and state are
mentioned a long with it, so that's why I was thinking of doing different
term boostings. The most important term is the name, and then the company,
and then the city and then the state. Finding each one increases the
quality of the result for me.

Thanks.

-max


jake.mannix at gmail

Nov 13, 2009, 2:29 PM

Post #4 of 10 (1014 views)
Permalink
Re: Term Boost Threshold [In reply to]

Did I do that wrong? I always mess up the AND/OR human-readable form
of this - it's clearer when you use +/- unary operators instead:

query: "San Francisco" "California" +("John Smith" "John Smith
Manufacturing")

Here the San Fran and CA clauses are optional, and the ("John Smith" OR
"John Smith Manufacturing") is required.

-jake

On Fri, Nov 13, 2009 at 2:24 PM, Max Lynch <ihasmax [at] gmail> wrote:

> > You want a query like
> >
> > ("San Francisco" OR "California") AND ("John Smith" OR "John Smith
> > Manufacturing")
> >
>
> Won't his require San Francisco or California to be present? I do not
> require them to be, I only require "John Smith" OR "John Smith
> Manufacturing", but I want to get a bigger score if the city and state are
> mentioned a long with it, so that's why I was thinking of doing different
> term boostings. The most important term is the name, and then the company,
> and then the city and then the state. Finding each one increases the
> quality of the result for me.
>
> Thanks.
>
> -max
>


ihasmax at gmail

Nov 13, 2009, 3:35 PM

Post #5 of 10 (999 views)
Permalink
Re: Term Boost Threshold [In reply to]

> query: "San Francisco" "California" +("John Smith" "John Smith
> Manufacturing")
>
> Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> "John Smith Manufacturing") is required.
>

Thanks Jake, that works nicely.

Now, I would like to know exactly what term was found. For example, if a
result comes back from the query above, how do I know whether John Smith was
found, or both John Smith and his company, or just John Smith Manufacturing
was found? The way I am doing that right now is using a highlighter (which
unfortunately breaks up "John Smith" into <b>John</b><b>Smith</b>) and
combining the terms that are to be highlighted and keeping track of them so
I know they were found. If there was a simple way to just check which part
of that query was matched that would be awesome. This is why I was thinking
of using the term boosting and using a threshold to say "Well, if the score
is above this value, then I can assume that "John Smith" was found, but if
the score is under a certain threshold, I can say that only his company was
found", without having to use the highlighter and noting when a term I'm
looking for is to be highlighted. Is there a solution?

Thanks,
Max


jake.mannix at gmail

Nov 13, 2009, 3:48 PM

Post #6 of 10 (1005 views)
Permalink
Re: Term Boost Threshold [In reply to]

On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch <ihasmax [at] gmail> wrote:

> > query: "San Francisco" "California" +("John Smith" "John Smith
> > Manufacturing")
> >
> > Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> > "John Smith Manufacturing") is required.
> >
>
> Thanks Jake, that works nicely.
>
> Now, I would like to know exactly what term was found. For example, if a
> result comes back from the query above, how do I know whether John Smith
> was
> found, or both John Smith and his company, or just John Smith Manufacturing
> was found?


In general, this is actually very hard. Lucene does not even keep track
itself
of which terms in a given query matched a given document, but you really
just need to know which terms matched in the final "top hits" you're showing
to the user, right? What is this information used for / why do you want to
know which term hit?

-jake


> The way I am doing that right now is using a highlighter (which
> unfortunately breaks up "John Smith" into <b>John</b><b>Smith</b>) and
> combining the terms that are to be highlighted and keeping track of them so
> I know they were found. If there was a simple way to just check which part
> of that query was matched that would be awesome. This is why I was
> thinking
> of using the term boosting and using a threshold to say "Well, if the score
> is above this value, then I can assume that "John Smith" was found, but if
> the score is under a certain threshold, I can say that only his company was
> found", without having to use the highlighter and noting when a term I'm
> looking for is to be highlighted. Is there a solution?
>
> Thanks,
> Max
>


ihasmax at gmail

Nov 13, 2009, 4:02 PM

Post #7 of 10 (1004 views)
Permalink
Re: Term Boost Threshold [In reply to]

> > Now, I would like to know exactly what term was found. For example, if a
> > result comes back from the query above, how do I know whether John Smith
> > was
> > found, or both John Smith and his company, or just John Smith
> Manufacturing
> > was found?
>
>
> In general, this is actually very hard. Lucene does not even keep track
> itself
> of which terms in a given query matched a given document, but you really
> just need to know which terms matched in the final "top hits" you're
> showing
> to the user, right? What is this information used for / why do you want to
> know which term hit?


Well I use results that have a name match as more important than ones with a
company match, and ones with both are the most important. I was hoping term
boosting would help me mathematically detect these cases (for example, a
firstname + company match would have detectably higher score) without having
to use a highlighter for what is clearly not its purpose. I also am not
using a traditional search display, so every result I find is important and
there is no pagination (it's a background search).

Is it possible to do this with term boosting? Otherwise my highlighter
solution works for the time being, it's just slow.

Thanks,
Max


jake.mannix at gmail

Nov 13, 2009, 4:17 PM

Post #8 of 10 (1008 views)
Permalink
Re: Term Boost Threshold [In reply to]

On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch <ihasmax [at] gmail> wrote:

> > > Now, I would like to know exactly what term was found. For example, if
> a
> > > result comes back from the query above, how do I know whether John
> Smith
> > > was
> > > found, or both John Smith and his company, or just John Smith
> > Manufacturing
> > > was found?
> >
> >
> > In general, this is actually very hard. Lucene does not even keep track
> > itself
> > of which terms in a given query matched a given document, but you really
> > just need to know which terms matched in the final "top hits" you're
> > showing
> > to the user, right? What is this information used for / why do you want
> to
> > know which term hit?
>
>
> Well I use results that have a name match as more important than ones with
> a
> company match, and ones with both are the most important. I was hoping
> term
> boosting would help me mathematically detect these cases (for example, a
> firstname + company match would have detectably higher score) without
> having
> to use a highlighter for what is clearly not its purpose. I also am not
> using a traditional search display, so every result I find is important and
> there is no pagination (it's a background search).
>

Well already, without doing any boosting, documents matching more of the
terms
in your query will score higher. If you really want to make this effect
more
pronounced, yes, you can boost the more important query terms higher.

-jake


ihasmax at gmail

Nov 13, 2009, 4:21 PM

Post #9 of 10 (994 views)
Permalink
Re: Term Boost Threshold [In reply to]

Well already, without doing any boosting, documents matching more of the
> terms
> in your query will score higher. If you really want to make this effect
> more
> pronounced, yes, you can boost the more important query terms higher.
>
> -jake
>

But there isn't a way to determine exactly what boosted term made up the
final score?


jake.mannix at gmail

Nov 13, 2009, 4:27 PM

Post #10 of 10 (995 views)
Permalink
Re: Term Boost Threshold [In reply to]

On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch <ihasmax [at] gmail> wrote:

> Well already, without doing any boosting, documents matching more of the
> > terms
> > in your query will score higher. If you really want to make this effect
> > more
> > pronounced, yes, you can boost the more important query terms higher.
> >
> > -jake
> >
>
> But there isn't a way to determine exactly what boosted term made up the
> final score?
>

Not really. Scores are normalized relative to the top score by default, so
only
the relative ordering of results has meaning.

-jake

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.