Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

SweetSpotSimilarity

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


akos.tajti at gmail

Jul 20, 2011, 9:54 AM

Post #1 of 15 (845 views)
Permalink
SweetSpotSimilarity

Dear List,

in our application there are many long documents that we index. Previously we had a problem with lucene's scoring: some documents got low scores because their lengths. Then we started to use SweetSpotSimilarity and it seemed to solve the problem. But now we face an other difficulty: it's hard to set the correct parameters for SweetSpotSimilarity. For example we want the title of a page to always have the highest boost than its content no matter how long it is.
Do you have any idea?

Thanks in advance,
√Ākos Tajti


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ian.lea at gmail

Jul 21, 2011, 12:52 AM

Post #2 of 15 (808 views)
Permalink
Re: SweetSpotSimilarity [In reply to]

Have you tried query time boosting of title queries? title:lucene^4
content:lucene. Might be easier than fiddling with sweetspot
arguments, although I see from the javadocs that "A per field min/max
can be specified if different fields have different sweet spots". Not
sure if that is relevant to you or not.


--
Ian.


On Wed, Jul 20, 2011 at 5:54 PM, Tajti Ńkos <akos.tajti [at] gmail> wrote:
> Dear List,
>
> in our application there are many long documents that we index. Previously we had a problem with lucene's scoring: some documents got low scores because their lengths. Then we started to use SweetSpotSimilarity and it seemed to solve the problem. But now we face an other difficulty: it's hard to set the correct parameters for SweetSpotSimilarity. For example we want the title of a page to always have the highest boost than its content no matter how long it is.
> Do you have any idea?
>
> Thanks in advance,
> Ńkos Tajti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Feb 15, 2012, 5:24 PM

Post #3 of 15 (746 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

I'd love to hear what you find out. I have been working with this also.
I only changed the sweet spot to a slightly larger range than the one in the original paper (but kept the same steepness) and I tweaked the sloppy freq to not score multiple occurances of a phrase as strong as the they are in the defaultSimilarity, by playing with sloppyFreq(distance).
hyperbolicTf() only comes into play if you override the tf method in your own subclass to call it instead of the baselineTf which it normally calls. I also didn't get what it was trying to do.

-Paul

> -----Original Message-----
> From: Peyman Faratin [mailto:peyman [at] robustlinks]
> Sent: Wednesday, February 15, 2012 6:40 AM
> To: java-user [at] lucene
> Subject: SweetSpotSimilarity
>
> Hi
>
> I have a noobie question. I am trying to use the SweetSpotSimilarity (SSS) class.
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/contrib-
> misc/org/apache/lucene/misc/SweetSpotSimilarity.html
>
> I understand the scoring behavior of Lucene
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/searc
> h/Similarity.html
>
> And I am aware that SweetSpotSimilarity resulted from this paper
>
> http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>
> However, I was wondering if there was a resource that explained (and gave examples) of how SSS
> works and what each parameter (hyperbolic, etc) means. I know this is a Lucene list but I am actually
> trying to use SSS in solr. I am aware of SOLR-1365 patch that makes configuring SSS easier (in the
> schema.xml)
>
> https://issues.apache.org/jira/browse/SOLR-1365
>
> but I am having trouble understanding what some of the 13 parameters are and how they map to SSS.
>
> Thank you
>
> Peyman
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Feb 15, 2012, 8:36 PM

Post #4 of 15 (712 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

: sloppyFreq(distance). hyperbolicTf() only comes into play if you
: override the tf method in your own subclass to call it instead of the
: baselineTf which it normally calls. I also didn't get what it was
: trying to do.

Correct, as documented...

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/contrib-misc/org/apache/lucene/misc/SweetSpotSimilarity.html

"For tf, baselineTf and hyperbolicTf functions are provided, which
subclasses can choose between."

tf() ... "Delegates to baselineTf"

hyperbolicTf ... "This code is provided as a convenience for subclasses
that want to use a hyperbolic tf function."

As for what hyperbolicTf is trying to do ... it creates a hyperbolic
function letting you specify a hard max no matter how many terms there
are.

: > And I am aware that SweetSpotSimilarity resulted from this paper
: >
: > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf

For the record, that paper did not result in SSS -- I wrote SSS ~Dec 2005
and contributed it to Apache a few months later on behalf of CNET Networks
where i developed it to solve some specific problems we had with
product data...

https://issues.apache.org/jira/browse/LUCENE-577
http://mail-archives.apache.org/mod_mbox/lucene-dev/200605.mbox/%3CF9F270C4-FA1E-460F-A54F-E2E56AAD0286%40rectangular.com%3E
(and subsequent replies)

...Doron wrote the paper later, although you'll note lots of dicsussions
arround that time on the mailing list about customizing Similarity based
on domain specific data -- the concepts certainly weren't novel.

: > However, I was wondering if there was a resource that explained (and gave examples) of how SSS
: > works and what each parameter (hyperbolic, etc) means. I know this is a Lucene list but I am actually

The functions are pretty clearly spelled out in the javadocs -- you just
set the options on the class to control the constant values of the
functions. The easiest way to understand them is probably to use
something like gnuplot to graph them using various values for the
constants, and then compare to graphs of the corrisponding functions from
DefaultSimilarity.




-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Feb 17, 2012, 11:41 AM

Post #5 of 15 (705 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene [at] fucit]
> As for what hyperbolicTf is trying to do ... it creates a hyperbolic function letting you specify a hard max
> no matter how many terms there are.

A picture -- or more precisely a graph -- would be worth a 1000 words. As it says in issue 577 "a hyperbolic tf function which is best explained by graphing the equation". That's great, but I couldn't find " Mark [Bennet's] nifty graph [...] (linked from his email)." Can anyone provide any help locating what sounds like a useful resource?

The JavaDoc (which Chris probably also wrote way back when), says hyperbolic TANGENT function (http://www.dplot.com/fct_tanh.htm ). At least that clarifies the basic shape, even if I (and apparently others judging from the yearly questions on the Lucene list) have yet to work out the full impact of all the parameters and how hyperbolic tangent might compare to the 1 / sqrt( freq + C ) of the baseline which I believe, if used with the defaults, degenerates to DefaultSimilarity.tf formula.

Another problem mentioned in the e-mail thread Chris linked is "people who know the 'sweetspot' of their data.", but I have yet to find a definition of what is meant by "sweetspot", so I couldn't say whether I know my data's sweet spot or not.
Another question is how the tf_hyper_offset parameter might be considered. It appears to be the inflexion point of the tanh equation, but what term count might a caller consider centering there ( or consider being the approx. area that the graph is "mostly" level) ? Or more simply why 10?
Any thoughts from anyone?

I also note that the JavaDoc says that the default tf_hyper_base ("the base value to be used in the exponential for the hyperbolic function ") value is e. But checking the code the default is actually 1.3 (less than half e). Should I file a doc bug?

To summarize: Does anyone have any resources along the lines of graphs of these (or any other) tf functions, general discussion of document collection sweet spot, and any insight into parameters of this class (hyperbolic tangent or otherwise)?

-Paul


>
> : > And I am aware that SweetSpotSimilarity resulted from this paper
> : >
> : > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>
> For the record, that paper did not result in SSS -- I wrote SSS ~Dec 2005 and contributed it to Apache a
> few months later on behalf of CNET Networks where i developed it to solve some specific problems
> we had with product data...
>
> https://issues.apache.org/jira/browse/LUCENE-577
> http://mail-archives.apache.org/mod_mbox/lucene-dev/200605.mbox/%3CF9F270C4-FA1E-460F-
> A54F-E2E56AAD0286%40rectangular.com%3E
> (and subsequent replies)
>
> ...Doron wrote the paper later, although you'll note lots of dicsussions arround that time on the
> mailing list about customizing Similarity based on domain specific data -- the concepts certainly weren't
> novel.
>
> : > However, I was wondering if there was a resource that explained (and gave examples) of how SSS
> : > works and what each parameter (hyperbolic, etc) means. I know this is a Lucene list but I am
> actually
>
> The functions are pretty clearly spelled out in the javadocs -- you just set the options on the class to
> control the constant values of the functions. The easiest way to understand them is probably to use
> something like gnuplot to graph them using various values for the constants, and then compare to
> graphs of the corrisponding functions from DefaultSimilarity.
>
>
>
>
> -Hoss
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Feb 28, 2012, 3:14 PM

Post #6 of 15 (674 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

: A picture -- or more precisely a graph -- would be worth a 1000 words.

fair enough. I think the reason i never committed one initially was
because the formula in the javadocs was trivial to plot in gnuplot...

gnuplot> min=0
gnuplot> max=2
gnuplot> base=1.3
gnuplot> xoffset=10
gnuplot> set yrange [0:3]
gnuplot> set xrange [0:20]
gnuplot> tf(x)=min+(max-min)/2*(((base**(x-xoffset)-base**-(x-xoffset))/(base**(x-xoffset)+base**-(x-xoffset)))+1)
gnuplot> plot tf(x)

i'll try to get some graphs commited and linked to from the javadocs that
make it more clear how tweaking the settings affect the formula

: Another problem mentioned in the e-mail thread Chris linked is "people
: who know the 'sweetspot' of their data.", but I have yet to find a
: definition of what is meant by "sweetspot", so I couldn't say whether I
: know my data's sweet spot or not.

hmmm... sorry, i kind of just always took it s self evident. i'm not even
sure how to define it ... the sweetspot is "the sweetspot" ... the range
of good values such that things not in the sweetspot are atypical and
"less good"

To give a practical example: when i was working with product data we found
that the sweetspot for the length of a product name was between 4 and 10
terms. products with less then 4 terms in the name field were usually
junk products (ie: "ram" or "mouse") and products with more then 10 terms
in the name were usually junk products that had keyword stuffing going on.

likewise we determined that for fields like the "product description" the
sweetspot for tf matching was arround 1-5 (if i remember correctly) ...
because no one term appeared in a "well written" product description more
then 5 times -- any more then that was keyword spamming.

every catalog of products is going to be different, and every domain is
going to be *much* different (ie: if you search books, or encyclopedia
articles then the sweetspots are going to be much larger)

: Another question is how the tf_hyper_offset parameter might be
: considered. It appears to be the inflexion point of the tanh equation,
: but what term count might a caller consider centering there ( or

right ... it's the center of your sweetspot if you use hyperbolicTf, you
use the value that makes sense for your data.

: I also note that the JavaDoc says that the default tf_hyper_base ("the
: base value to be used in the exponential for the hyperbolic function ")
: value is e. But checking the code the default is actually 1.3 (less than
: half e). Should I file a doc bug?

I'll fix that (if i remember correctly, "e" is the canonical value
typically used in doing hyperbolics for some reason, but for tf purposes
made for a curve thta was too steep to be generally useful by default so
we changed it as soon as it was committed) ... thanks for pointing out the doc mistake.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Feb 28, 2012, 5:14 PM

Post #7 of 15 (666 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

: i'll try to get some graphs commited and linked to from the javadocs that
: make it more clear how tweaking the settings affect the formula

http://svn.apache.org/viewvc?rev=1294920&view=rev



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Mar 1, 2012, 10:40 AM

Post #8 of 15 (661 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

HI Chris,

I didn't see your response. Thanks.
Actually I was recently playing in fooplot , an online plotting tool (one of many), to examine the various formulas and getting a better handle on what they do.

Thanks for the discussion of 'sweetspot'. I'm thinking this might help others going forward who come across sweet spot and wonder what that is all about.
So sweet spot is the range beyond which things get too "junky". Interesting! Now I'll have to get my head around that idea not for fields like your (product) descriptions, but actual documents written by users
(aka legal documents). There ARE ridiculous examples in legal documents -- things like giant long-running class action lawsuits that when printed are measured in meters. But maybe the upper tail would not drop off as fast as your product "description" field example or maybe sweet spot is not really a sensible idea for body fields which run from very small to occasionally very large. It also might be the case that cover letters and e-mails while short might not be really something to heavily discount. The lower discount range can be ignored by setting the min of any sweet spot to 1. Then one starts to wonder if there is really is any level area.

It is hard to put it all together, but I do appreciate the fact that all (nearly all?) of the scoring formula is contained in the class Similarity, but that presents its own interesting problem.
When I get that deep in the code the issue is not simply the shape of the equation, but issues like how tweaking any parameters effects the overall document scores. For example, consider the comments about "steepness" related to length norm. It talks (some) mathematics of the equation, but until one spends some time with that equation and understanding where they all fit together, I doubt it jumps out at most folks what large or smaller values mean for terms and resulting document scores.

One obvious hard to tease out part of the Similarity API is when each part is called -- the simplest being index time vs. search time -- there is some clues, but when a coder using any such interesting override is looking at a method that contains the actual equation, it is hard to put it all back together if one has just "spelunked" down all kinds of interesting "twisty little passages all the same" passed Weight, Scorer and all its friends, passing calls to deprecated APIs (3.4) to get to an actual formula. It is also not easy for the API documenter like you, because obviously, while there is normal place any bit of the equation comes into the overall scoring formula there really is no guarantee some variation of all the related classes will call things in the normal manner, So I understand your challenge. Now in everyone's defense (and for readers of this discussion), some of the best documentation for a bit larger picture is the abstract class Similarity even though it contains no formulas.

If I get this all figured out myself, maybe I'll submit a talk "changing document relevancy for newbies" or "What happens if I pull THIS lever?" :-)

The following is one variation of a plot of computeLengthNorm as shown in fooplot

http://fooplot.com/index.php?&type0=0&type1=0&type2=0&type3=0&type4=0&y0=&y1=0.1&y2=%281.0%20/%20sqrt%28%280.5*%28abs%28x-100%29%20%2B%20abs%28x%20-%2050000%29%20-%20%2850000-100%29%29%29%2B%201.0%29%29&y3=&y4=&r0=&r1=&r2=&r3=&r4=&px0=&px1=&px2=&px3=&px4=&py0=&py1=&py2=&py3=&py4=&smin0=0&smin1=0&smin2=0&smin3=0&smin4=0&smax0=2pi&smax1=2pi&smax2=2pi&smax3=2pi&smax4=2pi&thetamin0=0&thetamin1=0&thetamin2=0&thetamin3=0&thetamin4=0&thetamax0=2pi&thetamax1=2pi&thetamax2=2pi&thetamax3=2pi&thetamax4=2pi&ipw=1&ixmin=-50&ixmax=150&iymin=-0.5&iymax=1.5&igx=10&igy=0.25&igl=1&igs=1&iax=0&ila=1&xmin=-50&xmax=150&ymin=-0.5&ymax=1.5

It is hard to say where the best play to place graphs and any such helpful discussion; on-line or in the source tree.

-Paul

> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene [at] fucit]
> Sent: Tuesday, February 28, 2012 3:15 PM
> To: java-user [at] lucene
> Subject: RE: SweetSpotSimilarity
>
>
> : A picture -- or more precisely a graph -- would be worth a 1000 words.
>
> fair enough. I think the reason i never committed one initially was because the formula in the
> javadocs was trivial to plot in gnuplot...
>
> gnuplot> min=0
> gnuplot> max=2
> gnuplot> base=1.3
> gnuplot> xoffset=10
> gnuplot> set yrange [0:3]
> gnuplot> set xrange [0:20]
> gnuplot> tf(x)=min+(max-min)/2*(((base**(x-xoffset)-base**-(x-xoffset))/
> gnuplot> (base**(x-xoffset)+base**-(x-xoffset)))+1)
> gnuplot> plot tf(x)
>
> i'll try to get some graphs commited and linked to from the javadocs that make it more clear how
> tweaking the settings affect the formula
>
> : Another problem mentioned in the e-mail thread Chris linked is "people
> : who know the 'sweetspot' of their data.", but I have yet to find a
> : definition of what is meant by "sweetspot", so I couldn't say whether I
> : know my data's sweet spot or not.
>
> hmmm... sorry, i kind of just always took it s self evident. i'm not even sure how to define it ... the
> sweetspot is "the sweetspot" ... the range of good values such that things not in the sweetspot are
> atypical and "less good"
>
> To give a practical example: when i was working with product data we found that the sweetspot for
> the length of a product name was between 4 and 10 terms. products with less then 4 terms in the
> name field were usually junk products (ie: "ram" or "mouse") and products with more then 10 terms in
> the name were usually junk products that had keyword stuffing going on.
>
> likewise we determined that for fields like the "product description" the sweetspot for tf matching
> was arround 1-5 (if i remember correctly) ...
> because no one term appeared in a "well written" product description more then 5 times -- any more
> then that was keyword spamming.
>
> every catalog of products is going to be different, and every domain is going to be *much* different
> (ie: if you search books, or encyclopedia articles then the sweetspots are going to be much larger)
>
> : Another question is how the tf_hyper_offset parameter might be
> : considered. It appears to be the inflexion point of the tanh equation,
> : but what term count might a caller consider centering there ( or
>
> right ... it's the center of your sweetspot if you use hyperbolicTf, you use the value that makes sense
> for your data.
>
> : I also note that the JavaDoc says that the default tf_hyper_base ("the
> : base value to be used in the exponential for the hyperbolic function ")
> : value is e. But checking the code the default is actually 1.3 (less than
> : half e). Should I file a doc bug?
>
> I'll fix that (if i remember correctly, "e" is the canonical value typically used in doing hyperbolics for
> some reason, but for tf purposes made for a curve thta was too steep to be generally useful by
> default so we changed it as soon as it was committed) ... thanks for pointing out the doc mistake.
>
>
> -Hoss
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Mar 5, 2012, 11:26 AM

Post #9 of 15 (651 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

: very small to occasionally very large. It also might be the case that
: cover letters and e-mails while short might not be really something to
: heavily discount. The lower discount range can be ignored by setting
: the min of any sweet spot to 1. Then one starts to wonder if there is
: really is any level area.

I would definitley not suggest using SSS for fields like legal brief text
or emails where there is huge variability in the length of the content --
i can't think of any context where a "short" email is definitively
better/worse then a "long" email. more traditional TF/IDF seems like it
would make more sense there.

: When I get that deep in the code the issue is not simply the shape of
: the equation, but issues like how tweaking any parameters effects the
: overall document scores. For example, consider the comments about
: "steepness" related to length norm. It talks (some) mathematics of the
: equation, but until one spends some time with that equation and
: understanding where they all fit together, I doubt it jumps out at most
: folks what large or smaller values mean for terms and resulting document
: scores.
:
: One obvious hard to tease out part of the Similarity API is when each
: part is called -- the simplest being index time vs. search time -- there

well ... hopefully the Similarity docs and the the docs on Lucene scoring
have filled in most of those blanks before you drill down into the
specifics of how SSS work. if not, then any concrete improvements you can
suggest would certainly be apprecaited...

https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html

https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Mar 5, 2012, 3:01 PM

Post #10 of 15 (650 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

> I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge
> variability in the length of the content -- i can't think of any context where a "short" email is
> definitively better/worse then a "long" email. more traditional TF/IDF seems like it would make more
> sense there.

I was coming to a similar conclusion.

> well ... hopefully the Similarity docs and the the docs on Lucene scoring have filled in most of those
> blanks before you drill down into the specifics of how SSS work. if not, then any concrete
> improvements you can suggest would certainly be apprecaited...
>
> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html
>
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co

Thanks for the links.
The first thing I notice is that what is listed at the top of Similarity is totally changed. Great stuff about the object interaction. For example, I didn't understand how Weight object fit in until reading that.
But I see I got what I asked for. Someone thought describing the object interaction was more important than the scoring formula itself. I chew on it (but I'm currently using the 3.4 code).

My only thought is that the new stuff seems to be at the expense of the formulas listed in the old class overview for Similarity.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/Similarity.html
I would think that some of the old math, particularly the formula as it corresponds to the methods, would still be useful information even if I can't claim to know where it might be placed.

Maybe something like the site scoring page could talk how the arithmetic maps to the methods and how phrase scoring messes with scoring.
Just my $0.02

thanks

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Mar 5, 2012, 3:15 PM

Post #11 of 15 (657 views)
Permalink
RE: SweetSpotSimilarity [In reply to]

> -----Original Message-----
> My only thought is that the new stuff seems to be at the expense of the formulas listed in the old
> class overview for Similarity.
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/searc
> h/Similarity.html

Opps, my bad.

The arithmetic is now in the new 4.0 class:
https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

thanks again for more object life cycle and interaction discussion. Hopefully it won't be too much for my old brain. :-)

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Mar 5, 2012, 3:24 PM

Post #12 of 15 (672 views)
Permalink
Re: SweetSpotSimilarity [In reply to]

On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill <paul [at] metajure> wrote:
>> I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge
>> variability in the length of the content -- i can't think of any context where a "short" email is
>> definitively better/worse then a "long" email.  more traditional TF/IDF seems like it would make more
>> sense there.
>
> I was coming to a similar conclusion.
>
>> well ... hopefully the Similarity docs and the the docs on Lucene scoring have filled in most of those
>> blanks before you drill down into the specifics of how SSS work.  if not, then any concrete
>> improvements you can suggest would certainly be apprecaited...
>>
>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html
>>
>> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co
>
> Thanks for the links.
> The first thing I notice is that what is listed at the top of Similarity is totally changed.  Great stuff about the object interaction. For example, I didn't understand how Weight object fit in until reading that.
> But I see I got what I asked for.  Someone thought describing the object interaction was more important than the scoring formula itself.  I chew on it (but I'm currently using the 3.4 code).
>
> My only thought is that the new stuff seems to be at the expense of the formulas listed in the old class overview for Similarity.

Hello,

what is previously Similarity in older releases is moved to
TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
with its same formulas in the javadocs:
https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

The difference is that in 4.0, the idea is to support other scoring
models beyond the vector space model: thats why if you start looking
at other subclasses of Similarity you will find more options (e.g.
probabilistic models).

This change is described in CHANGES.txt (below). I hope its not
confusing: if you have ideas to improve the javadocs and present this
stuff better for migrating users, it would be very helpful.

* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.

* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.

- Added Okapi BM25, Language Models, Divergence from Randomness, and
Information-Based Models. The models are pluggable, support all of lucene's
features (boosts, slops, explanations, etc) and queries (spans, etc).

- All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try these out/switch back and
forth/run experiments and comparisons without reindexing. Note: most of
the models do rely upon index statistics that are new in Lucene 4.0, so
for existing 3.x indexes its a good idea to upgrade your index to the
new format with IndexUpgrader first.

- Added a new subclass SimilarityBase which provides a simplified API
for plugging in new ranking algorithms without dealing with all of the
nuances and implementation details of Lucene.

- For example, to use BM25 for all fields:
searcher.setSimilarity(new BM25Similarity());

If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Mar 6, 2012, 2:08 AM

Post #13 of 15 (648 views)
Permalink
Re: SweetSpotSimilarity [In reply to]

On 05/03/2012 19:26, Chris Hostetter wrote:
> : very small to occasionally very large. It also might be the case that
> : cover letters and e-mails while short might not be really something to
> : heavily discount. The lower discount range can be ignored by setting
> : the min of any sweet spot to 1. Then one starts to wonder if there is
> : really is any level area.
>
> I would definitley not suggest using SSS for fields like legal brief text
> or emails where there is huge variability in the length of the content --
> i can't think of any context where a "short" email is definitively
> better/worse then a "long" email. more traditional TF/IDF seems like it
> would make more sense there.
>
> : When I get that deep in the code the issue is not simply the shape of
> : the equation, but issues like how tweaking any parameters effects the
> : overall document scores. For example, consider the comments about
> : "steepness" related to length norm. It talks (some) mathematics of the
> : equation, but until one spends some time with that equation and
> : understanding where they all fit together, I doubt it jumps out at most
> : folks what large or smaller values mean for terms and resulting document
> : scores.
> :
> : One obvious hard to tease out part of the Similarity API is when each
> : part is called -- the simplest being index time vs. search time -- there
>
> well ... hopefully the Similarity docs and the the docs on Lucene scoring
> have filled in most of those blanks before you drill down into the
> specifics of how SSS work. if not, then any concrete improvements you can
> suggest would certainly be apprecaited...
>
> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html
>
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co
>
>
Chapter 12 Document Ranking in Hibernate Search in Action gives a
thorough explanation of Lucene Scoring and the Similarity class which
Ive found helpful. I think its worth mentioning as not the most obvious
book for this subject.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Mar 6, 2012, 2:57 PM

Post #14 of 15 (647 views)
Permalink
Re: SweetSpotSimilarity [In reply to]

On 05/03/2012 23:24, Robert Muir wrote:
> On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill<paul [at] metajure> wrote:
>>> I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge
>>> variability in the length of the content -- i can't think of any context where a "short" email is
>>> definitively better/worse then a "long" email. more traditional TF/IDF seems like it would make more
>>> sense there.
>> I was coming to a similar conclusion.
>>
>>> well ... hopefully the Similarity docs and the the docs on Lucene scoring have filled in most of those
>>> blanks before you drill down into the specifics of how SSS work. if not, then any concrete
>>> improvements you can suggest would certainly be apprecaited...
>>>
>>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
>>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html
>>>
>>> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co
>> Thanks for the links.
>> The first thing I notice is that what is listed at the top of Similarity is totally changed. Great stuff about the object interaction. For example, I didn't understand how Weight object fit in until reading that.
>> But I see I got what I asked for. Someone thought describing the object interaction was more important than the scoring formula itself. I chew on it (but I'm currently using the 3.4 code).
>>
>> My only thought is that the new stuff seems to be at the expense of the formulas listed in the old class overview for Similarity.
> Hello,
>
> what is previously Similarity in older releases is moved to
> TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
> with its same formulas in the javadocs:
> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
Looks good, do you know if this stuff will make it into 3.6 ?

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Mar 6, 2012, 2:59 PM

Post #15 of 15 (652 views)
Permalink
Re: SweetSpotSimilarity [In reply to]

On Tue, Mar 6, 2012 at 5:57 PM, Paul Taylor <paul_t100 [at] fastmail> wrote:
>> Hello,
>>
>> what is previously Similarity in older releases is moved to
>> TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
>> with its same formulas in the javadocs:
>>
>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>>
> Looks good, do you know if this stuff will make it into 3.6 ?
>
> Paul

It won't (will have to be in 4.0) for two main reasons:

* The changes are fairly intrusive and would break API
backwards-compatibility everywhere.
* All of the added scoring systems require index statistics that are
new in Lucene 4.0

The last one being the biggest dealbreaker: and it would be difficult
to wedge into the Lucene 3.x implementation, because terms aren't
cleanly separated per-field, etc, etc.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.