Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Weighted cosine similarity calculation using Lucene

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kasunp at opensource

Apr 20, 2012, 1:20 AM

Post #1 of 3 (452 views)
Permalink
Weighted cosine similarity calculation using Lucene

I have documents that are marked up with Taxonomy and Ontology terms
separately.
When I calculate the document similarity, I want to give higher weights to
those Taxonomy terms and Ontology terms.


When I index the document, I have defined the Document content, Taxonomy
and Ontology terms as Fields for each document like this in my program.


*Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field document = new Field(docNames[curDocNo], strRdElt,
Field.TermVector.YES);*



Iím using Lucene index .TermFreqVector functions to calculate TFIDF values
and, then calculate cosine similarity between two documents using TFIDF
values.


For give weights to Ontology and Taxonomy terms when calculating the cosine
similarity, what I can do is, programmatically multiply the Taxonomy
and Ontology
term frequencies with defined weight factor before calculating the TFIDF
scores. Will this give higher weight to Taxonomy and Ontology terms in
document similarity calculation?


Are there Lucene functions that can be used to give higher weights to the
certain fields when calculating TFIDF values using TermFreqVector? can I
just use the setboost() function for this purpose, then how?

--
Regards

Kasun Perera


erickerickson at gmail

Apr 20, 2012, 4:44 AM

Post #2 of 3 (456 views)
Permalink
Re: Weighted cosine similarity calculation using Lucene [In reply to]

Maybe I'm missing something here, but why not just boost the
terms in the fields at query time?

Best
Erick

On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera <kasunp [at] opensource> wrote:
> I have documents that are marked up with Taxonomy and Ontology terms
> separately.
> When I calculate the document similarity, I want to give higher weights to
> those Taxonomy terms and Ontology terms.
>
>
> When I index the document, I have defined the Document content, Taxonomy
> and Ontology terms as Fields for each document like this in my program.
>
>
> *Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo],
> Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
>
> *Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo],
> Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
>
> *Field document = new Field(docNames[curDocNo], strRdElt,
> Field.TermVector.YES);*
>
>
>
> Iím using Lucene index .TermFreqVector functions to calculate TFIDF values
> and, then calculate cosine similarity between two documents using TFIDF
> values.
>
>
> For give weights to Ontology and Taxonomy terms when calculating the cosine
> similarity, what I can do is, programmatically multiply the Taxonomy
> and Ontology
> term frequencies with defined weight factor before calculating the TFIDF
> scores. Will this give higher weight to Taxonomy and Ontology terms in
> document similarity calculation?
>
>
> Are there Lucene functions that can be used to give higher weights to the
> certain fields when calculating TFIDF values using TermFreqVector? can I
> just use the setboost() function for this purpose, then how?
>
> --
> Regards
>
> Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kasunp at opensource

Apr 20, 2012, 7:30 AM

Post #3 of 3 (445 views)
Permalink
Re: Weighted cosine similarity calculation using Lucene [In reply to]

Hi Erick

On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson <erickerickson [at] gmail>wrote:

> Maybe I'm missing something here, but why not just boost the
> terms in the fields at query time?
>

Yes I can boost the fields in the query time. But I'm using the
termFreqVector get term frequencies and then calculate the TFIDF values for
documents then calculate the cosine similarity using TFIDF.
The field.setboost() function will give NO effect on term Frequencies.
Is there anyother way to do the boosting that will give effect
on term-frequencies?

Thanks


>
> Best
> Erick
>
> On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera <kasunp [at] opensource>
> wrote:
> > I have documents that are marked up with Taxonomy and Ontology terms
> > separately.
> > When I calculate the document similarity, I want to give higher weights
> to
> > those Taxonomy terms and Ontology terms.
> >
> >
> > When I index the document, I have defined the Document content, Taxonomy
> > and Ontology terms as Fields for each document like this in my program.
> >
> >
> > *Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo],
> > Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
> >
> > *Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo],
> > Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
> >
> > *Field document = new Field(docNames[curDocNo], strRdElt,
> > Field.TermVector.YES);*
> >
> >
> >
> > Iím using Lucene index .TermFreqVector functions to calculate TFIDF
> values
> > and, then calculate cosine similarity between two documents using TFIDF
> > values.
> >
> >
> > For give weights to Ontology and Taxonomy terms when calculating the
> cosine
> > similarity, what I can do is, programmatically multiply the Taxonomy
> > and Ontology
> > term frequencies with defined weight factor before calculating the TFIDF
> > scores. Will this give higher weight to Taxonomy and Ontology terms in
> > document similarity calculation?
> >
> >
> > Are there Lucene functions that can be used to give higher weights to the
> > certain fields when calculating TFIDF values using TermFreqVector? can I
> > just use the setboost() function for this purpose, then how?
> >
> > --
> > Regards
> >
> > Kasun Perera
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


--
Regards

Kasun Perera

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.