Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

A simple Vector Space Model and TFIDF usage

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


amir.jadidi at yahoo

Jun 29, 2009, 12:10 PM

Post #1 of 3 (694 views)
Permalink
A simple Vector Space Model and TFIDF usage

Hi,
It's my first experiment with Lucene. Please help me.
I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF.
After that I want to compute the cosine similarity between all documents and produce a doc-doc similarity matrix. My document set is large and it's important to have a scalable implementation.
Would you please provide me a guideline or to-do list?
Thank you and kind regards.


gsingers at apache

Jun 30, 2009, 9:13 AM

Post #2 of 3 (649 views)
Permalink
Re: A simple Vector Space Model and TFIDF usage [In reply to]

On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:

> Hi,
> It's my first experiment with Lucene. Please help me.
> I'm going to index a set of documents and create a feature vector
> for each of them. This vector contains all terms belong to the
> document that weight using TFIDF.
> After that I want to compute the cosine similarity between all
> documents and produce a doc-doc similarity matrix. My document set
> is large and it's important to have a scalable implementation.


See Mahout (http://lucene.apache.org/mahout). In the utils module, is
a class called LuceneIterable that the o.a.mahout.utils.vectors.Driver
program can use to convert a Lucene index into a Mahout Vector
representation, which can then be used to create a d-d similarity
matrix. It uses Hadoop, so you can go as big as you want.

See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


kamal.najib at mytum

Jul 2, 2009, 1:49 AM

Post #3 of 3 (615 views)
Permalink
Re: A simple Vector Space Model and TFIDF usage [In reply to]

Hallo Amir,
So far i understand, you have two sets of documents, let we say set1 and set2. If you want to get the Similarity between the two sets documents you have to index the docs of one and schearch each doc of the others as a query, then you can get the similarity of the two documents. So:
1. Index the docs of the set1.
2. for each doc-element from the set2 do:
create a query that contains the content text of the doc-element.
Search them in your indexed docs from set2
And from the hits you will get, you can get the score of the Similarity between the doc-element and every hit.

Your diractory where your indexed docs are saved represents the vector space model you want to bild. If you want to see how lucene computes the score result, you can use the class explanation and similarity in lucene Api and you will see that lucene deals with the documents and querys in the same way as a vector space model. In the class explanation you can see that lucene use the TF, IDF and DF to compute the result score.
Best regards.
Kamal.
Original Message:

Hi,
<br />It's my first experiment with Lucene. Please help me.
<br />I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF.
<br />After that I want to compute the cosine similarity between all documents and produce a doc-doc similarity matrix. My document set is large and it's important to have a scalable implementation.
<br />Would you please provide me a guideline or to-do list?
<br />Thank you and kind regards.
<br />
<br />
<br />
<br />

--

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.