Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Calculating IDF value more efficiently

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kasunp at opensource

Apr 27, 2012, 7:38 PM

Post #1 of 2 (123 views)
Permalink
Calculating IDF value more efficiently

This is my program to calculate TF-IDF value for a document in a collection
of documents. This is working fine, but takes lot of time when calculating
the "IDF" values (finding the no of documents which contains particular
term).

Is there a more efficient way of finding the no of documents which contains
a particular term?

freq = termsFreq.getTermFrequencies();

terms = termsFreq.getTerms();

int noOfTerms = terms.length;

score = new float[noOfTerms];
DefaultSimilarity simi = new DefaultSimilarity();

for (i = 0; i < noOfTerms; i++) {

int noofDocsContainTerm = noOfDocsContainTerm(terms[i]);

float tf = simi.tf(freq[i]);

float idf = simi.idf(noofDocsContainTerm, noOfDocs);

score[i] = tf * idf ;

}

////

public int noOfDocsContainTerm(String querystr) throws
CorruptIndexException, IOException, ParseException{

QueryParser qp=new QueryParser(Version.LUCENE_35, "docuemnt", new
StandardAnalyzer(Version.LUCENE_35));

Query q=qp.parse(querystr);

int hitsPerPage = docNames.length; //minumum number or search results
IndexSearcher searcher = new IndexSearcher(ramMemDir, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

return hits.length;
}


--
Regards

Kasun Perera


rcmuir at gmail

May 6, 2012, 4:32 PM

Post #2 of 2 (98 views)
Permalink
Re: Calculating IDF value more efficiently [In reply to]

Look at IndexReader.docFreq

On Fri, Apr 27, 2012 at 10:38 PM, Kasun Perera <kasunp [at] opensource> wrote:
> This is my program to calculate TF-IDF value for a document in a collection
> of documents. This is working fine, but takes lot of time when calculating
> the "IDF" values (finding the no of documents which contains particular
> term).
>
> Is there a more efficient way of finding the no of documents which contains
> a particular term?
>
> freq = termsFreq.getTermFrequencies();
>
> terms = termsFreq.getTerms();
>
> int noOfTerms = terms.length;
>
> score = new float[noOfTerms];
> DefaultSimilarity simi = new DefaultSimilarity();
>
>        for (i = 0; i < noOfTerms; i++) {
>
>            int noofDocsContainTerm = noOfDocsContainTerm(terms[i]);
>
>            float tf = simi.tf(freq[i]);
>
>            float idf = simi.idf(noofDocsContainTerm, noOfDocs);
>
>            score[i] = tf * idf ;
>
>        }
>
> ////
>
> public int noOfDocsContainTerm(String querystr) throws
> CorruptIndexException, IOException, ParseException{
>
> QueryParser qp=new QueryParser(Version.LUCENE_35, "docuemnt", new
> StandardAnalyzer(Version.LUCENE_35));
>
> Query q=qp.parse(querystr);
>
> int hitsPerPage = docNames.length; //minumum number or search results
> IndexSearcher searcher = new IndexSearcher(ramMemDir, true);
> TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
>
> searcher.search(q, collector);
>
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>
>    return hits.length;
> }
>
>
> --
> Regards
>
> Kasun Perera



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.