Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

delete entries from posting list Lucene 4.0

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


zpvie at yahoo

Mar 19, 2012, 3:24 AM

Post #1 of 7 (597 views)
Permalink
delete entries from posting list Lucene 4.0

I need to delete entries from posting list. How to do it in Lucene 4.0? I
need to do this to test different pruning algorithms.

Thanks in advance

ZP


--
View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3838649.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Mar 19, 2012, 3:35 AM

Post #2 of 7 (585 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

On 19/03/2012 11:24, Zeynep P. wrote:
> I need to delete entries from posting list. How to do it in Lucene 4.0? I
> need to do this to test different pruning algorithms.
>
> Thanks in advance

http://issues.apache.org/jira/browse/LUCENE-1812
http://issues.apache.org/jira/browse/LUCENE-2632

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


zpvie at yahoo

Mar 19, 2012, 6:57 AM

Post #3 of 7 (579 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

That is perfect
Thank you very much

Best regards
ZP

--
View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3839095.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


zpvie at yahoo

Mar 27, 2012, 11:25 AM

Post #4 of 7 (562 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

While using the pruning package, I realised that ridf is calculated in
RIDFTermPruningPolicy as follows:
Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df

However, according to the original paper (Blanco et al.) for residual idf,
it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation,
Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc))

Do I miss something in the calculation or is this a bug?

Thanks in advance
ZP


--
View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3862334.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Mar 29, 2012, 2:14 AM

Post #5 of 7 (561 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

On 27/03/2012 20:25, Zeynep P. wrote:
> While using the pruning package, I realised that ridf is calculated in
> RIDFTermPruningPolicy as follows:
> Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df
>
> However, according to the original paper (Blanco et al.) for residual idf,
> it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation,
> Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc))
>
> Do I miss something in the calculation or is this a bug?

Hmm, good question! After checking the original paper again, and then
checking our implementation, I think that this is indeed a bug, and we
should add the minus there, but ... this formula may be completely
broken either way. The paper that you mention
(http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf)
says thus:

"Residual idf is defined in [3] as the difference between the observed
idf (IDF ) and the idf expected under the assumption that the terms
follow an independence model, such as Poisson (IDF^). [...] If tf is the
total number of tokens for a term t, then the ridf devised by a Poisson
distribution is

RIDF = IDF − IDF^ = −log(df/D) + log(1 − e^(-tf/D)) [2]
"

Since the purpose of the RIDF metric is to select informative words
collection-wide, and not per-document, then it makes sense that they use
a collection-wide metric like IDF as a baseline vs. another
collection-wide metric based on total term frequency, or rather the
total number of term occurrences in a collection.

The problem in our implementation is that we use a within-document term
frequency (the number of occurrences of t in the current document) and
not a collection-wide term frequency... so, it looks to me that the fix
would be to first fully traverse the doc enumeration and calculate the
total number of term occurrences in all documents (e.g. in
RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the
formula in place of termPositions.freq().

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Apr 2, 2012, 10:02 AM

Post #6 of 7 (534 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

On 29/03/2012 11:14, Andrzej Bialecki wrote:

> The problem in our implementation is that we use a within-document term
> frequency (the number of occurrences of t in the current document) and
> not a collection-wide term frequency... so, it looks to me that the fix
> would be to first fully traverse the doc enumeration and calculate the
> total number of term occurrences in all documents (e.g. in
> RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the
> formula in place of termPositions.freq().
>

This is the fix that I implemented, it's now committed to branch_3x and
will be included in release 3.6.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


zpvie at yahoo

Apr 23, 2012, 10:52 AM

Post #7 of 7 (463 views)
Permalink
Re: delete entries from posting list Lucene 4.0 [In reply to]

Hi,

Thanks for the fix.

I also wonder if you know any collection (free ones) to test pruning
approaches. Almost all the papers use TREC collections which I don't have!!
For now, I use Reuters21578 collection and Carmel's Kendall's tau extension
to measure similarity. But I need a collection with relevance judgements.

Thanks in advance,
Best Regards
ZP

--
View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3933206.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.