Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 29, 2009, 8:22 PM

Post #1 of 9 (381 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783530#action_12783530 ]

Otis Gospodnetic edited comment on LUCENE-2091 at 11/30/09 4:21 AM:
--------------------------------------------------------------------

Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Aha, I found something:
http://markmail.org/message/c2r4v7zj7mduzs5d

Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.

was (Author: otis):
Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default.

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 8:45 PM

Post #2 of 9 (354 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783532#action_12783532 ]

Robert Muir edited comment on LUCENE-2091 at 11/30/09 4:45 AM:
---------------------------------------------------------------

otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this.

btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl.


was (Author: rcmuir):
otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this.

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 8:02 AM

Post #3 of 9 (336 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784270#action_12784270 ]

Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/1/09 4:01 PM:
-------------------------------------------------------------------------

Hi Otis, Robert and Yuval.

I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this.

In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago.

I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function.

MAP P@5
VSM 0.2079 0.4096
BM25 0.2340 0.4578



This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core.

The only concerns that I have about it, are related with:
- Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.
- IDF cannot be calculated at a document level (this is important for BM25F).
- Another issue is related with computing the document average length, but this could be easily solved.


These issues are described in detail in the documentation that I made public in my website.

Thanks to all for your interest and work.

Joaquin Perez-Iglesias

was (Author: joaquin):
Hi Otis, Robert and Yuval.

I developed this add-on for Lucene in 2008, for some experiments that I was doing, and I would like to express my impressions about this.

In my experience and after reading lot of papers I have never found a case where the Lucene-VSM implementation improves BM25 performance.
BM25 (with standard parameters) outperforms Lucene-VSM, moreover a room for improvement exists if the parameters are fixed specifically for the collection. I made publish some results with the Eurogov collection some time ago.

I can show you now some experiments with TREC Disk4&5 collection, these results have been obtained with default parameters with the Robust track topics. As you can see BM25 improves the Lucene-VSM ranking function.

MAP P@5
VSM 0.2079 0.4096
BM25 0.2340 0.4578



This implementation is getting more popular and I know that some people is using it on their research, thus it will be really nice if at some point it is included in the core.

The only concerns that I have about it, are related with:
- Only simple boolean queries based on terms are supported (with operators or, and, not). For instance it does not support PhraseQuery.
- IDF cannot be calculated at a document level (this is important for BM25F).
- Another issue is related with computing the document average length, but this could be easily solved.


These issues are described in detail in the documentation that I made public in my website.

Thanks to all for your interest and work.

Joaquin Perez-Iglesias

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 3, 2009, 2:58 AM

Post #4 of 9 (332 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785264#action_12785264 ]

Robert Muir edited comment on LUCENE-2091 at 12/3/09 10:56 AM:
---------------------------------------------------------------

Hi Yuval, I see your patch, I can help with some relevance testing and comments.

I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that.

Here is the first thing I noticed, maybe I am completely stupid but I never understood this:

I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever.

So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser.

edit: by the way, I don't want to imply that what I am doing is "best" either, because I don't think it is, only that this would be one way to simplify the code a lot as a first step.


was (Author: rcmuir):
Hi Yuval, I see your patch, I can help with some relevance testing and comments.

I don't know if it should be assigned to me, maybe we can trick one of the devs who really knows the scoring system to well to look at it, especially about performance and things like that.

Here is the first thing I noticed, maybe I am completely stupid but I never understood this:

I don't understand why we need BM25Boolean.* and everything like that. I don't understand why these are necessary, they seem to be duplicates of BooleanQuery etc and just sum up subscorers or whatever.

So in my usages I dropped them. I just have BM25TermQuery,BM25TermScorer, and BM25Parameters and to use it, I override a method in QueryParser.


> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 3, 2009, 3:06 AM

Post #5 of 9 (331 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785271#action_12785271 ]

Uwe Schindler edited comment on LUCENE-2091 at 12/3/09 11:04 AM:
-----------------------------------------------------------------

I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that.

That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for Weights and Scoreres. The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else.

If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.

was (Author: thetaphi):
I was wondering about the separate BooleanQuery, too as it is almost simply a copy (of an old version of it). The question is more, why do we need the BM25 calsses at ally, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the scorers. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that.

That was just my first impression, these additional classes do not look like a good public API to me. Query classes should be abstract wrappers for wights and scoreres. The internal impl like BM25 or conventional should be hidden from the user (and maybe properties e.g. on the IndexSearcher to use BM25 scoring). This way, it could also be used for other query types (not only TermQ/BQ), but eg. for function queries (to further change the score) or FuzzyQuery and what else.

If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API.

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 3, 2009, 2:18 PM

Post #6 of 9 (326 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785569#action_12785569 ]

Uwe Schindler edited comment on LUCENE-2091 at 12/3/09 10:17 PM:
-----------------------------------------------------------------

Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most users also add fields that are e.g. catch-all fields (which would be the per doc IDF) but in addition they add special fields like numeric fields (which would not produce a good IDF, but at the moment this IDF is ignored). Some users also add fields simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.

was (Author: thetaphi):
Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most uses also add fields that are e.g. catch-all fields (which would be the IDF you want to have) but in addition they add special fields like numeric field (which would not produce a good IDF, at the moment this IDF is ignored). Some users also add fileds simply for sorting. So a IDF for documents is impossible with Lucene. You can only use e.g. catch all fields (which are always a godd idea for non-fielded searches, because oring all fields together is slower that just indexing the same terms a second time in a catch-all field), e.g. "contents" contains all terms from "title", "subject", "mailtext" as an example for emails. But the IDF for BM25F could be taken from the "contents" field even when searching only for a title.

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 4, 2009, 2:20 AM

Post #7 of 9 (315 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785840#action_12785840 ]

Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/4/09 10:19 AM:
--------------------------------------------------------------------------

Yes sorry.

Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency).
In Lucene this is carried out with the root square of the frequency, another classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency.

BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection.

(Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done.
For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields.

So maybe the best available approach would be to use IDF per field. What do you think?


was (Author: joaquin):
Yes sorry.

Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency).
In Lucene this is carried out with the root square of the frequency, another classical approach
is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency.

BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the
equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection.

> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 4, 2009, 11:17 AM

Post #8 of 9 (308 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786087#action_12786087 ]

Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/4/09 7:17 PM:
-------------------------------------------------------------------------

Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

(Michael)
In summary it will be nice to have:

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).

I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover that I don't really know is how to do it :-).


was (Author: joaquin):
Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

(Michael)
In summary it will be nice to have:

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).

I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover that I don't really know how to do it :-).


> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 4, 2009, 11:19 AM

Post #9 of 9 (306 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786087#action_12786087 ]

Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/4/09 7:17 PM:
-------------------------------------------------------------------------

Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

(Michael)
In summary it will be nice to have:

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).

I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover what I don't really know is how to do it :-).


was (Author: joaquin):
Yes, you are right what I meant was related with multifield queries, if you search a:F1^F2, the right approach will be to compute IDF with docFreq(a,F1^F1) what in my understanding cannot be done.

If I'm right Lucene does weight(a)*idf(a,F1) + weight(a)*idf(a,F2), and the correct approach would be weight(a)*idf(a,F1^F2).

That's the reason why Uwe (and I) suggested to use IDF per field in the previous case, and if the query is executed on each field, use a kind of catch-all field to compute docFreq in all fields.

(Michael)
In summary it will be nice to have:

1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough.
2. The Collection Average Document Length and Collection Average Field Length (per each field).

I don't think that we need "How many times does term T occur in all fields for doc D", frequency is necessary per field and not per document.

I don't know too much about the implementation of PhraseQuery, but I think that should be possible to implement BM25F for it (and any other query type), as far as frequency and docFreq of the phrase/terms are available.

At this point it is not supported in the patch, but I don't see any reason why it couldn't be implemented, moreover that I don't really know is how to do it :-).


> Add BM25 Scoring to Lucene
> --------------------------
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Yuval Feinstein
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime somewhat.
> I would like to contribute the code to Lucene under contrib.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.