Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Scoring formula - Average number of terms in IDF

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


v.verroios at di

Nov 10, 2009, 4:32 AM

Post #1 of 7 (1127 views)
Permalink
Scoring formula - Average number of terms in IDF

Hi,

I want to change the default scoring formula of lucene and one of the
changes I want to perform is on the idf term. What I want to do is to
include the average number of terms of the documents indexed in the
collection in the idf method of the Similarity class.

In order to change the scoring formula I'm planning to implement a subclass
of DefaultSimilarity and use the new class by calling
IndexWriter.setSimilarity before indexing and Searcher.setSimilarity before
searching.
The fact that lucene requests the new class to be used while creating the
index makes me wonder if it is possible to have a scoring formula with an
idf term that includes the average number of terms of documents being
indexed(an average which will be available only when all the documents are
indexed)

So is there a way to have access in the average number of document terms
inside the idf method of Similarity class??

thank you in advance
--
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26282578.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


v.verroios at di

Dec 15, 2009, 2:04 AM

Post #2 of 7 (945 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

any ideas please?
--
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Dec 17, 2009, 6:15 AM

Post #3 of 7 (935 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

There have been some discussions, here:

https://issues.apache.org/jira/browse/LUCENE-2091

about how Lucene could track avg field/doc length, but they are just
brainstorming type discussions now.

You could always do something approximate outside of Lucene? EG, make
a TokenFilter that counts how many tokens are produced for each
field/doc, aggregate & store that yourself, and use it in your
similarity impl?

Mike

On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios [at] di> wrote:
>
> any ideas please?
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


v.verroios at di

Dec 17, 2009, 7:50 AM

Post #4 of 7 (932 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

If I follow your approach, and produce the avg(outside of Lucene) while I 'm
building the index(due to performance reasons I can't wait for all the
documents to arrive before indexing them) for a collection, the avg will be
ready only when all the documents of the collection are indexed.
Lucene states that the new similarity class must be set in
IndexWriter.setSimilarity(), and be used while I build the index, and in
this time the avg isn't ready yet. Is there a way to overcome this? And if
not calculating the score while the index is being created, and only when
searching the index, what will the consequence in performance be?

(Mike thank you about your response)


Michael McCandless-2 wrote:
>
> There have been some discussions, here:
>
> https://issues.apache.org/jira/browse/LUCENE-2091
>
> about how Lucene could track avg field/doc length, but they are just
> brainstorming type discussions now.
>
> You could always do something approximate outside of Lucene? EG, make
> a TokenFilter that counts how many tokens are produced for each
> field/doc, aggregate & store that yourself, and use it in your
> similarity impl?
>
> Mike
>
> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios [at] di> wrote:
>>
>> any ideas please?
>> --
>> View this message in context:
>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

--
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Dec 17, 2009, 8:07 AM

Post #5 of 7 (930 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
field, per document) based on the length of the field... it doesn't
invoke the other methods on Similarity.

Are you saying you need to know the avg across the whole corpus before
computing that boost?

Mike

On Thu, Dec 17, 2009 at 10:50 AM, kdev <v.verroios [at] di> wrote:
>
> If I follow your approach, and produce the avg(outside of Lucene) while I 'm
> building the index(due to performance reasons I can't wait for all the
> documents to arrive before indexing them) for a collection, the avg will be
> ready only when all the documents of the collection are indexed.
> Lucene states that the new similarity class must be set in
> IndexWriter.setSimilarity(), and be used while I build the index, and in
> this time the avg isn't ready yet. Is there a way to overcome this? And if
> not calculating the score while the index is being created, and only when
> searching the index, what will the consequence in performance be?
>
> (Mike thank you about your response)
>
>
> Michael McCandless-2 wrote:
>>
>> There have been some discussions, here:
>>
>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>
>> about how Lucene could track avg field/doc length, but they are just
>> brainstorming type discussions now.
>>
>> You could always do something approximate outside of Lucene?  EG, make
>> a TokenFilter that counts how many tokens are produced for each
>> field/doc, aggregate & store that yourself, and use it in your
>> similarity impl?
>>
>> Mike
>>
>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios [at] di> wrote:
>>>
>>> any ideas please?
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


v.verroios at di

Dec 18, 2009, 2:12 AM

Post #6 of 7 (924 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

The avg is used only in the idf method of the Similarity class. So I guess
there is workaround for what I want to do. Can you give me a reference, on
lucene doc, on how a IndexWriter uses the provided Similarity class?

Thanks again for your time and your help.


Michael McCandless-2 wrote:
>
> IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
> field, per document) based on the length of the field... it doesn't
> invoke the other methods on Similarity.
>
> Are you saying you need to know the avg across the whole corpus before
> computing that boost?
>
> Mike
>
> On Thu, Dec 17, 2009 at 10:50 AM, kdev <v.verroios [at] di> wrote:
>>
>> If I follow your approach, and produce the avg(outside of Lucene) while I
>> 'm
>> building the index(due to performance reasons I can't wait for all the
>> documents to arrive before indexing them) for a collection, the avg will
>> be
>> ready only when all the documents of the collection are indexed.
>> Lucene states that the new similarity class must be set in
>> IndexWriter.setSimilarity(), and be used while I build the index, and in
>> this time the avg isn't ready yet. Is there a way to overcome this? And
>> if
>> not calculating the score while the index is being created, and only when
>> searching the index, what will the consequence in performance be?
>>
>> (Mike thank you about your response)
>>
>>
>> Michael McCandless-2 wrote:
>>>
>>> There have been some discussions, here:
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-2091
>>>
>>> about how Lucene could track avg field/doc length, but they are just
>>> brainstorming type discussions now.
>>>
>>> You could always do something approximate outside of Lucene? EG, make
>>> a TokenFilter that counts how many tokens are produced for each
>>> field/doc, aggregate & store that yourself, and use it in your
>>> similarity impl?
>>>
>>> Mike
>>>
>>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios [at] di> wrote:
>>>>
>>>> any ideas please?
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

--
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26841521.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Dec 18, 2009, 3:58 AM

Post #7 of 7 (917 views)
Permalink
Re: Scoring formula - Average number of terms in IDF [In reply to]

I'm not sure this specific detail (how IW uses Similarity) is
documented -- best "documentation" is the source code ;)

Have a look at oal.index.NormsWriterPerField. That's where the
default indexing chain asks Similarity to create the norm.

Mike

On Fri, Dec 18, 2009 at 5:12 AM, kdev <v.verroios [at] di> wrote:
>
> The avg is used only in the idf method of the Similarity class. So I guess
> there is workaround for what I want to do. Can you give me a reference, on
> lucene doc, on how a IndexWriter uses the provided Similarity class?
>
> Thanks again for your time and your help.
>
>
> Michael McCandless-2 wrote:
>>
>> IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
>> field, per document) based on the length of the field... it doesn't
>> invoke the other methods on Similarity.
>>
>> Are you saying you need to know the avg across the whole corpus before
>> computing that boost?
>>
>> Mike
>>
>> On Thu, Dec 17, 2009 at 10:50 AM, kdev <v.verroios [at] di> wrote:
>>>
>>> If I follow your approach, and produce the avg(outside of Lucene) while I
>>> 'm
>>> building the index(due to performance reasons I can't wait for all the
>>> documents to arrive before indexing them) for a collection, the avg will
>>> be
>>> ready only when all the documents of the collection are indexed.
>>> Lucene states that the new similarity class must be set in
>>> IndexWriter.setSimilarity(), and be used while I build the index, and in
>>> this time the avg isn't ready yet. Is there a way to overcome this? And
>>> if
>>> not calculating the score while the index is being created, and only when
>>> searching the index, what will the consequence in performance be?
>>>
>>> (Mike thank you about your response)
>>>
>>>
>>> Michael McCandless-2 wrote:
>>>>
>>>> There have been some discussions, here:
>>>>
>>>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>>>
>>>> about how Lucene could track avg field/doc length, but they are just
>>>> brainstorming type discussions now.
>>>>
>>>> You could always do something approximate outside of Lucene?  EG, make
>>>> a TokenFilter that counts how many tokens are produced for each
>>>> field/doc, aggregate & store that yourself, and use it in your
>>>> similarity impl?
>>>>
>>>> Mike
>>>>
>>>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios [at] di> wrote:
>>>>>
>>>>> any ideas please?
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26841521.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.