Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Different score for the same documents

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kenji.tsuruoka at gmail

Nov 2, 2009, 4:45 AM

Post #1 of 4 (518 views)
Permalink
Different score for the same documents

Dear. Lucene users.

Hi.
I have tried to index and search MEDLINE abstracts by LUCENE.

And there were some problems in the search results.
That is Lucene has assigned different ranks for the exactly same
documents.

I didn't know the input documents for the index contain duplicate
documents at the first time.
I have solve the problem by making all input documents UNIQUE for the
index.

But I want to know how and why the situation was happened.

The duplicate document is as follows:

_pubmed_id=13029105:1952Nov15
_ArticleTitle_
<s n="1">Experimental diabetes and clinical diabetes.</s>
_pubmed_id_end_

There are TWO exactly same documents in "index".
And their rankings by Lucene are 3 and 18.

I have known texts in XML/HTML data should be extracted before indexing.
Anyway, I haven't done this work now.

Please let me know the reason why the same documents were shown
different ranks.

Best,
K

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 2, 2009, 4:59 AM

Post #2 of 4 (479 views)
Permalink
Re: Different score for the same documents [In reply to]

What were their scores? I'm assuming that by "rank" you mean
the order in which the documents were returned, not the raw Lucene
score.

Lucene uses the insertion order to break ties. That is, two documents
with the same score will the appear in the order of their (internal)
Lucene doc ID.

So is it possible that *all* of the documents that appear between these
two have the exact same score for that query? That seems a bit
unlikely, but it's worth checking before going much further.....

Best
Erick

On Mon, Nov 2, 2009 at 7:45 AM, kenji tsuruoka <kenji.tsuruoka [at] gmail>wrote:

> Dear. Lucene users.
>
> Hi.
> I have tried to index and search MEDLINE abstracts by LUCENE.
>
> And there were some problems in the search results.
> That is Lucene has assigned different ranks for the exactly same documents.
>
> I didn't know the input documents for the index contain duplicate documents
> at the first time.
> I have solve the problem by making all input documents UNIQUE for the
> index.
>
> But I want to know how and why the situation was happened.
>
> The duplicate document is as follows:
>
> _pubmed_id=13029105:1952Nov15
> _ArticleTitle_
> <s n="1">Experimental diabetes and clinical diabetes.</s>
> _pubmed_id_end_
>
> There are TWO exactly same documents in "index".
> And their rankings by Lucene are 3 and 18.
>
> I have known texts in XML/HTML data should be extracted before indexing.
> Anyway, I haven't done this work now.
>
> Please let me know the reason why the same documents were shown different
> ranks.
>
> Best,
> K
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


kenji.tsuruoka at gmail

Nov 2, 2009, 5:11 AM

Post #3 of 4 (474 views)
Permalink
Re: Different score for the same documents [In reply to]

Thank you Erick.

What you mentioned is right.
The two same documents were shown at the 3rd and 18th.

So do you mean documents between the 3rd and the 18th (at least) in
the Lucene results have the same score?

Cheers,
K

On Nov 2, 2009, at 9:59 PM, Erick Erickson wrote:

> What were their scores? I'm assuming that by "rank" you mean
> the order in which the documents were returned, not the raw Lucene
> score.
>
> Lucene uses the insertion order to break ties. That is, two documents
> with the same score will the appear in the order of their (internal)
> Lucene doc ID.
>
> So is it possible that *all* of the documents that appear between
> these
> two have the exact same score for that query? That seems a bit
> unlikely, but it's worth checking before going much further.....
>
> Best
> Erick
>
> On Mon, Nov 2, 2009 at 7:45 AM, kenji tsuruoka <kenji.tsuruoka [at] gmail
> >wrote:
>
>> Dear. Lucene users.
>>
>> Hi.
>> I have tried to index and search MEDLINE abstracts by LUCENE.
>>
>> And there were some problems in the search results.
>> That is Lucene has assigned different ranks for the exactly same
>> documents.
>>
>> I didn't know the input documents for the index contain duplicate
>> documents
>> at the first time.
>> I have solve the problem by making all input documents UNIQUE for the
>> index.
>>
>> But I want to know how and why the situation was happened.
>>
>> The duplicate document is as follows:
>>
>> _pubmed_id=13029105:1952Nov15
>> _ArticleTitle_
>> <s n="1">Experimental diabetes and clinical diabetes.</s>
>> _pubmed_id_end_
>>
>> There are TWO exactly same documents in "index".
>> And their rankings by Lucene are 3 and 18.
>>
>> I have known texts in XML/HTML data should be extracted before
>> indexing.
>> Anyway, I haven't done this work now.
>>
>> Please let me know the reason why the same documents were shown
>> different
>> ranks.
>>
>> Best,
>> K
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 2, 2009, 5:17 AM

Post #4 of 4 (471 views)
Permalink
Re: Different score for the same documents [In reply to]

That's exactly the question. If all 16 documents have exactly the same
score, then
the internal tie-breaking is your answer. They would also all have strictly
increasing doc IDs.

But I'd check to see the scores before accepting this explanation because
I find it unlikely that all 16 docs have identical scores. But it's worth
checking
out before looking for more complex answers.....

Best
Erick

On Mon, Nov 2, 2009 at 8:11 AM, kenji tsuruoka <kenji.tsuruoka [at] gmail>wrote:

> Thank you Erick.
>
> What you mentioned is right.
> The two same documents were shown at the 3rd and 18th.
>
> So do you mean documents between the 3rd and the 18th (at least) in the
> Lucene results have the same score?
>
> Cheers,
> K
>
>
> On Nov 2, 2009, at 9:59 PM, Erick Erickson wrote:
>
> What were their scores? I'm assuming that by "rank" you mean
>> the order in which the documents were returned, not the raw Lucene
>> score.
>>
>> Lucene uses the insertion order to break ties. That is, two documents
>> with the same score will the appear in the order of their (internal)
>> Lucene doc ID.
>>
>> So is it possible that *all* of the documents that appear between these
>> two have the exact same score for that query? That seems a bit
>> unlikely, but it's worth checking before going much further.....
>>
>> Best
>> Erick
>>
>> On Mon, Nov 2, 2009 at 7:45 AM, kenji tsuruoka <kenji.tsuruoka [at] gmail
>> >wrote:
>>
>> Dear. Lucene users.
>>>
>>> Hi.
>>> I have tried to index and search MEDLINE abstracts by LUCENE.
>>>
>>> And there were some problems in the search results.
>>> That is Lucene has assigned different ranks for the exactly same
>>> documents.
>>>
>>> I didn't know the input documents for the index contain duplicate
>>> documents
>>> at the first time.
>>> I have solve the problem by making all input documents UNIQUE for the
>>> index.
>>>
>>> But I want to know how and why the situation was happened.
>>>
>>> The duplicate document is as follows:
>>>
>>> _pubmed_id=13029105:1952Nov15
>>> _ArticleTitle_
>>> <s n="1">Experimental diabetes and clinical diabetes.</s>
>>> _pubmed_id_end_
>>>
>>> There are TWO exactly same documents in "index".
>>> And their rankings by Lucene are 3 and 18.
>>>
>>> I have known texts in XML/HTML data should be extracted before indexing.
>>> Anyway, I haven't done this work now.
>>>
>>> Please let me know the reason why the same documents were shown different
>>> ranks.
>>>
>>> Best,
>>> K
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.