Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Different replicas return different scores

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


yuvalf at answers

Feb 9, 2010, 6:26 AM

Post #1 of 3 (735 views)
Permalink
Different replicas return different scores

We are running a large sharded Lucene-based application.
Our configuration supports near real-time updates, by incrementally
Updating documents (using delete then add) on the shards.
Every shard is replicated to several machines in order to improve performance.
We replicate the shard by sending the same deletion and addition commands to all the replicas,
Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
Then add them one-by-one semi-asynchronously).
Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
1. Different replicas have different sets of segments - most segment files are the same, but some are different.
2. The numbers of deleted documents are different between two replicas of the same shard.
Is this a known behavior of Java Lucene?
How can we change this behavior? We want different replicas returning the exact same score per query hits.
(We would rather not optimize the index as we believe this will harm performance.)

TIA,
Yuval and Ophir


ian.lea at gmail

Feb 9, 2010, 8:12 AM

Post #2 of 3 (669 views)
Permalink
Re: Different replicas return different scores [In reply to]

Since the update commands may run in different order on different
shards you might get different sets of segments because merges happen
to be triggered at different points in the different batches of
updates. But you shouldn't have different numbers of deleted docs if
you have really been applying the same updates to all the shards.
Could some updates have been missed? Or docs added then deleted or
something? Maybe there are other variations between the shards and
that is causing the variation in query scores.

As an alternative approach you could have one master index per shard
that takes all the updates and then send that index out to the shard
servers. If you don't use compound file format, and don't optimize,
the file changes are typically quite small with default or sensible
merge settings and can be distributed quickly using rsync. You can
have more control by using MergePolicy and friends.

What version of lucene are you running?


--
Ian.


On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yuvalf [at] answers> wrote:
> We are running a large sharded Lucene-based application.
> Our configuration supports near real-time updates, by incrementally
> Updating documents (using delete then add) on the shards.
> Every shard is replicated to several machines in order to improve performance.
> We replicate the shard by sending the same deletion and addition commands to all the replicas,
> Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
> Then add them one-by-one semi-asynchronously).
> Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
> Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
> 1.      Different replicas have different sets of segments - most segment files are the same, but some are different.
> 2.      The numbers of deleted documents are different between two replicas of the same shard.
> Is this a known behavior of Java Lucene?
> How can we change this behavior? We want different replicas returning the exact same score per query hits.
> (We would rather not optimize the index as we believe this will harm performance.)
>
> TIA,
> Yuval and Ophir
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yuvalf at answers

Feb 9, 2010, 10:59 PM

Post #3 of 3 (650 views)
Permalink
RE: Different replicas return different scores [In reply to]

Thanks for these directions, Ian.
We are running Lucene 2.9.1 on CentOs 5 64-bit machines.
We do use compound file format, and will look into replacing it with the simple files,
although I believe this will create too many files.
We will also consider the rsync option.
Thanks again,
-- Yuval

-----Original Message-----
From: Ian Lea [mailto:ian.lea [at] gmail]
Sent: Tuesday, February 09, 2010 6:13 PM
To: java-user [at] lucene
Subject: Re: Different replicas return different scores

Since the update commands may run in different order on different
shards you might get different sets of segments because merges happen
to be triggered at different points in the different batches of
updates. But you shouldn't have different numbers of deleted docs if
you have really been applying the same updates to all the shards.
Could some updates have been missed? Or docs added then deleted or
something? Maybe there are other variations between the shards and
that is causing the variation in query scores.

As an alternative approach you could have one master index per shard
that takes all the updates and then send that index out to the shard
servers. If you don't use compound file format, and don't optimize,
the file changes are typically quite small with default or sensible
merge settings and can be distributed quickly using rsync. You can
have more control by using MergePolicy and friends.

What version of lucene are you running?


--
Ian.


On Tue, Feb 9, 2010 at 2:26 PM, Yuval Feinstein <yuvalf [at] answers> wrote:
> We are running a large sharded Lucene-based application.
> Our configuration supports near real-time updates, by incrementally
> Updating documents (using delete then add) on the shards.
> Every shard is replicated to several machines in order to improve performance.
> We replicate the shard by sending the same deletion and addition commands to all the replicas,
> Where they may be performed in a different order. (We delete a set of documents, say 1000 at a time,
> Then add them one-by-one semi-asynchronously).
> Lately we have noticed a subtle difference in query scores across different replicas of the same shard.
> Further investigation showed that the only noticeable difference between the replicas was the index directory structure:
> 1.      Different replicas have different sets of segments - most segment files are the same, but some are different.
> 2.      The numbers of deleted documents are different between two replicas of the same shard.
> Is this a known behavior of Java Lucene?
> How can we change this behavior? We want different replicas returning the exact same score per query hits.
> (We would rather not optimize the index as we believe this will harm performance.)
>
> TIA,
> Yuval and Ophir
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.