Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

IndexWriter.updateDocument performance improvement

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


bogdan at ecstend

Nov 20, 2009, 4:21 AM

Post #1 of 4 (282 views)
Permalink
IndexWriter.updateDocument performance improvement

Hi,

One of the use case of my application involves updating the index with
10 to 10k docs every few minutes. Because we maintain a PK for each
doc we have to use IndexWriter.updateDocument to be consistent.

The average time for an update when we commit every 10k docs is around
17ms (the IndexWriter buffer is 100MB). I profiled the application for
several hours and I noticed that most of the time is spent in
IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
BufferedDeletes.terms from HashMap to TreeMap to have the terms
ordered and to reduce the number of random seeks on the disk.

I run my tests again with the patched Lucene 2.9.1 and the time has
dropped from 17ms to 2ms. The index has 18GB and 70 million docs.

I cannot send a patch because my company has some strict and time
consuming policies about open source but the change is small and can
be applied easily.

Regards,
Bogdan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at lucidimagination

Nov 20, 2009, 5:11 AM

Post #2 of 4 (275 views)
Permalink
Re: IndexWriter.updateDocument performance improvement [In reply to]

Thanks Bogdan, I've been meaning to bring this up.
Solr used a TreeMap in the past (when it handled it's own deletes) for
the same exact reason. In my profiling, I've also seen applyDeletes()
taking the bulk of the time with small/simple document indexing.

So we should definitely go in sorted order (either via TreeMap or sort
the HashMap).

-Yonik
http://www.lucidimagination.com

On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bogdan [at] ecstend> wrote:
> Hi,
>
> One of the use case of my application involves updating the index with
> 10 to 10k docs every few minutes. Because we maintain a PK for each
> doc we have to use IndexWriter.updateDocument to be consistent.
>
> The average time for an update when we commit every 10k docs is around
> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
> several hours and I noticed that most of the time is spent in
> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
> BufferedDeletes.terms from HashMap to TreeMap to have the terms
> ordered and to reduce the number of random seeks on the disk.
>
> I run my tests again with the patched Lucene 2.9.1 and the time has
> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>
> I cannot send a patch because my company has some strict and time
> consuming policies about open source but the change is small and can
> be applied easily.
>
> Regards,
> Bogdan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 20, 2009, 6:43 AM

Post #3 of 4 (254 views)
Permalink
Re: IndexWriter.updateDocument performance improvement [In reply to]

+1

I'll open an issue.

Mike

On Fri, Nov 20, 2009 at 8:11 AM, Yonik Seeley
<yonik [at] lucidimagination> wrote:
> Thanks Bogdan, I've been meaning to bring this up.
> Solr used a TreeMap in the past (when it handled it's own deletes) for
> the same exact reason.  In my profiling, I've also seen applyDeletes()
> taking the bulk of the time with small/simple document indexing.
>
> So we should definitely go in sorted order (either via TreeMap or sort
> the HashMap).
>
> -Yonik
> http://www.lucidimagination.com
>
> On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bogdan [at] ecstend> wrote:
>> Hi,
>>
>> One of the use case of my application involves updating the index with
>> 10 to 10k docs every few minutes. Because we maintain a PK for each
>> doc we have to use IndexWriter.updateDocument to be consistent.
>>
>> The average time for an update when we commit every 10k docs is around
>> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
>> several hours and I noticed that most of the time is spent in
>> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
>> BufferedDeletes.terms from HashMap to TreeMap to have the terms
>> ordered and to reduce the number of random seeks on the disk.
>>
>> I run my tests again with the patched Lucene 2.9.1 and the time has
>> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>>
>> I cannot send a patch because my company has some strict and time
>> consuming policies about open source but the change is small and can
>> be applied easily.
>>
>> Regards,
>> Bogdan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 20, 2009, 9:11 AM

Post #4 of 4 (246 views)
Permalink
Re: IndexWriter.updateDocument performance improvement [In reply to]

Opened LUCENE-2086.

Mike

On Fri, Nov 20, 2009 at 9:43 AM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> +1
>
> I'll open an issue.
>
> Mike
>
> On Fri, Nov 20, 2009 at 8:11 AM, Yonik Seeley
> <yonik [at] lucidimagination> wrote:
>> Thanks Bogdan, I've been meaning to bring this up.
>> Solr used a TreeMap in the past (when it handled it's own deletes) for
>> the same exact reason.  In my profiling, I've also seen applyDeletes()
>> taking the bulk of the time with small/simple document indexing.
>>
>> So we should definitely go in sorted order (either via TreeMap or sort
>> the HashMap).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bogdan [at] ecstend> wrote:
>>> Hi,
>>>
>>> One of the use case of my application involves updating the index with
>>> 10 to 10k docs every few minutes. Because we maintain a PK for each
>>> doc we have to use IndexWriter.updateDocument to be consistent.
>>>
>>> The average time for an update when we commit every 10k docs is around
>>> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
>>> several hours and I noticed that most of the time is spent in
>>> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
>>> BufferedDeletes.terms from HashMap to TreeMap to have the terms
>>> ordered and to reduce the number of random seeks on the disk.
>>>
>>> I run my tests again with the patched Lucene 2.9.1 and the time has
>>> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>>>
>>> I cannot send a patch because my company has some strict and time
>>> consuming policies about open source but the change is small and can
>>> be applied easily.
>>>
>>> Regards,
>>> Bogdan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.