Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Optimize for large index size

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


vivextra at gmail

Jan 18, 2008, 1:31 AM

Post #1 of 7 (4652 views)
Permalink
Optimize for large index size

Hi,

We are using Lucene 2.2. We have an index of size 70G (within 3-4
days) and growing. We run optimize pretty frequently (once every hour
- due to large number of index updates every min - can be up to 100K
new documents every min). I have seen every now and then the optimize
takes 3-4 hours to complete and up to 8 G memory (our limit). This
makes the whole system slow. Few questions,

1) Is there any alternative to optimize? That is, can we do without
optimize and still have our search fast?
2) What's the best way to use optimize, i.e. how can we make the
optimize much faster and use lesser memory?
3) Is there a way to partition the indexes using Lucene? Let's say we
partition daily, so we have to optimize only the daily indexes and not
the whole thing.

Our mergefactor=200 and maxMergeDocs=99999

Thanks,
-vivek

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 18, 2008, 2:37 AM

Post #2 of 7 (4609 views)
Permalink
Re: Optimize for large index size [In reply to]

vivek sar wrote:

> Hi,
>
> We are using Lucene 2.2. We have an index of size 70G (within 3-4
> days) and growing. We run optimize pretty frequently (once every hour
> - due to large number of index updates every min - can be up to 100K
> new documents every min). I have seen every now and then the optimize
> takes 3-4 hours to complete and up to 8 G memory (our limit). This
> makes the whole system slow. Few questions,
>
> 1) Is there any alternative to optimize? That is, can we do without
> optimize and still have our search fast?

In Lucene 2.3 (coming out shortly) there is a new "partial optimize"
method that takes an int maxNumSegments. It will optimize your index
down to that many segments. This let's you reduce cost of optimizing
while still getting faster searching. Maybe try that?

Also, Lucene 2.3 has sizable speedups to indexing, and uses RAM
buffering more efficiently. This will let you hold more documents in
RAM before flushing a new segment, which in turn should reduce your
merging cost.

> 2) What's the best way to use optimize, i.e. how can we make the
> optimize much faster and use lesser memory?

Fundamentally optimize is quite time consuming because it has to do
massive segment merging (the final merge being the worst).

However, it is not supposed to be so memory consuming. How are you
measuring memory usage? (Try using java -verbose:gc to see actual
heap usage after full GC). What kind of documents are you creating?
EG one known memory issue is if you have many diverse fields, all
with norms enabled. Norms are not stored sparsely, so, this will
consume alot of RAM during optimize and during searching.

> 3) Is there a way to partition the indexes using Lucene? Let's say we
> partition daily, so we have to optimize only the daily indexes and not
> the whole thing.

Yes, you can do this, and run searches over these indices (use
MultiSearcher). You can then merge them into a single index, using
addIndexesNoOptimize. But, this is not really different from using
optimize(int maxNumSegments).

> Our mergefactor=200 and maxMergeDocs=99999

In Lucene 2.3, segment merging is done in a background thread(s).
Given that, I think you'd want to decrease mergeFactor so that
merging is taking place while you are indexing. (Test to be sure).
That should make your optimize call less costly.

> Thanks,
> -vivek
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


vivextra at gmail

Jan 18, 2008, 9:50 PM

Post #3 of 7 (4608 views)
Permalink
Re: Optimize for large index size [In reply to]

Thanks Michael for the feedback. Couple more questions,

1) Doesn't Lucene do some sort of optimization internally based on
mergefactor, i.e, if the number of segments grow over the mergefactor
number Lucene would automatically merge them into one segment - is
this different than optimization? Does optimize do more than this?
The reason we are keeping high merge factor (200) is so Lucene doesn't
do frequent optimization on its own.

2) Do you know any approximate release date for 2.3?


We do have around 30 fields in our index (over 10 are untokenized, can
I just make then NO_NORM?).

Thanks,
-vivek

On Jan 18, 2008 2:37 AM, Michael McCandless <lucene [at] mikemccandless> wrote:
>
> vivek sar wrote:
>
> > Hi,
> >
> > We are using Lucene 2.2. We have an index of size 70G (within 3-4
> > days) and growing. We run optimize pretty frequently (once every hour
> > - due to large number of index updates every min - can be up to 100K
> > new documents every min). I have seen every now and then the optimize
> > takes 3-4 hours to complete and up to 8 G memory (our limit). This
> > makes the whole system slow. Few questions,
> >
> > 1) Is there any alternative to optimize? That is, can we do without
> > optimize and still have our search fast?
>
> In Lucene 2.3 (coming out shortly) there is a new "partial optimize"
> method that takes an int maxNumSegments. It will optimize your index
> down to that many segments. This let's you reduce cost of optimizing
> while still getting faster searching. Maybe try that?
>
> Also, Lucene 2.3 has sizable speedups to indexing, and uses RAM
> buffering more efficiently. This will let you hold more documents in
> RAM before flushing a new segment, which in turn should reduce your
> merging cost.
>
> > 2) What's the best way to use optimize, i.e. how can we make the
> > optimize much faster and use lesser memory?
>
> Fundamentally optimize is quite time consuming because it has to do
> massive segment merging (the final merge being the worst).
>
> However, it is not supposed to be so memory consuming. How are you
> measuring memory usage? (Try using java -verbose:gc to see actual
> heap usage after full GC). What kind of documents are you creating?
> EG one known memory issue is if you have many diverse fields, all
> with norms enabled. Norms are not stored sparsely, so, this will
> consume alot of RAM during optimize and during searching.
>
> > 3) Is there a way to partition the indexes using Lucene? Let's say we
> > partition daily, so we have to optimize only the daily indexes and not
> > the whole thing.
>
> Yes, you can do this, and run searches over these indices (use
> MultiSearcher). You can then merge them into a single index, using
> addIndexesNoOptimize. But, this is not really different from using
> optimize(int maxNumSegments).
>
> > Our mergefactor=200 and maxMergeDocs=99999
>
> In Lucene 2.3, segment merging is done in a background thread(s).
> Given that, I think you'd want to decrease mergeFactor so that
> merging is taking place while you are indexing. (Test to be sure).
> That should make your optimize call less costly.
>
> > Thanks,
> > -vivek
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 19, 2008, 3:25 AM

Post #4 of 7 (4607 views)
Permalink
Re: Optimize for large index size [In reply to]

vivek sar wrote:

> Thanks Michael for the feedback. Couple more questions,
>
> 1) Doesn't Lucene do some sort of optimization internally based on
> mergefactor, i.e, if the number of segments grow over the mergefactor
> number Lucene would automatically merge them into one segment - is
> this different than optimization? Does optimize do more than this?
> The reason we are keeping high merge factor (200) is so Lucene doesn't
> do frequent optimization on its own.

Lucene does periodically merge segments, but this is in general a
lower cost operation than optimize. With mergeFactor 10, after you
have flushed 10 segments (call these "level 0" segments), they will be
merged together into a level 1 segment. Then 10 flushes later,
another level 1 segment, etc.... once you have 10 level 1 segments,
they are merged into a level 2 segment. Etc.

So over time this results in a logarithmic segment structure, whereby
you have < 10 segments at each level and each level is 10X the
size of the previous one (unless you start doing deletes...).

The merges can cascade, which means at certain times this merging does
in fact equate to an optimize. But that does not happen very often.

Whereas optimize() always forces merging down to a single segment,
which is extremely costly.

I'm guessing, with Lucene 2.3, you will win with a lower mergeFactor
because this allows background threads to merge as you go. Then at
the end there will be fewer segments that optimize has to merge.

> 2) Do you know any approximate release date for 2.3?

Actually any day now ... the final vote to release 2.3 is underway
now!:

http://markmail.org/message/x66w2c5b5psvhc54

> We do have around 30 fields in our index (over 10 are untokenized, can
> I just make then NO_NORM?).

Yes, definitely test this and see if it reduces memory usage. You
have to fully
rebuild your index because norms are "contagious" meaning if any
document
has norms turned on, then, the segment holding that doc will have norms
allocated for all docs, and when that segment gets merged, all merged
docs
then have norms allocated, etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Jan 19, 2008, 10:30 PM

Post #5 of 7 (4599 views)
Permalink
Re: Optimize for large index size [In reply to]

In addition to what Mike already said:

maxMergeDocs=99999 -- do you really mean maxMergeDocs and not maxBufferedDocs?

Larg(er) maxBufferedDocs will speed up indexing.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: vivek sar <vivextra [at] gmail>
To: java-user [at] lucene
Sent: Friday, January 18, 2008 4:31:26 AM
Subject: Optimize for large index size

Hi,

We are using Lucene 2.2. We have an index of size 70G (within 3-4
days) and growing. We run optimize pretty frequently (once every hour
- due to large number of index updates every min - can be up to 100K
new documents every min). I have seen every now and then the optimize
takes 3-4 hours to complete and up to 8 G memory (our limit). This
makes the whole system slow. Few questions,

1) Is there any alternative to optimize? That is, can we do without
optimize and still have our search fast?
2) What's the best way to use optimize, i.e. how can we make the
optimize much faster and use lesser memory?
3) Is there a way to partition the indexes using Lucene? Let's say we
partition daily, so we have to optimize only the daily indexes and not
the whole thing.

Our mergefactor=200 and maxMergeDocs=99999

Thanks,
-vivek

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


vivextra at gmail

Jan 20, 2008, 3:42 AM

Post #6 of 7 (4599 views)
Permalink
Re: Optimize for large index size [In reply to]

my maxBufferedDocs is 1000, do you recommend bigger than that? What's
a good number for this for a very high indexing rate (10K new
documents every min).


On Jan 19, 2008 10:30 PM, Otis Gospodnetic <otis_gospodnetic [at] yahoo> wrote:
> In addition to what Mike already said:
>
> maxMergeDocs=99999 -- do you really mean maxMergeDocs and not maxBufferedDocs?
>
> Larg(er) maxBufferedDocs will speed up indexing.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: vivek sar <vivextra [at] gmail>
> To: java-user [at] lucene
> Sent: Friday, January 18, 2008 4:31:26 AM
> Subject: Optimize for large index size
>
> Hi,
>
> We are using Lucene 2.2. We have an index of size 70G (within 3-4
> days) and growing. We run optimize pretty frequently (once every hour
> - due to large number of index updates every min - can be up to 100K
> new documents every min). I have seen every now and then the optimize
> takes 3-4 hours to complete and up to 8 G memory (our limit). This
> makes the whole system slow. Few questions,
>
> 1) Is there any alternative to optimize? That is, can we do without
> optimize and still have our search fast?
> 2) What's the best way to use optimize, i.e. how can we make the
> optimize much faster and use lesser memory?
> 3) Is there a way to partition the indexes using Lucene? Let's say we
> partition daily, so we have to optimize only the daily indexes and not
> the whole thing.
>
> Our mergefactor=200 and maxMergeDocs=99999
>
> Thanks,
> -vivek
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 20, 2008, 4:53 AM

Post #7 of 7 (4598 views)
Permalink
Re: Optimize for large index size [In reply to]

On upgrading to 2.3, it's best to flush by RAM
(writer.setRAMBufferSizeMB) instead of document count.

Generally, the more RAM the better, to a point. Though you should
also be sure not to use so much RAM that your JVM must GC too often
or hits OOM error, or your machine starts swapping.

Mike

vivek sar wrote:

> my maxBufferedDocs is 1000, do you recommend bigger than that? What's
> a good number for this for a very high indexing rate (10K new
> documents every min).
>
>
> On Jan 19, 2008 10:30 PM, Otis Gospodnetic
> <otis_gospodnetic [at] yahoo> wrote:
>> In addition to what Mike already said:
>>
>> maxMergeDocs=99999 -- do you really mean maxMergeDocs and not
>> maxBufferedDocs?
>>
>> Larg(er) maxBufferedDocs will speed up indexing.
>>
>> Otis
>>
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: vivek sar <vivextra [at] gmail>
>> To: java-user [at] lucene
>> Sent: Friday, January 18, 2008 4:31:26 AM
>> Subject: Optimize for large index size
>>
>> Hi,
>>
>> We are using Lucene 2.2. We have an index of size 70G (within 3-4
>> days) and growing. We run optimize pretty frequently (once every hour
>> - due to large number of index updates every min - can be up to 100K
>> new documents every min). I have seen every now and then the optimize
>> takes 3-4 hours to complete and up to 8 G memory (our limit). This
>> makes the whole system slow. Few questions,
>>
>> 1) Is there any alternative to optimize? That is, can we do without
>> optimize and still have our search fast?
>> 2) What's the best way to use optimize, i.e. how can we make the
>> optimize much faster and use lesser memory?
>> 3) Is there a way to partition the indexes using Lucene? Let's say we
>> partition daily, so we have to optimize only the daily indexes and
>> not
>> the whole thing.
>>
>> Our mergefactor=200 and maxMergeDocs=99999
>>
>> Thanks,
>> -vivek
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.