Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

best practice on too many files vs IO overhead

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


istvan.soos at gmail

Nov 27, 2009, 1:23 AM

Post #1 of 6 (977 views)
Permalink
best practice on too many files vs IO overhead

Hi,

I've a requirement that involves frequent, batched update of my Lucene
index. This is done by a memory queue and process that periodically
wakes and process that queue into the Lucene index.

If I do not optimize my index, I'll receive "too many open files"
exception (yeah, right, I can get the OS's limit up a bit, but that
just prolongs the exception).
If I do optimize my index, I'll receive a very large IO overhead (as
it reads again and writes the whole index).

Right now I'm optimizing the index on each batch cycle, but as my
index size quickly goes to around 1GB, I experience great overhead in
the IO operations. The update shall happen frequently (1-10 times per
minute), so I'm looking for advices how to solve this issue. I might
split the index, but that way I'll receive the "too many open files"
sooner, and in the end the IO overhead remains...

Any suggestions?
Thanks,
Istvan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 27, 2009, 2:37 AM

Post #2 of 6 (923 views)
Permalink
Re: best practice on too many files vs IO overhead [In reply to]

Are you sure you're closing all readers that you're opening?

It's surprising with normal usage of Lucene that you'd run out of
descriptors, with its default mergeFactor (have you increased the
mergeFactor)?

You can also enable compound file, which uses far fewer file
descriptors, at some cost to indexing performance.

Also, a partial optimize (ie optimize(N)) does less IO but still
substantially reduces segment count of the index.

Mike

On Fri, Nov 27, 2009 at 4:23 AM, Istvan Soos <istvan.soos [at] gmail> wrote:
> Hi,
>
> I've a requirement that involves frequent, batched update of my Lucene
> index. This is done by a memory queue and process that periodically
> wakes and process that queue into the Lucene index.
>
> If I do not optimize my index, I'll receive "too many open files"
> exception (yeah, right, I can get the OS's limit up a bit, but that
> just prolongs the exception).
> If I do optimize my index, I'll receive a very large IO overhead (as
> it reads again and writes the whole index).
>
> Right now I'm optimizing the index on each batch cycle, but as my
> index size quickly goes to around 1GB, I experience great overhead in
> the IO operations. The update shall happen frequently (1-10 times per
> minute), so I'm looking for advices how to solve this issue. I might
> split the index, but that way I'll receive the "too many open files"
> sooner, and in the end the IO overhead remains...
>
> Any suggestions?
> Thanks,
>   Istvan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


istvan.soos at gmail

Nov 27, 2009, 2:48 AM

Post #3 of 6 (923 views)
Permalink
Re: best practice on too many files vs IO overhead [In reply to]

On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> Are you sure you're closing all readers that you're opening?

Absolutely. :) (okay, never say this, but I had bugz because of this
previously so I'm pretty sure that one is ok).

> It's surprising with normal usage of Lucene that you'd run out of
> descriptors, with its default mergeFactor (have you increased the
> mergeFactor)?

Default merge factor. (on Mac, the default maxfiles is 256, however
I've run out of descriptors event at 10240, if I hadn't called
optimize).

> You can also enable compound file, which uses far fewer file
> descriptors, at some cost to indexing performance.

I thought this is the default but I'll check...

> Also, a partial optimize (ie optimize(N)) does less IO but still
> substantially reduces segment count of the index.

I wasn't aware of this, thanks, I'll try it!

Regards,
Istvan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 27, 2009, 3:02 AM

Post #4 of 6 (919 views)
Permalink
Re: best practice on too many files vs IO overhead [In reply to]

If in fact you are using CFS (it is the default), and your OS is
letting you use 10240 descriptors, and you haven't changed the
mergeFactor, then something is seriously wrong. I would triple check
that all readers are being closed.

Or... if you list the index directory, how many files do you see?

Mike

On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <istvan.soos [at] gmail> wrote:
> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
> <lucene [at] mikemccandless> wrote:
>> Are you sure you're closing all readers that you're opening?
>
> Absolutely. :) (okay, never say this, but I had bugz because of this
> previously so I'm pretty sure that one is ok).
>
>> It's surprising with normal usage of Lucene that you'd run out of
>> descriptors, with its default mergeFactor (have you increased the
>> mergeFactor)?
>
> Default merge factor. (on Mac, the default maxfiles is 256, however
> I've run out of descriptors event at 10240, if I hadn't called
> optimize).
>
>> You can also enable compound file, which uses far fewer file
>> descriptors, at some cost to indexing performance.
>
> I thought this is the default but I'll check...
>
>> Also, a partial optimize (ie optimize(N)) does less IO but still
>> substantially reduces segment count of the index.
>
> I wasn't aware of this, thanks, I'll try it!
>
> Regards,
>   Istvan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


istvan.soos at gmail

Nov 27, 2009, 3:11 AM

Post #5 of 6 (925 views)
Permalink
Re: best practice on too many files vs IO overhead [In reply to]

You were right, my bad...

I have an async reader closing on a scheduled basis (after the writer
refreshes the index, to not interrupt the ongoing searches), but while
I've setup the scheduling for my first two index, I've forgotten it in
my third... oh dear...

Thanks anyway the info, it was useful indeed.
Regards,
Istvan

On Fri, Nov 27, 2009 at 12:02 PM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> If in fact you are using CFS (it is the default), and your OS is
> letting you use 10240 descriptors, and you haven't changed the
> mergeFactor, then something is seriously wrong.  I would triple check
> that all readers are being closed.
>
> Or... if you list the index directory, how many files do you see?
>
> Mike
>
> On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <istvan.soos [at] gmail> wrote:
>> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
>> <lucene [at] mikemccandless> wrote:
>>> Are you sure you're closing all readers that you're opening?
>>
>> Absolutely. :) (okay, never say this, but I had bugz because of this
>> previously so I'm pretty sure that one is ok).
>>
>>> It's surprising with normal usage of Lucene that you'd run out of
>>> descriptors, with its default mergeFactor (have you increased the
>>> mergeFactor)?
>>
>> Default merge factor. (on Mac, the default maxfiles is 256, however
>> I've run out of descriptors event at 10240, if I hadn't called
>> optimize).
>>
>>> You can also enable compound file, which uses far fewer file
>>> descriptors, at some cost to indexing performance.
>>
>> I thought this is the default but I'll check...
>>
>>> Also, a partial optimize (ie optimize(N)) does less IO but still
>>> substantially reduces segment count of the index.
>>
>> I wasn't aware of this, thanks, I'll try it!
>>
>> Regards,
>>   Istvan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 27, 2009, 3:12 AM

Post #6 of 6 (925 views)
Permalink
Re: best practice on too many files vs IO overhead [In reply to]

Phew :) Thanks for bringing closure!

Mike

On Fri, Nov 27, 2009 at 6:02 AM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> If in fact you are using CFS (it is the default), and your OS is
> letting you use 10240 descriptors, and you haven't changed the
> mergeFactor, then something is seriously wrong.  I would triple check
> that all readers are being closed.
>
> Or... if you list the index directory, how many files do you see?
>
> Mike
>
> On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <istvan.soos [at] gmail> wrote:
>> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
>> <lucene [at] mikemccandless> wrote:
>>> Are you sure you're closing all readers that you're opening?
>>
>> Absolutely. :) (okay, never say this, but I had bugz because of this
>> previously so I'm pretty sure that one is ok).
>>
>>> It's surprising with normal usage of Lucene that you'd run out of
>>> descriptors, with its default mergeFactor (have you increased the
>>> mergeFactor)?
>>
>> Default merge factor. (on Mac, the default maxfiles is 256, however
>> I've run out of descriptors event at 10240, if I hadn't called
>> optimize).
>>
>>> You can also enable compound file, which uses far fewer file
>>> descriptors, at some cost to indexing performance.
>>
>> I thought this is the default but I'll check...
>>
>>> Also, a partial optimize (ie optimize(N)) does less IO but still
>>> substantially reduces segment count of the index.
>>
>> I wasn't aware of this, thanks, I'll try it!
>>
>> Regards,
>>   Istvan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.