Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

java gc with a frequently changing index?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


tsturge at metaweb

Jul 25, 2007, 11:41 AM

Post #1 of 7 (2340 views)
Permalink
java gc with a frequently changing index?

Hi,

I am indexing a set of constantly changing documents. The change rate is
moderate (about 10 docs/sec over a 10M document collection with a 6G
total size) but I want to be right up to date (ideally within a second
but within 5 seconds is acceptable) with the index.

Right now I have code that adds new documents to the index and deletes
old ones using updateDocument() in the 2.1 IndexWriter. In order to see
the changes, I need to recreate the IndexReader/IndexSearcher every
second or so. I am not calling optimize() on the index in the writer,
and the mergeFactor is 10.

The problem I am facing is that java gc is terrible at collecting the
IndexSearchers I am discarding. I usually have a 3msec query time, but I
get gc pauses of 300msec to 3 sec (I assume is is collecting the
"tenured" generation in these pauses, which is my old IndexSearcher)

I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
calling System.gc() right after I close the old index without much luck
(I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
pauses). So my question is, should I be avoiding reloading my index in
this way? Should I keep a separate IndexReader (which only deletes old
documents) and one for new documents? Is there a standard technique for
a quickly changing index?

Thanks,

Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jul 28, 2007, 12:46 PM

Post #2 of 7 (2247 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

Why do you believe that it's the gc? I admit i just scanned your
e-mail, but I *do* know that the first search (especially sorts) on
a newly-opened IndexReader incure a bunch of overhead. Could
that be what you're seeing?

I'm not sure there is a "best practice", but I have seen two
solutions mentioned, both more complex than opening/closing
the reader.

1> open the reader in the background, fire a few "warmup" queries
at it, then switch it with the one you actually use to answer queries.

2> Use a RAMDirectory to hold your new entries for some period
of time. You'd have to do some fancy dancing to keep this straight
since you're updating documents, but it might be viable. The scheme
is something like
Open your FSDIR
Open a RAMdir.

Add all new documents to BOTH of them. When servicing a query,
look in both indexes, but you only open/close the RAMdir for
every query. Note that since, when you open a reader, it
takes a snapshot of the index, these two views will be disjoint. When you
get your results back, you'll have to do something about the documents
from the FSdir that have been replaced in the RAMdir, which is where
the fancy dancing part comes in. But I leave that as an exercise for
the reader.

Periodically, shut everything down and repeat. The point here is that
you can (probably) close/open your RAMdir with very small costs and
have the whole thing be up to date.

There'll be some coordination issues, and you'll have to cope with data
integrity if your process barfs before you've closed your FSDir....

Or, you could ask whether 5 seconds is really necessary.I've seen a lot
of times when "real time" could be 5 minutes and nobody would really
complain, and other times when it really is critical. But that's between you
and our Product Manager....

Hope this helps
Erick

On 7/25/07, Tim Sturge <tsturge [at] metaweb> wrote:
>
> Hi,
>
> I am indexing a set of constantly changing documents. The change rate is
> moderate (about 10 docs/sec over a 10M document collection with a 6G
> total size) but I want to be right up to date (ideally within a second
> but within 5 seconds is acceptable) with the index.
>
> Right now I have code that adds new documents to the index and deletes
> old ones using updateDocument() in the 2.1 IndexWriter. In order to see
> the changes, I need to recreate the IndexReader/IndexSearcher every
> second or so. I am not calling optimize() on the index in the writer,
> and the mergeFactor is 10.
>
> The problem I am facing is that java gc is terrible at collecting the
> IndexSearchers I am discarding. I usually have a 3msec query time, but I
> get gc pauses of 300msec to 3 sec (I assume is is collecting the
> "tenured" generation in these pauses, which is my old IndexSearcher)
>
> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
> calling System.gc() right after I close the old index without much luck
> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
> pauses). So my question is, should I be avoiding reloading my index in
> this way? Should I keep a separate IndexReader (which only deletes old
> documents) and one for new documents? Is there a standard technique for
> a quickly changing index?
>
> Thanks,
>
> Tim
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


markrmiller at gmail

Jul 30, 2007, 1:19 PM

Post #3 of 7 (2222 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.

On 7/30/07, Tim Sturge <tsturge [at] metaweb> wrote:
>
> Thanks for the reply Erick,
>
> I believe it is the gc for four reasons:
>
> - I've tried the "warmup" approach alredy and it didn't change the
> situation.
>
> - The server completely pauses for several seconds. I run jstack to find
> out where the pause is, and it also pauses for several seconds before
> telling me the server is doing something perfectly innocuous. If I was
> stuck in some search overhead, I would expect jstack to tell me where
> (and I would expect the where to be somewhere interesting and vaguely
> repeatable)
>
> - The impact is very uneven. Over 50000 queries (sequentially) I get
> 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
> really would be much happier with a consistent 10msec (which adds up to
> the same amount of time in total) or even 25msec
>
> - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
> 100 msec and 1 sec pauses instead, but 5x as many for slower overall
> time; 1 sec is far too slow)
>
> Your solution looks possible, but seems really too complex for what I am
> trying to do (which is basic incremental update). What I really am
> looking for is a way to avoid reopening the first segment of my FSDir. I
> have a single 6G segment, and then another 20-50 segments with updates,
> but they are <100M in total size. So if I could have lucene open just
> the segments file and the new or changed *.del and *.cfs files (without
> reopening the unchanged *.cfs files) that would be a huge win for me I
> think.
>
> It strikes me this should be possible with a thin but complex layer
> between the SegmentReader and MultiReader, and perhaps a way to get
> SegmentReader to update what *.del file it is using. I'm just curious
> why this doesn't already exist.
>
> Tim
>
> Erick Erickson wrote:
> > Why do you believe that it's the gc? I admit i just scanned your
> > e-mail, but I *do* know that the first search (especially sorts) on
> > a newly-opened IndexReader incure a bunch of overhead. Could
> > that be what you're seeing?
> >
> > I'm not sure there is a "best practice", but I have seen two
> > solutions mentioned, both more complex than opening/closing
> > the reader.
> > 1> open the reader in the background, fire a few "warmup" queries
> > at it, then switch it with the one you actually use to answer queries.
> >
> > 2> Use a RAMDirectory to hold your new entries for some period
> > of time. You'd have to do some fancy dancing to keep this straight
> > since you're updating documents, but it might be viable. The scheme
> > is something like
> > Open your FSDIR
> > Open a RAMdir.
> >
> > Add all new documents to BOTH of them. When servicing a query,
> > look in both indexes, but you only open/close the RAMdir for
> > every query. Note that since, when you open a reader, it
> > takes a snapshot of the index, these two views will be disjoint. When
> you
> > get your results back, you'll have to do something about the documents
> > from the FSdir that have been replaced in the RAMdir, which is where
> > the fancy dancing part comes in. But I leave that as an exercise for
> > the reader.
> >
> > Periodically, shut everything down and repeat. The point here is that
> > you can (probably) close/open your RAMdir with very small costs and
> > have the whole thing be up to date.
> >
> > There'll be some coordination issues, and you'll have to cope with data
> > integrity if your process barfs before you've closed your FSDir....
> >
> > Or, you could ask whether 5 seconds is really necessary.I've seen a lot
> > of times when "real time" could be 5 minutes and nobody would really
> > complain, and other times when it really is critical. But that's between
> you
> > and our Product Manager....
> >
> > Hope this helps
> > Erick
> >
> > On 7/25/07, Tim Sturge <tsturge [at] metaweb> wrote:
> >
> >> Hi,
> >>
> >> I am indexing a set of constantly changing documents. The change rate
> is
> >> moderate (about 10 docs/sec over a 10M document collection with a 6G
> >> total size) but I want to be right up to date (ideally within a second
> >> but within 5 seconds is acceptable) with the index.
> >>
> >> Right now I have code that adds new documents to the index and deletes
> >> old ones using updateDocument() in the 2.1 IndexWriter. In order to see
> >> the changes, I need to recreate the IndexReader/IndexSearcher every
> >> second or so. I am not calling optimize() on the index in the writer,
> >> and the mergeFactor is 10.
> >>
> >> The problem I am facing is that java gc is terrible at collecting the
> >> IndexSearchers I am discarding. I usually have a 3msec query time, but
> I
> >> get gc pauses of 300msec to 3 sec (I assume is is collecting the
> >> "tenured" generation in these pauses, which is my old IndexSearcher)
> >>
> >> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
> >> calling System.gc() right after I close the old index without much luck
> >> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
> >> pauses). So my question is, should I be avoiding reloading my index in
> >> this way? Should I keep a separate IndexReader (which only deletes old
> >> documents) and one for new documents? Is there a standard technique for
> >> a quickly changing index?
> >>
> >> Thanks,
> >>
> >> Tim
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >>
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


markrmiller at gmail

Jul 30, 2007, 1:23 PM

Post #4 of 7 (2241 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

And by the way, I cannot see it ever making sense to keep reopening an index
reader every second or so. It has to be MUCH more efficient to even wait
every 2 or 4 seconds...even that is going to be pretty nasty, but you have
to allow for a bit of batch man. You will waste so much time opening those
readers that its not going to be real-time anyway. You are just going to be
in a world of slow.

On 7/30/07, Mark Miller <markrmiller [at] gmail> wrote:
>
> I believe there is an issue in JIRA that handles reopening an IndexReader
> without reopening segments that have not changed.
>
> On 7/30/07, Tim Sturge < tsturge [at] metaweb> wrote:
> >
> > Thanks for the reply Erick,
> >
> > I believe it is the gc for four reasons:
> >
> > - I've tried the "warmup" approach alredy and it didn't change the
> > situation.
> >
> > - The server completely pauses for several seconds. I run jstack to find
> > out where the pause is, and it also pauses for several seconds before
> > telling me the server is doing something perfectly innocuous. If I was
> > stuck in some search overhead, I would expect jstack to tell me where
> > (and I would expect the where to be somewhere interesting and vaguely
> > repeatable)
> >
> > - The impact is very uneven. Over 50000 queries (sequentially) I get
> > 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
> > really would be much happier with a consistent 10msec (which adds up to
> > the same amount of time in total) or even 25msec
> >
> > - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
> > 100 msec and 1 sec pauses instead, but 5x as many for slower overall
> > time; 1 sec is far too slow)
> >
> > Your solution looks possible, but seems really too complex for what I am
> > trying to do (which is basic incremental update). What I really am
> > looking for is a way to avoid reopening the first segment of my FSDir. I
> >
> > have a single 6G segment, and then another 20-50 segments with updates,
> > but they are <100M in total size. So if I could have lucene open just
> > the segments file and the new or changed *.del and *.cfs files (without
> > reopening the unchanged *.cfs files) that would be a huge win for me I
> > think.
> >
> > It strikes me this should be possible with a thin but complex layer
> > between the SegmentReader and MultiReader, and perhaps a way to get
> > SegmentReader to update what *.del file it is using. I'm just curious
> > why this doesn't already exist.
> >
> > Tim
> >
> > Erick Erickson wrote:
> > > Why do you believe that it's the gc? I admit i just scanned your
> > > e-mail, but I *do* know that the first search (especially sorts) on
> > > a newly-opened IndexReader incure a bunch of overhead. Could
> > > that be what you're seeing?
> > >
> > > I'm not sure there is a "best practice", but I have seen two
> > > solutions mentioned, both more complex than opening/closing
> > > the reader.
> > > 1> open the reader in the background, fire a few "warmup" queries
> > > at it, then switch it with the one you actually use to answer queries.
> >
> > >
> > > 2> Use a RAMDirectory to hold your new entries for some period
> > > of time. You'd have to do some fancy dancing to keep this straight
> > > since you're updating documents, but it might be viable. The scheme
> > > is something like
> > > Open your FSDIR
> > > Open a RAMdir.
> > >
> > > Add all new documents to BOTH of them. When servicing a query,
> > > look in both indexes, but you only open/close the RAMdir for
> > > every query. Note that since, when you open a reader, it
> > > takes a snapshot of the index, these two views will be disjoint. When
> > you
> > > get your results back, you'll have to do something about the documents
> >
> > > from the FSdir that have been replaced in the RAMdir, which is where
> > > the fancy dancing part comes in. But I leave that as an exercise for
> > > the reader.
> > >
> > > Periodically, shut everything down and repeat. The point here is that
> > > you can (probably) close/open your RAMdir with very small costs and
> > > have the whole thing be up to date.
> > >
> > > There'll be some coordination issues, and you'll have to cope with
> > data
> > > integrity if your process barfs before you've closed your FSDir....
> > >
> > > Or, you could ask whether 5 seconds is really necessary.I've seen a
> > lot
> > > of times when "real time" could be 5 minutes and nobody would really
> > > complain, and other times when it really is critical. But that's
> > between you
> > > and our Product Manager....
> > >
> > > Hope this helps
> > > Erick
> > >
> > > On 7/25/07, Tim Sturge <tsturge [at] metaweb> wrote:
> > >
> > >> Hi,
> > >>
> > >> I am indexing a set of constantly changing documents. The change rate
> > is
> > >> moderate (about 10 docs/sec over a 10M document collection with a 6G
> > >> total size) but I want to be right up to date (ideally within a
> > second
> > >> but within 5 seconds is acceptable) with the index.
> > >>
> > >> Right now I have code that adds new documents to the index and
> > deletes
> > >> old ones using updateDocument() in the 2.1 IndexWriter. In order to
> > see
> > >> the changes, I need to recreate the IndexReader/IndexSearcher every
> > >> second or so. I am not calling optimize() on the index in the writer,
> > >> and the mergeFactor is 10.
> > >>
> > >> The problem I am facing is that java gc is terrible at collecting the
> >
> > >> IndexSearchers I am discarding. I usually have a 3msec query time,
> > but I
> > >> get gc pauses of 300msec to 3 sec (I assume is is collecting the
> > >> "tenured" generation in these pauses, which is my old IndexSearcher)
> > >>
> > >> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
> > >> calling System.gc() right after I close the old index without much
> > luck
> > >> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
> > >> pauses). So my question is, should I be avoiding reloading my index
> > in
> > >> this way? Should I keep a separate IndexReader (which only deletes
> > old
> > >> documents) and one for new documents? Is there a standard technique
> > for
> > >> a quickly changing index?
> > >>
> > >> Thanks,
> > >>
> > >> Tim
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > >> For additional commands, e-mail: java-user-help [at] lucene
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >
>


tsturge at metaweb

Jul 30, 2007, 1:29 PM

Post #5 of 7 (2232 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

Thanks for the reply Erick,

I believe it is the gc for four reasons:

- I've tried the "warmup" approach alredy and it didn't change the
situation.

- The server completely pauses for several seconds. I run jstack to find
out where the pause is, and it also pauses for several seconds before
telling me the server is doing something perfectly innocuous. If I was
stuck in some search overhead, I would expect jstack to tell me where
(and I would expect the where to be somewhere interesting and vaguely
repeatable)

- The impact is very uneven. Over 50000 queries (sequentially) I get
49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
really would be much happier with a consistent 10msec (which adds up to
the same amount of time in total) or even 25msec

- "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
100 msec and 1 sec pauses instead, but 5x as many for slower overall
time; 1 sec is far too slow)

Your solution looks possible, but seems really too complex for what I am
trying to do (which is basic incremental update). What I really am
looking for is a way to avoid reopening the first segment of my FSDir. I
have a single 6G segment, and then another 20-50 segments with updates,
but they are <100M in total size. So if I could have lucene open just
the segments file and the new or changed *.del and *.cfs files (without
reopening the unchanged *.cfs files) that would be a huge win for me I
think.

It strikes me this should be possible with a thin but complex layer
between the SegmentReader and MultiReader, and perhaps a way to get
SegmentReader to update what *.del file it is using. I'm just curious
why this doesn't already exist.

Tim

Erick Erickson wrote:
> Why do you believe that it's the gc? I admit i just scanned your
> e-mail, but I *do* know that the first search (especially sorts) on
> a newly-opened IndexReader incure a bunch of overhead. Could
> that be what you're seeing?
>
> I'm not sure there is a "best practice", but I have seen two
> solutions mentioned, both more complex than opening/closing
> the reader.
> 1> open the reader in the background, fire a few "warmup" queries
> at it, then switch it with the one you actually use to answer queries.
>
> 2> Use a RAMDirectory to hold your new entries for some period
> of time. You'd have to do some fancy dancing to keep this straight
> since you're updating documents, but it might be viable. The scheme
> is something like
> Open your FSDIR
> Open a RAMdir.
>
> Add all new documents to BOTH of them. When servicing a query,
> look in both indexes, but you only open/close the RAMdir for
> every query. Note that since, when you open a reader, it
> takes a snapshot of the index, these two views will be disjoint. When you
> get your results back, you'll have to do something about the documents
> from the FSdir that have been replaced in the RAMdir, which is where
> the fancy dancing part comes in. But I leave that as an exercise for
> the reader.
>
> Periodically, shut everything down and repeat. The point here is that
> you can (probably) close/open your RAMdir with very small costs and
> have the whole thing be up to date.
>
> There'll be some coordination issues, and you'll have to cope with data
> integrity if your process barfs before you've closed your FSDir....
>
> Or, you could ask whether 5 seconds is really necessary.I've seen a lot
> of times when "real time" could be 5 minutes and nobody would really
> complain, and other times when it really is critical. But that's between you
> and our Product Manager....
>
> Hope this helps
> Erick
>
> On 7/25/07, Tim Sturge <tsturge [at] metaweb> wrote:
>
>> Hi,
>>
>> I am indexing a set of constantly changing documents. The change rate is
>> moderate (about 10 docs/sec over a 10M document collection with a 6G
>> total size) but I want to be right up to date (ideally within a second
>> but within 5 seconds is acceptable) with the index.
>>
>> Right now I have code that adds new documents to the index and deletes
>> old ones using updateDocument() in the 2.1 IndexWriter. In order to see
>> the changes, I need to recreate the IndexReader/IndexSearcher every
>> second or so. I am not calling optimize() on the index in the writer,
>> and the mergeFactor is 10.
>>
>> The problem I am facing is that java gc is terrible at collecting the
>> IndexSearchers I am discarding. I usually have a 3msec query time, but I
>> get gc pauses of 300msec to 3 sec (I assume is is collecting the
>> "tenured" generation in these pauses, which is my old IndexSearcher)
>>
>> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
>> calling System.gc() right after I close the old index without much luck
>> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
>> pauses). So my question is, should I be avoiding reloading my index in
>> this way? Should I keep a separate IndexReader (which only deletes old
>> documents) and one for new documents? Is there a standard technique for
>> a quickly changing index?
>>
>> Thanks,
>>
>> Tim
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tsturge at metaweb

Jul 30, 2007, 2:16 PM

Post #6 of 7 (2225 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

Oh, yeah, I know now :-). But I really do have a requirement to show
search results from items that came in 5 seconds ago. We have an
application where a common usage pattern is

add an item
navigate to another item
search for the first item (to associate it with the second item)

and the gap between step 1 and step 3 is not very long.

Right now, I get a notification within a few hundred msec that the item
has been added. I just don't see why it is hard (in theory anyway,
lucene's implementation notwithstanding) to put that on the end of the
index I'm currently searching. I have lots of CPU available.

Can you tell me the JIRA issue? What kind of patch would lucene devs be
likely to accept (do I need to get it 100% done, or is 80% of the way
interesting?)

Tim

Mark Miller wrote:
> And by the way, I cannot see it ever making sense to keep reopening an index
> reader every second or so. It has to be MUCH more efficient to even wait
> every 2 or 4 seconds...even that is going to be pretty nasty, but you have
> to allow for a bit of batch man. You will waste so much time opening those
> readers that its not going to be real-time anyway. You are just going to be
> in a world of slow.
>
> On 7/30/07, Mark Miller <markrmiller [at] gmail> wrote:
>
>> I believe there is an issue in JIRA that handles reopening an IndexReader
>> without reopening segments that have not changed.
>>
>> On 7/30/07, Tim Sturge < tsturge [at] metaweb> wrote:
>>
>>> Thanks for the reply Erick,
>>>
>>> I believe it is the gc for four reasons:
>>>
>>> - I've tried the "warmup" approach alredy and it didn't change the
>>> situation.
>>>
>>> - The server completely pauses for several seconds. I run jstack to find
>>> out where the pause is, and it also pauses for several seconds before
>>> telling me the server is doing something perfectly innocuous. If I was
>>> stuck in some search overhead, I would expect jstack to tell me where
>>> (and I would expect the where to be somewhere interesting and vaguely
>>> repeatable)
>>>
>>> - The impact is very uneven. Over 50000 queries (sequentially) I get
>>> 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
>>> really would be much happier with a consistent 10msec (which adds up to
>>> the same amount of time in total) or even 25msec
>>>
>>> - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
>>> 100 msec and 1 sec pauses instead, but 5x as many for slower overall
>>> time; 1 sec is far too slow)
>>>
>>> Your solution looks possible, but seems really too complex for what I am
>>> trying to do (which is basic incremental update). What I really am
>>> looking for is a way to avoid reopening the first segment of my FSDir. I
>>>
>>> have a single 6G segment, and then another 20-50 segments with updates,
>>> but they are <100M in total size. So if I could have lucene open just
>>> the segments file and the new or changed *.del and *.cfs files (without
>>> reopening the unchanged *.cfs files) that would be a huge win for me I
>>> think.
>>>
>>> It strikes me this should be possible with a thin but complex layer
>>> between the SegmentReader and MultiReader, and perhaps a way to get
>>> SegmentReader to update what *.del file it is using. I'm just curious
>>> why this doesn't already exist.
>>>
>>> Tim
>>>
>>> Erick Erickson wrote:
>>>
>>>> Why do you believe that it's the gc? I admit i just scanned your
>>>> e-mail, but I *do* know that the first search (especially sorts) on
>>>> a newly-opened IndexReader incure a bunch of overhead. Could
>>>> that be what you're seeing?
>>>>
>>>> I'm not sure there is a "best practice", but I have seen two
>>>> solutions mentioned, both more complex than opening/closing
>>>> the reader.
>>>> 1> open the reader in the background, fire a few "warmup" queries
>>>> at it, then switch it with the one you actually use to answer queries.
>>>>
>>>> 2> Use a RAMDirectory to hold your new entries for some period
>>>> of time. You'd have to do some fancy dancing to keep this straight
>>>> since you're updating documents, but it might be viable. The scheme
>>>> is something like
>>>> Open your FSDIR
>>>> Open a RAMdir.
>>>>
>>>> Add all new documents to BOTH of them. When servicing a query,
>>>> look in both indexes, but you only open/close the RAMdir for
>>>> every query. Note that since, when you open a reader, it
>>>> takes a snapshot of the index, these two views will be disjoint. When
>>>>
>>> you
>>>
>>>> get your results back, you'll have to do something about the documents
>>>>
>>>> from the FSdir that have been replaced in the RAMdir, which is where
>>>> the fancy dancing part comes in. But I leave that as an exercise for
>>>> the reader.
>>>>
>>>> Periodically, shut everything down and repeat. The point here is that
>>>> you can (probably) close/open your RAMdir with very small costs and
>>>> have the whole thing be up to date.
>>>>
>>>> There'll be some coordination issues, and you'll have to cope with
>>>>
>>> data
>>>
>>>> integrity if your process barfs before you've closed your FSDir....
>>>>
>>>> Or, you could ask whether 5 seconds is really necessary.I've seen a
>>>>
>>> lot
>>>
>>>> of times when "real time" could be 5 minutes and nobody would really
>>>> complain, and other times when it really is critical. But that's
>>>>
>>> between you
>>>
>>>> and our Product Manager....
>>>>
>>>> Hope this helps
>>>> Erick
>>>>
>>>> On 7/25/07, Tim Sturge <tsturge [at] metaweb> wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am indexing a set of constantly changing documents. The change rate
>>>>>
>>> is
>>>
>>>>> moderate (about 10 docs/sec over a 10M document collection with a 6G
>>>>> total size) but I want to be right up to date (ideally within a
>>>>>
>>> second
>>>
>>>>> but within 5 seconds is acceptable) with the index.
>>>>>
>>>>> Right now I have code that adds new documents to the index and
>>>>>
>>> deletes
>>>
>>>>> old ones using updateDocument() in the 2.1 IndexWriter. In order to
>>>>>
>>> see
>>>
>>>>> the changes, I need to recreate the IndexReader/IndexSearcher every
>>>>> second or so. I am not calling optimize() on the index in the writer,
>>>>> and the mergeFactor is 10.
>>>>>
>>>>> The problem I am facing is that java gc is terrible at collecting the
>>>>>
>>>>> IndexSearchers I am discarding. I usually have a 3msec query time,
>>>>>
>>> but I
>>>
>>>>> get gc pauses of 300msec to 3 sec (I assume is is collecting the
>>>>> "tenured" generation in these pauses, which is my old IndexSearcher)
>>>>>
>>>>> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
>>>>> calling System.gc() right after I close the old index without much
>>>>>
>>> luck
>>>
>>>>> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
>>>>> pauses). So my question is, should I be avoiding reloading my index
>>>>>
>>> in
>>>
>>>>> this way? Should I keep a separate IndexReader (which only deletes
>>>>>
>>> old
>>>
>>>>> documents) and one for new documents? Is there a standard technique
>>>>>
>>> for
>>>
>>>>> a quickly changing index?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kroepke at classdump

Jul 30, 2007, 3:48 PM

Post #7 of 7 (2227 views)
Permalink
Re: java gc with a frequently changing index? [In reply to]

Hi Tim!

On Jul 25, 2007, at 8:41 PM, Tim Sturge wrote:

> I am indexing a set of constantly changing documents. The change
> rate is moderate (about 10 docs/sec over a 10M document collection
> with a 6G total size) but I want to be right up to date (ideally
> within a second but within 5 seconds is acceptable) with the index.

We have a change rate between 2-3 to 60 docs/sec over a bit smaller
index (but not too much smaller). We are actually reopening
IndexSearchers every five seconds or if the amount of index changes
exceeds a certain threshold (100 changes IIRC). The latter is to
guard against spikes in updates we like to see reflected earlier.
This is purely an implementation detail, though.

> Right now I have code that adds new documents to the index and
> deletes old ones using updateDocument() in the 2.1 IndexWriter. In
> order to see the changes, I need to recreate the IndexReader/
> IndexSearcher every second or so. I am not calling optimize() on
> the index in the writer, and the mergeFactor is 10.

Is there a separation between the code that inserts/updates and the
one that searches? We have that distinction and it's been working
great. Might not possible for your application (I simply don't know
what your objectives are) but might be worth considering. In other
words we have separate VMs doing the updates and searches, so we can
set different heap sizes and GC strategies.

> The problem I am facing is that java gc is terrible at collecting
> the IndexSearchers I am discarding. I usually have a 3msec query
> time, but I get gc pauses of 300msec to 3 sec (I assume is is
> collecting the "tenured" generation in these pauses, which is my
> old IndexSearcher)

We used to have that, too, until we switched GC algorithms. It was
unbearable.

> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"
> and calling System.gc() right after I close the old index without
> much luck (I get the pauses down to 1sec, but get 3x as many. I
> want < 25 msec pauses). So my question is, should I be avoiding
> reloading my index in this way? Should I keep a separate
> IndexReader (which only deletes old documents) and one for new
> documents? Is there a standard technique for a quickly changing index?

So, these are the settings we use for the search application (this is
Java 6, though, YMMV):
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:CMSIncrementalDutyCycleMin=0
-XX:CMSIncrementalDutyCycle=10

You might have to tweak the generation sizes for your application.
That is rather tricky business, but
-verbosegc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps

might help you to figure out what the correct sizes are. Those
settings should also tell you whether your tweaks are actually
working for you.
Systems.gc() is just asking for trouble, really. I have yet to see a
situation where it really helped me. The best way is to figure out
the right settings for the GC itself, and then forget about it. It
actually took some experimenting and load-testing to find the right
mixture for us.

GC pauses aren't user-noticable in our application (which is web-
based). Given our architecture we have a certain amount of latency
between a document change and the reflection of that in the index,
but it is not limited by GC. The machines are 64bit P4 Xeons with 4GB
RAM, so nothing out of the ordinary.
Java 6 made a noticable difference for us, on the order of some 10%
performance increase, both in load average and response time.
We have yet to encounter problems with it...

The updating part of the application runs with a simple -XX:
+UseParallelGC and its max heap size is much smaller.

Also we are using a custom refcounted scheme for index searchers, so
that new requests always get the latest IndexSearcher opened. We
reopen searchers constantly, as I mentioned above. This pretty much
ensures that we meet our 5 second max delay time. I cannot say that
it actually takes that long to reopen, though we have made some
modifications to the Lucene core which should make it even slower to
reopen and write to disc. So I guess this is not your bottleneck,
either.

HTH,
-k
--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.