
tsturge at metaweb
Jul 30, 2007, 2:16 PM
Post #6 of 7
(1123 views)
Permalink
|
|
Re: java gc with a frequently changing index?
[In reply to]
|
|
Oh, yeah, I know now :-). But I really do have a requirement to show search results from items that came in 5 seconds ago. We have an application where a common usage pattern is add an item navigate to another item search for the first item (to associate it with the second item) and the gap between step 1 and step 3 is not very long. Right now, I get a notification within a few hundred msec that the item has been added. I just don't see why it is hard (in theory anyway, lucene's implementation notwithstanding) to put that on the end of the index I'm currently searching. I have lots of CPU available. Can you tell me the JIRA issue? What kind of patch would lucene devs be likely to accept (do I need to get it 100% done, or is 80% of the way interesting?) Tim Mark Miller wrote: > And by the way, I cannot see it ever making sense to keep reopening an index > reader every second or so. It has to be MUCH more efficient to even wait > every 2 or 4 seconds...even that is going to be pretty nasty, but you have > to allow for a bit of batch man. You will waste so much time opening those > readers that its not going to be real-time anyway. You are just going to be > in a world of slow. > > On 7/30/07, Mark Miller <markrmiller[at]gmail.com> wrote: > >> I believe there is an issue in JIRA that handles reopening an IndexReader >> without reopening segments that have not changed. >> >> On 7/30/07, Tim Sturge < tsturge[at]metaweb.com> wrote: >> >>> Thanks for the reply Erick, >>> >>> I believe it is the gc for four reasons: >>> >>> - I've tried the "warmup" approach alredy and it didn't change the >>> situation. >>> >>> - The server completely pauses for several seconds. I run jstack to find >>> out where the pause is, and it also pauses for several seconds before >>> telling me the server is doing something perfectly innocuous. If I was >>> stuck in some search overhead, I would expect jstack to tell me where >>> (and I would expect the where to be somewhere interesting and vaguely >>> repeatable) >>> >>> - The impact is very uneven. Over 50000 queries (sequentially) I get >>> 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I >>> really would be much happier with a consistent 10msec (which adds up to >>> the same amount of time in total) or even 25msec >>> >>> - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get >>> 100 msec and 1 sec pauses instead, but 5x as many for slower overall >>> time; 1 sec is far too slow) >>> >>> Your solution looks possible, but seems really too complex for what I am >>> trying to do (which is basic incremental update). What I really am >>> looking for is a way to avoid reopening the first segment of my FSDir. I >>> >>> have a single 6G segment, and then another 20-50 segments with updates, >>> but they are <100M in total size. So if I could have lucene open just >>> the segments file and the new or changed *.del and *.cfs files (without >>> reopening the unchanged *.cfs files) that would be a huge win for me I >>> think. >>> >>> It strikes me this should be possible with a thin but complex layer >>> between the SegmentReader and MultiReader, and perhaps a way to get >>> SegmentReader to update what *.del file it is using. I'm just curious >>> why this doesn't already exist. >>> >>> Tim >>> >>> Erick Erickson wrote: >>> >>>> Why do you believe that it's the gc? I admit i just scanned your >>>> e-mail, but I *do* know that the first search (especially sorts) on >>>> a newly-opened IndexReader incure a bunch of overhead. Could >>>> that be what you're seeing? >>>> >>>> I'm not sure there is a "best practice", but I have seen two >>>> solutions mentioned, both more complex than opening/closing >>>> the reader. >>>> 1> open the reader in the background, fire a few "warmup" queries >>>> at it, then switch it with the one you actually use to answer queries. >>>> >>>> 2> Use a RAMDirectory to hold your new entries for some period >>>> of time. You'd have to do some fancy dancing to keep this straight >>>> since you're updating documents, but it might be viable. The scheme >>>> is something like >>>> Open your FSDIR >>>> Open a RAMdir. >>>> >>>> Add all new documents to BOTH of them. When servicing a query, >>>> look in both indexes, but you only open/close the RAMdir for >>>> every query. Note that since, when you open a reader, it >>>> takes a snapshot of the index, these two views will be disjoint. When >>>> >>> you >>> >>>> get your results back, you'll have to do something about the documents >>>> >>>> from the FSdir that have been replaced in the RAMdir, which is where >>>> the fancy dancing part comes in. But I leave that as an exercise for >>>> the reader. >>>> >>>> Periodically, shut everything down and repeat. The point here is that >>>> you can (probably) close/open your RAMdir with very small costs and >>>> have the whole thing be up to date. >>>> >>>> There'll be some coordination issues, and you'll have to cope with >>>> >>> data >>> >>>> integrity if your process barfs before you've closed your FSDir.... >>>> >>>> Or, you could ask whether 5 seconds is really necessary.I've seen a >>>> >>> lot >>> >>>> of times when "real time" could be 5 minutes and nobody would really >>>> complain, and other times when it really is critical. But that's >>>> >>> between you >>> >>>> and our Product Manager.... >>>> >>>> Hope this helps >>>> Erick >>>> >>>> On 7/25/07, Tim Sturge <tsturge[at]metaweb.com> wrote: >>>> >>>> >>>>> Hi, >>>>> >>>>> I am indexing a set of constantly changing documents. The change rate >>>>> >>> is >>> >>>>> moderate (about 10 docs/sec over a 10M document collection with a 6G >>>>> total size) but I want to be right up to date (ideally within a >>>>> >>> second >>> >>>>> but within 5 seconds is acceptable) with the index. >>>>> >>>>> Right now I have code that adds new documents to the index and >>>>> >>> deletes >>> >>>>> old ones using updateDocument() in the 2.1 IndexWriter. In order to >>>>> >>> see >>> >>>>> the changes, I need to recreate the IndexReader/IndexSearcher every >>>>> second or so. I am not calling optimize() on the index in the writer, >>>>> and the mergeFactor is 10. >>>>> >>>>> The problem I am facing is that java gc is terrible at collecting the >>>>> >>>>> IndexSearchers I am discarding. I usually have a 3msec query time, >>>>> >>> but I >>> >>>>> get gc pauses of 300msec to 3 sec (I assume is is collecting the >>>>> "tenured" generation in these pauses, which is my old IndexSearcher) >>>>> >>>>> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and >>>>> calling System.gc() right after I close the old index without much >>>>> >>> luck >>> >>>>> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec >>>>> pauses). So my question is, should I be avoiding reloading my index >>>>> >>> in >>> >>>>> this way? Should I keep a separate IndexReader (which only deletes >>>>> >>> old >>> >>>>> documents) and one for new documents? Is there a standard technique >>>>> >>> for >>> >>>>> a quickly changing index? >>>>> >>>>> Thanks, >>>>> >>>>> Tim >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org >>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org >>>>> >>>>> >>>>> >>>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org >>> For additional commands, e-mail: java-user-help[at]lucene.apache.org >>> >>> >>> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org For additional commands, e-mail: java-user-help[at]lucene.apache.org
|