
dmitrys at earthlink
Aug 9, 2002, 6:47 AM
Post #2 of 2
(84 views)
Permalink
|
Doug Cutting wrote: > [ I've moved this discussion to lucene-dev. -drc ] > > Dmitry Serebrennikov wrote: > >> I was just thinking about doing something similar, but after looking >> at your code I thought couldn't the same thing be done by >> manipulating the mergeFactor on the existing IndexWriter? It already >> indexes n documents into memory before writing a new disk segment. I >> just looked at it again but I can't see without a detailed study >> whether the mergeFactor applies to merging from RAM to disk only or >> for merging on-disk segments as well. If it applies to both, perhaps >> we could add a different field to the IndexWriter to allow the two >> values to be different? Am I missing something? > > > There'd be more to it than manipulating mergeFactor, but you're right, > IndexWriter already buffers indexes in a RAMDirectory. And it would > be nice to use that RAMDirectory more extensively, controlled > separately from mergeFactor. Peter achieves this by using two > IndexWriters, but probably this should be built-in to IndexWriter. > > Also, it would be better to limit things not based on the number of > documents indexed in RAM, but on the amount of memory used. > RAMDirectory could be altered to keep track of how big it is. We > could, by default, allow buffering of up to, say, 10MB of index before > things are flushed. This would speed up and reduce the number of > files when batch indexing. > > The changes to RAMDirectory should be straightforward. New buffers > are only allocated in the flushBuffer implementation, and they're only > freed by deleteFile and renameFile. > > The changes to IndexWriter are more complicated. The simplest > approach would be to just buffer all single-document indexes in memory > and merge them all at once when flushing. However, merging thousands > of documents at once would use a lot of memory, and is probably not > acceptable. So there probably needs to be two stacks of segment > indexes, one for those in memory and one for those on disk. (The > latter is what is contained in the "segments" file.) > > Both could use an algorithm like that used by the current stack. Each > time a new document is added, its index is pushed onto a stack, and > merging is attempted. When segments are merged, the old indexes are > removed from the top of the stack, and the new index replaces them on > the top of the stack. > > The current algorithm to figure out when to merge is roughly: > > int n = 1; > while (stack[top] through stack[top-mergeFactor] > all contain n or fewer documents) { > merge(stack[top] through stack[top-mergeFactor]); > n *= mergeFactor; > } > > The new case that would occur with two stacks is when the in-memory > index is flushed to disk as a single segment. This could put a > segment on top of the disk-based stack that's larger than those > beneath it, which can't currently happen, and breaks the above > algorithm. I think the fix is simple though: > > int n = stack[top].numDocs; > while (stack[top] through stack[top-mergeFactor] > all contain n or fewer documents) { > merge(stack[top] through stack[top-mergeFactor]); > n *= mergeFactor; > } > > Does that look right? The goals are to: > - keep only O(log(N)) segments, so searching is fast > - only merge each document O(log(N)) times, so indexing is fast > > Dmitry, are you interested in working on this? Well, frankly, I didn't realize how complicated this was... Yes, I am interested in working on this and it would be a great addition to Lucene, but there are other things in my work queue way ahead of this one. Peter's solution suddenly looks much more appealing :) Dmitry. > > > Doug > > > > -- > To unsubscribe, e-mail: > <mailto:lucene-dev-unsubscribe [at] jakarta> > For additional commands, e-mail: > <mailto:lucene-dev-help [at] jakarta> > > -- To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta> For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>
|