
buschmic at gmail
Nov 9, 2009, 7:10 AM
Post #3 of 9
(719 views)
Permalink
|
|
Re: Questions about doc store files (.cfx)
[In reply to]
|
|
On 11/9/09 2:56 AM, Michael McCandless wrote: > I think you're asking about the benefit of using "shared doc stores" at > all? > > CFX is just the compound format of these shared files; if compound > file is off, then they are still shared, just as separate (.fdx/t, > .tvx/d/f) files. > > Oh yeah, that's true. I do mean the shared doc stores in general. > For building up a single large index, I suspect the win is > sizable, if you store fields and compute term vectors. You save alot > of IO not merging these files, within that one IndexWriter session. > > That said, the win is probably less than it used to be, now that we > bulk-copy when merging these files. Previously, without bulk copy, it > also consumed alot of CPU to merge the files. > > And it's true that the gains only apply within one IW session, so I'd > expect this means in practice when building a huge index from scratch > you see sizable gains, but then when rolling smallish updates into the > index over time, there's no real gain. Though that's something we could > [alternatively] pursue improving (eg if we allowed a single segment to > reference multiple doc stores). > > Ok, thanks for clarifying. > I do think keeping the IO cost down during merging is important; > removing shared doc stores would be at step backwards (though, > I agree, would simplify things). > > Well, I was just wondering if you or anyone else had any numbers that quantify the benefits of the shared stores. If it really helps a lot I agree it's a good thing to have them. But they do add a layer of complexity to the code (and to the way one has to think about segments), so if the win is smallish this might not be desirable. Btw: I'm not trying to say it's required to remove them for parallel indexing. It'd be just be simpler without them. You can think about a segmented parallel index as a matrix of segments. And about the shared doc stores as merging multiple cells in a single row or column of a spreadsheet. It'd be a bit easier if that wasn't possible and it always was a true matrix. Michael > Mike > > On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<buschmic [at] gmail> wrote: > >> Hi, >> >> I'm wondering about the benefits of having the .cfx files. The main >> advantage is that you avoid merging (copying) stored fields and TermVectors >> during segment merge, right? And I think .cfx files are only shared across >> segments if the same IndexWriter is used to flush multiple segments and then >> to commit all those segments in a single transaction. Then those segments >> share the same .cfx file, correct? And in such a case .cfx files are also >> not merged into .cfs files? >> >> How big is usually the win of using .cfx files? I'm wondering, because the >> .cfx file is the only one that spans over multiple segments and therefore >> adds more complexity to the code. For parallel indexing it'd be nice to not >> have those kind of files that belong to multiple segments, especially when >> we want to update certain fields. >> >> Michael >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene >> For additional commands, e-mail: java-dev-help [at] lucene >> >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene > For additional commands, e-mail: java-dev-help [at] lucene > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene For additional commands, e-mail: java-dev-help [at] lucene
|