Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Questions about doc store files (.cfx)

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


buschmic at gmail

Nov 9, 2009, 12:17 AM

Post #1 of 9 (747 views)
Permalink
Questions about doc store files (.cfx)

Hi,

I'm wondering about the benefits of having the .cfx files. The main
advantage is that you avoid merging (copying) stored fields and
TermVectors during segment merge, right? And I think .cfx files are only
shared across segments if the same IndexWriter is used to flush multiple
segments and then to commit all those segments in a single transaction.
Then those segments share the same .cfx file, correct? And in such a
case .cfx files are also not merged into .cfs files?

How big is usually the win of using .cfx files? I'm wondering, because
the .cfx file is the only one that spans over multiple segments and
therefore adds more complexity to the code. For parallel indexing it'd
be nice to not have those kind of files that belong to multiple
segments, especially when we want to update certain fields.

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 9, 2009, 2:56 AM

Post #2 of 9 (728 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

I think you're asking about the benefit of using "shared doc stores" at
all?

CFX is just the compound format of these shared files; if compound
file is off, then they are still shared, just as separate (.fdx/t,
.tvx/d/f) files.

For building up a single large index, I suspect the win is
sizable, if you store fields and compute term vectors. You save alot
of IO not merging these files, within that one IndexWriter session.

That said, the win is probably less than it used to be, now that we
bulk-copy when merging these files. Previously, without bulk copy, it
also consumed alot of CPU to merge the files.

And it's true that the gains only apply within one IW session, so I'd
expect this means in practice when building a huge index from scratch
you see sizable gains, but then when rolling smallish updates into the
index over time, there's no real gain. Though that's something we could
[alternatively] pursue improving (eg if we allowed a single segment to
reference multiple doc stores).

I do think keeping the IO cost down during merging is important;
removing shared doc stores would be at step backwards (though,
I agree, would simplify things).

Mike

On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch <buschmic [at] gmail> wrote:
> Hi,
>
> I'm wondering about the benefits of having the .cfx files. The main
> advantage is that you avoid merging (copying) stored fields and TermVectors
> during segment merge, right? And I think .cfx files are only shared across
> segments if the same IndexWriter is used to flush multiple segments and then
> to commit all those segments in a single transaction. Then those segments
> share the same .cfx file, correct? And in such a case .cfx files are also
> not merged into .cfs files?
>
> How big is usually the win of using .cfx files? I'm wondering, because the
> .cfx file is the only one that spans over multiple segments and therefore
> adds more complexity to the code. For parallel indexing it'd be nice to not
> have those kind of files that belong to multiple segments, especially when
> we want to update certain fields.
>
>  Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


buschmic at gmail

Nov 9, 2009, 7:10 AM

Post #3 of 9 (719 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On 11/9/09 2:56 AM, Michael McCandless wrote:
> I think you're asking about the benefit of using "shared doc stores" at
> all?
>
> CFX is just the compound format of these shared files; if compound
> file is off, then they are still shared, just as separate (.fdx/t,
> .tvx/d/f) files.
>
>
Oh yeah, that's true. I do mean the shared doc stores in general.

> For building up a single large index, I suspect the win is
> sizable, if you store fields and compute term vectors. You save alot
> of IO not merging these files, within that one IndexWriter session.
>
> That said, the win is probably less than it used to be, now that we
> bulk-copy when merging these files. Previously, without bulk copy, it
> also consumed alot of CPU to merge the files.
>
> And it's true that the gains only apply within one IW session, so I'd
> expect this means in practice when building a huge index from scratch
> you see sizable gains, but then when rolling smallish updates into the
> index over time, there's no real gain. Though that's something we could
> [alternatively] pursue improving (eg if we allowed a single segment to
> reference multiple doc stores).
>
>

Ok, thanks for clarifying.

> I do think keeping the IO cost down during merging is important;
> removing shared doc stores would be at step backwards (though,
> I agree, would simplify things).
>
>

Well, I was just wondering if you or anyone else had any numbers that
quantify the benefits of the shared stores. If it really helps a lot I
agree it's a good thing to have them. But they do add a layer of
complexity to the code (and to the way one has to think about segments),
so if the win is smallish this might not be desirable. Btw: I'm not
trying to say it's required to remove them for parallel indexing. It'd
be just be simpler without them. You can think about a segmented
parallel index as a matrix of segments. And about the shared doc stores
as merging multiple cells in a single row or column of a spreadsheet.
It'd be a bit easier if that wasn't possible and it always was a true
matrix.

Michael


> Mike
>
> On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<buschmic [at] gmail> wrote:
>
>> Hi,
>>
>> I'm wondering about the benefits of having the .cfx files. The main
>> advantage is that you avoid merging (copying) stored fields and TermVectors
>> during segment merge, right? And I think .cfx files are only shared across
>> segments if the same IndexWriter is used to flush multiple segments and then
>> to commit all those segments in a single transaction. Then those segments
>> share the same .cfx file, correct? And in such a case .cfx files are also
>> not merged into .cfs files?
>>
>> How big is usually the win of using .cfx files? I'm wondering, because the
>> .cfx file is the only one that spans over multiple segments and therefore
>> adds more complexity to the code. For parallel indexing it'd be nice to not
>> have those kind of files that belong to multiple segments, especially when
>> we want to update certain fields.
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 9, 2009, 9:00 AM

Post #4 of 9 (723 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On Mon, Nov 9, 2009 at 10:10 AM, Michael Busch <buschmic [at] gmail> wrote:
>> I think you're asking about the benefit of using "shared doc stores" at
>> all?
>>
>> CFX is just the compound format of these shared files; if compound
>> file is off, then they are still shared, just as separate (.fdx/t,
>> .tvx/d/f) files.
>>
>>
>
> Oh yeah, that's true. I do mean the shared doc stores in general.
>
>> For building up a single large index, I suspect the win is
>> sizable, if you store fields and compute term vectors. You save alot
>> of IO not merging these files, within that one IndexWriter session.
>>
>> That said, the win is probably less than it used to be, now that we
>> bulk-copy when merging these files. Previously, without bulk copy, it
>> also consumed alot of CPU to merge the files.
>>
>> And it's true that the gains only apply within one IW session, so I'd
>> expect this means in practice when building a huge index from scratch
>> you see sizable gains, but then when rolling smallish updates into the
>> index over time, there's no real gain. Though that's something we could
>> [alternatively] pursue improving (eg if we allowed a single segment to
>> reference multiple doc stores).
>>
>>
>
> Ok, thanks for clarifying.
>
>> I do think keeping the IO cost down during merging is important;
>> removing shared doc stores would be at step backwards (though,
>> I agree, would simplify things).
>>
>>
>
> Well, I was just wondering if you or anyone else had any numbers that
> quantify the benefits of the shared stores. If it really helps a lot I agree
> it's a good thing to have them. But they do add a layer of complexity to the
> code (and to the way one has to think about segments), so if the win is
> smallish this might not be desirable

Alas, I don't have any benchmarks offhand... if you want to run one,
you should be able to hardwire flushDocStores=true in
IndexWriter.doFlushInternal? I think that'd turn off the sharing
without breaking things (run the tests to be sure ;) ).

> Btw: I'm not trying to say it's
> required to remove them for parallel indexing. It'd be just be simpler
> without them. You can think about a segmented parallel index as a matrix of
> segments. And about the shared doc stores as merging multiple cells in a
> single row or column of a spreadsheet. It'd be a bit easier if that wasn't
> possible and it always was a true matrix.

I agree, not sharing the stores would make things simpler. Wouldn't
the parallel indexes be able to "privately" share their own stores?
Ie, how the sharing happens need not be in sync across the main &
parallel indexes?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


buschmic at gmail

Nov 9, 2009, 5:40 PM

Post #5 of 9 (720 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On 11/9/09 9:00 AM, Michael McCandless wrote:
> Alas, I don't have any benchmarks offhand... if you want to run one,
> you should be able to hardwire flushDocStores=true in
> IndexWriter.doFlushInternal? I think that'd turn off the sharing
> without breaking things (run the tests to be sure ;) ).
>
>

Yes, I'm pretty sure that works. I think I've even done that in the
LUCENE-1879 patch (which works with Lucene 2.4).
>> Btw: I'm not trying to say it's
>> required to remove them for parallel indexing. It'd be just be simpler
>> without them. You can think about a segmented parallel index as a matrix of
>> segments. And about the shared doc stores as merging multiple cells in a
>> single row or column of a spreadsheet. It'd be a bit easier if that wasn't
>> possible and it always was a true matrix.
>>
> I agree, not sharing the stores would make things simpler. Wouldn't
> the parallel indexes be able to "privately" share their own stores?
> Ie, how the sharing happens need not be in sync across the main&
> parallel indexes?
>
>

I think that should be ok with parallel indexing, as long as we can
always select all corresponding segments from *all* parallel indexes for
a merge to keep the docIds in sync.

That actually leads me to another question: Let's say you have three
segments a, b, c. b and c share the same doc store. You perform deletes
on a and b. Then you call expungeDeletes(). Normally that call should
only merge a and b, because c doesn't have any deletes. But b and c have
to participate in the same merge, because they share the same doc store,
right? So would it merge all three segments?

If that's the case (that b and c must be part of the same merge) then it
would make the parallel indexing more difficult. The reason is that if
two parallel indexes 1 and 2 can decide on their own how to share e.g.
doc stores across segments, then we might come into a situation where 1a
and 1b share the same doc store, and 2b and 2c share the same doc store.
Then if index 1 needs to merge 1a and 1b, it can't assume that this
merge is allowed. There would have to be someone on top of the whole
thing who decides that all three segments need to be merged at the same
time, because b is connected to a and c in the two parallel indexes. I
wouldn't like such a restriction very much.

We could think about allowing merges like ab->d, even if b,c share the
same doc store. That would mean to copy the b part of the shared bc doc
store into the new segment d. Then until c gets deleted the stored docs
of b would be on disk twice and require more disk space temporarily.

Well maybe there is already a solution for all this in the code and I'm
just not aware of it?

Michael


> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


buschmic at gmail

Nov 9, 2009, 9:06 PM

Post #6 of 9 (723 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On 11/9/09 5:40 PM, Michael Busch wrote:
> I think that should be ok with parallel indexing, as long as we can
> always select all corresponding segments from *all* parallel indexes
> for a merge to keep the docIds in sync.
>
> That actually leads me to another question: Let's say you have three
> segments a, b, c. b and c share the same doc store. You perform
> deletes on a and b. Then you call expungeDeletes(). Normally that call
> should only merge a and b, because c doesn't have any deletes. But b
> and c have to participate in the same merge, because they share the
> same doc store, right? So would it merge all three segments?
>
> If that's the case (that b and c must be part of the same merge) then
> it would make the parallel indexing more difficult. The reason is that
> if two parallel indexes 1 and 2 can decide on their own how to share
> e.g. doc stores across segments, then we might come into a situation
> where 1a and 1b share the same doc store, and 2b and 2c share the same
> doc store. Then if index 1 needs to merge 1a and 1b, it can't assume
> that this merge is allowed. There would have to be someone on top of
> the whole thing who decides that all three segments need to be merged
> at the same time, because b is connected to a and c in the two
> parallel indexes. I wouldn't like such a restriction very much.
>
> We could think about allowing merges like ab->d, even if b,c share the
> same doc store. That would mean to copy the b part of the shared bc
> doc store into the new segment d. Then until c gets deleted the stored
> docs of b would be on disk twice and require more disk space temporarily.
>

I think this is exactly what happens? I wrote a small test program that
creates a situation like mentioned above in the "expungeDelete"
scenario. It ends up with a docstore containing docs from two segments,
but after expungeDeletes only one segment references the docstore. The
non-deleted docs from the other segment end up in a new segment, so they
are twice on disk (once orphaned in the old docstore, once in the new
segment).
Is that the desired behavior?

Michael

> Well maybe there is already a solution for all this in the code and
> I'm just not aware of it?
>
> Michael
>
>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 10, 2009, 1:57 AM

Post #7 of 9 (713 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On Tue, Nov 10, 2009 at 12:06 AM, Michael Busch <buschmic [at] gmail> wrote:
> On 11/9/09 5:40 PM, Michael Busch wrote:
>>
>> I think that should be ok with parallel indexing, as long as we can always
>> select all corresponding segments from *all* parallel indexes for a merge to
>> keep the docIds in sync.
>>
>> That actually leads me to another question: Let's say you have three
>> segments a, b, c.  b and c share the same doc store. You perform deletes on
>> a and b. Then you call expungeDeletes(). Normally that call should only
>> merge a and b, because c doesn't have any deletes. But b and c have to
>> participate in the same merge, because they share the same doc store, right?
>> So would it merge all three segments?
>>
>> If that's the case (that b and c must be part of the same merge) then it
>> would make the parallel indexing more difficult. The reason is that if two
>> parallel indexes 1 and 2 can decide on their own how to share e.g. doc
>> stores across segments, then we might come into a situation where 1a and 1b
>> share the same doc store, and 2b and 2c share the same doc store. Then if
>> index 1 needs to merge 1a and 1b, it can't assume that this merge is
>> allowed. There would have to be someone on top of the whole thing who
>> decides that all three segments need to be merged at the same time, because
>> b is connected to a and c in the two parallel indexes. I wouldn't like such
>> a restriction very much.
>>
>> We could think about allowing merges like ab->d, even if b,c share the
>> same doc store. That would mean to copy the b part of the shared bc doc
>> store into the new segment d. Then until c gets deleted the stored docs of b
>> would be on disk twice and require more disk space temporarily.
>>
>
> I think this is exactly what happens? I wrote a small test program that
> creates a situation like mentioned above in the "expungeDelete" scenario. It
> ends up with a docstore containing docs from two segments, but after
> expungeDeletes only one segment references the docstore. The non-deleted
> docs from the other segment end up in a new segment, so they are twice on
> disk (once orphaned in the old docstore, once in the new segment).
> Is that the desired behavior?

Right this is what happens -- since segment C wasn't merged, it
remains as the only segment still referencing the shared doc stores,
and, yes, this does result in duplicate storage for some docs (until C
is merged away). IFD keeps track of whether a given set of doc stores
is still referenced.

I think in practice this should not result in too much duplication.
If C is large, it's likely to have accumulated deletes as well. If C
is small, it's likely to get merged away in the course of normal
merging.

But, if we are really concerned with it, we could modify the merge
policy to bias its selection on this ("remove stores that are wasting
too much space") basis.

I think this makes the parallel index job's simpler, right? Ie, how
the segments are sharing the stores within their own index does not
restrict what merging is done.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


buschmic at gmail

Nov 10, 2009, 10:18 AM

Post #8 of 9 (706 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On 11/10/09 1:57 AM, Michael McCandless wrote:
>
>> I think this is exactly what happens? I wrote a small test program that
>> creates a situation like mentioned above in the "expungeDelete" scenario. It
>> ends up with a docstore containing docs from two segments, but after
>> expungeDeletes only one segment references the docstore. The non-deleted
>> docs from the other segment end up in a new segment, so they are twice on
>> disk (once orphaned in the old docstore, once in the new segment).
>> Is that the desired behavior?
>>
> Right this is what happens -- since segment C wasn't merged, it
> remains as the only segment still referencing the shared doc stores,
> and, yes, this does result in duplicate storage for some docs (until C
> is merged away). IFD keeps track of whether a given set of doc stores
> is still referenced.
>
>

OK, thanks for clarifying!

> I think in practice this should not result in too much duplication.
> If C is large, it's likely to have accumulated deletes as well. If C
> is small, it's likely to get merged away in the course of normal
> merging.
>
>

I agree - it shouldn't happen very often. I was just not sure how the
current behavior in this corner case was and wanted to understand it.

> But, if we are really concerned with it, we could modify the merge
> policy to bias its selection on this ("remove stores that are wasting
> too much space") basis.
>

I'm not too concerned, because I also don't think this should happen
very often.

> I think this makes the parallel index job's simpler, right? Ie, how
> the segments are sharing the stores within their own index does not
> restrict what merging is done.
>
>

Yes exactly. It won't prevent us from keeping the parallel indexes
independent in this regard.

Then the compound (.cfx and .cfs) files are rather orthogonal to this. I
talked to Marvin on ApacheCon; in Lucy he wants to have all the compound
file support in the store package, separately from the indexer. I think
that would make sense in Lucene too, there's not really the need to have
it tightly integrated in the IndexWriter and SegmentMerger. We can
generalize the compound file concept further, so that with parallel
indexes the files can be selected in either direction for inclusion in a
compound file.

E.g. if we separated the inverted index and store, so that they are
logically two parallel index components, then the .cfx file as it works
now would contain files from two parallel index components (term vectors
from inverted index, stored fields from the store). This is fine if you
don't want to update those components individually and can remain this
way for the default IndexWriter implementation. But if we generalize the
compound concept, then people can alter this behavior to better suit
their update requirements.

I think this would actually be a very clean design (even though it might
sound complicated here).

> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 10, 2009, 10:46 AM

Post #9 of 9 (699 views)
Permalink
Re: Questions about doc store files (.cfx) [In reply to]

On Tue, Nov 10, 2009 at 1:18 PM, Michael Busch <buschmic [at] gmail> wrote:

> I talked to Marvin on ApacheCon; in Lucy he wants to have all the compound
> file support in the store package, separately from the indexer. I think that
> would make sense in Lucene too, there's not really the need to have it
> tightly integrated in the IndexWriter and SegmentMerger. We can generalize
> the compound file concept further, so that with parallel indexes the files
> can be selected in either direction for inclusion in a compound file.
>
> E.g. if we separated the inverted index and store, so that they are
> logically two parallel index components, then the .cfx file as it works now
> would contain files from two parallel index components (term vectors from
> inverted index, stored fields from the store). This is fine if you don't
> want to update those components individually and can remain this way for the
> default IndexWriter implementation. But if we generalize the compound
> concept, then people can alter this behavior to better suit their update
> requirements.
>
> I think this would actually be a very clean design (even though it might
> sound complicated here).

This sounds great!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.