Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

possible segment merge improvement?

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


rengels at ix

Oct 31, 2007, 9:28 PM

Post #1 of 14 (1588 views)
Permalink
possible segment merge improvement?

Currently, when merging segments, every document is [parsed and then
rewritten since the field numbers may differ between the segments
(compressed data is not uncompressed in the latest versions).

It would seem that in many (if not most) Lucene uses the fields
stored within each document with an index are relatively static,
probably changing for all documents added after point X, if at all.

Why not check the fields dictionary for the segments being merged,
and if the same, just copy the binary data directly?

In the common case this should be a vast improvement.

Anyone worked on anything like this? Am I missing something?

Robert Engels



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chenjian1227 at gmail

Oct 31, 2007, 10:30 PM

Post #2 of 14 (1542 views)
Permalink
Re: possible segment merge improvement? [In reply to]

Hi, Robert,

That's a brilliant idea! Thanks so much for suggesting that.

Cheers,

Jian

On 10/31/07, robert engels <rengels [at] ix> wrote:
>
> Currently, when merging segments, every document is [parsed and then
> rewritten since the field numbers may differ between the segments
> (compressed data is not uncompressed in the latest versions).
>
> It would seem that in many (if not most) Lucene uses the fields
> stored within each document with an index are relatively static,
> probably changing for all documents added after point X, if at all.
>
> Why not check the fields dictionary for the segments being merged,
> and if the same, just copy the binary data directly?
>
> In the common case this should be a vast improvement.
>
> Anyone worked on anything like this? Am I missing something?
>
> Robert Engels
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


rengels at ix

Oct 31, 2007, 11:06 PM

Post #3 of 14 (1550 views)
Permalink
Re: possible segment merge improvement? [In reply to]

It seems that the following are needed:

FieldInfos.hashCode(); // to allow for fast equals failure
FieldInfos.equals();

for most efficient buffer reuse during merge to avoid GC, add

int FieldsReader.doclength(int doc);
int FieldsReader.binarydoc(int doc,byte[] buffer);

this will allow the caller to reuse the existing buffer if large
enough, or create a new one

and

FieldsWriter.addBinaryDocument(byte[] buffer,int len);

All of the above methods are trivial.

SegmentMerger just needs to be changed to compare the readers to be
merged, and if all have equal FieldInfos, then use a short circuit
copy similar to

byte[] buffer = new byte[1024];

for each reader {
for doc in reader {
if doc not deleted {
int len = reader.doclength(doc);
if(len > buffer.length) {
buffer = new byte[len*2]; // allow for growth
}
reader.binarydoc(doc,buffer);
newsegment.addBinaryDocument(buffer,len);
}
}
}



On Nov 1, 2007, at 12:30 AM, jian chen wrote:

> Hi, Robert,
>
> That's a brilliant idea! Thanks so much for suggesting that.
>
> Cheers,
>
> Jian
>
> On 10/31/07, robert engels <rengels [at] ix> wrote:
>>
>> Currently, when merging segments, every document is [parsed and then
>> rewritten since the field numbers may differ between the segments
>> (compressed data is not uncompressed in the latest versions).
>>
>> It would seem that in many (if not most) Lucene uses the fields
>> stored within each document with an index are relatively static,
>> probably changing for all documents added after point X, if at all.
>>
>> Why not check the fields dictionary for the segments being merged,
>> and if the same, just copy the binary data directly?
>>
>> In the common case this should be a vast improvement.
>>
>> Anyone worked on anything like this? Am I missing something?
>>
>> Robert Engels
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Oct 31, 2007, 11:30 PM

Post #4 of 14 (1538 views)
Permalink
Re: possible segment merge improvement? [In reply to]

Actually, a bit better signatures would use method overloading and be

int FieldsReader.length(int doc); // length of document in bytes
int FieldsReader.doc(int doc,byte[] buffer); // read a formatted
document into a buffer

void FieldsWriter.addDocument(byte[] buffer, int len); // write an
already formatted document from a buffer


On Nov 1, 2007, at 1:06 AM, robert engels wrote:

> It seems that the following are needed:
>
> FieldInfos.hashCode(); // to allow for fast equals failure
> FieldInfos.equals();
>
> for most efficient buffer reuse during merge to avoid GC, add
>
> int FieldsReader.doclength(int doc);
> int FieldsReader.binarydoc(int doc,byte[] buffer);
>
> this will allow the caller to reuse the existing buffer if large
> enough, or create a new one
>
> and
>
> FieldsWriter.addBinaryDocument(byte[] buffer,int len);
>
> All of the above methods are trivial.
>
> SegmentMerger just needs to be changed to compare the readers to be
> merged, and if all have equal FieldInfos, then use a short circuit
> copy similar to
>
> byte[] buffer = new byte[1024];
>
> for each reader {
> for doc in reader {
> if doc not deleted {
> int len = reader.doclength(doc);
> if(len > buffer.length) {
> buffer = new byte[len*2]; // allow for growth
> }
> reader.binarydoc(doc,buffer);
> newsegment.addBinaryDocument(buffer,len);
> }
> }
> }
>
>
>
> On Nov 1, 2007, at 12:30 AM, jian chen wrote:
>
>> Hi, Robert,
>>
>> That's a brilliant idea! Thanks so much for suggesting that.
>>
>> Cheers,
>>
>> Jian
>>
>> On 10/31/07, robert engels <rengels [at] ix> wrote:
>>>
>>> Currently, when merging segments, every document is [parsed and then
>>> rewritten since the field numbers may differ between the segments
>>> (compressed data is not uncompressed in the latest versions).
>>>
>>> It would seem that in many (if not most) Lucene uses the fields
>>> stored within each document with an index are relatively static,
>>> probably changing for all documents added after point X, if at all.
>>>
>>> Why not check the fields dictionary for the segments being merged,
>>> and if the same, just copy the binary data directly?
>>>
>>> In the common case this should be a vast improvement.
>>>
>>> Anyone worked on anything like this? Am I missing something?
>>>
>>> Robert Engels
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Nov 1, 2007, 3:04 AM

Post #5 of 14 (1540 views)
Permalink
Re: possible segment merge improvement? [In reply to]

"robert engels" <rengels [at] ix> wrote:

> Why not check the fields dictionary for the segments being merged,
> and if the same, just copy the binary data directly?

+1

While Lucene does not have a global field schema/semantics, unlike eg
KinoSearch, I think for many apps the fields are in fact static.

In KinoSearch, merging of stored fields & term vectors is always a
fast concatenation of the entry for that document, whereas Lucene must
re-interpret/re-number all fields on the doc, in general. In fact I
think that KinoSearch stores field names directly in the index (ie,
not numbers).

If we make this change to Lucene then for those apps that effectively
have a static field schema (because all docs always have matching
fields), we can get the same performance that KinoSearch always gets
during its merging of stored fields & term vectors. For all other
apps we must continue to re-interpret each field on each document.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Nov 1, 2007, 7:10 AM

Post #6 of 14 (1534 views)
Permalink
Re: possible segment merge improvement? [In reply to]

On 11/1/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> "robert engels" <rengels [at] ix> wrote:
>
> > Why not check the fields dictionary for the segments being merged,
> > and if the same, just copy the binary data directly?
>
> +1
>
> While Lucene does not have a global field schema/semantics, unlike eg
> KinoSearch, I think for many apps the fields are in fact static.
>
> In KinoSearch, merging of stored fields & term vectors is always a
> fast concatenation of the entry for that document, whereas Lucene must
> re-interpret/re-number all fields on the doc, in general. In fact I
> think that KinoSearch stores field names directly in the index (ie,
> not numbers).
>
> If we make this change to Lucene then for those apps that effectively
> have a static field schema (because all docs always have matching
> fields), we can get the same performance that KinoSearch always gets
> during its merging of stored fields & term vectors.

Does "all docs have matching fields" mean that the fields must be
present (as well as identically typed) on each doc, or could they
still be sparse? If they can be sparse, how do you avoid
renumbering???

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Nov 1, 2007, 7:33 AM

Post #7 of 14 (1536 views)
Permalink
Re: possible segment merge improvement? [In reply to]

yseeley [at] gmail wrote on 01/11/2007 16:10:27:

> > If we make this change to Lucene then for those apps that effectively
> > have a static field schema (because all docs always have matching
> > fields), we can get the same performance that KinoSearch always gets
> > during its merging of stored fields & term vectors.
>
> Does "all docs have matching fields" mean that the fields must be
> present (as well as identically typed) on each doc, or could they
> still be sparse? If they can be sparse, how do you avoid
> renumbering???

Perhaps I interpreted this optimization proposal wrong. -

My understanding is that this is for stored fields data
in the field data (.fdt) file, where FieldNum might
need to be changed, in:

DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount

My reading of Robert's suggestion is that when we know that
FieldInfos of the resulted segment is identical to the
FieldInfos of a certain (sub) segment being merged then
there is no need to parse+rewrite the field data for all
docs of that (sub)segment, rather they can be written as is.

Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Nov 1, 2007, 8:47 AM

Post #8 of 14 (1532 views)
Permalink
Re: possible segment merge improvement? [In reply to]

On 11/1/07, Doron Cohen <DORONC [at] il> wrote:
> My reading of Robert's suggestion is that when we know that
> FieldInfos of the resulted segment is identical to the
> FieldInfos of a certain (sub) segment being merged then
> there is no need to parse+rewrite the field data for all
> docs of that (sub)segment, rather they can be written as is.

Ah right... so for sparse fields it really depends on the order
documents were added to the segment I imagine.
If a document w/o all fields is added first, I guess the field numbers
would be different in the segments. Also, people should take care to
add fields in the same order (first doc in the segment will define the
fieldname->fieldnumber ordering I think)

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Nov 1, 2007, 9:46 AM

Post #9 of 14 (1539 views)
Permalink
Re: possible segment merge improvement? [In reply to]

On Nov 1, 2007, at 3:04 AM, Michael McCandless wrote:

> In KinoSearch, merging of stored fields & term vectors is always a
> fast concatenation of the entry for that document, whereas Lucene must
> re-interpret/re-number all fields on the doc, in general. In fact I
> think that KinoSearch stores field names directly in the index (ie,
> not numbers).

Yes, that's right. <http://xrl.us/73dx> (Link to mail-
archives.apache.org)

Ferret and KS had both previously implemented Robert's suggested mod,
where no remaps take place if field numbers can be matched up. KS
also expended extra effort to keep field numbers consistent (and I
think Ferret did too) -- but the possibility that we would have to
remap couldn't ever be eliminated.

Going with field names rather than numbers allowed KS to eliminate a
big chunk of code. For the price of a small increase in index size,
the segment merging process for stored fields and term vectors got
much simpler. No more parsing, no more remapping -- it became
possible to read the record naively as one chunk and copy it, no
matter what.

If Lucene were to go this route, my suggestion would be to start a
new subclass of FieldsWriter that uses different index extensions.
(KS uses .ds and .dsx: "Document Storage".) Individual
SegmentReaders can then decide which subclass to use based on which
files are detected.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Nov 1, 2007, 9:49 AM

Post #10 of 14 (1533 views)
Permalink
Re: possible segment merge improvement? [In reply to]

On Nov 1, 2007, at 7:10 AM, Yonik Seeley wrote:

> Does "all docs have matching fields" mean that the fields must be
> present (as well as identically typed) on each doc, or could they
> still be sparse? If they can be sparse, how do you avoid
> renumbering???

The fields still get renumbered, but if you key values off of field
names in the files, these parts of the index don't know anything
about field numbers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Nov 1, 2007, 10:27 AM

Post #11 of 14 (1527 views)
Permalink
Re: possible segment merge improvement? [In reply to]

I have looked into modifying FieldInfos to keep the fields sorted by
field name, so the user would not be forced to add the fields in the
same order.

Sparse documents are really not a problem. Since after the first
merge of that document it will pickup the other fields from the other
segments, after which it will merge "as the same".

I had to add getFieldInfos() to SegmentReader to make all of this
work. I did not need to modify FieldInfos or FieldIno - I do the
equality checks in SegmentMerger, and only perform them once.

Code looks as follows:

private final int mergeFields() throws IOException {
fieldInfos = new FieldInfos(); // merge field names
int docCount = 0;
for (int i = 0; i < readers.size(); i++) {
IndexReader reader = (IndexReader) readers.elementAt(i);
if (reader instanceof SegmentReader) {
SegmentReader sreader = (SegmentReader) reader;
for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,
fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, !
reader.hasNorms(fi.name));
}
} else {
addIndexed(reader, fieldInfos, reader.getFieldNames
(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,
true, true);
addIndexed(reader, fieldInfos, reader.getFieldNames
(IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
addIndexed(reader, fieldInfos, reader.getFieldNames
(IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
addIndexed(reader, fieldInfos, reader.getFieldNames
(IndexReader.FieldOption.TERMVECTOR), true, false, false);
addIndexed(reader, fieldInfos, reader.getFieldNames
(IndexReader.FieldOption.INDEXED), false, false, false);
fieldInfos.add(reader.getFieldNames
(IndexReader.FieldOption.UNINDEXED), false);
}
}
fieldInfos.write(directory, segment + ".fnm");

SegmentReader[] sreaders = new SegmentReader[readers.size()];
for (int i = 0; i < readers.size(); i++) {
IndexReader reader = (IndexReader) readers.elementAt(i);
boolean same = reader.getFieldNames().size() == fieldInfos.size
() && reader instanceof SegmentReader;
if(same) {
SegmentReader sreader = (SegmentReader) reader;
for (int j = 0; same && j < fieldInfos.size(); j++) {
same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos
().fieldName(j));
}
if(same)
sreaders[i] = sreader;
}
}

byte[] buffer = new byte[1024];

// merge field values
FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,
fieldInfos);

try {
for (int i = 0; i < readers.size(); i++) {
IndexReader reader = (IndexReader) readers.elementAt(i);
SegmentReader sreader = sreaders[i];
int maxDoc = reader.maxDoc();
for (int j = 0; j < maxDoc; j++)
if (!reader.isDeleted(j)) { // skip deleted docs
if (sreader!=null) {
int len = sreader.length(j);
if (len > buffer.length) {
buffer = new byte[len * 2];
}
sreader.document(buffer, j, len);
fieldsWriter.addDocument(buffer, len);
} else {
fieldsWriter.addDocument(reader.document(j));
}
docCount++;
}
}
} finally {
fieldsWriter.close();
}
return docCount;
}


On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:

> On 11/1/07, Doron Cohen <DORONC [at] il> wrote:
>> My reading of Robert's suggestion is that when we know that
>> FieldInfos of the resulted segment is identical to the
>> FieldInfos of a certain (sub) segment being merged then
>> there is no need to parse+rewrite the field data for all
>> docs of that (sub)segment, rather they can be written as is.
>
> Ah right... so for sparse fields it really depends on the order
> documents were added to the segment I imagine.
> If a document w/o all fields is added first, I guess the field numbers
> would be different in the segments. Also, people should take care to
> add fields in the same order (first doc in the segment will define the
> fieldname->fieldnumber ordering I think)
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


lucene at mikemccandless

Nov 2, 2007, 11:23 AM

Post #12 of 14 (1522 views)
Permalink
Re: possible segment merge improvement? [In reply to]

OK, I got Robert's optimization working on the current trunk ... I
will open a Jira issue with the patch.

Mike

"robert engels" <rengels [at] ix> wrote:
> I have looked into modifying FieldInfos to keep the fields sorted by
> field name, so the user would not be forced to add the fields in the
> same order.
>
> Sparse documents are really not a problem. Since after the first
> merge of that document it will pickup the other fields from the other
> segments, after which it will merge "as the same".
>
> I had to add getFieldInfos() to SegmentReader to make all of this
> work. I did not need to modify FieldInfos or FieldIno - I do the
> equality checks in SegmentMerger, and only perform them once.
>
> Code looks as follows:
>
> private final int mergeFields() throws IOException {
> fieldInfos = new FieldInfos(); // merge field names
> int docCount = 0;
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> if (reader instanceof SegmentReader) {
> SegmentReader sreader = (SegmentReader) reader;
> for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
> FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
> fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,
> fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, !
> reader.hasNorms(fi.name));
> }
> } else {
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,
> true, true);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR), true, false, false);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.INDEXED), false, false, false);
> fieldInfos.add(reader.getFieldNames
> (IndexReader.FieldOption.UNINDEXED), false);
> }
> }
> fieldInfos.write(directory, segment + ".fnm");
>
> SegmentReader[] sreaders = new SegmentReader[readers.size()];
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> boolean same = reader.getFieldNames().size() == fieldInfos.size
> () && reader instanceof SegmentReader;
> if(same) {
> SegmentReader sreader = (SegmentReader) reader;
> for (int j = 0; same && j < fieldInfos.size(); j++) {
> same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos
> ().fieldName(j));
> }
> if(same)
> sreaders[i] = sreader;
> }
> }
>
> byte[] buffer = new byte[1024];
>
> // merge field values
> FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,
> fieldInfos);
>
> try {
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> SegmentReader sreader = sreaders[i];
> int maxDoc = reader.maxDoc();
> for (int j = 0; j < maxDoc; j++)
> if (!reader.isDeleted(j)) { // skip deleted docs
> if (sreader!=null) {
> int len = sreader.length(j);
> if (len > buffer.length) {
> buffer = new byte[len * 2];
> }
> sreader.document(buffer, j, len);
> fieldsWriter.addDocument(buffer, len);
> } else {
> fieldsWriter.addDocument(reader.document(j));
> }
> docCount++;
> }
> }
> } finally {
> fieldsWriter.close();
> }
> return docCount;
> }
>
>
> On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:
>
> > On 11/1/07, Doron Cohen <DORONC [at] il> wrote:
> >> My reading of Robert's suggestion is that when we know that
> >> FieldInfos of the resulted segment is identical to the
> >> FieldInfos of a certain (sub) segment being merged then
> >> there is no need to parse+rewrite the field data for all
> >> docs of that (sub)segment, rather they can be written as is.
> >
> > Ah right... so for sparse fields it really depends on the order
> > documents were added to the segment I imagine.
> > If a document w/o all fields is added first, I guess the field numbers
> > would be different in the segments. Also, people should take care to
> > add fields in the same order (first doc in the segment will define the
> > fieldname->fieldnumber ordering I think)
> >
> > -Yonik
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Nov 2, 2007, 11:40 AM

Post #13 of 14 (1527 views)
Permalink
Re: possible segment merge improvement? [In reply to]

On 11/1/07, robert engels <rengels [at] ix> wrote:
> I have looked into modifying FieldInfos to keep the fields sorted by
> field name, so the user would not be forced to add the fields in the
> same order.
>
> Sparse documents are really not a problem. Since after the first
> merge of that document it will pickup the other fields from the other
> segments, after which it will merge "as the same".

Only when the field numbers happen match up though right?
There could be number mismatches far after the first merge, depending
on what fields were encountered first in those segments.

Aside: renumbering fields is another area where using byte counts
instead of char counts should really speed things up.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Nov 2, 2007, 11:50 AM

Post #14 of 14 (1522 views)
Permalink
Re: possible segment merge improvement? [In reply to]

Sort of (if I understand you).

Eventually the segments (after merging) converge to having the same
fields in the same order.

New segments are mostly merged only with new segment (which probably
have the same fields).

When a "newer" segment is merged with a "older" you will not be able
to optimize the process (some complex change/mapping code might be
able to do a better job that the current brute force read all / write
all method).

If the fields were always kept sorted you have a better chance of
having the fields dictionary of various segments match up.

At least for us, our fields dictionaries are VERY static, and
constant across all documents (we partition different document types
into separate indexes), so this optimization is a big help.


On Nov 2, 2007, at 1:40 PM, Yonik Seeley wrote:

> On 11/1/07, robert engels <rengels [at] ix> wrote:
>> I have looked into modifying FieldInfos to keep the fields sorted by
>> field name, so the user would not be forced to add the fields in the
>> same order.
>>
>> Sparse documents are really not a problem. Since after the first
>> merge of that document it will pickup the other fields from the other
>> segments, after which it will merge "as the same".
>
> Only when the field numbers happen match up though right?
> There could be number mismatches far after the first merge, depending
> on what fields were encountered first in those segments.
>
> Aside: renumbering fields is another area where using byte counts
> instead of char counts should really speed things up.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.