
lucene at mikemccandless
Nov 2, 2007, 11:23 AM
Post #12 of 14
(3508 views)
Permalink
|
OK, I got Robert's optimization working on the current trunk ... I will open a Jira issue with the patch. Mike "robert engels" <rengels [at] ix> wrote: > I have looked into modifying FieldInfos to keep the fields sorted by > field name, so the user would not be forced to add the fields in the > same order. > > Sparse documents are really not a problem. Since after the first > merge of that document it will pickup the other fields from the other > segments, after which it will merge "as the same". > > I had to add getFieldInfos() to SegmentReader to make all of this > work. I did not need to modify FieldInfos or FieldIno - I do the > equality checks in SegmentMerger, and only perform them once. > > Code looks as follows: > > private final int mergeFields() throws IOException { > fieldInfos = new FieldInfos(); // merge field names > int docCount = 0; > for (int i = 0; i < readers.size(); i++) { > IndexReader reader = (IndexReader) readers.elementAt(i); > if (reader instanceof SegmentReader) { > SegmentReader sreader = (SegmentReader) reader; > for (int j = 0; j < sreader.getFieldInfos().size(); j++) { > FieldInfo fi = sreader.getFieldInfos().fieldInfo(j); > fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector, > fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! > reader.hasNorms(fi.name)); > } > } else { > addIndexed(reader, fieldInfos, reader.getFieldNames > (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true, > true, true); > addIndexed(reader, fieldInfos, reader.getFieldNames > (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false); > addIndexed(reader, fieldInfos, reader.getFieldNames > (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true); > addIndexed(reader, fieldInfos, reader.getFieldNames > (IndexReader.FieldOption.TERMVECTOR), true, false, false); > addIndexed(reader, fieldInfos, reader.getFieldNames > (IndexReader.FieldOption.INDEXED), false, false, false); > fieldInfos.add(reader.getFieldNames > (IndexReader.FieldOption.UNINDEXED), false); > } > } > fieldInfos.write(directory, segment + ".fnm"); > > SegmentReader[] sreaders = new SegmentReader[readers.size()]; > for (int i = 0; i < readers.size(); i++) { > IndexReader reader = (IndexReader) readers.elementAt(i); > boolean same = reader.getFieldNames().size() == fieldInfos.size > () && reader instanceof SegmentReader; > if(same) { > SegmentReader sreader = (SegmentReader) reader; > for (int j = 0; same && j < fieldInfos.size(); j++) { > same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos > ().fieldName(j)); > } > if(same) > sreaders[i] = sreader; > } > } > > byte[] buffer = new byte[1024]; > > // merge field values > FieldsWriter fieldsWriter = new FieldsWriter(directory, segment, > fieldInfos); > > try { > for (int i = 0; i < readers.size(); i++) { > IndexReader reader = (IndexReader) readers.elementAt(i); > SegmentReader sreader = sreaders[i]; > int maxDoc = reader.maxDoc(); > for (int j = 0; j < maxDoc; j++) > if (!reader.isDeleted(j)) { // skip deleted docs > if (sreader!=null) { > int len = sreader.length(j); > if (len > buffer.length) { > buffer = new byte[len * 2]; > } > sreader.document(buffer, j, len); > fieldsWriter.addDocument(buffer, len); > } else { > fieldsWriter.addDocument(reader.document(j)); > } > docCount++; > } > } > } finally { > fieldsWriter.close(); > } > return docCount; > } > > > On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote: > > > On 11/1/07, Doron Cohen <DORONC [at] il> wrote: > >> My reading of Robert's suggestion is that when we know that > >> FieldInfos of the resulted segment is identical to the > >> FieldInfos of a certain (sub) segment being merged then > >> there is no need to parse+rewrite the field data for all > >> docs of that (sub)segment, rather they can be written as is. > > > > Ah right... so for sparse fields it really depends on the order > > documents were added to the segment I imagine. > > If a document w/o all fields is added first, I guess the field numbers > > would be different in the segments. Also, people should take care to > > add fields in the same order (first doc in the segment will define the > > fieldname->fieldnumber ordering I think) > > > > -Yonik > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene > > For additional commands, e-mail: java-dev-help [at] lucene > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene For additional commands, e-mail: java-dev-help [at] lucene
|