Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Lucene Document order not being maintained?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


daniel.armbrust.list at gmail

Apr 5, 2006, 8:24 AM

Post #1 of 19 (1204 views)
Permalink
Lucene Document order not being maintained?

I'm using Lucene 1.9.1, and I'm seeing some odd behavior that I hope
someone can help me with.

My application counts on Lucene maintaining the order of the documents
exactly the same as how I insert them. Lucene is supposed to maintain
document order, even across index merges, correct?

My indexing process works as follows (and some of this is hold-over from
the time before lucene had a compound file format - so bear with me)

I open up a File based index - using a merge factor of 90, and in my
current test, the compound index format. When I have added 100,000
documents, I close this index, and start on a new index. I continue
this until I'm done with all of the documents. Then, as a last step, I
open up a new empty index, and I call addIndexes(Directory[]) - and I
pass in the directories in the same order that I created them.


This allows me to use higher merge factors without running into file
handle issues, and without having to call optimize.

The problem that I am seeing right now, is that when I look into my
large combined index with Luke, Document number 899 is the 899th
document that I added. However, Document 900 is the 49860th document
that I added. This continues until Document 910, where it suddenly
jumps to the 99720th document.

Is this a bug, or am I misusing something in the API?

Thanks,

Dan


--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Apr 5, 2006, 10:03 AM

Post #2 of 19 (1194 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

: exactly the same as how I insert them. Lucene is supposed to maintain
: document order, even across index merges, correct?

Lucene definitely maintains index order for document additions -- but i
don't know if any similar claim has been made about merging whole indexes.

: this until I'm done with all of the documents. Then, as a last step, I
: open up a new empty index, and I call addIndexes(Directory[]) - and I
: pass in the directories in the same order that I created them.
...
: The problem that I am seeing right now, is that when I look into my
: large combined index with Luke, Document number 899 is the 899th
: document that I added. However, Document 900 is the 49860th document
: that I added. This continues until Document 910, where it suddenly
: jumps to the 99720th document.

As I said, i'm not sure if it's a bug or undefined behavior, but
can you post a self contained JUnit test demonstrating this? -- that way
people can look at exactly what is going on (if it is a bug).




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel.armbrust.list at gmail

Apr 5, 2006, 1:08 PM

Post #3 of 19 (1196 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Chris Hostetter wrote:
> : exactly the same as how I insert them. Lucene is supposed to maintain
> : document order, even across index merges, correct?
>
> Lucene definitely maintains index order for document additions -- but i
> don't know if any similar claim has been made about merging whole indexes.
>
> : this until I'm done with all of the documents. Then, as a last step, I
> : open up a new empty index, and I call addIndexes(Directory[]) - and I
> : pass in the directories in the same order that I created them.
> ...
> : The problem that I am seeing right now, is that when I look into my
> : large combined index with Luke, Document number 899 is the 899th
> : document that I added. However, Document 900 is the 49860th document
> : that I added. This continues until Document 910, where it suddenly
> : jumps to the 99720th document.
>
> As I said, i'm not sure if it's a bug or undefined behavior, but
> can you post a self contained JUnit test demonstrating this? -- that way
> people can look at exactly what is going on (if it is a bug).
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Well, I set out to write JUnit test case to quickly show this... but
I'm having a heck of a time doing it. With relatively small numbers of
documents containing very few fields... I haven't been able to recreate
the out-of-order problem. However, with my real process, with a ton
more data, I can recreate it every single time I index (it even gets the
same documents out of order, consistently).

I'll continue to try to generate a test case that gets the docs out of
order... but if someone in the know could answer authoritatively whether
or not lucene is supposed to maintain document order when you merge
multiple indexes together, that would be great.

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 1:17 PM

Post #4 of 19 (1195 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

On 4/5/06, Dan Armbrust <daniel.armbrust.list [at] gmail> wrote:
> I'll continue to try to generate a test case that gets the docs out of
> order... but if someone in the know could answer authoritatively whether

I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks
like it should preserve order.
The directories are added in order, and the segments for each
directory are added in order. The merging code is shared, so that
shouldn't do anything different than normal segment merges.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 1:21 PM

Post #5 of 19 (1199 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

On 4/5/06, Dan Armbrust <daniel.armbrust.list [at] gmail> wrote:
> I haven't been able to recreate
> the out-of-order problem. However, with my real process, with a ton
> more data, I can recreate it every single time I index (it even gets the
> same documents out of order, consistently).

If you have enough file handles, you can test if it's a Lucene problem
or your app by opening a MultiReader over all the indexes and testing
if the documents are in the order you think they are *before* merging.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel.armbrust.list at gmail

Apr 5, 2006, 1:48 PM

Post #6 of 19 (1186 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Yonik Seeley wrote:
> On 4/5/06, Dan Armbrust <daniel.armbrust.list [at] gmail> wrote:
>> I'll continue to try to generate a test case that gets the docs out of
>> order... but if someone in the know could answer authoritatively whether
>
> I browsed the code for IndexWriter.addIndexes(Dir[]), and it looks
> like it should preserve order.
> The directories are added in order, and the segments for each
> directory are added in order. The merging code is shared, so that
> shouldn't do anything different than normal segment merges.
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Thanks for checking Yonik. I'm fairly certain that this is a lucene bug
then - I will try to come up with a reproduceable test case.

My load code is pretty simple... whenever I create a new document, I put
in a field that contains a counter of the load order.

When I look at the individual indexes, things are fine - but after it
merges them, I get a significant percentage of documents which have been
reordered.

One other thing I can look into - I've been building these indexes on a
64 bit linux machine, using a 64 bit JVM. I need to see if the same
error happens on 32 bit windows....

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Apr 5, 2006, 1:57 PM

Post #7 of 19 (1196 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

: Well, I set out to write JUnit test case to quickly show this... but
: I'm having a heck of a time doing it. With relatively small numbers of
: documents containing very few fields... I haven't been able to recreate
: the out-of-order problem. However, with my real process, with a ton
: more data, I can recreate it every single time I index (it even gets the
: same documents out of order, consistently).

it's very possible that the problem is specific to large numbers of
documents/indexes, or that it's specific to FSDirectory - so if you can't
reproduce with a handfull of docs on a RAMDirectory don't shy away from
making a test case that creates 10 1GB indexes in ./test-doc-order-on-merge
or something like that if it's the only way to reproduce the problem.

just warn us if it it's not obvious from the code that it does that :)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 2:03 PM

Post #8 of 19 (1180 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

On 4/5/06, Dan Armbrust <daniel.armbrust.list [at] gmail> wrote:
> I will try to come up with a reproduceable test case.

If you reproduce it, I'll fix it :-)

For your test case, try lowering numbers, such as maxBufferedDocs=2,
mergeFactor=2 or 3
to create more segments more quickly and cause more merges with fewer documents.


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cutting at apache

Apr 5, 2006, 2:23 PM

Post #9 of 19 (1181 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Dan Armbrust wrote:
> My indexing process works as follows (and some of this is hold-over from
> the time before lucene had a compound file format - so bear with me)
>
> I open up a File based index - using a merge factor of 90, and in my
> current test, the compound index format. When I have added 100,000
> documents, I close this index, and start on a new index. I continue
> this until I'm done with all of the documents. Then, as a last step, I
> open up a new empty index, and I call addIndexes(Directory[]) - and I
> pass in the directories in the same order that I created them.
>
> This allows me to use higher merge factors without running into file
> handle issues, and without having to call optimize.

As others have noted, this should work correctly.

I assume that your merge factor when calling addIndexes() is less than
90. If it's 90, then what you're doing is the same as Lucene would
automatically do. I think you could save yourself a lot of trouble if
you simply lowered your merge factor substantially and then indexed
everything in one pass. To make things go faster, set
maxBufferedDocs=100 or larger. This should be as fast as what you're
doing now and a lot simpler.

Or is that the part where I was supposed to "bear with" you?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 2:50 PM

Post #10 of 19 (1183 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

On 4/5/06, Doug Cutting <cutting [at] apache> wrote:

> As others have noted, this should work correctly.

One slight oddity I noticed with addIndexes(Dir[]) is that merging
starts at one past the first new segment added (not the first new
segment). It doesn't seem like that should hurt much though.


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel.armbrust.list at gmail

Apr 5, 2006, 2:56 PM

Post #11 of 19 (1197 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Yonik Seeley wrote:
> For your test case, try lowering numbers, such as maxBufferedDocs=2,
> mergeFactor=2 or 3
> to create more segments more quickly and cause more merges with fewer documents.

Good suggestion. A merge factor of 2 made it happen much more quickly.
Bug is filed:

http://issues.apache.org/jira/browse/LUCENE-540

JUnit test case is attached (although it may not be in the proper format
for lucene - but I think its pretty straight forward)

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel.armbrust.list at gmail

Apr 5, 2006, 2:59 PM

Post #12 of 19 (1194 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Doug Cutting wrote:
>
> I assume that your merge factor when calling addIndexes() is less than
> 90. If it's 90, then what you're doing is the same as Lucene would
> automatically do. I think you could save yourself a lot of trouble if
> you simply lowered your merge factor substantially and then indexed
> everything in one pass. To make things go faster, set
> maxBufferedDocs=100 or larger. This should be as fast as what you're
> doing now and a lot simpler.
>
> Or is that the part where I was supposed to "bear with" you?
>
> Doug
>

Yep. This code was written when I had to index tons of stuff on linux,
and was constantly running into file handle issues (even with low merge
factors). I ended up writing a wrapper for lucene that handled it all
for me, and I've just been reusing it. Then today, I ran into this
issue. It may be time to rework some of the wrapper to take advantage
of the lucene updates :)

Dan


--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 3:00 PM

Post #13 of 19 (1187 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

On 4/5/06, Dan Armbrust <daniel.armbrust.list [at] gmail> wrote:
> Yonik Seeley wrote:
> > For your test case, try lowering numbers, such as maxBufferedDocs=2,
> > mergeFactor=2 or 3
> > to create more segments more quickly and cause more merges with fewer documents.
>
> Good suggestion. A merge factor of 2 made it happen much more quickly.
> Bug is filed:
>
> http://issues.apache.org/jira/browse/LUCENE-540

Thanks Dan, I'll look into it tonight, as promised.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 3:52 PM

Post #14 of 19 (1187 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Ah Ha! I found the problem.

SegmentInfos.read(Directory directory) reads the segment info in reverse order!
I gotta go home now... I'll look into the right fix later (it depends
on what else uses that method...)

FYI, I managed to reproduce it with only 3 documents in each index.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 3:57 PM

Post #15 of 19 (1192 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Spoke too soon... the loop counter goes down to zero, but it looks
like the segments are added in order.

for (int i = input.readInt(); i > 0; i--) { // read segmentInfos
SegmentInfo si =
new SegmentInfo(input.readString(), input.readInt(), directory);
addElement(si);
}

On 4/5/06, Yonik Seeley <yseeley [at] gmail> wrote:
> Ah Ha! I found the problem.
>
> SegmentInfos.read(Directory directory) reads the segment info in reverse order!
> I gotta go home now... I'll look into the right fix later (it depends
> on what else uses that method...)
>
> FYI, I managed to reproduce it with only 3 documents in each index.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 4:23 PM

Post #16 of 19 (1193 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

I realized what the real problem was during the drive home.

merged segments are added after all other segments, instead of the
spot the original segments resided.

I'll propose a patch soon...

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 4:35 PM

Post #17 of 19 (1190 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

OK, the following patch seems to work for me!
You might want to try it out on your larger test Dan.

The first part probably isn't necessary (the base=start instead of
start+1), but the second part is.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server



Index: org/apache/lucene/index/IndexWriter.java
===================================================================
--- org/apache/lucene/index/IndexWriter.java (revision 391084)
+++ org/apache/lucene/index/IndexWriter.java (working copy)
@@ -569,7 +569,7 @@

// merge newly added segments in log(n) passes
while (segmentInfos.size() > start+mergeFactor) {
- for (int base = start+1; base < segmentInfos.size(); base++) {
+ for (int base = start; base < segmentInfos.size(); base++) {
int end = Math.min(segmentInfos.size(), base+mergeFactor);
if (end-base > 1)
mergeSegments(base, end);
@@ -710,9 +710,9 @@
infoStream.println(" into "+mergedName+" ("+mergedDocCount+" docs)");
}

- for (int i = end-1; i >= minSegment; i--) // remove old infos & add new

+ for (int i = end-1; i > minSegment; i--) // remove old infos & add new
segmentInfos.remove(i);
- segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount,
+ segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount,
directory));

// close readers before we attempt to delete now-obsolete segments

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Apr 5, 2006, 5:49 PM

Post #18 of 19 (1194 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

addIndexes(Dir[]) was the only user of mergeSegments() that passed an
endpoint that wasn't the end of the segment list, and hence the only
caller to mergeSegments() that will see a change of behavior.

Given that, I feel comfortable enough to commit this.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

On 4/5/06, Yonik Seeley <yseeley [at] gmail> wrote:
> OK, the following patch seems to work for me!
> You might want to try it out on your larger test Dan.
>
> The first part probably isn't necessary (the base=start instead of
> start+1), but the second part is.
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
>
>
>
> Index: org/apache/lucene/index/IndexWriter.java
> ===================================================================
> --- org/apache/lucene/index/IndexWriter.java (revision 391084)
> +++ org/apache/lucene/index/IndexWriter.java (working copy)
> @@ -569,7 +569,7 @@
>
> // merge newly added segments in log(n) passes
> while (segmentInfos.size() > start+mergeFactor) {
> - for (int base = start+1; base < segmentInfos.size(); base++) {
> + for (int base = start; base < segmentInfos.size(); base++) {
> int end = Math.min(segmentInfos.size(), base+mergeFactor);
> if (end-base > 1)
> mergeSegments(base, end);
> @@ -710,9 +710,9 @@
> infoStream.println(" into "+mergedName+" ("+mergedDocCount+" docs)");
> }
>
> - for (int i = end-1; i >= minSegment; i--) // remove old infos & add new
>
> + for (int i = end-1; i > minSegment; i--) // remove old infos & add new
> segmentInfos.remove(i);
> - segmentInfos.addElement(new SegmentInfo(mergedName, mergedDocCount,
> + segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount,
> directory));
>
> // close readers before we attempt to delete now-obsolete segments
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


daniel.armbrust.list at gmail

Apr 5, 2006, 7:38 PM

Post #19 of 19 (1187 views)
Permalink
Re: Lucene Document order not being maintained? [In reply to]

Thanks guys.... as always... lucene (and especially the people behind
it) are top notch.

Less than 6 hours from the time I figured out that the bug was in
Lucene (and not my code, which is usually the case) - and its already
fixed (I'm going to assume - I'll test it tomorrow when I get to work)

Amazing.

Thanks again,

Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.