Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Problem with near realtime search

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


Harald.Kirsch at raytion

Aug 3, 2012, 6:41 AM

Post #1 of 5 (956 views)
Permalink
Problem with near realtime search

I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
persistent map. I am entering 38000 documents at a rate of 1000/s to the
index. Because each item add may be actually an update, I have a
sequence of read/change/write for each of the documents.

All goes well until when just after writing the last item, I run a query
that retrieves about 16000 documents. All docids are collected in a
Collector, and, yes, I make sure to rebase the docIds. Then I iterate
over all docIds found and retrieve the documents basically like this:

for(int docId : docIds) {
Document d = getSearcher().doc(docId);
..
}

where getSearcher() uses IndexReader.openIfChanged() to always get the
most current searcher and makes sure to eventually close the old searcher.


At document 15940 I get an exception like this:

Exception in thread "main" java.lang.IllegalArgumentException: docID
must be >= 0 and < maxDoc=1 (got docID=1)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)

I can get rid of the Exception by one of two ways that I both don't like:

1) Put a Thread.sleep(1000) just before running the query+document
retrieval part.

2) Use the same IndexSearcher to retrieve all documents instead of
calling getSearcher for each document retrieval.

This is just a test single threaded test program. I only see Lucene
Merge threads in jvisualvm besides the main thread. A breakpoint on the
exception shows that org.apache.lucene.index.DirectoryReader.document
does seem to have wrong segments, which triggers the Exception.

Since Lucene 3.6.1 is in productive use for some time I doubt it is a
bug in Lucene, but I don't see what I am doing wrong. It might be
connected to trying to get the freshest IndexReader for retrieving
documents.

Any better ideas or explanations?

Harald.

--
Harald Kirsch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Aug 3, 2012, 8:24 AM

Post #2 of 5 (920 views)
Permalink
Re: Problem with near realtime search [In reply to]

hey harald,

if you use a possibly different searcher (reader) than you used for
the search you will run into problems with the doc IDs since they
might change during the request. I suggest you to use SearcherManager
or NRTMangager and carry on the searcher reference when you collect
the stored values. Just keep around the searcher you used and
NRTManager / SearcherManager will do the job for you.

simon

On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch <Harald.Kirsch [at] raytion> wrote:
> I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
> persistent map. I am entering 38000 documents at a rate of 1000/s to the
> index. Because each item add may be actually an update, I have a sequence of
> read/change/write for each of the documents.
>
> All goes well until when just after writing the last item, I run a query
> that retrieves about 16000 documents. All docids are collected in a
> Collector, and, yes, I make sure to rebase the docIds. Then I iterate over
> all docIds found and retrieve the documents basically like this:
>
> for(int docId : docIds) {
> Document d = getSearcher().doc(docId);
> ..
> }
>
> where getSearcher() uses IndexReader.openIfChanged() to always get the most
> current searcher and makes sure to eventually close the old searcher.
>
>
> At document 15940 I get an exception like this:
>
> Exception in thread "main" java.lang.IllegalArgumentException: docID must be
>>= 0 and < maxDoc=1 (got docID=1)
> at
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
> at
> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
> at
> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
>
> I can get rid of the Exception by one of two ways that I both don't like:
>
> 1) Put a Thread.sleep(1000) just before running the query+document retrieval
> part.
>
> 2) Use the same IndexSearcher to retrieve all documents instead of calling
> getSearcher for each document retrieval.
>
> This is just a test single threaded test program. I only see Lucene Merge
> threads in jvisualvm besides the main thread. A breakpoint on the exception
> shows that org.apache.lucene.index.DirectoryReader.document does seem to
> have wrong segments, which triggers the Exception.
>
> Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in
> Lucene, but I don't see what I am doing wrong. It might be connected to
> trying to get the freshest IndexReader for retrieving documents.
>
> Any better ideas or explanations?
>
> Harald.
>
> --
> Harald Kirsch
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


Harald.Kirsch at raytion

Aug 3, 2012, 10:38 PM

Post #3 of 5 (914 views)
Permalink
Re: Problem with near realtime search [In reply to]

Hello Simon,

thanks for the information. I really thought that once a docId is
assigned it is kept until the document is deleted. The only problem I
would have expected are docIds that no longer refer to a document,
because it was deleted in the meantime. But this is clearly not the case
in my setup.

But if docIds change during index rearrangement, then this would of
course completely explain the symptoms I saw.

So docIds can definitively change under the hood?

Harald.


Am 03.08.2012 17:24, schrieb Simon Willnauer:
> hey harald,
>
> if you use a possibly different searcher (reader) than you used for
> the search you will run into problems with the doc IDs since they
> might change during the request. I suggest you to use SearcherManager
> or NRTMangager and carry on the searcher reference when you collect
> the stored values. Just keep around the searcher you used and
> NRTManager / SearcherManager will do the job for you.
>
> simon
>
> On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch <Harald.Kirsch [at] raytion> wrote:
>> I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
>> persistent map. I am entering 38000 documents at a rate of 1000/s to the
>> index. Because each item add may be actually an update, I have a sequence of
>> read/change/write for each of the documents.
>>
>> All goes well until when just after writing the last item, I run a query
>> that retrieves about 16000 documents. All docids are collected in a
>> Collector, and, yes, I make sure to rebase the docIds. Then I iterate over
>> all docIds found and retrieve the documents basically like this:
>>
>> for(int docId : docIds) {
>> Document d = getSearcher().doc(docId);
>> ..
>> }
>>
>> where getSearcher() uses IndexReader.openIfChanged() to always get the most
>> current searcher and makes sure to eventually close the old searcher.
>>
>>
>> At document 15940 I get an exception like this:
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: docID must be
>>> = 0 and < maxDoc=1 (got docID=1)
>> at
>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
>> at
>> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
>> at
>> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
>>
>> I can get rid of the Exception by one of two ways that I both don't like:
>>
>> 1) Put a Thread.sleep(1000) just before running the query+document retrieval
>> part.
>>
>> 2) Use the same IndexSearcher to retrieve all documents instead of calling
>> getSearcher for each document retrieval.
>>
>> This is just a test single threaded test program. I only see Lucene Merge
>> threads in jvisualvm besides the main thread. A breakpoint on the exception
>> shows that org.apache.lucene.index.DirectoryReader.document does seem to
>> have wrong segments, which triggers the Exception.
>>
>> Since Lucene 3.6.1 is in productive use for some time I doubt it is a bug in
>> Lucene, but I don't see what I am doing wrong. It might be connected to
>> trying to get the freshest IndexReader for retrieving documents.
>>
>> Any better ideas or explanations?
>>
>> Harald.
>>
>> --
>> Harald Kirsch
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


Harald.Kirsch at raytion

Aug 3, 2012, 10:58 PM

Post #4 of 5 (915 views)
Permalink
Re: Problem with near realtime search [In reply to]

Hello Simon,

now that I knew what to search for I found

http://wiki.apache.org/lucene-java/LuceneFAQ#When_is_it_possible_for_document_IDs_to_change.3F

So that clearly explains this issue for me.

Many thanks for your help.

Harald



Am 04.08.2012 07:38, schrieb Harald Kirsch:
> Hello Simon,
>
> thanks for the information. I really thought that once a docId is
> assigned it is kept until the document is deleted. The only problem I
> would have expected are docIds that no longer refer to a document,
> because it was deleted in the meantime. But this is clearly not the case
> in my setup.
>
> But if docIds change during index rearrangement, then this would of
> course completely explain the symptoms I saw.
>
> So docIds can definitively change under the hood?
>
> Harald.
>
>
> Am 03.08.2012 17:24, schrieb Simon Willnauer:
>> hey harald,
>>
>> if you use a possibly different searcher (reader) than you used for
>> the search you will run into problems with the doc IDs since they
>> might change during the request. I suggest you to use SearcherManager
>> or NRTMangager and carry on the searcher reference when you collect
>> the stored values. Just keep around the searcher you used and
>> NRTManager / SearcherManager will do the job for you.
>>
>> simon
>>
>> On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch
>> <Harald.Kirsch [at] raytion> wrote:
>>> I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
>>> persistent map. I am entering 38000 documents at a rate of 1000/s to the
>>> index. Because each item add may be actually an update, I have a
>>> sequence of
>>> read/change/write for each of the documents.
>>>
>>> All goes well until when just after writing the last item, I run a query
>>> that retrieves about 16000 documents. All docids are collected in a
>>> Collector, and, yes, I make sure to rebase the docIds. Then I iterate
>>> over
>>> all docIds found and retrieve the documents basically like this:
>>>
>>> for(int docId : docIds) {
>>> Document d = getSearcher().doc(docId);
>>> ..
>>> }
>>>
>>> where getSearcher() uses IndexReader.openIfChanged() to always get
>>> the most
>>> current searcher and makes sure to eventually close the old searcher.
>>>
>>>
>>> At document 15940 I get an exception like this:
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: docID
>>> must be
>>>> = 0 and < maxDoc=1 (got docID=1)
>>> at
>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
>>> at
>>> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
>>>
>>> at
>>> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
>>>
>>> I can get rid of the Exception by one of two ways that I both don't
>>> like:
>>>
>>> 1) Put a Thread.sleep(1000) just before running the query+document
>>> retrieval
>>> part.
>>>
>>> 2) Use the same IndexSearcher to retrieve all documents instead of
>>> calling
>>> getSearcher for each document retrieval.
>>>
>>> This is just a test single threaded test program. I only see Lucene
>>> Merge
>>> threads in jvisualvm besides the main thread. A breakpoint on the
>>> exception
>>> shows that org.apache.lucene.index.DirectoryReader.document does seem to
>>> have wrong segments, which triggers the Exception.
>>>
>>> Since Lucene 3.6.1 is in productive use for some time I doubt it is a
>>> bug in
>>> Lucene, but I don't see what I am doing wrong. It might be connected to
>>> trying to get the freshest IndexReader for retrieving documents.
>>>
>>> Any better ideas or explanations?
>>>
>>> Harald.
>>>
>>> --
>>> Harald Kirsch
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49-211-550266-0
Fax +49-211-550266-19
http://www.raytion.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Aug 5, 2012, 1:10 AM

Post #5 of 5 (918 views)
Permalink
Re: Problem with near realtime search [In reply to]

Hey Harald,

On Sat, Aug 4, 2012 at 7:58 AM, Harald Kirsch <Harald.Kirsch [at] raytion> wrote:
> Hello Simon,
>
> now that I knew what to search for I found
>
> http://wiki.apache.org/lucene-java/LuceneFAQ#When_is_it_possible_for_document_IDs_to_change.3F
>
> So that clearly explains this issue for me.
>
> Many thanks for your help.

no worries, the document ids are only fixed per segment (right now -
even that can change in the future) if you delete a document its ID is
still set but its marked as deleted. No you Merge two segments you doc
gets a new ID in the merged segment (it might happen that the ID is
the same for some segments but that is just an impl detail and not
generally true. if you add document and commit it you have a new
segment which is part of the index, yet if you do handle persegment
IDs in a collector you build a global ID that changes again with newly
flushed segments.

simon
>
> Harald
>
>
>
> Am 04.08.2012 07:38, schrieb Harald Kirsch:
>
>> Hello Simon,
>>
>> thanks for the information. I really thought that once a docId is
>> assigned it is kept until the document is deleted. The only problem I
>> would have expected are docIds that no longer refer to a document,
>> because it was deleted in the meantime. But this is clearly not the case
>> in my setup.
>>
>> But if docIds change during index rearrangement, then this would of
>> course completely explain the symptoms I saw.
>>
>> So docIds can definitively change under the hood?
>>
>> Harald.
>>
>>
>> Am 03.08.2012 17:24, schrieb Simon Willnauer:
>>>
>>> hey harald,
>>>
>>> if you use a possibly different searcher (reader) than you used for
>>> the search you will run into problems with the doc IDs since they
>>> might change during the request. I suggest you to use SearcherManager
>>> or NRTMangager and carry on the searcher reference when you collect
>>> the stored values. Just keep around the searcher you used and
>>> NRTManager / SearcherManager will do the job for you.
>>>
>>> simon
>>>
>>> On Fri, Aug 3, 2012 at 3:41 PM, Harald Kirsch
>>> <Harald.Kirsch [at] raytion> wrote:
>>>>
>>>> I am trying to (mis)use Lucene a bit like a NoSQL database or, rather, a
>>>> persistent map. I am entering 38000 documents at a rate of 1000/s to the
>>>> index. Because each item add may be actually an update, I have a
>>>> sequence of
>>>> read/change/write for each of the documents.
>>>>
>>>> All goes well until when just after writing the last item, I run a query
>>>> that retrieves about 16000 documents. All docids are collected in a
>>>> Collector, and, yes, I make sure to rebase the docIds. Then I iterate
>>>> over
>>>> all docIds found and retrieve the documents basically like this:
>>>>
>>>> for(int docId : docIds) {
>>>> Document d = getSearcher().doc(docId);
>>>> ..
>>>> }
>>>>
>>>> where getSearcher() uses IndexReader.openIfChanged() to always get
>>>> the most
>>>> current searcher and makes sure to eventually close the old searcher.
>>>>
>>>>
>>>> At document 15940 I get an exception like this:
>>>>
>>>> Exception in thread "main" java.lang.IllegalArgumentException: docID
>>>> must be
>>>>>
>>>>> = 0 and < maxDoc=1 (got docID=1)
>>>>
>>>> at
>>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:490)
>>>> at
>>>>
>>>> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:568)
>>>>
>>>> at
>>>> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:264)
>>>>
>>>> I can get rid of the Exception by one of two ways that I both don't
>>>> like:
>>>>
>>>> 1) Put a Thread.sleep(1000) just before running the query+document
>>>> retrieval
>>>> part.
>>>>
>>>> 2) Use the same IndexSearcher to retrieve all documents instead of
>>>> calling
>>>> getSearcher for each document retrieval.
>>>>
>>>> This is just a test single threaded test program. I only see Lucene
>>>> Merge
>>>> threads in jvisualvm besides the main thread. A breakpoint on the
>>>> exception
>>>> shows that org.apache.lucene.index.DirectoryReader.document does seem to
>>>> have wrong segments, which triggers the Exception.
>>>>
>>>> Since Lucene 3.6.1 is in productive use for some time I doubt it is a
>>>> bug in
>>>> Lucene, but I don't see what I am doing wrong. It might be connected to
>>>> trying to get the freshest IndexReader for retrieving documents.
>>>>
>>>> Any better ideas or explanations?
>>>>
>>>> Harald.
>>>>
>>>> --
>>>> Harald Kirsch
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49-211-550266-0
> Fax +49-211-550266-19
> http://www.raytion.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.