Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Lucene's internal doc ID space

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jong.lucene at gmail

May 11, 2012, 4:56 AM

Post #1 of 5 (457 views)
Permalink
Lucene's internal doc ID space

When I update a document in Lucene (i.e., re-indexing), I have to delete
the existing document, and create a new one. My understanding is that this
assigns a new doc ID for the newly created document. If that is the case,
is it true that the system can rather quickly run out of doc ID space
(which is about 2 billion since doc ID data type is integer) if the update
frequency is extremly high in an application?

So, my question is -

1. Does Lucene always increment the doc ID for newly created document
(hence, the risk of running out of ID space) just like auto increment
column in the database does? Or does it re-use the numbers that are
currently not in use (i.e. those IDs that were once assigned but since
deleted)?

2. If Lucene can recycle old IDs, it would be even better if I could force
it to re-use a particular doc ID when updating a document by deleting old
one and creating new one. This scheme will allow me to reference this doc
ID from another doc in the index as if it was a foreign key value that
doesn't change upon reindexing. I didn't see anything like this in the API,
but is it ever possible?

3. If Lucene does not recycle old IDs, how do people deal with this issue
when designing a system with extremely high re-indexing frequency?

Thanks in advance for help
/Jong


trejkaz at trypticon

May 12, 2012, 12:14 AM

Post #2 of 5 (411 views)
Permalink
Re: Lucene's internal doc ID space [In reply to]

On Fri, May 11, 2012 at 9:56 PM, Jong Kim <jong.lucene [at] gmail> wrote:
> 2. If Lucene can recycle old IDs, it would be even better if I could force
> it to re-use a particular doc ID when updating a document by deleting old
> one and creating new one. This scheme will allow me to reference this doc
> ID from another doc in the index as if it was a foreign key value that
> doesn't change upon reindexing. I didn't see anything like this in the API,
> but is it ever possible?

Not answering the question, but this would be really awesome as we
could then actually replace documents instead of deleting an old one
and adding a new one.

(But I figure if it were possible, the replace method would work like that.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at googlemail

May 12, 2012, 1:36 AM

Post #3 of 5 (414 views)
Permalink
Re: Lucene's internal doc ID space [In reply to]

On Fri, May 11, 2012 at 7:56 AM, Jong Kim <jong.lucene [at] gmail> wrote:
> When I update a document in Lucene (i.e., re-indexing), I have to delete
> the existing document, and create a new one. My understanding is that this
> assigns a new doc ID for the newly created document. If that is the case,
> is it true that the system can rather quickly run out of doc ID space
> (which is about 2 billion since doc ID data type is integer) if the update
> frequency is extremly high in an application?

the Document IDs in Lucene are per segment. ie. they are always
segment based. There is certainly a limitation here that is 1. in the
API ie. all methods accepting internal doc ids expect int not long. 2.
on a segment level. Basically you gonna run into problems if you have
more than Integer.MAX_VALUE documents in one index. You can work
around that if everything is "per-segment", in such a case the
limitation only applies to a single segment.

Running out of "ids" won't be an issue as they are all relative
per-segment. ie. you can forever update a single document and don't
run out of ids.
>
> So, my question is -
>
> 1. Does Lucene always increment the doc ID for newly created document
> (hence, the risk of running out of ID space) just like auto increment
> column in the database does? Or does it re-use the numbers that are
> currently not in use (i.e. those IDs that were once assigned but since
> deleted)?
>
> 2. If Lucene can recycle old IDs, it would be even better if I could force
> it to re-use a particular doc ID when updating a document by deleting old
> one and creating new one. This scheme will allow me to reference this doc
> ID from another doc in the index as if it was a foreign key value that
> doesn't change upon reindexing. I didn't see anything like this in the API,
> but is it ever possible?
>
> 3. If Lucene does not recycle old IDs, how do people deal with this issue
> when designing a system with extremely high re-indexing frequency?

the lucene internal ids should not be used in the application
integrating lucene or at least not in a way you would use a primary
"auto-incremented" key in a DB. you can specify your own "id" field
and reuse the ids (you actually have to if you want to update.

does that make sense?

simon
>
> Thanks in advance for help
> /Jong

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


valeri.felberg at gmail

May 12, 2012, 6:12 AM

Post #4 of 5 (420 views)
Permalink
Re: Lucene's internal doc ID space [In reply to]

> the Document IDs in Lucene are per segment. ie. they are always
> segment based.

@Simon I'm just wondering: If the document IDs are per segment how
does it work if I call Searcher.search(Query, int) and get TopDocs
referencing ScoreDocs which contain document IDs? What happens if
there are two matching documents in different segments? How does
Lucene know which segment is meant if I call Searcher.doc(docId) with
some docId from the search result?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

May 12, 2012, 1:12 PM

Post #5 of 5 (429 views)
Permalink
Re: Lucene's internal doc ID space [In reply to]

On Sat, May 12, 2012 at 9:12 AM, Valeriy Felberg
<valeri.felberg [at] gmail> wrote:
>> the Document IDs in Lucene are per segment. ie. they are always
>> segment based.
>
> @Simon I'm just wondering: If the document IDs are per segment how
> does it work if I call Searcher.search(Query, int) and get TopDocs
> referencing ScoreDocs which contain document IDs? What happens if
> there are two matching documents in different segments? How does
> Lucene know which segment is meant if I call Searcher.doc(docId) with
> some docId from the search result?

The per-segment docIDs are "rebased" before Searcher.search returns,
ie turned into global docID against the top reader.

Also: when a merge runs, it removes any deleted docIDs (thus
renumbering all non-deleted docIDs)...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.