Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

delete by docid in lucene 4

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


sean.bridges at gmail

Jul 11, 2012, 6:09 PM

Post #1 of 14 (375 views)
Permalink
delete by docid in lucene 4

Is it possible to delete by docId in lucene 4? I can delete by docid
in lucene 3 using IndexReader.deleteDocument(int docId), but that
method is gone in lucene 4, and IndexWriter only allows deleting by
Term or Query.

This is our use case - In our system, each document is identified by
a unique serial id. If an error occurs, we may index the same message
multiple times. When an index grows large enough, we stop adding to
it, and optimize the index. During optimization, if we see multiple
docs with the same serialid, we delete all but the first, as all
documents with the same serialid are the same.

Thanks,

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Jul 12, 2012, 12:42 AM

Post #2 of 14 (374 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail> wrote:
> Is it possible to delete by docId in lucene 4? I can delete by docid
> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> method is gone in lucene 4, and IndexWriter only allows deleting by
> Term or Query.

that is correct. In lucene 4 IndexReader is really just a reader!
>
> This is our use case - In our system, each document is identified by
> a unique serial id. If an error occurs, we may index the same message
> multiple times. When an index grows large enough, we stop adding to
> it, and optimize the index. During optimization, if we see multiple
> docs with the same serialid, we delete all but the first, as all
> documents with the same serialid are the same.

I am wondering why you don't use the IW#updateDocument(Term,Doc)
method? do you rely on multiple versions of the same doc? With Lucene
4 relying on the doc id can become very tricky. If you use multiple
threads you create a lot of segments which can be merged in any order.
You can't tell if a document ID maintains happened-before semantics at
all.

Can you tell us more about your usecase and why you are using deleteByDocID

simon


>
> Thanks,
>
> Sean
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erouse at comsquared

Jul 12, 2012, 6:54 AM

Post #3 of 14 (370 views)
Permalink
RE: delete by docid in lucene 4 [In reply to]

I get around this by creating an id based term like:

new Term(Constants.DEFAULT_ID_FIELD, id)

> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> Sent: Wednesday, July 11, 2012 9:09 PM
> To: java-user [at] lucene
> Subject: delete by docid in lucene 4
>
> Is it possible to delete by docId in lucene 4? I can delete by docid
> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> method is gone in lucene 4, and IndexWriter only allows deleting by
> Term or Query.
>
> This is our use case - In our system, each document is identified by
> a unique serial id. If an error occurs, we may index the same message
> multiple times. When an index grows large enough, we stop adding to
> it, and optimize the index. During optimization, if we see multiple
> docs with the same serialid, we delete all but the first, as all
> documents with the same serialid are the same.
>
> Thanks,
>
> Sean
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sean.bridges at gmail

Jul 12, 2012, 8:41 AM

Post #4 of 14 (369 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

We have indexer machines which are fed documents by other machines.
If an error occurs (machine crashing etc) the same document may be
sent to an indexer multiple times. Serial ids are assigned before
documents reach the indexer, so a document, may be in the index
multiple times, each time with the same serial id.

When the index gets large enough, the indexer will stop writing to the
index, and upload it to another machine, which keeps the index
forever. Before we upload the index, we forceMerge(1) on it, and
gather some stats about the index like max,min serial id, total
documents. While calculating max and min serial id, if we see a
duplicate serial id, we call IndexReader.deleteByDocId(...) .

We could check for duplicate serial ids while indexing, but that is
racy, and not as efficient.

Thanks,

Sean


On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
<simon.willnauer [at] gmail> wrote:
> On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail> wrote:
>> Is it possible to delete by docId in lucene 4? I can delete by docid
>> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> method is gone in lucene 4, and IndexWriter only allows deleting by
>> Term or Query.
>
> that is correct. In lucene 4 IndexReader is really just a reader!
>>
>> This is our use case - In our system, each document is identified by
>> a unique serial id. If an error occurs, we may index the same message
>> multiple times. When an index grows large enough, we stop adding to
>> it, and optimize the index. During optimization, if we see multiple
>> docs with the same serialid, we delete all but the first, as all
>> documents with the same serialid are the same.
>
> I am wondering why you don't use the IW#updateDocument(Term,Doc)
> method? do you rely on multiple versions of the same doc? With Lucene
> 4 relying on the doc id can become very tricky. If you use multiple
> threads you create a lot of segments which can be merged in any order.
> You can't tell if a document ID maintains happened-before semantics at
> all.
>
> Can you tell us more about your usecase and why you are using deleteByDocID
>
> simon
>
>
>>
>> Thanks,
>>
>> Sean
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sean.bridges at gmail

Jul 12, 2012, 8:49 AM

Post #5 of 14 (368 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

Does that return a Term which matches the lucene docId? What is the
value of Constants.DEFAULT_ID_FIELD ?

Thanks,
Sean

On Thu, Jul 12, 2012 at 6:54 AM, Edward W. Rouse <erouse [at] comsquared> wrote:
> I get around this by creating an id based term like:
>
> new Term(Constants.DEFAULT_ID_FIELD, id)
>
>> -----Original Message-----
>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>> Sent: Wednesday, July 11, 2012 9:09 PM
>> To: java-user [at] lucene
>> Subject: delete by docid in lucene 4
>>
>> Is it possible to delete by docId in lucene 4? I can delete by docid
>> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> method is gone in lucene 4, and IndexWriter only allows deleting by
>> Term or Query.
>>
>> This is our use case - In our system, each document is identified by
>> a unique serial id. If an error occurs, we may index the same message
>> multiple times. When an index grows large enough, we stop adding to
>> it, and optimize the index. During optimization, if we see multiple
>> docs with the same serialid, we delete all but the first, as all
>> documents with the same serialid are the same.
>>
>> Thanks,
>>
>> Sean
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Jul 12, 2012, 9:27 AM

Post #6 of 14 (373 views)
Permalink
RE: delete by docid in lucene 4 [In reply to]

The trick is to index not with addDocument(Document) but instead with
updateDocument(Term, Document). Lucene then adds the document atomically
while deleting any previous documents with the given term (which is qour
unique ID). If the key does not exist it simply indexes without deleting
anything.
By this you always have only one document with the same Term (==your unique
ID).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> Sent: Thursday, July 12, 2012 5:42 PM
> To: java-user [at] lucene; simon.willnauer [at] gmail
> Subject: Re: delete by docid in lucene 4
>
> We have indexer machines which are fed documents by other machines.
> If an error occurs (machine crashing etc) the same document may be sent to
an
> indexer multiple times. Serial ids are assigned before documents reach
the
> indexer, so a document, may be in the index multiple times, each time with
the
> same serial id.
>
> When the index gets large enough, the indexer will stop writing to the
index,
> and upload it to another machine, which keeps the index forever. Before
we
> upload the index, we forceMerge(1) on it, and gather some stats about the
> index like max,min serial id, total documents. While calculating max and
min
> serial id, if we see a duplicate serial id, we call
IndexReader.deleteByDocId(...) .
>
> We could check for duplicate serial ids while indexing, but that is racy,
and not
> as efficient.
>
> Thanks,
>
> Sean
>
>
> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
> <simon.willnauer [at] gmail> wrote:
> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail>
> wrote:
> >> Is it possible to delete by docId in lucene 4? I can delete by docid
> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> >> method is gone in lucene 4, and IndexWriter only allows deleting by
> >> Term or Query.
> >
> > that is correct. In lucene 4 IndexReader is really just a reader!
> >>
> >> This is our use case - In our system, each document is identified by
> >> a unique serial id. If an error occurs, we may index the same
> >> message multiple times. When an index grows large enough, we stop
> >> adding to it, and optimize the index. During optimization, if we see
> >> multiple docs with the same serialid, we delete all but the first, as
> >> all documents with the same serialid are the same.
> >
> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
> > method? do you rely on multiple versions of the same doc? With Lucene
> > 4 relying on the doc id can become very tricky. If you use multiple
> > threads you create a lot of segments which can be merged in any order.
> > You can't tell if a document ID maintains happened-before semantics at
> > all.
> >
> > Can you tell us more about your usecase and why you are using
> > deleteByDocID
> >
> > simon
> >
> >
> >>
> >> Thanks,
> >>
> >> Sean
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sean.bridges at gmail

Jul 12, 2012, 9:55 AM

Post #7 of 14 (367 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

Thanks for the tip.

Does using updateDocument instead of addDocument affect
indexing/search performance?

Sean

On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
> The trick is to index not with addDocument(Document) but instead with
> updateDocument(Term, Document). Lucene then adds the document atomically
> while deleting any previous documents with the given term (which is qour
> unique ID). If the key does not exist it simply indexes without deleting
> anything.
> By this you always have only one document with the same Term (==your unique
> ID).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
>> -----Original Message-----
>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>> Sent: Thursday, July 12, 2012 5:42 PM
>> To: java-user [at] lucene; simon.willnauer [at] gmail
>> Subject: Re: delete by docid in lucene 4
>>
>> We have indexer machines which are fed documents by other machines.
>> If an error occurs (machine crashing etc) the same document may be sent to
> an
>> indexer multiple times. Serial ids are assigned before documents reach
> the
>> indexer, so a document, may be in the index multiple times, each time with
> the
>> same serial id.
>>
>> When the index gets large enough, the indexer will stop writing to the
> index,
>> and upload it to another machine, which keeps the index forever. Before
> we
>> upload the index, we forceMerge(1) on it, and gather some stats about the
>> index like max,min serial id, total documents. While calculating max and
> min
>> serial id, if we see a duplicate serial id, we call
> IndexReader.deleteByDocId(...) .
>>
>> We could check for duplicate serial ids while indexing, but that is racy,
> and not
>> as efficient.
>>
>> Thanks,
>>
>> Sean
>>
>>
>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>> <simon.willnauer [at] gmail> wrote:
>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail>
>> wrote:
>> >> Is it possible to delete by docId in lucene 4? I can delete by docid
>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> >> method is gone in lucene 4, and IndexWriter only allows deleting by
>> >> Term or Query.
>> >
>> > that is correct. In lucene 4 IndexReader is really just a reader!
>> >>
>> >> This is our use case - In our system, each document is identified by
>> >> a unique serial id. If an error occurs, we may index the same
>> >> message multiple times. When an index grows large enough, we stop
>> >> adding to it, and optimize the index. During optimization, if we see
>> >> multiple docs with the same serialid, we delete all but the first, as
>> >> all documents with the same serialid are the same.
>> >
>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>> > method? do you rely on multiple versions of the same doc? With Lucene
>> > 4 relying on the doc id can become very tricky. If you use multiple
>> > threads you create a lot of segments which can be merged in any order.
>> > You can't tell if a document ID maintains happened-before semantics at
>> > all.
>> >
>> > Can you tell us more about your usecase and why you are using
>> > deleteByDocID
>> >
>> > simon
>> >
>> >
>> >>
>> >> Thanks,
>> >>
>> >> Sean
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >> For additional commands, e-mail: java-user-help [at] lucene
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> > For additional commands, e-mail: java-user-help [at] lucene
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erouse at comsquared

Jul 12, 2012, 10:33 AM

Post #8 of 14 (366 views)
Permalink
RE: delete by docid in lucene 4 [In reply to]

Constants.DEFAULT_ID_FIELD is the name of our unique documentId. The lucene
docId has no purpose for us as we consider it for internal use by lucene
only and use our own id for document tracking purposes.

> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> Sent: Thursday, July 12, 2012 11:50 AM
> To: java-user [at] lucene
> Subject: Re: delete by docid in lucene 4
>
> Does that return a Term which matches the lucene docId? What is the
> value of Constants.DEFAULT_ID_FIELD ?
>
> Thanks,
> Sean
>
> On Thu, Jul 12, 2012 at 6:54 AM, Edward W. Rouse
> <erouse [at] comsquared> wrote:
> > I get around this by creating an id based term like:
> >
> > new Term(Constants.DEFAULT_ID_FIELD, id)
> >
> >> -----Original Message-----
> >> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> >> Sent: Wednesday, July 11, 2012 9:09 PM
> >> To: java-user [at] lucene
> >> Subject: delete by docid in lucene 4
> >>
> >> Is it possible to delete by docId in lucene 4? I can delete by
> docid
> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> >> method is gone in lucene 4, and IndexWriter only allows deleting by
> >> Term or Query.
> >>
> >> This is our use case - In our system, each document is identified
> by
> >> a unique serial id. If an error occurs, we may index the same
> message
> >> multiple times. When an index grows large enough, we stop adding to
> >> it, and optimize the index. During optimization, if we see multiple
> >> docs with the same serialid, we delete all but the first, as all
> >> documents with the same serialid are the same.
> >>
> >> Thanks,
> >>
> >> Sean
> >>
> >> --------------------------------------------------------------------
> -
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Jul 12, 2012, 11:53 AM

Post #9 of 14 (361 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges [at] gmail> wrote:
> Thanks for the tip.
>
> Does using updateDocument instead of addDocument affect
> indexing/search performance?

it does affect index performance compared to add document but that
might be minor compared to your analysis chain. I wouldn't worry about
updateDocument its the only sensible way to use lucene really. Why
didn't you use this before, any reason? What is your ingest rate / doc
throughput and where would you get concerned?

simon
>
> Sean
>
> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
>> The trick is to index not with addDocument(Document) but instead with
>> updateDocument(Term, Document). Lucene then adds the document atomically
>> while deleting any previous documents with the given term (which is qour
>> unique ID). If the key does not exist it simply indexes without deleting
>> anything.
>> By this you always have only one document with the same Term (==your unique
>> ID).
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>>
>>
>>> -----Original Message-----
>>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>>> Sent: Thursday, July 12, 2012 5:42 PM
>>> To: java-user [at] lucene; simon.willnauer [at] gmail
>>> Subject: Re: delete by docid in lucene 4
>>>
>>> We have indexer machines which are fed documents by other machines.
>>> If an error occurs (machine crashing etc) the same document may be sent to
>> an
>>> indexer multiple times. Serial ids are assigned before documents reach
>> the
>>> indexer, so a document, may be in the index multiple times, each time with
>> the
>>> same serial id.
>>>
>>> When the index gets large enough, the indexer will stop writing to the
>> index,
>>> and upload it to another machine, which keeps the index forever. Before
>> we
>>> upload the index, we forceMerge(1) on it, and gather some stats about the
>>> index like max,min serial id, total documents. While calculating max and
>> min
>>> serial id, if we see a duplicate serial id, we call
>> IndexReader.deleteByDocId(...) .
>>>
>>> We could check for duplicate serial ids while indexing, but that is racy,
>> and not
>>> as efficient.
>>>
>>> Thanks,
>>>
>>> Sean
>>>
>>>
>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>>> <simon.willnauer [at] gmail> wrote:
>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail>
>>> wrote:
>>> >> Is it possible to delete by docId in lucene 4? I can delete by docid
>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>>> >> method is gone in lucene 4, and IndexWriter only allows deleting by
>>> >> Term or Query.
>>> >
>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>>> >>
>>> >> This is our use case - In our system, each document is identified by
>>> >> a unique serial id. If an error occurs, we may index the same
>>> >> message multiple times. When an index grows large enough, we stop
>>> >> adding to it, and optimize the index. During optimization, if we see
>>> >> multiple docs with the same serialid, we delete all but the first, as
>>> >> all documents with the same serialid are the same.
>>> >
>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>>> > method? do you rely on multiple versions of the same doc? With Lucene
>>> > 4 relying on the doc id can become very tricky. If you use multiple
>>> > threads you create a lot of segments which can be merged in any order.
>>> > You can't tell if a document ID maintains happened-before semantics at
>>> > all.
>>> >
>>> > Can you tell us more about your usecase and why you are using
>>> > deleteByDocID
>>> >
>>> > simon
>>> >
>>> >
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Sean
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> >> For additional commands, e-mail: java-user-help [at] lucene
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> > For additional commands, e-mail: java-user-help [at] lucene
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sean.bridges at gmail

Jul 12, 2012, 12:50 PM

Post #10 of 14 (361 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

I never used updateDocument() due to ignorance.

We are indexing several hundred documents per second, and most of the
analysis takes places on the non indexer machines to reduce load on
the indexers. For our use case, deleteDocument(int docId) will be
faster as there are very few duplicates, but I don't know if the
difference is significant.

It would be nice to have a deleteDocument(int docId) in IndexWriter.
It seems like it would be easy to add as DocumentsWriter already has a
deletedDocID. I can file a jira and submit a patch if this is
something that you guys would accept.

Sean

On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
<simon.willnauer [at] gmail> wrote:
> On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges [at] gmail> wrote:
>> Thanks for the tip.
>>
>> Does using updateDocument instead of addDocument affect
>> indexing/search performance?
>
> it does affect index performance compared to add document but that
> might be minor compared to your analysis chain. I wouldn't worry about
> updateDocument its the only sensible way to use lucene really. Why
> didn't you use this before, any reason? What is your ingest rate / doc
> throughput and where would you get concerned?
>
> simon
>>
>> Sean
>>
>> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>> The trick is to index not with addDocument(Document) but instead with
>>> updateDocument(Term, Document). Lucene then adds the document atomically
>>> while deleting any previous documents with the given term (which is qour
>>> unique ID). If the key does not exist it simply indexes without deleting
>>> anything.
>>> By this you always have only one document with the same Term (==your unique
>>> ID).
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>>>> Sent: Thursday, July 12, 2012 5:42 PM
>>>> To: java-user [at] lucene; simon.willnauer [at] gmail
>>>> Subject: Re: delete by docid in lucene 4
>>>>
>>>> We have indexer machines which are fed documents by other machines.
>>>> If an error occurs (machine crashing etc) the same document may be sent to
>>> an
>>>> indexer multiple times. Serial ids are assigned before documents reach
>>> the
>>>> indexer, so a document, may be in the index multiple times, each time with
>>> the
>>>> same serial id.
>>>>
>>>> When the index gets large enough, the indexer will stop writing to the
>>> index,
>>>> and upload it to another machine, which keeps the index forever. Before
>>> we
>>>> upload the index, we forceMerge(1) on it, and gather some stats about the
>>>> index like max,min serial id, total documents. While calculating max and
>>> min
>>>> serial id, if we see a duplicate serial id, we call
>>> IndexReader.deleteByDocId(...) .
>>>>
>>>> We could check for duplicate serial ids while indexing, but that is racy,
>>> and not
>>>> as efficient.
>>>>
>>>> Thanks,
>>>>
>>>> Sean
>>>>
>>>>
>>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>>>> <simon.willnauer [at] gmail> wrote:
>>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges [at] gmail>
>>>> wrote:
>>>> >> Is it possible to delete by docId in lucene 4? I can delete by docid
>>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>>>> >> method is gone in lucene 4, and IndexWriter only allows deleting by
>>>> >> Term or Query.
>>>> >
>>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>>>> >>
>>>> >> This is our use case - In our system, each document is identified by
>>>> >> a unique serial id. If an error occurs, we may index the same
>>>> >> message multiple times. When an index grows large enough, we stop
>>>> >> adding to it, and optimize the index. During optimization, if we see
>>>> >> multiple docs with the same serialid, we delete all but the first, as
>>>> >> all documents with the same serialid are the same.
>>>> >
>>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>>>> > method? do you rely on multiple versions of the same doc? With Lucene
>>>> > 4 relying on the doc id can become very tricky. If you use multiple
>>>> > threads you create a lot of segments which can be merged in any order.
>>>> > You can't tell if a document ID maintains happened-before semantics at
>>>> > all.
>>>> >
>>>> > Can you tell us more about your usecase and why you are using
>>>> > deleteByDocID
>>>> >
>>>> > simon
>>>> >
>>>> >
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Sean
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> >> For additional commands, e-mail: java-user-help [at] lucene
>>>> >>
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> > For additional commands, e-mail: java-user-help [at] lucene
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Jul 12, 2012, 1:08 PM

Post #11 of 14 (362 views)
Permalink
RE: delete by docid in lucene 4 [In reply to]

Hi Sean,

Without checking the performance in your case, it makes no sense to discuss
about this. Lucene 4.0 changed a lot, there are several improvements. Please
read the following:

- Because of the new term dictionary, Term lookups on non-existing terms are
fail-fast, they don't do any disk IO in most cases. You can do ten thousands
of those per second on a simple laptop.
- DocumentsWriter uses internal Lucene DocIDs, but those are not global and
therefore not useful for you. They are only valid for one index segment and
only temporarily until IndexWriter merges segments again (possibly in
another thread)

So: Use updateDocument always when you put your new documents into the index
and give every document the unique ID from your pool. Document IDs of Lucene
are pure internal and especially in 4.0's IndexWriter no longer constant
(they can easily change after reopening an index depending on merge policy
or getting a new realtime reader). To uniquely identify documents later you
*have* to use a own key field.

Lucene 4.0 is different than previous versions, deleting by internal Lucene
docId will not come back.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> Sent: Thursday, July 12, 2012 9:51 PM
> To: java-user [at] lucene; simon.willnauer [at] gmail
> Subject: Re: delete by docid in lucene 4
>
> I never used updateDocument() due to ignorance.
>
> We are indexing several hundred documents per second, and most of the
> analysis takes places on the non indexer machines to reduce load on the
> indexers. For our use case, deleteDocument(int docId) will be faster as
there
> are very few duplicates, but I don't know if the difference is
significant.
>
> It would be nice to have a deleteDocument(int docId) in IndexWriter.
> It seems like it would be easy to add as DocumentsWriter already has a
> deletedDocID. I can file a jira and submit a patch if this is something
that you
> guys would accept.
>
> Sean
>
> On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
> <simon.willnauer [at] gmail> wrote:
> > On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges [at] gmail>
> wrote:
> >> Thanks for the tip.
> >>
> >> Does using updateDocument instead of addDocument affect
> >> indexing/search performance?
> >
> > it does affect index performance compared to add document but that
> > might be minor compared to your analysis chain. I wouldn't worry about
> > updateDocument its the only sensible way to use lucene really. Why
> > didn't you use this before, any reason? What is your ingest rate / doc
> > throughput and where would you get concerned?
> >
> > simon
> >>
> >> Sean
> >>
> >> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
> >>> The trick is to index not with addDocument(Document) but instead
> >>> with updateDocument(Term, Document). Lucene then adds the document
> >>> atomically while deleting any previous documents with the given term
> >>> (which is qour unique ID). If the key does not exist it simply
> >>> indexes without deleting anything.
> >>> By this you always have only one document with the same Term (==your
> >>> unique ID).
> >>>
> >>> Uwe
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> >>> eMail: uwe [at] thetaphi
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
> >>>> Sent: Thursday, July 12, 2012 5:42 PM
> >>>> To: java-user [at] lucene; simon.willnauer [at] gmail
> >>>> Subject: Re: delete by docid in lucene 4
> >>>>
> >>>> We have indexer machines which are fed documents by other machines.
> >>>> If an error occurs (machine crashing etc) the same document may be
sent
> to
> >>> an
> >>>> indexer multiple times. Serial ids are assigned before documents
reach
> >>> the
> >>>> indexer, so a document, may be in the index multiple times, each time
> with
> >>> the
> >>>> same serial id.
> >>>>
> >>>> When the index gets large enough, the indexer will stop writing to
the
> >>> index,
> >>>> and upload it to another machine, which keeps the index forever.
Before
> >>> we
> >>>> upload the index, we forceMerge(1) on it, and gather some stats about
> the
> >>>> index like max,min serial id, total documents. While calculating max
and
> >>> min
> >>>> serial id, if we see a duplicate serial id, we call
> >>> IndexReader.deleteByDocId(...) .
> >>>>
> >>>> We could check for duplicate serial ids while indexing, but that is
racy,
> >>> and not
> >>>> as efficient.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Sean
> >>>>
> >>>>
> >>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
> >>>> <simon.willnauer [at] gmail> wrote:
> >>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges
> <sean.bridges [at] gmail>
> >>>> wrote:
> >>>> >> Is it possible to delete by docId in lucene 4? I can delete by
docid
> >>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> >>>> >> method is gone in lucene 4, and IndexWriter only allows deleting
by
> >>>> >> Term or Query.
> >>>> >
> >>>> > that is correct. In lucene 4 IndexReader is really just a reader!
> >>>> >>
> >>>> >> This is our use case - In our system, each document is identified
by
> >>>> >> a unique serial id. If an error occurs, we may index the same
> >>>> >> message multiple times. When an index grows large enough, we stop
> >>>> >> adding to it, and optimize the index. During optimization, if we
see
> >>>> >> multiple docs with the same serialid, we delete all but the first,
as
> >>>> >> all documents with the same serialid are the same.
> >>>> >
> >>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
> >>>> > method? do you rely on multiple versions of the same doc? With
Lucene
> >>>> > 4 relying on the doc id can become very tricky. If you use multiple
> >>>> > threads you create a lot of segments which can be merged in any
order.
> >>>> > You can't tell if a document ID maintains happened-before semantics
at
> >>>> > all.
> >>>> >
> >>>> > Can you tell us more about your usecase and why you are using
> >>>> > deleteByDocID
> >>>> >
> >>>> > simon
> >>>> >
> >>>> >
> >>>> >>
> >>>> >> Thanks,
> >>>> >>
> >>>> >> Sean
> >>>> >>
> >>>> >>
---------------------------------------------------------------------
> >>>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>>> >> For additional commands, e-mail: java-user-help [at] lucene
> >>>> >>
> >>>> >
> >>>> >
---------------------------------------------------------------------
> >>>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>>> > For additional commands, e-mail: java-user-help [at] lucene
> >>>> >
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>>> For additional commands, e-mail: java-user-help [at] lucene
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>> For additional commands, e-mail: java-user-help [at] lucene
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


simon.willnauer at gmail

Jul 12, 2012, 3:17 PM

Post #12 of 14 (370 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

Sean seriously a couple of hundred docs a second, don't bother just
use updateDocument. My benchmarks show that there is only a smallish
impact during indexing especially with concurrent flushing in lucene
4. I don't know how resource intensive your analysis chain is but on a
decent machine you can easily go > 20k docs a second with
updateDocument.

If you want to give deleteByDocid a try for kicks I'd be curious how
you solve some of the really tricky issues! :)

simon

On Thu, Jul 12, 2012 at 10:08 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
> Hi Sean,
>
> Without checking the performance in your case, it makes no sense to discuss
> about this. Lucene 4.0 changed a lot, there are several improvements. Please
> read the following:
>
> - Because of the new term dictionary, Term lookups on non-existing terms are
> fail-fast, they don't do any disk IO in most cases. You can do ten thousands
> of those per second on a simple laptop.
> - DocumentsWriter uses internal Lucene DocIDs, but those are not global and
> therefore not useful for you. They are only valid for one index segment and
> only temporarily until IndexWriter merges segments again (possibly in
> another thread)
>
> So: Use updateDocument always when you put your new documents into the index
> and give every document the unique ID from your pool. Document IDs of Lucene
> are pure internal and especially in 4.0's IndexWriter no longer constant
> (they can easily change after reopening an index depending on merge policy
> or getting a new realtime reader). To uniquely identify documents later you
> *have* to use a own key field.
>
> Lucene 4.0 is different than previous versions, deleting by internal Lucene
> docId will not come back.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
>> -----Original Message-----
>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>> Sent: Thursday, July 12, 2012 9:51 PM
>> To: java-user [at] lucene; simon.willnauer [at] gmail
>> Subject: Re: delete by docid in lucene 4
>>
>> I never used updateDocument() due to ignorance.
>>
>> We are indexing several hundred documents per second, and most of the
>> analysis takes places on the non indexer machines to reduce load on the
>> indexers. For our use case, deleteDocument(int docId) will be faster as
> there
>> are very few duplicates, but I don't know if the difference is
> significant.
>>
>> It would be nice to have a deleteDocument(int docId) in IndexWriter.
>> It seems like it would be easy to add as DocumentsWriter already has a
>> deletedDocID. I can file a jira and submit a patch if this is something
> that you
>> guys would accept.
>>
>> Sean
>>
>> On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
>> <simon.willnauer [at] gmail> wrote:
>> > On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges [at] gmail>
>> wrote:
>> >> Thanks for the tip.
>> >>
>> >> Does using updateDocument instead of addDocument affect
>> >> indexing/search performance?
>> >
>> > it does affect index performance compared to add document but that
>> > might be minor compared to your analysis chain. I wouldn't worry about
>> > updateDocument its the only sensible way to use lucene really. Why
>> > didn't you use this before, any reason? What is your ingest rate / doc
>> > throughput and where would you get concerned?
>> >
>> > simon
>> >>
>> >> Sean
>> >>
>> >> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe [at] thetaphi> wrote:
>> >>> The trick is to index not with addDocument(Document) but instead
>> >>> with updateDocument(Term, Document). Lucene then adds the document
>> >>> atomically while deleting any previous documents with the given term
>> >>> (which is qour unique ID). If the key does not exist it simply
>> >>> indexes without deleting anything.
>> >>> By this you always have only one document with the same Term (==your
>> >>> unique ID).
>> >>>
>> >>> Uwe
>> >>>
>> >>> -----
>> >>> Uwe Schindler
>> >>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>> >>> eMail: uwe [at] thetaphi
>> >>>
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Sean Bridges [mailto:sean.bridges [at] gmail]
>> >>>> Sent: Thursday, July 12, 2012 5:42 PM
>> >>>> To: java-user [at] lucene; simon.willnauer [at] gmail
>> >>>> Subject: Re: delete by docid in lucene 4
>> >>>>
>> >>>> We have indexer machines which are fed documents by other machines.
>> >>>> If an error occurs (machine crashing etc) the same document may be
> sent
>> to
>> >>> an
>> >>>> indexer multiple times. Serial ids are assigned before documents
> reach
>> >>> the
>> >>>> indexer, so a document, may be in the index multiple times, each time
>> with
>> >>> the
>> >>>> same serial id.
>> >>>>
>> >>>> When the index gets large enough, the indexer will stop writing to
> the
>> >>> index,
>> >>>> and upload it to another machine, which keeps the index forever.
> Before
>> >>> we
>> >>>> upload the index, we forceMerge(1) on it, and gather some stats about
>> the
>> >>>> index like max,min serial id, total documents. While calculating max
> and
>> >>> min
>> >>>> serial id, if we see a duplicate serial id, we call
>> >>> IndexReader.deleteByDocId(...) .
>> >>>>
>> >>>> We could check for duplicate serial ids while indexing, but that is
> racy,
>> >>> and not
>> >>>> as efficient.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Sean
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>> >>>> <simon.willnauer [at] gmail> wrote:
>> >>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges
>> <sean.bridges [at] gmail>
>> >>>> wrote:
>> >>>> >> Is it possible to delete by docId in lucene 4? I can delete by
> docid
>> >>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> >>>> >> method is gone in lucene 4, and IndexWriter only allows deleting
> by
>> >>>> >> Term or Query.
>> >>>> >
>> >>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>> >>>> >>
>> >>>> >> This is our use case - In our system, each document is identified
> by
>> >>>> >> a unique serial id. If an error occurs, we may index the same
>> >>>> >> message multiple times. When an index grows large enough, we stop
>> >>>> >> adding to it, and optimize the index. During optimization, if we
> see
>> >>>> >> multiple docs with the same serialid, we delete all but the first,
> as
>> >>>> >> all documents with the same serialid are the same.
>> >>>> >
>> >>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>> >>>> > method? do you rely on multiple versions of the same doc? With
> Lucene
>> >>>> > 4 relying on the doc id can become very tricky. If you use multiple
>> >>>> > threads you create a lot of segments which can be merged in any
> order.
>> >>>> > You can't tell if a document ID maintains happened-before semantics
> at
>> >>>> > all.
>> >>>> >
>> >>>> > Can you tell us more about your usecase and why you are using
>> >>>> > deleteByDocID
>> >>>> >
>> >>>> > simon
>> >>>> >
>> >>>> >
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Sean
>> >>>> >>
>> >>>> >>
> ---------------------------------------------------------------------
>> >>>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >>>> >> For additional commands, e-mail: java-user-help [at] lucene
>> >>>> >>
>> >>>> >
>> >>>> >
> ---------------------------------------------------------------------
>> >>>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >>>> > For additional commands, e-mail: java-user-help [at] lucene
>> >>>> >
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >>>> For additional commands, e-mail: java-user-help [at] lucene
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >>> For additional commands, e-mail: java-user-help [at] lucene
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >> For additional commands, e-mail: java-user-help [at] lucene
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> > For additional commands, e-mail: java-user-help [at] lucene
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jul 12, 2012, 3:25 PM

Post #13 of 14 (361 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

On Thu, Jul 12, 2012 at 6:17 PM, Simon Willnauer
<simon.willnauer [at] gmail> wrote:
> Sean seriously a couple of hundred docs a second, don't bother just
> use updateDocument. My benchmarks show that there is only a smallish
> impact during indexing especially with concurrent flushing in lucene
> 4. I don't know how resource intensive your analysis chain is but on a
> decent machine you can easily go > 20k docs a second with
> updateDocument.
>
> If you want to give deleteByDocid a try for kicks I'd be curious how
> you solve some of the really tricky issues! :)

This (add delete-by-docID to IndexWriter) has been fairly frequently
requested...

But the problem is docIDs can suddenly change up whenever a merge
commits, so I don't see how we can add it in general.

That said, there is an initial patch here:

https://issues.apache.org/jira/browse/LUCENE-4203

It adds IW.tryDeleteDocument(AtomicReader reader, int docID), with the
requirement that the reader is a near-real-time reader obtained from
the writer. The delete will succeed (return true) if that reader has
not yet been merged away, else it fails (returns false) and you have
to do the delete the "normal" way (by Term).

I won't have much time to get back to that issue in the near future so
feel free to take it!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sean.bridges at gmail

Jul 12, 2012, 9:39 PM

Post #14 of 14 (358 views)
Permalink
Re: delete by docid in lucene 4 [In reply to]

Thanks for the advice everyone, I'll try updateDocument() for now.

Sean

On Thu, Jul 12, 2012 at 3:25 PM, Michael McCandless
<lucene [at] mikemccandless> wrote:
> On Thu, Jul 12, 2012 at 6:17 PM, Simon Willnauer
> <simon.willnauer [at] gmail> wrote:
>> Sean seriously a couple of hundred docs a second, don't bother just
>> use updateDocument. My benchmarks show that there is only a smallish
>> impact during indexing especially with concurrent flushing in lucene
>> 4. I don't know how resource intensive your analysis chain is but on a
>> decent machine you can easily go > 20k docs a second with
>> updateDocument.
>>
>> If you want to give deleteByDocid a try for kicks I'd be curious how
>> you solve some of the really tricky issues! :)
>
> This (add delete-by-docID to IndexWriter) has been fairly frequently
> requested...
>
> But the problem is docIDs can suddenly change up whenever a merge
> commits, so I don't see how we can add it in general.
>
> That said, there is an initial patch here:
>
> https://issues.apache.org/jira/browse/LUCENE-4203
>
> It adds IW.tryDeleteDocument(AtomicReader reader, int docID), with the
> requirement that the reader is a near-real-time reader obtained from
> the writer. The delete will succeed (return true) if that reader has
> not yet been merged away, else it fails (returns false) and you have
> to do the delete the "normal" way (by Term).
>
> I won't have much time to get back to that issue in the near future so
> feel free to take it!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.