Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

delete a document from indexwriter

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


cambazz at gmail

Jan 18, 2008, 6:22 AM

Post #1 of 11 (4236 views)
Permalink
delete a document from indexwriter

Hello,

How do I delete a specific document from an indexwriter? I understand there
is deleteDocuments(term) which deletes all the documents matching the term.
But what if I want to delete a document that has more then one term in
specific. I can search the document with a boolean query, and then get the
doc id.
I know that doc ids are temporary, but can I not use it for delete?

IndexReader has a delete by doc id method, but I am not sure how to use this
when using an indexwriter.

Best,
C.B.


lucene at mikemccandless

Jan 19, 2008, 3:07 AM

Post #2 of 11 (4139 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Good question....

So far, this method has not been carried over to IndexWriter because
in general it's not really safe, since there's no way to get an
accurate docID from IndexWriter itself.

You can't really "know" when IndexWriter does merges that compacts
deletes and thus changes docIDs. So, if you open a reader on the
side, get a docID you want to delete, and then go and ask IndexWriter
to delete that docID, you may in fact delete the wrong document. In
2.3, where segment merges are now done with a background thread, it's
even worse, because a merge could complete and be committed, thus
changing docIDs, at any time...

See complex discussion here:

http://markmail.org/message/wxqel3gd6cmavk5a

As of 2.3, the low level infrastructure was added to IndexWriter for
deleting by document ID, but this is not exposed publicly (this was a
side effect of LUCENE-1112). It's only used, internally, to delete a
document if an exception is hit while indexing it. In theory, you
could then subclass IndexWriter and tap into this infrastructure to
delete by docID, but, you're entering dangerous territory!

Do you have a specific use case in mind here? I think we'd like to
make this option available someday in IndexWriter, but doing so now
(when there is no way to get a "reliable" docID) seems too dangerous...

Mike

Cam Bazz wrote:

> Hello,
>
> How do I delete a specific document from an indexwriter? I
> understand there
> is deleteDocuments(term) which deletes all the documents matching
> the term.
> But what if I want to delete a document that has more then one term in
> specific. I can search the document with a boolean query, and then
> get the
> doc id.
> I know that doc ids are temporary, but can I not use it for delete?
>
> IndexReader has a delete by doc id method, but I am not sure how to
> use this
> when using an indexwriter.
>
> Best,
> C.B.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cambazz at gmail

Jan 21, 2008, 5:54 AM

Post #3 of 11 (4117 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Hello Mike;

How about deleting by a compount term?

for example if I have a document with two fields srcId and dstId
and I want to delete the document where srcId=1 and dstId=2

right now there exists a IndexWriter.deleteDocuments(Term t) but with that I
can only delete lets say where srcId=something.

I am sure there is a workaround but I could not find it.

Best,

On Jan 19, 2008 1:07 PM, Michael McCandless <lucene [at] mikemccandless>
wrote:

>
> Good question....
>
> So far, this method has not been carried over to IndexWriter because
> in general it's not really safe, since there's no way to get an
> accurate docID from IndexWriter itself.
>
> You can't really "know" when IndexWriter does merges that compacts
> deletes and thus changes docIDs. So, if you open a reader on the
> side, get a docID you want to delete, and then go and ask IndexWriter
> to delete that docID, you may in fact delete the wrong document. In
> 2.3, where segment merges are now done with a background thread, it's
> even worse, because a merge could complete and be committed, thus
> changing docIDs, at any time...
>
> See complex discussion here:
>
> http://markmail.org/message/wxqel3gd6cmavk5a
>
> As of 2.3, the low level infrastructure was added to IndexWriter for
> deleting by document ID, but this is not exposed publicly (this was a
> side effect of LUCENE-1112). It's only used, internally, to delete a
> document if an exception is hit while indexing it. In theory, you
> could then subclass IndexWriter and tap into this infrastructure to
> delete by docID, but, you're entering dangerous territory!
>
> Do you have a specific use case in mind here? I think we'd like to
> make this option available someday in IndexWriter, but doing so now
> (when there is no way to get a "reliable" docID) seems too dangerous...
>
> Mike
>
> Cam Bazz wrote:
>
> > Hello,
> >
> > How do I delete a specific document from an indexwriter? I
> > understand there
> > is deleteDocuments(term) which deletes all the documents matching
> > the term.
> > But what if I want to delete a document that has more then one term in
> > specific. I can search the document with a boolean query, and then
> > get the
> > doc id.
> > I know that doc ids are temporary, but can I not use it for delete?
> >
> > IndexReader has a delete by doc id method, but I am not sure how to
> > use this
> > when using an indexwriter.
> >
> > Best,
> > C.B.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


lucene at mikemccandless

Jan 21, 2008, 6:28 AM

Post #4 of 11 (4122 views)
Permalink
Re: delete a document from indexwriter [In reply to]

For this case, too, you will need to use an IndexReader, or use
IndexSearcher to run that particular search and then delete the
docIDs returned using the IndexReader.

Though, be sure to first iterate through all hits, gathering all
docIDs. And then in 2nd pass, do the deletions. Otherwise you'll
hit this issue:

https://issues.apache.org/jira/browse/LUCENE-1096

(Unless you're using 2.3).

You can also use Solr, which provides "delete by query".

Mike

Cam Bazz wrote:

> Hello Mike;
>
> How about deleting by a compount term?
>
> for example if I have a document with two fields srcId and dstId
> and I want to delete the document where srcId=1 and dstId=2
>
> right now there exists a IndexWriter.deleteDocuments(Term t) but
> with that I
> can only delete lets say where srcId=something.
>
> I am sure there is a workaround but I could not find it.
>
> Best,
>
> On Jan 19, 2008 1:07 PM, Michael McCandless
> <lucene [at] mikemccandless>
> wrote:
>
>>
>> Good question....
>>
>> So far, this method has not been carried over to IndexWriter because
>> in general it's not really safe, since there's no way to get an
>> accurate docID from IndexWriter itself.
>>
>> You can't really "know" when IndexWriter does merges that compacts
>> deletes and thus changes docIDs. So, if you open a reader on the
>> side, get a docID you want to delete, and then go and ask IndexWriter
>> to delete that docID, you may in fact delete the wrong document. In
>> 2.3, where segment merges are now done with a background thread, it's
>> even worse, because a merge could complete and be committed, thus
>> changing docIDs, at any time...
>>
>> See complex discussion here:
>>
>> http://markmail.org/message/wxqel3gd6cmavk5a
>>
>> As of 2.3, the low level infrastructure was added to IndexWriter for
>> deleting by document ID, but this is not exposed publicly (this was a
>> side effect of LUCENE-1112). It's only used, internally, to delete a
>> document if an exception is hit while indexing it. In theory, you
>> could then subclass IndexWriter and tap into this infrastructure to
>> delete by docID, but, you're entering dangerous territory!
>>
>> Do you have a specific use case in mind here? I think we'd like to
>> make this option available someday in IndexWriter, but doing so now
>> (when there is no way to get a "reliable" docID) seems too
>> dangerous...
>>
>> Mike
>>
>> Cam Bazz wrote:
>>
>>> Hello,
>>>
>>> How do I delete a specific document from an indexwriter? I
>>> understand there
>>> is deleteDocuments(term) which deletes all the documents matching
>>> the term.
>>> But what if I want to delete a document that has more then one
>>> term in
>>> specific. I can search the document with a boolean query, and then
>>> get the
>>> doc id.
>>> I know that doc ids are temporary, but can I not use it for delete?
>>>
>>> IndexReader has a delete by doc id method, but I am not sure how to
>>> use this
>>> when using an indexwriter.
>>>
>>> Best,
>>> C.B.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cambazz at gmail

Jan 21, 2008, 6:51 AM

Post #5 of 11 (4120 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Hello Michael;

how can I construct a chain where both reader and writer at the same state?
You can call getIndexReader method of the IndexSearcher. But when I delete
documents through the reader, how will this interact with the writer?
I am have disabled autoflush and using my own logic to do flushes, since I
have very small document sizes. However I am little confused about how to
use
IndexWriter and IndexReader and IndexSearcher at the same logic. Basically
currently I have a IndexWriter and IndexSearcher, I write documents, and
flush() my own. In this scenario
how can I access a reader so that my logic still works.

Best.
-C.B.





On Jan 21, 2008 4:28 PM, Michael McCandless <lucene [at] mikemccandless>
wrote:

>
> For this case, too, you will need to use an IndexReader, or use
> IndexSearcher to run that particular search and then delete the
> docIDs returned using the IndexReader.
>
> Though, be sure to first iterate through all hits, gathering all
> docIDs. And then in 2nd pass, do the deletions. Otherwise you'll
> hit this issue:
>
> https://issues.apache.org/jira/browse/LUCENE-1096
>
> (Unless you're using 2.3).
>
> You can also use Solr, which provides "delete by query".
>
> Mike
>
> Cam Bazz wrote:
>
> > Hello Mike;
> >
> > How about deleting by a compount term?
> >
> > for example if I have a document with two fields srcId and dstId
> > and I want to delete the document where srcId=1 and dstId=2
> >
> > right now there exists a IndexWriter.deleteDocuments(Term t) but
> > with that I
> > can only delete lets say where srcId=something.
> >
> > I am sure there is a workaround but I could not find it.
> >
> > Best,
> >
> > On Jan 19, 2008 1:07 PM, Michael McCandless
> > <lucene [at] mikemccandless>
> > wrote:
> >
> >>
> >> Good question....
> >>
> >> So far, this method has not been carried over to IndexWriter because
> >> in general it's not really safe, since there's no way to get an
> >> accurate docID from IndexWriter itself.
> >>
> >> You can't really "know" when IndexWriter does merges that compacts
> >> deletes and thus changes docIDs. So, if you open a reader on the
> >> side, get a docID you want to delete, and then go and ask IndexWriter
> >> to delete that docID, you may in fact delete the wrong document. In
> >> 2.3, where segment merges are now done with a background thread, it's
> >> even worse, because a merge could complete and be committed, thus
> >> changing docIDs, at any time...
> >>
> >> See complex discussion here:
> >>
> >> http://markmail.org/message/wxqel3gd6cmavk5a
> >>
> >> As of 2.3, the low level infrastructure was added to IndexWriter for
> >> deleting by document ID, but this is not exposed publicly (this was a
> >> side effect of LUCENE-1112). It's only used, internally, to delete a
> >> document if an exception is hit while indexing it. In theory, you
> >> could then subclass IndexWriter and tap into this infrastructure to
> >> delete by docID, but, you're entering dangerous territory!
> >>
> >> Do you have a specific use case in mind here? I think we'd like to
> >> make this option available someday in IndexWriter, but doing so now
> >> (when there is no way to get a "reliable" docID) seems too
> >> dangerous...
> >>
> >> Mike
> >>
> >> Cam Bazz wrote:
> >>
> >>> Hello,
> >>>
> >>> How do I delete a specific document from an indexwriter? I
> >>> understand there
> >>> is deleteDocuments(term) which deletes all the documents matching
> >>> the term.
> >>> But what if I want to delete a document that has more then one
> >>> term in
> >>> specific. I can search the document with a boolean query, and then
> >>> get the
> >>> doc id.
> >>> I know that doc ids are temporary, but can I not use it for delete?
> >>>
> >>> IndexReader has a delete by doc id method, but I am not sure how to
> >>> use this
> >>> when using an indexwriter.
> >>>
> >>> Best,
> >>> C.B.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


lucene at mikemccandless

Jan 21, 2008, 6:55 AM

Post #6 of 11 (4116 views)
Permalink
Re: delete a document from indexwriter [In reply to]

You will have to close the IndexWriter.

Only one "writer" may be open at once on an index, where "writer"
includes an IndexReader that has done some deletes (the first time
you delete a document using a reader, it will acquire the write.lock,
which will fail if you have another writer open on that index).

Mike

Cam Bazz wrote:

> Hello Michael;
>
> how can I construct a chain where both reader and writer at the
> same state?
> You can call getIndexReader method of the IndexSearcher. But when I
> delete
> documents through the reader, how will this interact with the writer?
> I am have disabled autoflush and using my own logic to do flushes,
> since I
> have very small document sizes. However I am little confused about
> how to
> use
> IndexWriter and IndexReader and IndexSearcher at the same logic.
> Basically
> currently I have a IndexWriter and IndexSearcher, I write
> documents, and
> flush() my own. In this scenario
> how can I access a reader so that my logic still works.
>
> Best.
> -C.B.
>
>
>
>
>
> On Jan 21, 2008 4:28 PM, Michael McCandless
> <lucene [at] mikemccandless>
> wrote:
>
>>
>> For this case, too, you will need to use an IndexReader, or use
>> IndexSearcher to run that particular search and then delete the
>> docIDs returned using the IndexReader.
>>
>> Though, be sure to first iterate through all hits, gathering all
>> docIDs. And then in 2nd pass, do the deletions. Otherwise you'll
>> hit this issue:
>>
>> https://issues.apache.org/jira/browse/LUCENE-1096
>>
>> (Unless you're using 2.3).
>>
>> You can also use Solr, which provides "delete by query".
>>
>> Mike
>>
>> Cam Bazz wrote:
>>
>>> Hello Mike;
>>>
>>> How about deleting by a compount term?
>>>
>>> for example if I have a document with two fields srcId and dstId
>>> and I want to delete the document where srcId=1 and dstId=2
>>>
>>> right now there exists a IndexWriter.deleteDocuments(Term t) but
>>> with that I
>>> can only delete lets say where srcId=something.
>>>
>>> I am sure there is a workaround but I could not find it.
>>>
>>> Best,
>>>
>>> On Jan 19, 2008 1:07 PM, Michael McCandless
>>> <lucene [at] mikemccandless>
>>> wrote:
>>>
>>>>
>>>> Good question....
>>>>
>>>> So far, this method has not been carried over to IndexWriter
>>>> because
>>>> in general it's not really safe, since there's no way to get an
>>>> accurate docID from IndexWriter itself.
>>>>
>>>> You can't really "know" when IndexWriter does merges that compacts
>>>> deletes and thus changes docIDs. So, if you open a reader on the
>>>> side, get a docID you want to delete, and then go and ask
>>>> IndexWriter
>>>> to delete that docID, you may in fact delete the wrong
>>>> document. In
>>>> 2.3, where segment merges are now done with a background thread,
>>>> it's
>>>> even worse, because a merge could complete and be committed, thus
>>>> changing docIDs, at any time...
>>>>
>>>> See complex discussion here:
>>>>
>>>> http://markmail.org/message/wxqel3gd6cmavk5a
>>>>
>>>> As of 2.3, the low level infrastructure was added to IndexWriter
>>>> for
>>>> deleting by document ID, but this is not exposed publicly (this
>>>> was a
>>>> side effect of LUCENE-1112). It's only used, internally, to
>>>> delete a
>>>> document if an exception is hit while indexing it. In theory, you
>>>> could then subclass IndexWriter and tap into this infrastructure to
>>>> delete by docID, but, you're entering dangerous territory!
>>>>
>>>> Do you have a specific use case in mind here? I think we'd like to
>>>> make this option available someday in IndexWriter, but doing so now
>>>> (when there is no way to get a "reliable" docID) seems too
>>>> dangerous...
>>>>
>>>> Mike
>>>>
>>>> Cam Bazz wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> How do I delete a specific document from an indexwriter? I
>>>>> understand there
>>>>> is deleteDocuments(term) which deletes all the documents matching
>>>>> the term.
>>>>> But what if I want to delete a document that has more then one
>>>>> term in
>>>>> specific. I can search the document with a boolean query, and then
>>>>> get the
>>>>> doc id.
>>>>> I know that doc ids are temporary, but can I not use it for
>>>>> delete?
>>>>>
>>>>> IndexReader has a delete by doc id method, but I am not sure
>>>>> how to
>>>>> use this
>>>>> when using an indexwriter.
>>>>>
>>>>> Best,
>>>>> C.B.
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cambazz at gmail

Jan 21, 2008, 7:10 AM

Post #7 of 11 (4115 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Yes, I noticed
http://www.archivum.info/java-dev [at] lucene/2006-09/msg00065.html

Somehow I gotta do my delete within the same writer. I could use another
field that combines both src and dst field, and use this field without
storing but still a waste of resources.

I wonder if IndexWriter can be modified to accept a boolean query with termA
and termB - instead of single term?
I read the code until the part where deleted the term is added to a hashmap,
but could not follow on after that.

Best,
-C.B.



synchronized private void addDeleteTerm(Term term, int docCount) {
Num num = (Num) bufferedDeleteTerms.get(term);
if (num == null) {
bufferedDeleteTerms.put(term, new Num(docCount));
// This is coarse approximation of actual bytes used:
numBytesUsed += (term.field().length() + term.text().length()) *
BYTES_PER_CHAR
+ 4 + 5 * OBJECT_HEADER_BYTES + 5 * OBJECT_POINTER_BYTES;
if (ramBufferSize != IndexWriter.DISABLE_AUTO_FLUSH
&& numBytesUsed > ramBufferSize) {
bufferIsFull = true;
}
} else {
num.setNum(docCount);
}
numBufferedDeleteTerms++;
}




Best Regards,


On Jan 21, 2008 4:55 PM, Michael McCandless <lucene [at] mikemccandless>
wrote:

>
> You will have to close the IndexWriter.
>
> Only one "writer" may be open at once on an index, where "writer"
> includes an IndexReader that has done some deletes (the first time
> you delete a document using a reader, it will acquire the write.lock,
> which will fail if you have another writer open on that index).
>
> Mike
>
> Cam Bazz wrote:
>
> > Hello Michael;
> >
> > how can I construct a chain where both reader and writer at the
> > same state?
> > You can call getIndexReader method of the IndexSearcher. But when I
> > delete
> > documents through the reader, how will this interact with the writer?
> > I am have disabled autoflush and using my own logic to do flushes,
> > since I
> > have very small document sizes. However I am little confused about
> > how to
> > use
> > IndexWriter and IndexReader and IndexSearcher at the same logic.
> > Basically
> > currently I have a IndexWriter and IndexSearcher, I write
> > documents, and
> > flush() my own. In this scenario
> > how can I access a reader so that my logic still works.
> >
> > Best.
> > -C.B.
> >
> >
> >
> >
> >
> > On Jan 21, 2008 4:28 PM, Michael McCandless
> > <lucene [at] mikemccandless>
> > wrote:
> >
> >>
> >> For this case, too, you will need to use an IndexReader, or use
> >> IndexSearcher to run that particular search and then delete the
> >> docIDs returned using the IndexReader.
> >>
> >> Though, be sure to first iterate through all hits, gathering all
> >> docIDs. And then in 2nd pass, do the deletions. Otherwise you'll
> >> hit this issue:
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1096
> >>
> >> (Unless you're using 2.3).
> >>
> >> You can also use Solr, which provides "delete by query".
> >>
> >> Mike
> >>
> >> Cam Bazz wrote:
> >>
> >>> Hello Mike;
> >>>
> >>> How about deleting by a compount term?
> >>>
> >>> for example if I have a document with two fields srcId and dstId
> >>> and I want to delete the document where srcId=1 and dstId=2
> >>>
> >>> right now there exists a IndexWriter.deleteDocuments(Term t) but
> >>> with that I
> >>> can only delete lets say where srcId=something.
> >>>
> >>> I am sure there is a workaround but I could not find it.
> >>>
> >>> Best,
> >>>
> >>> On Jan 19, 2008 1:07 PM, Michael McCandless
> >>> <lucene [at] mikemccandless>
> >>> wrote:
> >>>
> >>>>
> >>>> Good question....
> >>>>
> >>>> So far, this method has not been carried over to IndexWriter
> >>>> because
> >>>> in general it's not really safe, since there's no way to get an
> >>>> accurate docID from IndexWriter itself.
> >>>>
> >>>> You can't really "know" when IndexWriter does merges that compacts
> >>>> deletes and thus changes docIDs. So, if you open a reader on the
> >>>> side, get a docID you want to delete, and then go and ask
> >>>> IndexWriter
> >>>> to delete that docID, you may in fact delete the wrong
> >>>> document. In
> >>>> 2.3, where segment merges are now done with a background thread,
> >>>> it's
> >>>> even worse, because a merge could complete and be committed, thus
> >>>> changing docIDs, at any time...
> >>>>
> >>>> See complex discussion here:
> >>>>
> >>>> http://markmail.org/message/wxqel3gd6cmavk5a
> >>>>
> >>>> As of 2.3, the low level infrastructure was added to IndexWriter
> >>>> for
> >>>> deleting by document ID, but this is not exposed publicly (this
> >>>> was a
> >>>> side effect of LUCENE-1112). It's only used, internally, to
> >>>> delete a
> >>>> document if an exception is hit while indexing it. In theory, you
> >>>> could then subclass IndexWriter and tap into this infrastructure to
> >>>> delete by docID, but, you're entering dangerous territory!
> >>>>
> >>>> Do you have a specific use case in mind here? I think we'd like to
> >>>> make this option available someday in IndexWriter, but doing so now
> >>>> (when there is no way to get a "reliable" docID) seems too
> >>>> dangerous...
> >>>>
> >>>> Mike
> >>>>
> >>>> Cam Bazz wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> How do I delete a specific document from an indexwriter? I
> >>>>> understand there
> >>>>> is deleteDocuments(term) which deletes all the documents matching
> >>>>> the term.
> >>>>> But what if I want to delete a document that has more then one
> >>>>> term in
> >>>>> specific. I can search the document with a boolean query, and then
> >>>>> get the
> >>>>> doc id.
> >>>>> I know that doc ids are temporary, but can I not use it for
> >>>>> delete?
> >>>>>
> >>>>> IndexReader has a delete by doc id method, but I am not sure
> >>>>> how to
> >>>>> use this
> >>>>> when using an indexwriter.
> >>>>>
> >>>>> Best,
> >>>>> C.B.
> >>>>
> >>>>
> >>>> -------------------------------------------------------------------
> >>>> --
> >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>>> For additional commands, e-mail: java-user-help [at] lucene
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


cambazz at gmail

Jan 22, 2008, 1:43 AM

Post #8 of 11 (4107 views)
Permalink
Re: delete a document from indexwriter [In reply to]

I am looking at the IndexWriter source code - and I could not find a method
(private) to delete by doc id.
Where is it hiding?

Best,
-C.B.

On Jan 19, 2008 1:07 PM, Michael McCandless <lucene [at] mikemccandless>
wrote:

>
> Good question....
>
> So far, this method has not been carried over to IndexWriter because
> in general it's not really safe, since there's no way to get an
> accurate docID from IndexWriter itself.
>
> You can't really "know" when IndexWriter does merges that compacts
> deletes and thus changes docIDs. So, if you open a reader on the
> side, get a docID you want to delete, and then go and ask IndexWriter
> to delete that docID, you may in fact delete the wrong document. In
> 2.3, where segment merges are now done with a background thread, it's
> even worse, because a merge could complete and be committed, thus
> changing docIDs, at any time...
>
> See complex discussion here:
>
> http://markmail.org/message/wxqel3gd6cmavk5a
>
> As of 2.3, the low level infrastructure was added to IndexWriter for
> deleting by document ID, but this is not exposed publicly (this was a
> side effect of LUCENE-1112). It's only used, internally, to delete a
> document if an exception is hit while indexing it. In theory, you
> could then subclass IndexWriter and tap into this infrastructure to
> delete by docID, but, you're entering dangerous territory!
>
> Do you have a specific use case in mind here? I think we'd like to
> make this option available someday in IndexWriter, but doing so now
> (when there is no way to get a "reliable" docID) seems too dangerous...
>
> Mike
>
> Cam Bazz wrote:
>
> > Hello,
> >
> > How do I delete a specific document from an indexwriter? I
> > understand there
> > is deleteDocuments(term) which deletes all the documents matching
> > the term.
> > But what if I want to delete a document that has more then one term in
> > specific. I can search the document with a boolean query, and then
> > get the
> > doc id.
> > I know that doc ids are temporary, but can I not use it for delete?
> >
> > IndexReader has a delete by doc id method, but I am not sure how to
> > use this
> > when using an indexwriter.
> >
> > Best,
> > C.B.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


lucene at mikemccandless

Jan 22, 2008, 2:05 AM

Post #9 of 11 (4113 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Cam Bazz wrote:

> Yes, I noticed
> http://www.archivum.info/java-dev [at] lucene/2006-09/
> msg00065.html
>
> Somehow I gotta do my delete within the same writer. I could use
> another
> field that combines both src and dst field, and use this field without
> storing but still a waste of resources.
>
> I wonder if IndexWriter can be modified to accept a boolean query
> with termA
> and termB - instead of single term?

If we go this route, we should add deleteByQuery to IndexWriter,
which I think is a good idea, but just hasn't been done yet....

It was discussed before, in the context of LUCENE-565, but never
actually added to IndexWriter. I think the approach would be similar
to how IndexWriter now deletes by Term.

> I read the code until the part where deleted the term is added to a
> hashmap,
> but could not follow on after that.

Basically, the terms are buffered into a HashMap. Then when a flush
happens, a SegmentReader is opened one by one on all segments, and
the delete is applied to each segment. Special logic is used to
apply the delete to the just-flushed segment since you have to take
care to only delete docIDs up until the point when the delete by term
was called.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 22, 2008, 2:07 AM

Post #10 of 11 (4104 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Exactly, that is the method...

Mike

Cam Bazz wrote:

> Hello,
>
> Did you mean the
>
> synchronized private void addDeleteDocID(int docId) {
> bufferedDeleteDocIDs.add(new Integer(docId));
> numBytesUsed += OBJECT_HEADER_BYTES + BYTES_PER_INT +
> OBJECT_POINTER_BYTES;
> }
>
>
> this however does not delete document but rather marks it in
> bufferedDeleteDocIDs right?
>
> On Jan 22, 2008 11:51 AM, Michael McCandless <
> lucene [at] mikemccandless> wrote:
>
> Woops, sorry, it's hiding in DocumentsWriter: the addDeleteDocID
> method.
>
> And, it's private. So you will actually have to modify the Lucene
> core sources (not just subclass IndexWriter) to experiment with
> deleting by docID in IndexWriter. If you find a clean way to safely
> expose this functionality, please post back with your results!
>
> Mike
>
> Cam Bazz wrote:
>
>> I am looking at the IndexWriter source code - and I could not find
>> a method (private) to delete by doc id.
>> Where is it hiding?
>>
>> Best,
>> - C.B.
>>
>> On Jan 19, 2008 1:07 PM, Michael McCandless <
>> lucene [at] mikemccandless> wrote:
>>
>> Good question....
>>
>> So far, this method has not been carried over to IndexWriter because
>> in general it's not really safe, since there's no way to get an
>> accurate docID from IndexWriter itself.
>>
>> You can't really "know" when IndexWriter does merges that compacts
>> deletes and thus changes docIDs. So, if you open a reader on the
>> side, get a docID you want to delete, and then go and ask IndexWriter
>> to delete that docID, you may in fact delete the wrong document. In
>> 2.3 , where segment merges are now done with a background thread,
>> it's
>> even worse, because a merge could complete and be committed, thus
>> changing docIDs, at any time...
>>
>> See complex discussion here:
>>
>> http://markmail.org/message/wxqel3gd6cmavk5a
>>
>> As of 2.3, the low level infrastructure was added to IndexWriter for
>> deleting by document ID, but this is not exposed publicly (this was a
>> side effect of LUCENE-1112). It's only used, internally, to delete a
>> document if an exception is hit while indexing it. In theory, you
>> could then subclass IndexWriter and tap into this infrastructure to
>> delete by docID, but, you're entering dangerous territory!
>>
>> Do you have a specific use case in mind here? I think we'd like to
>> make this option available someday in IndexWriter, but doing so now
>> (when there is no way to get a "reliable" docID) seems too
>> dangerous...
>>
>> Mike
>>
>> Cam Bazz wrote:
>>
>> > Hello,
>> >
>> > How do I delete a specific document from an indexwriter? I
>> > understand there
>> > is deleteDocuments(term) which deletes all the documents matching
>> > the term.
>> > But what if I want to delete a document that has more then one
>> term in
>> > specific. I can search the document with a boolean query, and then
>> > get the
>> > doc id.
>> > I know that doc ids are temporary, but can I not use it for delete?
>> >
>> > IndexReader has a delete by doc id method, but I am not sure how to
>> > use this
>> > when using an indexwriter.
>> >
>> > Best,
>> > C.B.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>


lucene at mikemccandless

Jan 22, 2008, 2:08 AM

Post #11 of 11 (4112 views)
Permalink
Re: delete a document from indexwriter [In reply to]

Well, docIDs are used all over the place in the index. Sometimes
they key into an index file "linearly", like for stored fields and
term vectors index files, but other times they are encoded eg in the
posting lists.

Mike

Cam Bazz wrote:

> Yes, I have found. however it is not for reqular indexing. I am
> using lucene as a graph database. So I add edges. An edge is a
> document with srcId and dstId field.
> However I am using a caching scheme which employs a LRU a
> weakHashMap and a two link lists for keeping track of reads and
> writes. (It is not working yet, but I will tell you when it does)
>
> About doc ids being transparent. are the doc ids point to places in
> diskspace?
>
> Best Regards,
> C.A.
>
>
>
> On Jan 22, 2008 11:51 AM, Michael McCandless <
> lucene [at] mikemccandless> wrote:
>
> Woops, sorry, it's hiding in DocumentsWriter: the addDeleteDocID
> method.
>
> And, it's private. So you will actually have to modify the Lucene
> core sources (not just subclass IndexWriter) to experiment with
> deleting by docID in IndexWriter. If you find a clean way to safely
> expose this functionality, please post back with your results!
>
> Mike
>
> Cam Bazz wrote:
>
>> I am looking at the IndexWriter source code - and I could not find
>> a method (private) to delete by doc id.
>> Where is it hiding?
>>
>> Best,
>> - C.B.
>>
>> On Jan 19, 2008 1:07 PM, Michael McCandless <
>> lucene [at] mikemccandless> wrote:
>>
>> Good question....
>>
>> So far, this method has not been carried over to IndexWriter because
>> in general it's not really safe, since there's no way to get an
>> accurate docID from IndexWriter itself.
>>
>> You can't really "know" when IndexWriter does merges that compacts
>> deletes and thus changes docIDs. So, if you open a reader on the
>> side, get a docID you want to delete, and then go and ask IndexWriter
>> to delete that docID, you may in fact delete the wrong document. In
>> 2.3 , where segment merges are now done with a background thread,
>> it's
>> even worse, because a merge could complete and be committed, thus
>> changing docIDs, at any time...
>>
>> See complex discussion here:
>>
>> http://markmail.org/message/wxqel3gd6cmavk5a
>>
>> As of 2.3, the low level infrastructure was added to IndexWriter for
>> deleting by document ID, but this is not exposed publicly (this was a
>> side effect of LUCENE-1112). It's only used, internally, to delete a
>> document if an exception is hit while indexing it. In theory, you
>> could then subclass IndexWriter and tap into this infrastructure to
>> delete by docID, but, you're entering dangerous territory!
>>
>> Do you have a specific use case in mind here? I think we'd like to
>> make this option available someday in IndexWriter, but doing so now
>> (when there is no way to get a "reliable" docID) seems too
>> dangerous...
>>
>> Mike
>>
>> Cam Bazz wrote:
>>
>> > Hello,
>> >
>> > How do I delete a specific document from an indexwriter? I
>> > understand there
>> > is deleteDocuments(term) which deletes all the documents matching
>> > the term.
>> > But what if I want to delete a document that has more then one
>> term in
>> > specific. I can search the document with a boolean query, and then
>> > get the
>> > doc id.
>> > I know that doc ids are temporary, but can I not use it for delete?
>> >
>> > IndexReader has a delete by doc id method, but I am not sure how to
>> > use this
>> > when using an indexwriter.
>> >
>> > Best,
>> > C.B.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.