Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Deleted document terms

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jdp2000 at gmail

Aug 26, 2008, 12:56 AM

Post #1 of 5 (472 views)
Permalink
Deleted document terms

Hi,

I just discovered some strange behaviour with deleted documents. I do a
search for documents with a certain query and delete one using
IndexWriter.deleteDocuments(Term) using a key for the term. Then I repeat
the search and the document is still there because I use a custom
HitCollector which does not check IndexReader.isDeleted(int). That is all
expected.

But when I try to show the deleted document by searching by key using the
same term it was deleted with, it is not found. So it seems that the term
(id:MYKEY) is removed from the index.

So I was surprised that the term for the id was removed but not the other
terms for document.

But I guess this makes sense and I just need to check
IndexReader.isDeleted()

Does this all sound like correct behaviour?

Thanks,

John
--
View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19157027.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 26, 2008, 1:45 AM

Post #2 of 5 (427 views)
Permalink
Re: Deleted document terms [In reply to]

John Patterson wrote:

> I just discovered some strange behaviour with deleted documents. I
> do a
> search for documents with a certain query and delete one using
> IndexWriter.deleteDocuments(Term) using a key for the term. Then I
> repeat
> the search and the document is still there because I use a custom
> HitCollector which does not check IndexReader.isDeleted(int). That
> is all
> expected.

Hmm -- once a document is deleted, your HitCollector won't ever see
it. During searching, isDeleted is called per document at a very low
level.

If your HitCollector is seeing it, it sounds like it wasn't really
deleted. Are you sure you closed the IndexWriter and then reopened
your searcher, so that the searcher will see the deletion?

> But when I try to show the deleted document by searching by key
> using the
> same term it was deleted with, it is not found. So it seems that
> the term
> (id:MYKEY) is removed from the index.

This is odd -- the document should either be deleted (entirely), or
not. You shouldn't get different behavior if you search for the doc
one way vs another.

> So I was surprised that the term for the id was removed but not the
> other
> terms for document.

That make two of us!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kalanir at gmail

Aug 26, 2008, 2:16 AM

Post #3 of 5 (436 views)
Permalink
Re: Deleted document terms [In reply to]

Hi John,

Are you sure you made the id "tokenized" while indexing? I could overcome
this issue by having a tokenized field, which was used for the deletion as
below.

document.add(new Field("id", id, Field.Store.YES, *Field.Index.TOKENIZED*));



Thanks



On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
lucene [at] mikemccandless> wrote:

>
>
> John Patterson wrote:
>
> I just discovered some strange behaviour with deleted documents. I do a
>> search for documents with a certain query and delete one using
>> IndexWriter.deleteDocuments(Term) using a key for the term. Then I repeat
>> the search and the document is still there because I use a custom
>> HitCollector which does not check IndexReader.isDeleted(int). That is all
>> expected.
>>
>
> Hmm -- once a document is deleted, your HitCollector won't ever see it.
> During searching, isDeleted is called per document at a very low level.
>
> If your HitCollector is seeing it, it sounds like it wasn't really deleted.
> Are you sure you closed the IndexWriter and then reopened your searcher, so
> that the searcher will see the deletion?
>
> But when I try to show the deleted document by searching by key using the
>> same term it was deleted with, it is not found. So it seems that the term
>> (id:MYKEY) is removed from the index.
>>
>
> This is odd -- the document should either be deleted (entirely), or not.
> You shouldn't get different behavior if you search for the doc one way vs
> another.
>
> So I was surprised that the term for the id was removed but not the other
>> terms for document.
>>
>
> That make two of us!
>
> Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


--
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa


jdp2000 at gmail

Aug 26, 2008, 2:45 AM

Post #4 of 5 (426 views)
Permalink
Re: Deleted document terms [In reply to]

That was the problem - the id was not tokenized. Thanks for your help.


Kalani Ruwanpathirana wrote:
>
> Hi John,
>
> Are you sure you made the id "tokenized" while indexing? I could overcome
> this issue by having a tokenized field, which was used for the deletion as
> below.
>
> document.add(new Field("id", id, Field.Store.YES,
> *Field.Index.TOKENIZED*));
>
>
>
> Thanks
>
>
>
> On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
> lucene [at] mikemccandless> wrote:
>
>>
>>
>> John Patterson wrote:
>>
>> I just discovered some strange behaviour with deleted documents. I do a
>>> search for documents with a certain query and delete one using
>>> IndexWriter.deleteDocuments(Term) using a key for the term. Then I
>>> repeat
>>> the search and the document is still there because I use a custom
>>> HitCollector which does not check IndexReader.isDeleted(int). That is
>>> all
>>> expected.
>>>
>>
>> Hmm -- once a document is deleted, your HitCollector won't ever see it.
>> During searching, isDeleted is called per document at a very low level.
>>
>> If your HitCollector is seeing it, it sounds like it wasn't really
>> deleted.
>> Are you sure you closed the IndexWriter and then reopened your searcher,
>> so
>> that the searcher will see the deletion?
>>
>> But when I try to show the deleted document by searching by key using
>> the
>>> same term it was deleted with, it is not found. So it seems that the
>>> term
>>> (id:MYKEY) is removed from the index.
>>>
>>
>> This is odd -- the document should either be deleted (entirely), or not.
>> You shouldn't get different behavior if you search for the doc one way
>> vs
>> another.
>>
>> So I was surprised that the term for the id was removed but not the
>> other
>>> terms for document.
>>>
>>
>> That make two of us!
>>
>> Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
> --
> Kalani Ruwanpathirana
> Department of Computer Science & Engineering
> University of Moratuwa
>
>

--
View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19158657.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 26, 2008, 3:54 AM

Post #5 of 5 (425 views)
Permalink
Re: Deleted document terms [In reply to]

Normally an ID should be indexed as Field.Index.UN_TOKENIZED.

Mike

John Patterson wrote:

>
> That was the problem - the id was not tokenized. Thanks for your
> help.
>
>
> Kalani Ruwanpathirana wrote:
>>
>> Hi John,
>>
>> Are you sure you made the id "tokenized" while indexing? I could
>> overcome
>> this issue by having a tokenized field, which was used for the
>> deletion as
>> below.
>>
>> document.add(new Field("id", id, Field.Store.YES,
>> *Field.Index.TOKENIZED*));
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
>> lucene [at] mikemccandless> wrote:
>>
>>>
>>>
>>> John Patterson wrote:
>>>
>>> I just discovered some strange behaviour with deleted documents.
>>> I do a
>>>> search for documents with a certain query and delete one using
>>>> IndexWriter.deleteDocuments(Term) using a key for the term. Then I
>>>> repeat
>>>> the search and the document is still there because I use a custom
>>>> HitCollector which does not check IndexReader.isDeleted(int).
>>>> That is
>>>> all
>>>> expected.
>>>>
>>>
>>> Hmm -- once a document is deleted, your HitCollector won't ever
>>> see it.
>>> During searching, isDeleted is called per document at a very low
>>> level.
>>>
>>> If your HitCollector is seeing it, it sounds like it wasn't really
>>> deleted.
>>> Are you sure you closed the IndexWriter and then reopened your
>>> searcher,
>>> so
>>> that the searcher will see the deletion?
>>>
>>> But when I try to show the deleted document by searching by key
>>> using
>>> the
>>>> same term it was deleted with, it is not found. So it seems that
>>>> the
>>>> term
>>>> (id:MYKEY) is removed from the index.
>>>>
>>>
>>> This is odd -- the document should either be deleted (entirely),
>>> or not.
>>> You shouldn't get different behavior if you search for the doc one
>>> way
>>> vs
>>> another.
>>>
>>> So I was surprised that the term for the id was removed but not the
>>> other
>>>> terms for document.
>>>>
>>>
>>> That make two of us!
>>>
>>> Mike
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>>
>> --
>> Kalani Ruwanpathirana
>> Department of Computer Science & Engineering
>> University of Moratuwa
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19158657.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.