Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Deleted documents in the index.

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


nageshblore at gmail

Jul 25, 2008, 4:17 AM

Post #1 of 7 (385 views)
Permalink
Deleted documents in the index.

Hi,
I am writing a class to report on an index. This index has documents updated
using the IndexWriter.updateDocument(Term, Document) method. That is,
documents were deleted and added again. My aim is to see what documents (and
their fields) are present in the index. Since the document was updated (i.e.
deleted and added), it is marked as deleted and hence not able to obtain a
Document object for the updated document.

How do I report on such documents ?

for (int i = 1; i < numDocs; i++) {
//ir is an IndexReader object
if (ir.isDeleted(i)) {
bw.write("Document " + i + " has been deleted.");
bw.newLine();
} else {
Document d = getDocument(ir, i);

List<Field> l = d.getFields();
int numFields = l.size();
bw.write("Document has " + numFields + " fields as
follows");
bw.newLine();

for (int j = 0; j < numFields; j++) {
String fieldName = l.get(j).name();
bw.write("\t Field : " + fieldName + " Value : "
+ d.getField(fieldName).stringValue());
bw.newLine();
}
}
}


nageshblore at gmail

Jul 25, 2008, 5:12 AM

Post #2 of 7 (353 views)
Permalink
Deleted documents in the index. [In reply to]

Hi,
I think, the earlier mail didn't make it through.

I am writing a class to report on an index. This index has documents
updated using the IndexWriter.updateDocument(Term, Document) method.
That is, documents were deleted and added again. My aim is to see what
documents (and their fields) are present in the index. Since the
document was updated (i.e. deleted and added), it is marked as deleted
and hence not able to obtain a Document object for the updated
document.

How do I report on such documents ?

for (int i = 1; i < numDocs; i++) {
//ir is an IndexReader object
if (ir.isDeleted(i)) {
bw.write("Document " + i + " has been deleted.");
bw.newLine();
} else {
Document d = getDocument(ir, i);

List<Field> l = d.getFields();
int numFields = l.size();
bw.write("Document has " + numFields + " fields as follows");
bw.newLine();

for (int j = 0; j < numFields; j++) {
String fieldName = l.get(j).name();
bw.write("\t Field : " + fieldName + " Value : "
+ d.getField(fieldName).stringValue());
bw.newLine();
}
}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jul 25, 2008, 5:18 AM

Post #3 of 7 (353 views)
Permalink
Re: Deleted documents in the index. [In reply to]

When you call updateDocument, the old document is deleted but a
wholly new document is added. So the "else" clause in your loop below
will report on the newly added documents (you won't miss any).

Mike

(Nagesh S) wrote:

> Hi,
> I think, the earlier mail didn't make it through.
>
> I am writing a class to report on an index. This index has documents
> updated using the IndexWriter.updateDocument(Term, Document) method.
> That is, documents were deleted and added again. My aim is to see what
> documents (and their fields) are present in the index. Since the
> document was updated (i.e. deleted and added), it is marked as deleted
> and hence not able to obtain a Document object for the updated
> document.
>
> How do I report on such documents ?
>
> for (int i = 1; i < numDocs; i++) {
> //ir is an IndexReader object
> if (ir.isDeleted(i)) {
> bw.write("Document " + i + " has been deleted.");
> bw.newLine();
> } else {
> Document d = getDocument(ir, i);
>
> List<Field> l = d.getFields();
> int numFields = l.size();
> bw.write("Document has " + numFields + " fields as
> follows");
> bw.newLine();
>
> for (int j = 0; j < numFields; j++) {
> String fieldName = l.get(j).name();
> bw.write("\t Field : " + fieldName + " Value : "
> + d.getField(fieldName).stringValue());
> bw.newLine();
> }
> }
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


nageshblore at gmail

Jul 25, 2008, 5:44 AM

Post #4 of 7 (357 views)
Permalink
Re: Deleted documents in the index. [In reply to]

Hi Michael,
Thanks for your response. Yes, I got that.

I guess, my question is, how do I access the newly added document ? In
other words, if the index initially had 20 docs of which 10 were
updated (that is, deleted and then added), how do I access the updated
ones ?

Initially, there was no check for delete - that is, I did not have
IndexReader.isDeleted(int). It had the for loop only which would fail
when obtaining a 'deleted' document with the following :

java.lang.IllegalArgumentException: attempt to access a deleted document
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:331)
at org.apache.lucene.index.MultiReader.document(MultiReader.java:108)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:437)
etc.

Regards,
Nagesh

On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>
> When you call updateDocument, the old document is deleted but a
> wholly new document is added. So the "else" clause in your loop below
> will report on the newly added documents (you won't miss any).
>
> Mike
>
> (Nagesh S) wrote:
>
>> Hi,
>> I think, the earlier mail didn't make it through.
>>
>> I am writing a class to report on an index. This index has documents
>> updated using the IndexWriter.updateDocument(Term, Document) method.
>> That is, documents were deleted and added again. My aim is to see what
>> documents (and their fields) are present in the index. Since the
>> document was updated (i.e. deleted and added), it is marked as deleted
>> and hence not able to obtain a Document object for the updated
>> document.
>>
>> How do I report on such documents ?
>>
>> for (int i = 1; i < numDocs; i++) {
>> //ir is an IndexReader object
>> if (ir.isDeleted(i)) {
>> bw.write("Document " + i + " has been deleted.");
>> bw.newLine();
>> } else {
>> Document d = getDocument(ir, i);
>>
>> List<Field> l = d.getFields();
>> int numFields = l.size();
>> bw.write("Document has " + numFields + " fields as
>> follows");
>> bw.newLine();
>>
>> for (int j = 0; j < numFields; j++) {
>> String fieldName = l.get(j).name();
>> bw.write("\t Field : " + fieldName + " Value : "
>> + d.getField(fieldName).stringValue());
>> bw.newLine();
>> }
>> }
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jul 25, 2008, 6:47 AM

Post #5 of 7 (351 views)
Permalink
Re: Deleted documents in the index. [In reply to]

Oh, I think I see the problem -- instead of numDocs in your for loop
(which I assume came from IndexReader.numDocs()) change that to maxDoc
(IndexReader.maxDoc()).

Mike

(Nagesh S) wrote:

> Hi Michael,
> Thanks for your response. Yes, I got that.
>
> I guess, my question is, how do I access the newly added document ? In
> other words, if the index initially had 20 docs of which 10 were
> updated (that is, deleted and then added), how do I access the updated
> ones ?
>
> Initially, there was no check for delete - that is, I did not have
> IndexReader.isDeleted(int). It had the for loop only which would fail
> when obtaining a 'deleted' document with the following :
>
> java.lang.IllegalArgumentException: attempt to access a deleted
> document
> at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:
> 331)
> at org.apache.lucene.index.MultiReader.document(MultiReader.java:108)
> at org.apache.lucene.index.IndexReader.document(IndexReader.java:437)
> etc.
>
> Regards,
> Nagesh
>
> On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>>
>> When you call updateDocument, the old document is deleted but a
>> wholly new document is added. So the "else" clause in your loop
>> below
>> will report on the newly added documents (you won't miss any).
>>
>> Mike
>>
>> (Nagesh S) wrote:
>>
>>> Hi,
>>> I think, the earlier mail didn't make it through.
>>>
>>> I am writing a class to report on an index. This index has documents
>>> updated using the IndexWriter.updateDocument(Term, Document) method.
>>> That is, documents were deleted and added again. My aim is to see
>>> what
>>> documents (and their fields) are present in the index. Since the
>>> document was updated (i.e. deleted and added), it is marked as
>>> deleted
>>> and hence not able to obtain a Document object for the updated
>>> document.
>>>
>>> How do I report on such documents ?
>>>
>>> for (int i = 1; i < numDocs; i++) {
>>> //ir is an IndexReader object
>>> if (ir.isDeleted(i)) {
>>> bw.write("Document " + i + " has been deleted.");
>>> bw.newLine();
>>> } else {
>>> Document d = getDocument(ir, i);
>>>
>>> List<Field> l = d.getFields();
>>> int numFields = l.size();
>>> bw.write("Document has " + numFields + " fields as
>>> follows");
>>> bw.newLine();
>>>
>>> for (int j = 0; j < numFields; j++) {
>>> String fieldName = l.get(j).name();
>>> bw.write("\t Field : " + fieldName + " Value : "
>>> + d.getField(fieldName).stringValue());
>>> bw.newLine();
>>> }
>>> }
>>> }
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


nageshblore at gmail

Jul 25, 2008, 7:09 AM

Post #6 of 7 (343 views)
Permalink
Re: Deleted documents in the index. [In reply to]

Hi Michael,
The numDocs did come from IndexReader.numDocs().

hmm...let me try with maxDoc.

Nagesh

On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>
> Oh, I think I see the problem -- instead of numDocs in your for loop
> (which I assume came from IndexReader.numDocs()) change that to maxDoc
> (IndexReader.maxDoc()).
>
> Mike
>
> (Nagesh S) wrote:
>
>> Hi Michael,
>> Thanks for your response. Yes, I got that.
>>
>> I guess, my question is, how do I access the newly added document ? In
>> other words, if the index initially had 20 docs of which 10 were
>> updated (that is, deleted and then added), how do I access the updated
>> ones ?
>>
>> Initially, there was no check for delete - that is, I did not have
>> IndexReader.isDeleted(int). It had the for loop only which would fail
>> when obtaining a 'deleted' document with the following :
>>
>> java.lang.IllegalArgumentException: attempt to access a deleted
>> document
>> at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:
>> 331)
>> at org.apache.lucene.index.MultiReader.document(MultiReader.java:108)
>> at org.apache.lucene.index.IndexReader.document(IndexReader.java:437)
>> etc.
>>
>> Regards,
>> Nagesh
>>
>> On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>
>>> When you call updateDocument, the old document is deleted but a
>>> wholly new document is added. So the "else" clause in your loop
>>> below
>>> will report on the newly added documents (you won't miss any).
>>>
>>> Mike
>>>
>>> (Nagesh S) wrote:
>>>
>>>> Hi,
>>>> I think, the earlier mail didn't make it through.
>>>>
>>>> I am writing a class to report on an index. This index has documents
>>>> updated using the IndexWriter.updateDocument(Term, Document) method.
>>>> That is, documents were deleted and added again. My aim is to see
>>>> what
>>>> documents (and their fields) are present in the index. Since the
>>>> document was updated (i.e. deleted and added), it is marked as
>>>> deleted
>>>> and hence not able to obtain a Document object for the updated
>>>> document.
>>>>
>>>> How do I report on such documents ?
>>>>
>>>> for (int i = 1; i < numDocs; i++) {
>>>> //ir is an IndexReader object
>>>> if (ir.isDeleted(i)) {
>>>> bw.write("Document " + i + " has been deleted.");
>>>> bw.newLine();
>>>> } else {
>>>> Document d = getDocument(ir, i);
>>>>
>>>> List<Field> l = d.getFields();
>>>> int numFields = l.size();
>>>> bw.write("Document has " + numFields + " fields as
>>>> follows");
>>>> bw.newLine();
>>>>
>>>> for (int j = 0; j < numFields; j++) {
>>>> String fieldName = l.get(j).name();
>>>> bw.write("\t Field : " + fieldName + " Value : "
>>>> + d.getField(fieldName).stringValue());
>>>> bw.newLine();
>>>> }
>>>> }
>>>> }
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


nageshblore at gmail

Jul 25, 2008, 7:56 AM

Post #7 of 7 (349 views)
Permalink
Re: Deleted documents in the index. [In reply to]

Hey Michael,
The maxDoc() did the trick ! Thanks !

I have got some reading to do about numDocs() and maxDoc().....


Nagesh

On 7/25/08, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) <nageshblore [at] gmail> wrote:
> Hi Michael,
> The numDocs did come from IndexReader.numDocs().
>
> hmm...let me try with maxDoc.
>
> Nagesh
>
> On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>>
>> Oh, I think I see the problem -- instead of numDocs in your for loop
>> (which I assume came from IndexReader.numDocs()) change that to maxDoc
>> (IndexReader.maxDoc()).
>>
>> Mike
>>
>> (Nagesh S) wrote:
>>
>>> Hi Michael,
>>> Thanks for your response. Yes, I got that.
>>>
>>> I guess, my question is, how do I access the newly added document ? In
>>> other words, if the index initially had 20 docs of which 10 were
>>> updated (that is, deleted and then added), how do I access the updated
>>> ones ?
>>>
>>> Initially, there was no check for delete - that is, I did not have
>>> IndexReader.isDeleted(int). It had the for loop only which would fail
>>> when obtaining a 'deleted' document with the following :
>>>
>>> java.lang.IllegalArgumentException: attempt to access a deleted
>>> document
>>> at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:
>>> 331)
>>> at org.apache.lucene.index.MultiReader.document(MultiReader.java:108)
>>> at org.apache.lucene.index.IndexReader.document(IndexReader.java:437)
>>> etc.
>>>
>>> Regards,
>>> Nagesh
>>>
>>> On 7/25/08, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>>
>>>> When you call updateDocument, the old document is deleted but a
>>>> wholly new document is added. So the "else" clause in your loop
>>>> below
>>>> will report on the newly added documents (you won't miss any).
>>>>
>>>> Mike
>>>>
>>>> (Nagesh S) wrote:
>>>>
>>>>> Hi,
>>>>> I think, the earlier mail didn't make it through.
>>>>>
>>>>> I am writing a class to report on an index. This index has documents
>>>>> updated using the IndexWriter.updateDocument(Term, Document) method.
>>>>> That is, documents were deleted and added again. My aim is to see
>>>>> what
>>>>> documents (and their fields) are present in the index. Since the
>>>>> document was updated (i.e. deleted and added), it is marked as
>>>>> deleted
>>>>> and hence not able to obtain a Document object for the updated
>>>>> document.
>>>>>
>>>>> How do I report on such documents ?
>>>>>
>>>>> for (int i = 1; i < numDocs; i++) {
>>>>> //ir is an IndexReader object
>>>>> if (ir.isDeleted(i)) {
>>>>> bw.write("Document " + i + " has been deleted.");
>>>>> bw.newLine();
>>>>> } else {
>>>>> Document d = getDocument(ir, i);
>>>>>
>>>>> List<Field> l = d.getFields();
>>>>> int numFields = l.size();
>>>>> bw.write("Document has " + numFields + " fields as
>>>>> follows");
>>>>> bw.newLine();
>>>>>
>>>>> for (int j = 0; j < numFields; j++) {
>>>>> String fieldName = l.get(j).name();
>>>>> bw.write("\t Field : " + fieldName + " Value : "
>>>>> + d.getField(fieldName).stringValue());
>>>>> bw.newLine();
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.