Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Why is the old value still in the index

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


paul_t100 at fastmail

Dec 16, 2011, 8:54 AM

Post #1 of 9 (431 views)
Permalink
Why is the old value still in the index

I'm adding documents to an index, at a later date I modify a document
and update the index, close the writer and open a new IndexReader. My
indexreader iterates over terms for that field and docFreq() returns one
as I would expect, however the iterator returns both the old value of
the document and the new value, I don't expect (or want) the old value
to still be in the index, so why is this.


This full test program generates:

TermDocsFreq1
test
TermDocsFreq1
test
test2

Dont expect to see 'test' listed the second time


package com.jthink.jaikoz;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;


public class LuceneTest
{
public static void main(String []args)
{
try
{
String FIELD1="field1";
RAMDirectory dir = new RAMDirectory();
IndexWriterConfig iwc = new
IndexWriterConfig(Version.LUCENE_35, new
StandardAnalyzer(Version.LUCENE_35));
IndexWriter iw = new IndexWriter(dir, iwc);
Document document = new Document();
document.add(new Field(FIELD1,"test", Field.Store.YES,
Field.Index.ANALYZED));
iw.addDocument(document);
iw.close();

IndexReader ir = IndexReader.open(dir,true);
TermEnum terms = ir.terms(new Term(FIELD1));
System.out.println("TermDocsFreq"+terms.docFreq());
do
{
if (terms.term() != null)
{
System.out.println(terms.term().text());
}
}
while (terms.next() && terms.term().field().equals(FIELD1));

IndexWriterConfig iwc2 = new
IndexWriterConfig(Version.LUCENE_35, new
StandardAnalyzer(Version.LUCENE_35));
iw = new IndexWriter(dir, iwc2);
document = new Document();
document.add(new Field(FIELD1,"test2", Field.Store.YES,
Field.Index.ANALYZED));
iw.updateDocument(new Term(FIELD1,"term1"),document);
iw.close();

ir = IndexReader.open(dir,true);
terms = ir.terms(new Term(FIELD1));
System.out.println("TermDocsFreq"+terms.docFreq());
do
{
if (terms.term() != null)
{
System.out.println(terms.term().text());
}
}
while (terms.next() && terms.term().field().equals(FIELD1));
}
catch(Exception ex)
{
ex.printStackTrace();
}
}

}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ian.lea at gmail

Dec 16, 2011, 9:10 AM

Post #2 of 9 (420 views)
Permalink
Re: Why is the old value still in the index [In reply to]

Shouldn't

iw.updateDocument(new Term(FIELD1,"term1"),document);

be

iw.updateDocument(new Term(FIELD1,"test"),document);

if you want to replace the first doc?


--
Ian.

On Fri, Dec 16, 2011 at 4:54 PM, Paul Taylor <paul_t100 [at] fastmail> wrote:
> I'm adding documents to an index, at a later date I modify a document and
> update the index, close the writer and open a new IndexReader. My
> indexreader iterates over terms for that field and docFreq() returns one as
> I would expect, however the iterator  returns both the old value of the
> document and the new value, I don't expect (or want) the old value to still
> be in the index, so why is this.
>
>
> This full test program generates:
>
> TermDocsFreq1
> test
> TermDocsFreq1
> test
> test2
>
> Dont expect to see 'test' listed the second time
>
>
> package com.jthink.jaikoz;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.*;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Version;
>
>
> public class LuceneTest
> {
>    public  static void main(String []args)
>    {
>        try
>        {
>            String FIELD1="field1";
>            RAMDirectory dir = new RAMDirectory();
>            IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_35,
> new StandardAnalyzer(Version.LUCENE_35));
>            IndexWriter       iw  = new IndexWriter(dir, iwc);
>            Document document = new Document();
>            document.add(new Field(FIELD1,"test", Field.Store.YES,
> Field.Index.ANALYZED));
>            iw.addDocument(document);
>            iw.close();
>
>            IndexReader ir = IndexReader.open(dir,true);
>            TermEnum terms = ir.terms(new Term(FIELD1));
>            System.out.println("TermDocsFreq"+terms.docFreq());
>            do
>            {
>                if (terms.term() != null)
>                {
>                    System.out.println(terms.term().text());
>                }
>            }
>            while (terms.next() && terms.term().field().equals(FIELD1));
>
>            IndexWriterConfig iwc2 = new IndexWriterConfig(Version.LUCENE_35,
> new StandardAnalyzer(Version.LUCENE_35));
>            iw  = new IndexWriter(dir, iwc2);
>            document = new Document();
>            document.add(new Field(FIELD1,"test2", Field.Store.YES,
> Field.Index.ANALYZED));
>            iw.updateDocument(new Term(FIELD1,"term1"),document);
>            iw.close();
>
>            ir = IndexReader.open(dir,true);
>            terms = ir.terms(new Term(FIELD1));
>            System.out.println("TermDocsFreq"+terms.docFreq());
>            do
>            {
>                if (terms.term() != null)
>                {
>                    System.out.println(terms.term().text());
>                }
>            }
>            while (terms.next() && terms.term().field().equals(FIELD1));
>        }
>        catch(Exception ex)
>        {
>            ex.printStackTrace();
>        }
>    }
>
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Dec 16, 2011, 9:19 AM

Post #3 of 9 (418 views)
Permalink
Re: Why is the old value still in the index [In reply to]

On 16/12/2011 17:10, Ian Lea wrote:
> Shouldn't
>
> iw.updateDocument(new Term(FIELD1,"term1"),document);
>
> be
>
> iw.updateDocument(new Term(FIELD1,"test"),document);
>
> if you want to replace the first doc?
Hmm, you are right if I change it I then get

TermDocsFreq1
test
TermDocsFreq1
test2



(but doesn't resolve the program with my real code that doesnt seem to
have this mistake :()

What I dont understand then is in the incorrect example why don't I get

TermDocsFreq2


if Ive actually create another document rather than updating one ?

-- Ian. On Fri, Dec 16, 2011 at 4:54 PM, Paul Taylor
<paul_t100 [at] fastmail> wrote:
>> I'm adding documents to an index, at a later date I modify a document and
>> update the index, close the writer and open a new IndexReader. My
>> indexreader iterates over terms for that field and docFreq() returns one as
>> I would expect, however the iterator returns both the old value of the
>> document and the new value, I don't expect (or want) the old value to still
>> be in the index, so why is this.
>>
>>
>> This full test program generates:
>>
>> TermDocsFreq1
>> test
>> TermDocsFreq1
>> test
>> test2
>>
>> Dont expect to see 'test' listed the second time
>>
>>
>> package com.jthink.jaikoz;
>>
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.*;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.util.Version;
>>
>>
>> public class LuceneTest
>> {
>> public static void main(String []args)
>> {
>> try
>> {
>> String FIELD1="field1";
>> RAMDirectory dir = new RAMDirectory();
>> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>> IndexWriter iw = new IndexWriter(dir, iwc);
>> Document document = new Document();
>> document.add(new Field(FIELD1,"test", Field.Store.YES,
>> Field.Index.ANALYZED));
>> iw.addDocument(document);
>> iw.close();
>>
>> IndexReader ir = IndexReader.open(dir,true);
>> TermEnum terms = ir.terms(new Term(FIELD1));
>> System.out.println("TermDocsFreq"+terms.docFreq());
>> do
>> {
>> if (terms.term() != null)
>> {
>> System.out.println(terms.term().text());
>> }
>> }
>> while (terms.next()&& terms.term().field().equals(FIELD1));
>>
>> IndexWriterConfig iwc2 = new IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>> iw = new IndexWriter(dir, iwc2);
>> document = new Document();
>> document.add(new Field(FIELD1,"test2", Field.Store.YES,
>> Field.Index.ANALYZED));
>> iw.updateDocument(new Term(FIELD1,"term1"),document);
>> iw.close();
>>
>> ir = IndexReader.open(dir,true);
>> terms = ir.terms(new Term(FIELD1));
>> System.out.println("TermDocsFreq"+terms.docFreq());
>> do
>> {
>> if (terms.term() != null)
>> {
>> System.out.println(terms.term().text());
>> }
>> }
>> while (terms.next()&& terms.term().field().equals(FIELD1));
>> }
>> catch(Exception ex)
>> {
>> ex.printStackTrace();
>> }
>> }
>>
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


Carl.Austin at baesystemsdetica

Dec 16, 2011, 9:32 AM

Post #4 of 9 (422 views)
Permalink
RE: Why is the old value still in the index [In reply to]

The .docFreq() call returns the number of documents that the current
term in the enum is in, not all terms in the term enum.

Also be aware of, from the lucene wiki : "Once a document is deleted it
will not appear in TermDocs nor TermPositions enumerations, nor any
search results. Attempts to load the document will result in an
exception. The presence of this document may still be reflected in the
docFreq statistics, and thus alter search scores, though this will be
corrected eventually as segments containing deletions are merged."

You can check more accurately by using the TermDocs if you need to.

-----Original Message-----
From: Paul Taylor [mailto:paul_t100 [at] fastmail]
Sent: 16 December 2011 17:20
To: Ian Lea
Cc: java-user [at] lucene
Subject: Re: Why is the old value still in the index

On 16/12/2011 17:10, Ian Lea wrote:
> Shouldn't
>
> iw.updateDocument(new Term(FIELD1,"term1"),document);
>
> be
>
> iw.updateDocument(new Term(FIELD1,"test"),document);
>
> if you want to replace the first doc?
Hmm, you are right if I change it I then get

TermDocsFreq1
test
TermDocsFreq1
test2



(but doesn't resolve the program with my real code that doesnt seem to
have this mistake :()

What I dont understand then is in the incorrect example why don't I get

TermDocsFreq2


if Ive actually create another document rather than updating one ?

-- Ian. On Fri, Dec 16, 2011 at 4:54 PM, Paul Taylor
<paul_t100 [at] fastmail> wrote:
>> I'm adding documents to an index, at a later date I modify a document
and
>> update the index, close the writer and open a new IndexReader. My
>> indexreader iterates over terms for that field and docFreq() returns
one as
>> I would expect, however the iterator returns both the old value of
the
>> document and the new value, I don't expect (or want) the old value to
still
>> be in the index, so why is this.
>>
>>
>> This full test program generates:
>>
>> TermDocsFreq1
>> test
>> TermDocsFreq1
>> test
>> test2
>>
>> Dont expect to see 'test' listed the second time
>>
>>
>> package com.jthink.jaikoz;
>>
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.*;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.util.Version;
>>
>>
>> public class LuceneTest
>> {
>> public static void main(String []args)
>> {
>> try
>> {
>> String FIELD1="field1";
>> RAMDirectory dir = new RAMDirectory();
>> IndexWriterConfig iwc = new
IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>> IndexWriter iw = new IndexWriter(dir, iwc);
>> Document document = new Document();
>> document.add(new Field(FIELD1,"test", Field.Store.YES,
>> Field.Index.ANALYZED));
>> iw.addDocument(document);
>> iw.close();
>>
>> IndexReader ir = IndexReader.open(dir,true);
>> TermEnum terms = ir.terms(new Term(FIELD1));
>> System.out.println("TermDocsFreq"+terms.docFreq());
>> do
>> {
>> if (terms.term() != null)
>> {
>> System.out.println(terms.term().text());
>> }
>> }
>> while (terms.next()&&
terms.term().field().equals(FIELD1));
>>
>> IndexWriterConfig iwc2 = new
IndexWriterConfig(Version.LUCENE_35,
>> new StandardAnalyzer(Version.LUCENE_35));
>> iw = new IndexWriter(dir, iwc2);
>> document = new Document();
>> document.add(new Field(FIELD1,"test2", Field.Store.YES,
>> Field.Index.ANALYZED));
>> iw.updateDocument(new Term(FIELD1,"term1"),document);
>> iw.close();
>>
>> ir = IndexReader.open(dir,true);
>> terms = ir.terms(new Term(FIELD1));
>> System.out.println("TermDocsFreq"+terms.docFreq());
>> do
>> {
>> if (terms.term() != null)
>> {
>> System.out.println(terms.term().text());
>> }
>> }
>> while (terms.next()&&
terms.term().field().equals(FIELD1));
>> }
>> catch(Exception ex)
>> {
>> ex.printStackTrace();
>> }
>> }
>>
>> }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Please consider the environment before printing this email.

This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately.

Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.

The contents of this email may relate to dealings with other companies under the control of BAE Systems plc details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Detica Limited is a BAE Systems company trading as BAE Systems Detica.
Detica Limited is registered in England and Wales under No: 1337451.
Registered office: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Dec 16, 2011, 9:43 AM

Post #5 of 9 (419 views)
Permalink
RE: Why is the old value still in the index [In reply to]

Hi,
> I'm adding documents to an index, at a later date I modify a document and
> update the index, close the writer and open a new IndexReader. My
> indexreader iterates over terms for that field and docFreq() returns one
as I
> would expect, however the iterator returns both the old value of the
document
> and the new value, I don't expect (or want) the old value to still be in
the index,
> so why is this.

That is all as expected. Updating documents in a Lucene index is an atomic
delete/add operation. Deleting in Lucene just marks the document for
deletion, but it is still there (search results won't return it). The
condequence is that all terms are still in terms index and all document
frequencies still contain both documents. This *may* cause scoring problems
in indexes with many deletes (but those will go away as merging will remove
them, see below), but this is known (see wiki, javadocs,...).

Once you add more documents the index will merge segments and that will make
the deleted documents disappear. If you really want to do remove the old
documents with all terms (this is veeeeery expensive), you can call
IW.forceMergeDeletes:
http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/index/IndexWr
iter.html#forceMergeDeletes()

The way how inverted indexes work makes it impossible to update the terms
index afterwards.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Dec 16, 2011, 12:54 PM

Post #6 of 9 (417 views)
Permalink
Re: Why is the old value still in the index [In reply to]

On 16/12/2011 17:43, Uwe Schindler wrote:
> Hi,
>> I'm adding documents to an index, at a later date I modify a document and
>> update the index, close the writer and open a new IndexReader. My
>> indexreader iterates over terms for that field and docFreq() returns one
> as I
>> would expect, however the iterator returns both the old value of the
> document
>> and the new value, I don't expect (or want) the old value to still be in
> the index,
>> so why is this.
> That is all as expected. Updating documents in a Lucene index is an atomic
> delete/add operation. Deleting in Lucene just marks the document for
> deletion, but it is still there (search results won't return it). The
> condequence is that all terms are still in terms index and all document
> frequencies still contain both documents. This *may* cause scoring problems
> in indexes with many deletes (but those will go away as merging will remove
> them, see below), but this is known (see wiki, javadocs,...).
>
> Once you add more documents the index will merge segments and that will make
> the deleted documents disappear. If you really want to do remove the old
> documents with all terms (this is veeeeery expensive), you can call
> IW.forceMergeDeletes:
> http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/index/IndexWr
> iter.html#forceMergeDeletes()
>
> The way how inverted indexes work makes it impossible to update the terms
> index afterwards.
>
> Uwe
>
>
Hi

Thanks I think you might have it, but tell me if forceMergeDelete() is a
bad idea is there a query I can run that just returns all docs rather
than the iteration I use, (what I want is the value of a particular
field in each doc)

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Dec 16, 2011, 1:58 PM

Post #7 of 9 (423 views)
Permalink
Re: Why is the old value still in the index [In reply to]

On 16/12/2011 20:54, Paul Taylor wrote:
> On 16/12/2011 17:43, Uwe Schindler wrote:
>> Hi,
>>> I'm adding documents to an index, at a later date I modify a
>>> document and
>>> update the index, close the writer and open a new IndexReader. My
>>> indexreader iterates over terms for that field and docFreq() returns
>>> one
>> as I
>>> would expect, however the iterator returns both the old value of the
>> document
>>> and the new value, I don't expect (or want) the old value to still
>>> be in
>> the index,
>>> so why is this.
>> That is all as expected. Updating documents in a Lucene index is an
>> atomic
>> delete/add operation. Deleting in Lucene just marks the document for
>> deletion, but it is still there (search results won't return it). The
>> condequence is that all terms are still in terms index and all document
>> frequencies still contain both documents. This *may* cause scoring
>> problems
>> in indexes with many deletes (but those will go away as merging will
>> remove
>> them, see below), but this is known (see wiki, javadocs,...).
>>
>> Once you add more documents the index will merge segments and that
>> will make
>> the deleted documents disappear. If you really want to do remove the old
>> documents with all terms (this is veeeeery expensive), you can call
>> IW.forceMergeDeletes:
>> http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/index/IndexWr
>>
>> iter.html#forceMergeDeletes()
>>
>> The way how inverted indexes work makes it impossible to update the
>> terms
>> index afterwards.
>>
>> Uwe
>>
>>
> Hi
>
> Thanks I think you might have it, but tell me if forceMergeDelete() is
> a bad idea is there a query I can run that just returns all docs
> rather than the iteration I use, (what I want is the value of a
> particular field in each doc)
>
> Paul
Never mind Ive got it working by adding another field to the index with
always the same value that I can search on

thansk Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rene.a.hackl at gmx

Dec 16, 2011, 2:51 PM

Post #8 of 9 (412 views)
Permalink
Re: Why is the old value still in the index [In reply to]

Maybe you could just use MatchAllDocsQuery?

http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/MatchAllDocsQuery.html


Rene

Am 16.12.2011 22:58, schrieb Paul Taylor:
> On 16/12/2011 20:54, Paul Taylor wrote:
>>
>> Thanks I think you might have it, but tell me if forceMergeDelete()
>> is a bad idea is there a query I can run that just returns all docs
>> rather than the iteration I use, (what I want is the value of a
>> particular field in each doc)
>>
>> Paul
> Never mind Ive got it working by adding another field to the index
> with always the same value that I can search on
>
> thansk Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul_t100 at fastmail

Dec 16, 2011, 3:05 PM

Post #9 of 9 (413 views)
Permalink
Re: Why is the old value still in the index [In reply to]

On 16/12/2011 22:51, Rene Hackl-Sommer wrote:
> Maybe you could just use MatchAllDocsQuery?
>
> http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/search/MatchAllDocsQuery.html
>
>
> Rene
Ah thanks Rene, thats what I wanted

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.