Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Problem with TermVector offsets and positions not being preserved

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


tmoleary at uw

Jul 19, 2012, 4:16 PM

Post #1 of 13 (548 views)
Permalink
Problem with TermVector offsets and positions not being preserved

I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:

doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS);

The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.

When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
Mike


rcmuir at gmail

Jul 20, 2012, 6:10 AM

Post #2 of 13 (532 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

Hi Mike:

I wrote up some tests last night against 3.6 trying to find some way
to reproduce what you are seeing, e.g. adding additional segments with
the field specified without term vectors, without tv offsets, omitting
TF, and merging them and checking everything out. I couldnt find any
problems.

Can you provide more information?

On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:
>
> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>
> When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
> Mike



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tmoleary at uw

Jul 20, 2012, 2:10 PM

Post #3 of 13 (536 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

Hi Robert,
I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is called CreateTestIndex, and it comes with the Lucene In Action book's source code that you can download from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions, and to add a loop that writes names of fields and their TermVector, offset and position settings to the console.

The other application is called DumpIndex, and got it from a web site somewhere about 6 months ago. I changed a few lines to get rid of deprecated function warnings and added the same line of code to it that writes field information to the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first created, added to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that was created, those fields print out true for field.isTermVectorStored() and false for the other two functions.
Thanks,
Mike

This is the source code for CreateTextIndex:

////////////////////////////////////////////////////////////////////////////////
package myLucene;

/**
* Copyright Manning Publications Co.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific lan
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {

public static Document getDocument(String rootDir, File file) throws IOException {
Properties props = new Properties();
props.load(new FileInputStream(file));

Document doc = new Document();

// category comes from relative path below the base directory
String category = file.getParent().substring(rootDir.length()); //1
category = category.replace(File.separatorChar, '/'); //1

String isbn = props.getProperty("isbn"); //2
String title = props.getProperty("title"); //2
String author = props.getProperty("author"); //2
String url = props.getProperty("url"); //2
String subject = props.getProperty("subject"); //2

String pubmonth = props.getProperty("pubmonth"); //2

System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth + "\n" + category + "\n---------");

doc.add(new Field("isbn", // 3
isbn, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("category", // 3
category, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("title", // 3
title, // 3
Field.Store.YES, // 3
Field.Index.ANALYZED, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
doc.add(new Field("title2", // 3
title.toLowerCase(), // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED_NO_NORMS, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3

// split multiple authors into unique field instances
String[] authors = author.split(","); // 3
for (String a : authors) { // 3
doc.add(new Field("author", // 3
a, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
}

doc.add(new Field("url", // 3
url, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED_NO_NORMS)); // 3
doc.add(new Field("subject", // 3 //4
subject, // 3 //4
Field.Store.YES, // 3 //4
Field.Index.ANALYZED, // 3 //4
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 //4

doc.add(new NumericField("pubmonth", // 3
Field.Store.YES, // 3
true).setIntValue(Integer.parseInt(pubmonth))); // 3

Date d; // 3
try { // 3
d = DateTools.stringToDate(pubmonth); // 3
} catch (ParseException pe) { // 3
throw new RuntimeException(pe); // 3
} // 3
doc.add(new NumericField("pubmonthAsDay") // 3
.setIntValue((int) (d.getTime()/(1000*3600*24)))); // 3

for(String text : new String[] {title, subject, author, category}) { // 3 // 5
doc.add(new Field("contents", text, // 3 // 5
Field.Store.NO, Field.Index.ANALYZED, // 3 // 5
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 // 5
}

List<Fieldable> fields = doc.getFields();

for (Fieldable field : fields) {
System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
}
return doc;
}

private static void findFiles(List<File> result, File dir) {
for(File file : dir.listFiles()) {
if (file.getName().endsWith(".properties")) {
result.add(file);
} else if (file.isDirectory()) {
findFiles(result, file);
}
}
}

public static void main(String[] args) throws IOException {
String dataDir = args[0];
String indexDir = args[1];
List<File> results = new ArrayList<File>();
findFiles(results, new File(dataDir));
System.out.println(results.size() + " books to index");
Directory dir = FSDirectory.open(new File(indexDir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter w = new IndexWriter(dir, config);
for(File file : results) {
Document doc = getDocument(dataDir, file);
w.addDocument(doc);
}
w.close();
dir.close();
}
}

/*
#1 Get category
#2 Pull fields
#3 Add fields to Document instance
#4 Flag subject field
#5 Add catch-all contents field
#6 Custom analyzer to override multi-valued position increment
*/
////////////////////////////////////////////////////////////////////////////////
And for DumpIndex:
////////////////////////////////////////////////////////////////////////////////
package myLucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;

import org.apache.lucene.store.FSDirectory;

import java.io.File;
import java.io.IOException;

import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

/**
* Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
*
* Blog post at
* http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
*
* @author Luis Parravicini
*/
public class DumpIndex {
/**
* Reads the index from the directory passed as argument or "index" if no arguments are given.
*/
public static void main(String[] args) throws Exception {
String index = (args.length > 0 ? args[0] : "index");

new DumpIndex(index).dump();
}

private String dir;

public DumpIndex(String dir) {
this.dir = dir;
}

public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException, IOException {
XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
IndexReader reader = IndexReader.open(FSDirectory.open(new File(dir)));

out.writeStartDocument();
out.writeStartElement("documents");

for (int i = 0; i < reader.numDocs(); i++) {
dumpDocument(reader.document(i), out);
}
out.writeEndElement();
out.writeEndDocument();
out.flush();
reader.close();
}

private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException {
out.writeStartElement("document");

for (Fieldable field : document.getFields()) {
System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());

out.writeStartElement("field");
out.writeAttribute("name", field.name());
out.writeAttribute("value", field.stringValue());
out.writeEndElement();
}
out.writeEndElement();
}
}
////////////////////////////////////////////////////////////////////////////////

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, July 20, 2012 6:11 AM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

Hi Mike:

I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems.

Can you provide more information?

On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:
>
> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES,
> Field.Index.ANALYZED_NO_NORMS,
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>
> When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
> Mike



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

ТÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÐÐ¥FòVç7V'67&–&RÂRÖÖ–â¦f×W6W"×Vç7V'67&–&TÇV6VæRæ6†Ræ÷&pФf÷"FF—F–öæÂ6öÖÖæG2ÂRÖÖ–â¦f×W6W"Ö†VÇÇV6VæRæ6†Ræ÷&pРÐ


tmoleary at uw

Jul 20, 2012, 3:53 PM

Post #4 of 13 (530 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

I neglected to mention that CreateTestIndex uses a collection of data files with .properties extensions that are included in the Lucene In Action source code download.
Mike

-----Original Message-----
From: Mike O'Leary [mailto:tmoleary [at] uw]
Sent: Friday, July 20, 2012 2:10 PM
To: java-user [at] lucene
Subject: RE: Problem with TermVector offsets and positions not being preserved

Hi Robert,
I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is called CreateTestIndex, and it comes with the Lucene In Action book's source code that you can download from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions, and to add a loop that writes names of fields and their TermVector, offset and position settings to the console.

The other application is called DumpIndex, and got it from a web site somewhere about 6 months ago. I changed a few lines to get rid of deprecated function warnings and added the same line of code to it that writes field information to the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first created, added to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that was created, those fields print out true for field.isTermVectorStored() and false for the other two functions.
Thanks,
Mike

This is the source code for CreateTextIndex:

////////////////////////////////////////////////////////////////////////////////
package myLucene;

/**
* Copyright Manning Publications Co.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific lan
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field; import org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {

public static Document getDocument(String rootDir, File file) throws IOException {
Properties props = new Properties();
props.load(new FileInputStream(file));

Document doc = new Document();

// category comes from relative path below the base directory
String category = file.getParent().substring(rootDir.length()); //1
category = category.replace(File.separatorChar, '/'); //1

String isbn = props.getProperty("isbn"); //2
String title = props.getProperty("title"); //2
String author = props.getProperty("author"); //2
String url = props.getProperty("url"); //2
String subject = props.getProperty("subject"); //2

String pubmonth = props.getProperty("pubmonth"); //2

System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth + "\n" + category + "\n---------");

doc.add(new Field("isbn", // 3
isbn, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("category", // 3
category, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("title", // 3
title, // 3
Field.Store.YES, // 3
Field.Index.ANALYZED, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
doc.add(new Field("title2", // 3
title.toLowerCase(), // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED_NO_NORMS, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3

// split multiple authors into unique field instances
String[] authors = author.split(","); // 3
for (String a : authors) { // 3
doc.add(new Field("author", // 3
a, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED, // 3
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
}

doc.add(new Field("url", // 3
url, // 3
Field.Store.YES, // 3
Field.Index.NOT_ANALYZED_NO_NORMS)); // 3
doc.add(new Field("subject", // 3 //4
subject, // 3 //4
Field.Store.YES, // 3 //4
Field.Index.ANALYZED, // 3 //4
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 //4

doc.add(new NumericField("pubmonth", // 3
Field.Store.YES, // 3
true).setIntValue(Integer.parseInt(pubmonth))); // 3

Date d; // 3
try { // 3
d = DateTools.stringToDate(pubmonth); // 3
} catch (ParseException pe) { // 3
throw new RuntimeException(pe); // 3
} // 3
doc.add(new NumericField("pubmonthAsDay") // 3
.setIntValue((int) (d.getTime()/(1000*3600*24)))); // 3

for(String text : new String[] {title, subject, author, category}) { // 3 // 5
doc.add(new Field("contents", text, // 3 // 5
Field.Store.NO, Field.Index.ANALYZED, // 3 // 5
Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 // 5
}

List<Fieldable> fields = doc.getFields();

for (Fieldable field : fields) {
System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
}
return doc;
}

private static void findFiles(List<File> result, File dir) {
for(File file : dir.listFiles()) {
if (file.getName().endsWith(".properties")) {
result.add(file);
} else if (file.isDirectory()) {
findFiles(result, file);
}
}
}

public static void main(String[] args) throws IOException {
String dataDir = args[0];
String indexDir = args[1];
List<File> results = new ArrayList<File>();
findFiles(results, new File(dataDir));
System.out.println(results.size() + " books to index");
Directory dir = FSDirectory.open(new File(indexDir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
IndexWriter w = new IndexWriter(dir, config);
for(File file : results) {
Document doc = getDocument(dataDir, file);
w.addDocument(doc);
}
w.close();
dir.close();
}
}

/*
#1 Get category
#2 Pull fields
#3 Add fields to Document instance
#4 Flag subject field
#5 Add catch-all contents field
#6 Custom analyzer to override multi-valued position increment */ ////////////////////////////////////////////////////////////////////////////////
And for DumpIndex:
////////////////////////////////////////////////////////////////////////////////
package myLucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;

import org.apache.lucene.store.FSDirectory;

import java.io.File;
import java.io.IOException;

import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

/**
* Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
*
* Blog post at
* http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
*
* @author Luis Parravicini
*/
public class DumpIndex {
/**
* Reads the index from the directory passed as argument or "index" if no arguments are given.
*/
public static void main(String[] args) throws Exception {
String index = (args.length > 0 ? args[0] : "index");

new DumpIndex(index).dump();
}

private String dir;

public DumpIndex(String dir) {
this.dir = dir;
}

public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException, IOException {
XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
IndexReader reader = IndexReader.open(FSDirectory.open(new File(dir)));

out.writeStartDocument();
out.writeStartElement("documents");

for (int i = 0; i < reader.numDocs(); i++) {
dumpDocument(reader.document(i), out);
}
out.writeEndElement();
out.writeEndDocument();
out.flush();
reader.close();
}

private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException {
out.writeStartElement("document");

for (Fieldable field : document.getFields()) {
System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());

out.writeStartElement("field");
out.writeAttribute("name", field.name());
out.writeAttribute("value", field.stringValue());
out.writeEndElement();
}
out.writeEndElement();
}
}
////////////////////////////////////////////////////////////////////////////////

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, July 20, 2012 6:11 AM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

Hi Mike:

I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems.

Can you provide more information?

On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:
>
> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES,
> Field.Index.ANALYZED_NO_NORMS,
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>
> When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
> Mike



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

B‹KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB•È[œÝXœØÜšX™KK[XZ[ˆ˜]˜K]\Ù\‹][œÝXœØÜšX™PXÙ[™K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ˜]˜K]\Ù\‹Z[XÙ[™K˜\XÚK›Ü™ÃBƒB


rcmuir at gmail

Jul 20, 2012, 4:04 PM

Post #5 of 13 (529 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

I think its wrong for DumpIndex to look at term vector information
from the Document that was retrieved from IndexReader.document,
thats basically just a way of getting access to your stored fields.

This tool should be using something like IndexReader.getTermFreqVector
for the document to determine if it has term vectors.

On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is called CreateTestIndex, and it comes with the Lucene In Action book's source code that you can download from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions, and to add a loop that writes names of fields and their TermVector, offset and position settings to the console.
>
> The other application is called DumpIndex, and got it from a web site somewhere about 6 months ago. I changed a few lines to get rid of deprecated function warnings and added the same line of code to it that writes field information to the console.
>
> What I am seeing is that when I run CreateTestIndex, when the fields are first created, added to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that was created, those fields print out true for field.isTermVectorStored() and false for the other two functions.
> Thanks,
> Mike
>
> This is the source code for CreateTextIndex:
>
> ////////////////////////////////////////////////////////////////////////////////
> package myLucene;
>
> /**
> * Copyright Manning Publications Co.
> *
> * Licensed under the Apache License, Version 2.0 (the "License");
> * you may not use this file except in compliance with the License.
> * You may obtain a copy of the License at
> *
> * http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific lan
> */
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Fieldable;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.document.DateTools;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Properties;
> import java.util.Date;
> import java.util.List;
> import java.util.ArrayList;
> import java.text.ParseException;
>
> public class CreateTestIndex {
>
> public static Document getDocument(String rootDir, File file) throws IOException {
> Properties props = new Properties();
> props.load(new FileInputStream(file));
>
> Document doc = new Document();
>
> // category comes from relative path below the base directory
> String category = file.getParent().substring(rootDir.length()); //1
> category = category.replace(File.separatorChar, '/'); //1
>
> String isbn = props.getProperty("isbn"); //2
> String title = props.getProperty("title"); //2
> String author = props.getProperty("author"); //2
> String url = props.getProperty("url"); //2
> String subject = props.getProperty("subject"); //2
>
> String pubmonth = props.getProperty("pubmonth"); //2
>
> System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth + "\n" + category + "\n---------");
>
> doc.add(new Field("isbn", // 3
> isbn, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED)); // 3
> doc.add(new Field("category", // 3
> category, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED)); // 3
> doc.add(new Field("title", // 3
> title, // 3
> Field.Store.YES, // 3
> Field.Index.ANALYZED, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
> doc.add(new Field("title2", // 3
> title.toLowerCase(), // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED_NO_NORMS, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
>
> // split multiple authors into unique field instances
> String[] authors = author.split(","); // 3
> for (String a : authors) { // 3
> doc.add(new Field("author", // 3
> a, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
> }
>
> doc.add(new Field("url", // 3
> url, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED_NO_NORMS)); // 3
> doc.add(new Field("subject", // 3 //4
> subject, // 3 //4
> Field.Store.YES, // 3 //4
> Field.Index.ANALYZED, // 3 //4
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 //4
>
> doc.add(new NumericField("pubmonth", // 3
> Field.Store.YES, // 3
> true).setIntValue(Integer.parseInt(pubmonth))); // 3
>
> Date d; // 3
> try { // 3
> d = DateTools.stringToDate(pubmonth); // 3
> } catch (ParseException pe) { // 3
> throw new RuntimeException(pe); // 3
> } // 3
> doc.add(new NumericField("pubmonthAsDay") // 3
> .setIntValue((int) (d.getTime()/(1000*3600*24)))); // 3
>
> for(String text : new String[] {title, subject, author, category}) { // 3 // 5
> doc.add(new Field("contents", text, // 3 // 5
> Field.Store.NO, Field.Index.ANALYZED, // 3 // 5
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 // 5
> }
>
> List<Fieldable> fields = doc.getFields();
>
> for (Fieldable field : fields) {
> System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
> field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
> }
> return doc;
> }
>
> private static void findFiles(List<File> result, File dir) {
> for(File file : dir.listFiles()) {
> if (file.getName().endsWith(".properties")) {
> result.add(file);
> } else if (file.isDirectory()) {
> findFiles(result, file);
> }
> }
> }
>
> public static void main(String[] args) throws IOException {
> String dataDir = args[0];
> String indexDir = args[1];
> List<File> results = new ArrayList<File>();
> findFiles(results, new File(dataDir));
> System.out.println(results.size() + " books to index");
> Directory dir = FSDirectory.open(new File(indexDir));
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
> IndexWriter w = new IndexWriter(dir, config);
> for(File file : results) {
> Document doc = getDocument(dataDir, file);
> w.addDocument(doc);
> }
> w.close();
> dir.close();
> }
> }
>
> /*
> #1 Get category
> #2 Pull fields
> #3 Add fields to Document instance
> #4 Flag subject field
> #5 Add catch-all contents field
> #6 Custom analyzer to override multi-valued position increment
> */
> ////////////////////////////////////////////////////////////////////////////////
> And for DumpIndex:
> ////////////////////////////////////////////////////////////////////////////////
> package myLucene;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Fieldable;
>
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
>
> import org.apache.lucene.store.FSDirectory;
>
> import java.io.File;
> import java.io.IOException;
>
> import javax.xml.stream.FactoryConfigurationError;
> import javax.xml.stream.XMLOutputFactory;
> import javax.xml.stream.XMLStreamException;
> import javax.xml.stream.XMLStreamWriter;
>
> /**
> * Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
> *
> * Blog post at
> * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
> *
> * @author Luis Parravicini
> */
> public class DumpIndex {
> /**
> * Reads the index from the directory passed as argument or "index" if no arguments are given.
> */
> public static void main(String[] args) throws Exception {
> String index = (args.length > 0 ? args[0] : "index");
>
> new DumpIndex(index).dump();
> }
>
> private String dir;
>
> public DumpIndex(String dir) {
> this.dir = dir;
> }
>
> public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException, IOException {
> XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
> IndexReader reader = IndexReader.open(FSDirectory.open(new File(dir)));
>
> out.writeStartDocument();
> out.writeStartElement("documents");
>
> for (int i = 0; i < reader.numDocs(); i++) {
> dumpDocument(reader.document(i), out);
> }
> out.writeEndElement();
> out.writeEndDocument();
> out.flush();
> reader.close();
> }
>
> private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException {
> out.writeStartElement("document");
>
> for (Fieldable field : document.getFields()) {
> System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
> field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
>
> out.writeStartElement("field");
> out.writeAttribute("name", field.name());
> out.writeAttribute("value", field.stringValue());
> out.writeEndElement();
> }
> out.writeEndElement();
> }
> }
> ////////////////////////////////////////////////////////////////////////////////
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Friday, July 20, 2012 6:11 AM
> To: java-user [at] lucene
> Subject: Re: Problem with TermVector offsets and positions not being preserved
>
> Hi Mike:
>
> I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems.
>
> Can you provide more information?
>
> On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary [at] uw> wrote:
>> I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:
>>
>> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES,
>> Field.Index.ANALYZED_NO_NORMS,
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>>
>> When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
>> Mike
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tmoleary at uw

Jul 20, 2012, 5:24 PM

Post #6 of 13 (532 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions stored. Shouldn't the Field instance variables called storeOffsetWithTermVector and storePositionWithTermVector be set to true for a field that is defined to store offsets and positions in term vectors? They are set to true in 3.5, but not in 3.6. When I open an index that I created with 3.6 in Luke, it says the fields in question have term vectors enabled, but offsets and positions are not stored. Maybe once term vectors with offsets and positions are created, it doesn't matter anymore what the values of storeOffsetWithTermVector and storePositionWithTermVector happen to be, but I'd like to find out for sure if offsets and positions are being handled right in 3.6 or not because I need to produce indexes that a co-worker can use with a UI that uses fast vector term highlighting, and I'd like to be sure I have created indexes that work for him.
Thanks,
Mike

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, July 20, 2012 4:05 PM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

I think its wrong for DumpIndex to look at term vector information from the Document that was retrieved from IndexReader.document, thats basically just a way of getting access to your stored fields.

This tool should be using something like IndexReader.getTermFreqVector for the document to determine if it has term vectors.

On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is called CreateTestIndex, and it comes with the Lucene In Action book's source code that you can download from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions, and to add a loop that writes names of fields and their TermVector, offset and position settings to the console.
>
> The other application is called DumpIndex, and got it from a web site somewhere about 6 months ago. I changed a few lines to get rid of deprecated function warnings and added the same line of code to it that writes field information to the console.
>
> What I am seeing is that when I run CreateTestIndex, when the fields are first created, added to a document, and are about to be added to the index, the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly print out that the values of field.isTermVectorStored(), field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() are true. When I run DumpIndex on the index that was created, those fields print out true for field.isTermVectorStored() and false for the other two functions.
> Thanks,
> Mike
>
> This is the source code for CreateTextIndex:
>
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> /**
> * Copyright Manning Publications Co.
> *
> * Licensed under the Apache License, Version 2.0 (the "License");
> * you may not use this file except in compliance with the License.
> * You may obtain a copy of the License at
> *
> * http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific lan */
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field; import
> org.apache.lucene.document.Fieldable;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.document.DateTools;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Properties;
> import java.util.Date;
> import java.util.List;
> import java.util.ArrayList;
> import java.text.ParseException;
>
> public class CreateTestIndex {
>
> public static Document getDocument(String rootDir, File file) throws IOException {
> Properties props = new Properties();
> props.load(new FileInputStream(file));
>
> Document doc = new Document();
>
> // category comes from relative path below the base directory
> String category = file.getParent().substring(rootDir.length()); //1
> category = category.replace(File.separatorChar, '/'); //1
>
> String isbn = props.getProperty("isbn"); //2
> String title = props.getProperty("title"); //2
> String author = props.getProperty("author"); //2
> String url = props.getProperty("url"); //2
> String subject = props.getProperty("subject"); //2
>
> String pubmonth = props.getProperty("pubmonth"); //2
>
> System.out.println(title + "\n" + author + "\n" + subject + "\n" +
> pubmonth + "\n" + category + "\n---------");
>
> doc.add(new Field("isbn", // 3
> isbn, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED)); // 3
> doc.add(new Field("category", // 3
> category, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED)); // 3
> doc.add(new Field("title", // 3
> title, // 3
> Field.Store.YES, // 3
> Field.Index.ANALYZED, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
> doc.add(new Field("title2", // 3
> title.toLowerCase(), // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED_NO_NORMS, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
>
> // split multiple authors into unique field instances
> String[] authors = author.split(","); // 3
> for (String a : authors) { // 3
> doc.add(new Field("author", // 3
> a, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED, // 3
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
> }
>
> doc.add(new Field("url", // 3
> url, // 3
> Field.Store.YES, // 3
> Field.Index.NOT_ANALYZED_NO_NORMS)); // 3
> doc.add(new Field("subject", // 3 //4
> subject, // 3 //4
> Field.Store.YES, // 3 //4
> Field.Index.ANALYZED, // 3 //4
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3
> //4
>
> doc.add(new NumericField("pubmonth", // 3
> Field.Store.YES, // 3
> true).setIntValue(Integer.parseInt(pubmonth))); // 3
>
> Date d; // 3
> try { // 3
> d = DateTools.stringToDate(pubmonth); // 3
> } catch (ParseException pe) { // 3
> throw new RuntimeException(pe); // 3
> } // 3
> doc.add(new NumericField("pubmonthAsDay") // 3
> .setIntValue((int) (d.getTime()/(1000*3600*24)))); // 3
>
> for(String text : new String[] {title, subject, author, category}) { // 3 // 5
> doc.add(new Field("contents", text, // 3 // 5
> Field.Store.NO, Field.Index.ANALYZED, // 3 // 5
> Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3 // 5
> }
>
> List<Fieldable> fields = doc.getFields();
>
> for (Fieldable field : fields) {
> System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
> field.isStoreOffsetWithTermVector() + " " + field.isStorePositionWithTermVector());
> }
> return doc;
> }
>
> private static void findFiles(List<File> result, File dir) {
> for(File file : dir.listFiles()) {
> if (file.getName().endsWith(".properties")) {
> result.add(file);
> } else if (file.isDirectory()) {
> findFiles(result, file);
> }
> }
> }
>
> public static void main(String[] args) throws IOException {
> String dataDir = args[0];
> String indexDir = args[1];
> List<File> results = new ArrayList<File>();
> findFiles(results, new File(dataDir));
> System.out.println(results.size() + " books to index");
> Directory dir = FSDirectory.open(new File(indexDir));
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36));
> IndexWriter w = new IndexWriter(dir, config);
> for(File file : results) {
> Document doc = getDocument(dataDir, file);
> w.addDocument(doc);
> }
> w.close();
> dir.close();
> }
> }
>
> /*
> #1 Get category
> #2 Pull fields
> #3 Add fields to Document instance
> #4 Flag subject field
> #5 Add catch-all contents field
> #6 Custom analyzer to override multi-valued position increment */
> //////////////////////////////////////////////////////////////////////
> //////////
> And for DumpIndex:
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Fieldable;
>
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
>
> import org.apache.lucene.store.FSDirectory;
>
> import java.io.File;
> import java.io.IOException;
>
> import javax.xml.stream.FactoryConfigurationError;
> import javax.xml.stream.XMLOutputFactory;
> import javax.xml.stream.XMLStreamException;
> import javax.xml.stream.XMLStreamWriter;
>
> /**
> * Dumps a Lucene index as XML. Dumps all documents with their fields and values to stdout.
> *
> * Blog post at
> * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
> *
> * @author Luis Parravicini
> */
> public class DumpIndex {
> /**
> * Reads the index from the directory passed as argument or "index" if no arguments are given.
> */
> public static void main(String[] args) throws Exception {
> String index = (args.length > 0 ? args[0] : "index");
>
> new DumpIndex(index).dump();
> }
>
> private String dir;
>
> public DumpIndex(String dir) {
> this.dir = dir;
> }
>
> public void dump() throws XMLStreamException, FactoryConfigurationError, CorruptIndexException, IOException {
> XMLStreamWriter out = XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
> IndexReader reader =
> IndexReader.open(FSDirectory.open(new File(dir)));
>
> out.writeStartDocument();
> out.writeStartElement("documents");
>
> for (int i = 0; i < reader.numDocs(); i++) {
> dumpDocument(reader.document(i), out);
> }
> out.writeEndElement();
> out.writeEndDocument();
> out.flush();
> reader.close();
> }
>
> private void dumpDocument(Document document, XMLStreamWriter out) throws XMLStreamException {
> out.writeStartElement("document");
>
> for (Fieldable field : document.getFields()) {
> System.out.println(field.name() + " " + field.isTermVectorStored() + " " +
> field.isStoreOffsetWithTermVector() +
> " " + field.isStorePositionWithTermVector());
>
> out.writeStartElement("field");
> out.writeAttribute("name", field.name());
> out.writeAttribute("value", field.stringValue());
> out.writeEndElement();
> }
> out.writeEndElement();
> }
> }
> //////////////////////////////////////////////////////////////////////
> //////////
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Friday, July 20, 2012 6:11 AM
> To: java-user [at] lucene
> Subject: Re: Problem with TermVector offsets and positions not being
> preserved
>
> Hi Mike:
>
> I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems.
>
> Can you provide more information?
>
> On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmoleary [at] uw> wrote:
>> I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but offsets and positions were not. The code I used for indexing couldn't be simpler. It looks like this for the relevant field:
>>
>> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES,
>> Field.Index.ANALYZED_NO_NORMS,
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> The indexer adds these documents to the index and commits them. I ran the indexer in a debugger and watched the Lucene code set the Field instance variables called storeTermVector, storeOffsetWithTermVector and storePositionWithTermVector to true for this field.
>>
>> When the indexing was done, I ran a simple program in a debugger that opens an index, reads each document and writes out its information as XML. The values of storeOffsetWithTermVector and storePositionWithTermVector in the ReportText Field objects were false. Is there something other than specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to be done in order for offsets and positions to be saved in the index? Or are there circumstances under which the Field.TermVector setting for a Field object is ignored? This doesn't make sense to me, and I could swear that offsets and positions were being saved in some older indexes I created that I unfortunately no longer have around for comparison. I'm sure that I am just overlooking something or have made some kind of mistake, but I can't see what it is at the moment. Thanks for any help or advice you can give me.
>> Mike
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

B‹KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB•È[œÝXœØÜšX™KK[XZ[ˆ˜]˜K]\Ù\‹][œÝXœØÜšX™PXÙ[™K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ˜]˜K]\Ù\‹Z[XÙ[™K˜\XÚK›Ü™ÃBƒB


rcmuir at gmail

Jul 20, 2012, 5:59 PM

Post #7 of 13 (528 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.

Right: what i'm trying to tell you is that offsets and positions is
not an index-wide setting for a field: its per-document.

I think all the tools you are using to check these values are not
doing it correctly:
1. DumpIndex is wrongly using values from the Document returned by
IndexReader.document(), but that doesn't and never did retrieve these
values (it would be 2 extra disk seeks per document to figure out the
term vector flags)
2. I havent looked at Luke, but its probably printing the "global"
bits from FieldInfos. It used to be that we wrote some bits for these
options, I don't ever know what the purpose was since these options
can be controlled on/off at a per-document level: they make no sense.
Because of this we stopped writing these bits in 3.6 (we only write
into FieldInfos if the field has any term vectors at all), and thats
probably whats confusing you there.

Again, if you really want to validate that a specific document has
offsets/positions in its term vectors, you need to check that specific
document with IndexReader.getTermFreqVector, there is no other way,
since this can be controlled on a per-document basis for a field.


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tmoleary at uw

Jul 26, 2012, 3:50 PM

Post #8 of 13 (504 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

Hi Robert,
Thanks for your help. This cleared up all of the things I was having trouble understanding about offsets and positions in term vectors.
Mike

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, July 20, 2012 5:59 PM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.

Right: what i'm trying to tell you is that offsets and positions is not an index-wide setting for a field: its per-document.

I think all the tools you are using to check these values are not doing it correctly:
1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(), but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing the "global"
bits from FieldInfos. It used to be that we wrote some bits for these options, I don't ever know what the purpose was since these options can be controlled on/off at a per-document level: they make no sense.
Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if the field has any term vectors at all), and thats probably whats confusing you there.

Again, if you really want to validate that a specific document has offsets/positions in its term vectors, you need to check that specific document with IndexReader.getTermFreqVector, there is no other way, since this can be controlled on a per-document basis for a field.


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

B‹KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB•È[œÝXœØÜšX™KK[XZ[ˆ˜]˜K]\Ù\‹][œÝXœØÜšX™PXÙ[™K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ˜]˜K]\Ù\‹Z[XÙ[™K˜\XÚK›Ü™ÃBƒB


ab at getopt

Jul 27, 2012, 6:10 AM

Post #9 of 13 (502 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

On 27/07/2012 00:50, Mike O'Leary wrote:
> Hi Robert,
> Thanks for your help. This cleared up all of the things I was having trouble understanding about offsets and positions in term vectors.
> Mike
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Friday, July 20, 2012 5:59 PM
> To: java-user [at] lucene
> Subject: Re: Problem with TermVector offsets and positions not being preserved
>
> On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
>> Hi Robert,
>> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.
>
> Right: what i'm trying to tell you is that offsets and positions is not an index-wide setting for a field: its per-document.
>
> I think all the tools you are using to check these values are not doing it correctly:
> 1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(), but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing the "global"
> bits from FieldInfos. It used to be that we wrote some bits for these options, I don't ever know what the purpose was since these options can be controlled on/off at a per-document level: they make no sense.
> Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if the field has any term vectors at all), and thats probably whats confusing you there.

Catching up with this thread ... Luke 4.0-ALPHA makes a similar mistake.
I fixed this in svn (to be released in a week or so) so that:

* Luke now actually checks whether a doc has term vectors for a
particular field and adjusts the field flags based on the
presence/absence of a term vector. FieldInfos were not enough to handle
some combinations.

* Luke doesn't show the offsets/positions flags in the document view,
since they are not known in advance. However, the pop-up that shows a
term vector correctly shows positions and offsets if available (or
blanks if not available).


--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
___.,___,___,___,_._. __________________<><____________________
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Jul 27, 2012, 6:24 AM

Post #10 of 13 (505 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

On Fri, Jul 27, 2012 at 9:10 AM, Andrzej Bialecki <ab [at] getopt> wrote:
>
> Catching up with this thread ... Luke 4.0-ALPHA makes a similar mistake. I
> fixed this in svn (to be released in a week or so) so that:
>
> * Luke now actually checks whether a doc has term vectors for a particular
> field and adjusts the field flags based on the presence/absence of a term
> vector. FieldInfos were not enough to handle some combinations.
>
> * Luke doesn't show the offsets/positions flags in the document view, since
> they are not known in advance. However, the pop-up that shows a term vector
> correctly shows positions and offsets if available (or blanks if not
> available).
>

Thanks Andrzej!

I can't remember what issue we stopped writing those bits (maybe
https://issues.apache.org/jira/browse/LUCENE-3679 ?)... It wasn't
until this email that I remembered it.

But if I recall there might have been problems: I know there was a lot
of sneakiness to try to handle the corner cases so the bits would be
"correct", but nothing in lucene really used these bits... and I don't
think checkindex ever actually validated that if the offsets bit was
set in fieldinfos that at least 1 doc (even if deleted) actually had
them and so on.

The worst part is, I don't actually understand the use case for this
being configurable on a per-document basis for a field, I actually
think this is confusing...

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tmoleary at uw

Aug 22, 2012, 2:23 PM

Post #11 of 13 (426 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

I have one more question about term vector positions and offsets being preserved. My co-worker is working on updating the documents in an index with a field that contains a numerical value derived from the term frequencies and inverse document frequencies of terms in the document. His first pass at doing this calculates these values, writes them along with document ids to a text file and then updates the documents by reading lines from the file, searching for the document that contains the id, adding the field to the document, and replacing the document in the index. Some of the fields in these documents have term vectors with offsets and positions. After the revised document is updated in the index, those fields' term vector offsets and positions are still found. After closing the searcher, reader and writer that are used in this process, the fields that have term vectors no longer have positions and offsets in them. His code looks like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, _analyzer);
IndexWriter writer = new IndexWriter(indexDir, config);
IndexReader reader = IndexReader.open(writer, true);
IndexSearcher searcher = new IndexSearcher(reader);

while ((s = in.readLine()) != null) {
String[] tokens = s.split(",");
float fieldValue = Float.parseFloat(tokens[1].trim());
NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
nField.setFloatValue(fieldValue);
String docId = tokens[0].trim();
Term docIdTerm = new Term("DocId", docId);
TermQuery query = new TermQuery(docIdTerm);
TopDocs hits = searcher.search(query, 2);

if (hits.scoreDocs.length != 1) {
throw new Exception("Unexpected number of documents in index with docId = " + docId);
}
int docNum = hits.scoreDocs[0].doc;
Document doc = searcher.doc(docNum);
doc.add(nField);
writer.updateDocument(docIdTerm, doc);
}
displayTermVectorInfo(dir); // for debugging
writer.close();
displayTermVectorInfo(dir); // for debugging
reader.close();
searcher.close();

private static void displayTermVectorInfo(Directory dir) throws IOException, CorruptIndexException {
IndexReader reader = null;

try {
reader = IndexReader.open(dir);

for (int i = 0; i < reader.numDocs; i++) {
Document doc = reader.document(j);
List<Fieldable> docFields = doc.getFields();

for (Fieldable field : docFields) {
TermFreqVector termFreqVector = reader.getTermFreqVector(i, field.name());

if (termFreqVector != null && termFreqVector instanceof TermPositionVector) {
TermPositionVector termPositionVector = (TermPositionVector)termFreqVector;
System.out.println("Field " + field.name());

for (int j = 0; j < termFreqVector.size(); j++) {
TermVectorOffsetInfo[] offsets = termPositionVector.getOffsets(j);

for (TermVectorOffsetInfo offsetInfo : offsets) {
System.out.println("offset: " + offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
}
}
for (int k = 0; k < termFreqVector.size(); k++) {
int[] positions = termPositionVector.getTermPositions(k);

for (int position : positions) {
System.out.println("position: " + position);
}
}
}
}
}
} finally {
if (reader != null) {
reader.close();
}
}
}

The first time displayTermVectorInfo is called, it displays offsets and positions for the fields that have term vectors with offsets and positions. The second time it is called, it doesn't display anything because none of the term vectors satisfy termFreqVector instanceof TermPositionVector. Is it supposed to work this way? What is it about closing the writer that alters the term vectors in the affected fields? Is there a way to add a field to the documents in an index in which this doesn't occur?
Thanks,
Mike


-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, July 20, 2012 5:59 PM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.

Right: what i'm trying to tell you is that offsets and positions is not an index-wide setting for a field: its per-document.

I think all the tools you are using to check these values are not doing it correctly:
1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(), but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing the "global"
bits from FieldInfos. It used to be that we wrote some bits for these options, I don't ever know what the purpose was since these options can be controlled on/off at a per-document level: they make no sense.
Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if the field has any term vectors at all), and thats probably whats confusing you there.

Again, if you really want to validate that a specific document has offsets/positions in its term vectors, you need to check that specific document with IndexReader.getTermFreqVector, there is no other way, since this can be controlled on a per-document basis for a field.


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

ТÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÐÐ¥FòVç7V'67&–&RÂRÖÖ–â¦f×W6W"×Vç7V'67&–&TÇV6VæRæ6†Ræ÷&pФf÷"FF—F–öæÂ6öÖÖæG2ÂRÖÖ–â¦f×W6W"Ö†VÇÇV6VæRæ6†Ræ÷&pРÐ


rcmuir at gmail

Aug 24, 2012, 9:51 AM

Post #12 of 13 (416 views)
Permalink
Re: Problem with TermVector offsets and positions not being preserved [In reply to]

Calling IR.document does not restore your 'original Document'
completely. This is really an age-old trap.
So don't update documents this way: its fine to fetch their contents
but nothing goes thru the effort to ensure that things like term
vectors parameters are the same as what you originally provided. This
would require extra disk seeks.

See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to
fix this trap for google summer of code.

On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> I have one more question about term vector positions and offsets being preserved. My co-worker is working on updating the documents in an index with a field that contains a numerical value derived from the term frequencies and inverse document frequencies of terms in the document. His first pass at doing this calculates these values, writes them along with document ids to a text file and then updates the documents by reading lines from the file, searching for the document that contains the id, adding the field to the document, and replacing the document in the index. Some of the fields in these documents have term vectors with offsets and positions. After the revised document is updated in the index, those fields' term vector offsets and positions are still found. After closing the searcher, reader and writer that are used in this process, the fields that have term vectors no longer have positions and offsets in them. His code looks like this:
>
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, _analyzer);
> IndexWriter writer = new IndexWriter(indexDir, config);
> IndexReader reader = IndexReader.open(writer, true);
> IndexSearcher searcher = new IndexSearcher(reader);
>
> while ((s = in.readLine()) != null) {
> String[] tokens = s.split(",");
> float fieldValue = Float.parseFloat(tokens[1].trim());
> NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
> nField.setFloatValue(fieldValue);
> String docId = tokens[0].trim();
> Term docIdTerm = new Term("DocId", docId);
> TermQuery query = new TermQuery(docIdTerm);
> TopDocs hits = searcher.search(query, 2);
>
> if (hits.scoreDocs.length != 1) {
> throw new Exception("Unexpected number of documents in index with docId = " + docId);
> }
> int docNum = hits.scoreDocs[0].doc;
> Document doc = searcher.doc(docNum);
> doc.add(nField);
> writer.updateDocument(docIdTerm, doc);
> }
> displayTermVectorInfo(dir); // for debugging
> writer.close();
> displayTermVectorInfo(dir); // for debugging
> reader.close();
> searcher.close();
>
> private static void displayTermVectorInfo(Directory dir) throws IOException, CorruptIndexException {
> IndexReader reader = null;
>
> try {
> reader = IndexReader.open(dir);
>
> for (int i = 0; i < reader.numDocs; i++) {
> Document doc = reader.document(j);
> List<Fieldable> docFields = doc.getFields();
>
> for (Fieldable field : docFields) {
> TermFreqVector termFreqVector = reader.getTermFreqVector(i, field.name());
>
> if (termFreqVector != null && termFreqVector instanceof TermPositionVector) {
> TermPositionVector termPositionVector = (TermPositionVector)termFreqVector;
> System.out.println("Field " + field.name());
>
> for (int j = 0; j < termFreqVector.size(); j++) {
> TermVectorOffsetInfo[] offsets = termPositionVector.getOffsets(j);
>
> for (TermVectorOffsetInfo offsetInfo : offsets) {
> System.out.println("offset: " + offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
> }
> }
> for (int k = 0; k < termFreqVector.size(); k++) {
> int[] positions = termPositionVector.getTermPositions(k);
>
> for (int position : positions) {
> System.out.println("position: " + position);
> }
> }
> }
> }
> }
> } finally {
> if (reader != null) {
> reader.close();
> }
> }
> }
>
> The first time displayTermVectorInfo is called, it displays offsets and positions for the fields that have term vectors with offsets and positions. The second time it is called, it doesn't display anything because none of the term vectors satisfy termFreqVector instanceof TermPositionVector. Is it supposed to work this way? What is it about closing the writer that alters the term vectors in the affected fields? Is there a way to add a field to the documents in an index in which this doesn't occur?
> Thanks,
> Mike
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Friday, July 20, 2012 5:59 PM
> To: java-user [at] lucene
> Subject: Re: Problem with TermVector offsets and positions not being preserved
>
> On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
>> Hi Robert,
>> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.
>
> Right: what i'm trying to tell you is that offsets and positions is not an index-wide setting for a field: its per-document.
>
> I think all the tools you are using to check these values are not doing it correctly:
> 1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(), but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing the "global"
> bits from FieldInfos. It used to be that we wrote some bits for these options, I don't ever know what the purpose was since these options can be controlled on/off at a per-document level: they make no sense.
> Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if the field has any term vectors at all), and thats probably whats confusing you there.
>
> Again, if you really want to validate that a specific document has offsets/positions in its term vectors, you need to check that specific document with IndexReader.getTermFreqVector, there is no other way, since this can be controlled on a per-document basis for a field.
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


tmoleary at uw

Aug 24, 2012, 11:02 AM

Post #13 of 13 (414 views)
Permalink
RE: Problem with TermVector offsets and positions not being preserved [In reply to]

So for Lucene 3.6, is the right way to do this to create a new Document and add new Fields based on the old Fields (with the settings you want them to have for term vector offsets and positions, etc.) and then call updateDocument on that new Document?
Thanks,
Mike

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Friday, August 24, 2012 9:52 AM
To: java-user [at] lucene
Subject: Re: Problem with TermVector offsets and positions not being preserved

Calling IR.document does not restore your 'original Document'
completely. This is really an age-old trap.
So don't update documents this way: its fine to fetch their contents but nothing goes thru the effort to ensure that things like term vectors parameters are the same as what you originally provided. This would require extra disk seeks.

See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to fix this trap for google summer of code.

On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary <tmoleary [at] uw> wrote:
> I have one more question about term vector positions and offsets being preserved. My co-worker is working on updating the documents in an index with a field that contains a numerical value derived from the term frequencies and inverse document frequencies of terms in the document. His first pass at doing this calculates these values, writes them along with document ids to a text file and then updates the documents by reading lines from the file, searching for the document that contains the id, adding the field to the document, and replacing the document in the index. Some of the fields in these documents have term vectors with offsets and positions. After the revised document is updated in the index, those fields' term vector offsets and positions are still found. After closing the searcher, reader and writer that are used in this process, the fields that have term vectors no longer have positions and offsets in them. His code looks like this:
>
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36,
> _analyzer); IndexWriter writer = new IndexWriter(indexDir, config);
> IndexReader reader = IndexReader.open(writer, true); IndexSearcher
> searcher = new IndexSearcher(reader);
>
> while ((s = in.readLine()) != null) {
> String[] tokens = s.split(",");
> float fieldValue = Float.parseFloat(tokens[1].trim());
> NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
> nField.setFloatValue(fieldValue);
> String docId = tokens[0].trim();
> Term docIdTerm = new Term("DocId", docId);
> TermQuery query = new TermQuery(docIdTerm);
> TopDocs hits = searcher.search(query, 2);
>
> if (hits.scoreDocs.length != 1) {
> throw new Exception("Unexpected number of documents in index with docId = " + docId);
> }
> int docNum = hits.scoreDocs[0].doc;
> Document doc = searcher.doc(docNum);
> doc.add(nField);
> writer.updateDocument(docIdTerm, doc); }
> displayTermVectorInfo(dir); // for debugging
> writer.close();
> displayTermVectorInfo(dir); // for debugging
> reader.close();
> searcher.close();
>
> private static void displayTermVectorInfo(Directory dir) throws IOException, CorruptIndexException {
> IndexReader reader = null;
>
> try {
> reader = IndexReader.open(dir);
>
> for (int i = 0; i < reader.numDocs; i++) {
> Document doc = reader.document(j);
> List<Fieldable> docFields = doc.getFields();
>
> for (Fieldable field : docFields) {
> TermFreqVector termFreqVector =
> reader.getTermFreqVector(i, field.name());
>
> if (termFreqVector != null && termFreqVector instanceof TermPositionVector) {
> TermPositionVector termPositionVector = (TermPositionVector)termFreqVector;
> System.out.println("Field " + field.name());
>
> for (int j = 0; j < termFreqVector.size(); j++) {
> TermVectorOffsetInfo[] offsets =
> termPositionVector.getOffsets(j);
>
> for (TermVectorOffsetInfo offsetInfo : offsets) {
> System.out.println("offset: " + offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
> }
> }
> for (int k = 0; k < termFreqVector.size(); k++) {
> int[] positions =
> termPositionVector.getTermPositions(k);
>
> for (int position : positions) {
> System.out.println("position: " + position);
> }
> }
> }
> }
> }
> } finally {
> if (reader != null) {
> reader.close();
> }
> }
> }
>
> The first time displayTermVectorInfo is called, it displays offsets and positions for the fields that have term vectors with offsets and positions. The second time it is called, it doesn't display anything because none of the term vectors satisfy termFreqVector instanceof TermPositionVector. Is it supposed to work this way? What is it about closing the writer that alters the term vectors in the affected fields? Is there a way to add a field to the documents in an index in which this doesn't occur?
> Thanks,
> Mike
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Friday, July 20, 2012 5:59 PM
> To: java-user [at] lucene
> Subject: Re: Problem with TermVector offsets and positions not being
> preserved
>
> On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmoleary [at] uw> wrote:
>> Hi Robert,
>> I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions > stored.
>
> Right: what i'm trying to tell you is that offsets and positions is not an index-wide setting for a field: its per-document.
>
> I think all the tools you are using to check these values are not doing it correctly:
> 1. DumpIndex is wrongly using values from the Document returned by IndexReader.document(), but that doesn't and never did retrieve these values (it would be 2 extra disk seeks per document to figure out the term vector flags) 2. I havent looked at Luke, but its probably printing the "global"
> bits from FieldInfos. It used to be that we wrote some bits for these options, I don't ever know what the purpose was since these options can be controlled on/off at a per-document level: they make no sense.
> Because of this we stopped writing these bits in 3.6 (we only write into FieldInfos if the field has any term vectors at all), and thats probably whats confusing you there.
>
> Again, if you really want to validate that a specific document has offsets/positions in its term vectors, you need to check that specific document with IndexReader.getTermFreqVector, there is no other way, since this can be controlled on a per-document basis for a field.
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

B‹KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB•È[œÝXœØÜšX™KK[XZ[ˆ˜]˜K]\Ù\‹][œÝXœØÜšX™PXÙ[™K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ˜]˜K]\Ù\‹Z[XÙ[™K˜\XÚK›Ü™ÃBƒB

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.