Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

indexing xml messages

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


v.sevel at lombardodier

Nov 2, 2009, 11:40 PM

Post #1 of 2 (115 views)
Permalink
indexing xml messages

Hi, the following junit test fails on 3 out of the 6 searches:

@Test
public void indexXML() throws Exception {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
String xml = FileHelper.readFileContent("lucene_work/myxml.xml");
doc.add(new Field("myxml", xml, Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("id", "1", Field.Store.YES,
Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
writer.close();

IndexReader reader = IndexReader.open(dir, true); // only searching,
so read-only=true
Searcher searcher = new IndexSearcher(reader);
// Assert.assertEquals(1, searcher.search(new TermQuery(new
Term("myxml", "123AB")), 1).totalHits);
Assert.assertEquals(1, searcher.search(new TermQuery(new
Term("myxml", "reference")), 1).totalHits);
// Assert.assertEquals(1, searcher.search(new TermQuery(new
Term("myxml", "operationImpact")), 1).totalHits);
Assert.assertEquals(1, searcher.search(new TermQuery(new
Term("myxml", "data")), 1).totalHits);
// Assert.assertEquals(1, searcher.search(new TermQuery(new
Term("myxml", "EFG")), 1).totalHits);
searcher.close();
reader.close();
}

given this xml message:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<operationImpact>
<reference value="123AB"/>
<data>EFG</data>
</operationImpact>

How do I get this to work? My goal is to be able to do full text search on
XML documents. This includes tags, attribute values and tag values.

Thanks,
vince
--
View this message in context: http://old.nabble.com/indexing-xml-messages-tp26160016p26160016.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ian.lea at gmail

Nov 3, 2009, 1:08 AM

Post #2 of 2 (109 views)
Permalink
Re: indexing xml messages [In reply to]

StandardAnalyzer will, amongst other things, convert everything to
lowercase which means that term queries on mixed or upper case text
will fail to match.

There is some info on indexing XML docs in the FAQ
http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_index_XML_documents.3F
and I'm sure that Google would find loads more stuff.

And Luke is invaluable for seeing what your index really holds.


--
Ian.


On Tue, Nov 3, 2009 at 7:40 AM, vsevel <v.sevel[at]lombardodier.com> wrote:
>
> Hi, the following junit test fails on 3 out of the 6 searches:
>
>    @Test
>    public void indexXML() throws Exception {
>        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>        RAMDirectory dir = new RAMDirectory();
>        IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
>        Document doc = new Document();
>        String xml = FileHelper.readFileContent("lucene_work/myxml.xml");
>        doc.add(new Field("myxml", xml, Field.Store.YES,
> Field.Index.ANALYZED));
>        doc.add(new Field("id", "1", Field.Store.YES,
> Field.Index.NOT_ANALYZED));
>        writer.addDocument(doc);
>        writer.close();
>
>        IndexReader reader = IndexReader.open(dir, true); // only searching,
> so read-only=true
>        Searcher searcher = new IndexSearcher(reader);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "123AB")), 1).totalHits);
>        Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "reference")), 1).totalHits);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "operationImpact")), 1).totalHits);
>        Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "data")), 1).totalHits);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "EFG")), 1).totalHits);
>        searcher.close();
>        reader.close();
>    }
>
> given this xml message:
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <operationImpact>
>        <reference value="123AB"/>
>        <data>EFG</data>
> </operationImpact>
>
> How do I get this to work? My goal is to be able to do full text search on
> XML documents. This includes tags, attribute values and tag values.
>
> Thanks,
> vince
> --
> View this message in context: http://old.nabble.com/indexing-xml-messages-tp26160016p26160016.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.