Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Creating an index from an XML file using Lucene in Java

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


fayyazuddin at gmail

Jul 27, 2008, 10:58 AM

Post #1 of 7 (271 views)
Permalink
Creating an index from an XML file using Lucene in Java

Dear fellow Java/Lucene developers:

I have a question on creating an index from an XML document for the purpose
of searching using the Lucene API in Java.

I am searching shakespeare's "Hamlet" which I have as an xml document. I
want to include comentary on each scene and would like to make this section
searchable as well for the user. However, at present, I search through a
set of <SPEECH> tags which represents a particular character's dialogue.
With my new arrangement, each scene, which is composed of several characters
respective dialogues, will be enclosed in a pair of <SCENE></SCENE> tags,
and will have a set of <SCENE-COMMENTARY></SCENE-COMMENTARY> tags at the top
which will provide the commentary for the scene that follows. How would I
modify my index code (which follows after the xml document) to create a
searchable index which allows the user to search <SCENE-COMMENTARY> section
just as easily as the text contained in the <SPEECH> tags? Once I have
accomplished this, I would like to then be able to search the text and
display the results to the user just as easily as if they were searching
through the <SPEECH> tags.
I have also listed the code for searching through the current index.

Thanks in advance to everyone who replies.

Sincerely;
Fayyaz


Here is the xml snippet for the play:

<PLAY>
<TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>
<SCENE>
<SCENE-COMMENTARY>Here is where I will include commentary on the scene that
follows, which I would also like to make searchable to the
user.</SCENE-COMMENTARY>
<SPEECH>
<REFERENCE>ACT 1, SCENE 1</REFERENCE>
<SPEAKER>LORD POLONIUS</SPEAKER>
<LINES>Yet here, Laertes! aboard, aboard, for shame!
The wind sits in the shoulder of your sail,
And you are stay'd for. There; my blessing with thee!
And these few precepts in thy memory
See thou character. Give thy thoughts no tongue,
Nor any unproportioned thought his act.
Be thou familiar, but by no means vulgar.
Those friends thou hast, and their adoption tried,
Grapple them to thy soul with hoops of steel;
But do not dull thy palm with entertainment
Of each new-hatch'd, unfledged comrade. Beware
Of entrance to a quarrel, but being in,
Bear't that the opposed may beware of thee.
Give every man thy ear, but few thy voice;
Take each man's censure, but reserve thy judgment.
Costly thy habit as thy purse can buy,
But not express'd in fancy; rich, not gaudy;
For the apparel oft proclaims the man,
And they in France of the best rank and station
Are of a most select and generous chief in that.
Neither a borrower nor a lender be;
For loan oft loses both itself and friend,
And borrowing dulls the edge of husbandry.
This above all: to thine ownself be true,
And it must follow, as the night the day,
Thou canst not then be false to any man.
Farewell: my blessing season this in thee!</LINES>
</SPEECH>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<REFERENCE>ACT 1, SCENE 2</REFERENCE>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<REFERENCE>ACT 1, SCENE 3</REFERENCE>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<REFERENCE>ACT 1, SCENE 4</REFERENCE>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
</SCENE>
</PLAY>


Here is my indexing code:

package hamlet;

import java.io.InputStream;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.HashMap;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;

public class HamletHandler extends DefaultHandler implements DocumentHandler
{

//the directory that stores xml files
private final String dataDir = "c:\\dataD";
//the directory that is used to store lucene index
private final String indexDir = "c:\\indexD";

private StringBuffer elementBuffer=new StringBuffer();
private HashMap attributeMap;
private Document doc;
static IndexWriter indexWriter;


public Document getDocument(InputStream is) throws DocumentHandlerException
{
// TODO Auto-generated method stub
SAXParserFactory spf=SAXParserFactory.newInstance();

try{
SAXParser parser=spf.newSAXParser();
parser.parse(is, this);
}
catch(IOException e){
throw new DocumentHandlerException("Cannot parse XML document", e);
}

catch(ParserConfigurationException e){
throw new DocumentHandlerException("Cannot parse XML document", e);
}

catch(SAXException e){
throw new DocumentHandlerException("Cannot parse XML document", e);
}

return doc;
}

public void startDocument(){
//doc=new Document();
}

public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException{

if(qName.equals("SPEECH")){
doc=new Document();
}
elementBuffer.setLength(0);
//attributeMap.clear();
if(atts.getLength()>0){
attributeMap=new HashMap();
for(int i=0; i<atts.getLength(); i++){
attributeMap.put(atts.getQName(i), atts.getValue(i));
}
}
}
public void characters(char[] text, int start, int length){
elementBuffer.append(text, start, length);

}

public void endElement(String uri, String localName, String qName) throws
SAXException{

try {

if(qName.equals("REFERENCE")){
Field reference = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.NO, Field.TermVector.NO);
doc.add(reference);
}

else if(qName.equals("SPEAKER")){
Field speaker = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
speaker.setBoost(2.0f);
doc.add(speaker);
}
else if(qName.equals("LINES")){
Field lines = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
lines.setBoost(1.0f);
doc.add(lines);
indexWriter.addDocument(doc);
}
else{
return;
}

} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}



}
/**
* @param args
*/
public static void main(String[] args) throws Exception{
File index=new File("c:\\Documents and Settings\\Fayyazuddin A Syed\\My
Documents\\indexD");
Directory fsDirectory = FSDirectory.getDirectory(index);
Analyzer analyzer = new StandardAnalyzer();
indexWriter = new IndexWriter(fsDirectory, analyzer, true);
HamletHandler handler=new HamletHandler();
Document doc=handler.getDocument(new FileInputStream(new File(args[0])));
int numIndexed=indexWriter.docCount();
System.out.println(numIndexed);
indexWriter.optimize();
indexWriter.close();

}

}


and here is my searcher code:

package search;

/*
* Searcher.java
*
* Created on August 6, 2007, 8:46 PM
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/



import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.CachingTokenFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer ;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.Query ;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.NullFragmenter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
/**
*
*
*/
public class Searcher {

/** Creates a new instance of Searcher */

/**
* @param args the command line arguments
*/
public static void main(String[] args) throws Exception{

Searcher searchDoc=new Searcher();
File indexDir=new File("c:\\Documents and Settings\\Fayyazuddin A
Syed\\My Documents\\indexD");
String q="SLINGS AND ARROWS";
String s="think~";
if (s.contains("?") || s.contains("*")){
System.out.println("this is a wildcard search");
}
else if (s.contains("~")){
System.out.println("this is a fuzzy search");
}
else {
System.out.println("this is a normal search");
}

if(!indexDir.exists() || !indexDir.isDirectory()){
throw new Exception(indexDir + "does not exist of is not a
directory.");
}
//searchDoc.wildSearch(indexDir);
searchDoc.search(indexDir, q);
//searchDoc.fuzzySearch(indexDir);


}

public List search(File indexDir, String q) throws Exception {

List searchResult = new ArrayList();
Directory fsDir=FSDirectory.getDirectory(indexDir);
IndexSearcher is=new IndexSearcher(fsDir);

Analyzer analyser = new StandardAnalyzer();
Query parser=new QueryParser("LINES", analyser).parse(q);
long start=new Date().getTime();
Hits hits=is.search(parser);
long end=new Date().getTime();
QueryScorer scorer = new QueryScorer(parser);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
Highlighter highlighter = new Highlighter(formatter, scorer);
Highlighter high = new Highlighter(formatter, scorer);
Fragmenter fragmenter = new NullFragmenter();
Fragmenter fragment = new SimpleFragmenter(250);
highlighter.setTextFragmenter(fragmenter);
high.setTextFragmenter(fragment);

for(int i=0; i<hits.length(); i++){
Document doc=hits.doc(i);
String lns = doc.get("LINES");
TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
CachingTokenFilter filter = new CachingTokenFilter(lines);
String highlightedLines = highlighter.getBestFragment(filter, lns);
filter.reset();
String highlight = high.getBestFragment(filter, lns);
SearchResult resultBean = new SearchResult();
resultBean.setReference(hits.doc(i).get("REFERENCE"));
resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
resultBean.setHitResult(highlight);
resultBean.setQuote(highlightedLines);
searchResult.add(resultBean);
System.out.println(resultBean.getReference());
System.out.println(resultBean.getNarrator());
System.out.println(resultBean.getHitResult());
System.out.println("");
System.out.println(resultBean.getQuote());
System.out.println("");
}

System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + q + "':");

return searchResult;
}

public List wildSearch(File indexDir) throws Exception {

List searchResult = new ArrayList();
Directory fsDir=FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);
IndexReader ir = IndexReader.open(fsDir);
Analyzer analyser = new StandardAnalyzer();
Query parser=new WildcardQuery(new Term("LINES", "the*"));
parser=parser.rewrite(ir);
long start=new Date().getTime();
Hits hits=is.search(parser);
long end=new Date().getTime();
QueryScorer scorer = new QueryScorer(parser);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
Highlighter highlighter = new Highlighter(formatter, scorer);
Highlighter high = new Highlighter(formatter, scorer);
Fragmenter fragmenter = new NullFragmenter();
Fragmenter fragment = new SimpleFragmenter(250);
highlighter.setTextFragmenter(fragmenter);
high.setTextFragmenter(fragment);

for(int i=0; i<hits.length(); i++){
Document doc=hits.doc(i);
String lns = doc.get("LINES");
TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
CachingTokenFilter filter = new CachingTokenFilter(lines);
String highlightedLines = highlighter.getBestFragment(filter,
lns);
filter.reset();
String highlight = high.getBestFragment(filter, lns);
SearchResult resultBean = new SearchResult();
resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
resultBean.setHitResult(highlight);
resultBean.setQuote(highlightedLines);
searchResult.add(resultBean);
System.out.println(resultBean.getNarrator());
System.out.println(resultBean.getHitResult());
System.out.println("");
System.out.println(resultBean.getQuote());
System.out.println("");
}

System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + "':");

return searchResult;
}

public List fuzzySearch(File indexDir) throws Exception {

List searchResult = new ArrayList();
Directory fsDir=FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);
IndexReader ir = IndexReader.open(fsDir);
Analyzer analyser = new StandardAnalyzer();
Query parser=new FuzzyQuery(new Term("LINES", "the~"));
parser=parser.rewrite(ir);
long start=new Date().getTime();
Hits hits=is.search(parser);
long end=new Date().getTime();
QueryScorer scorer = new QueryScorer(parser);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
Highlighter highlighter = new Highlighter(formatter, scorer);
Highlighter high = new Highlighter(formatter, scorer);
Fragmenter fragmenter = new NullFragmenter();
Fragmenter fragment = new SimpleFragmenter(250);
highlighter.setTextFragmenter(fragmenter);
high.setTextFragmenter(fragment);

for(int i=0; i<hits.length(); i++){
Document doc=hits.doc(i);
String lns = doc.get("LINES");
TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
CachingTokenFilter filter = new CachingTokenFilter(lines);
String highlightedLines = highlighter.getBestFragment(filter,
lns);
filter.reset();
String highlight = high.getBestFragment(filter, lns);
SearchResult resultBean = new SearchResult();
resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
resultBean.setHitResult(highlight);
resultBean.setQuote(highlightedLines);
searchResult.add(resultBean);
System.out.println(resultBean.getNarrator());
System.out.println(resultBean.getHitResult());
System.out.println("");
System.out.println(resultBean.getQuote());
System.out.println("");
}

System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + "':");

return searchResult;
}
}



--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18678779.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


karsten-lucene at fiz-technik

Jul 27, 2008, 11:25 AM

Post #2 of 7 (264 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

Hi Fayyaz,

From my point of view, this is not a lucene question.

If I understand your SAX-Handler correctly, you start a document with each
"speech"-start-Tag and you end this document with each "lines"-close-Tag.
So if you know that the SCENE-COMMENTARY Elements and the speech elements
are disjunctive, you could use the same Element for generate document
(start-tag) and adding document to index (close-tag).

Best regards

Karsten

p.s. is this a homework?
--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18679016.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


fayyazuddin at gmail

Jul 27, 2008, 5:01 PM

Post #3 of 7 (258 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

I think I understand what you are saying, but I was hoping you could clarify
a little further. in the start-element method, I have the following:

if(qName.equals("SPEECH")){
doc=new Document();
}

are you saying that I should add an identical block of code for
<SCENE-COMMENTARY> as well, and include a similar clause in the endElement
method as well? i.e.

else if(qName.equals("SCENE-COMMENTARY")){
Field lines = new Field(qName,
elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.YES);
lines.setBoost(1.0f);
doc.add(lines);
indexWriter.addDocument(doc);
}

Does it also matter where in the if/else if clauses I mention the
"SCENE-COMMENTARY" tag? ie. should I mention it first? last? or does the
order matter?

Just wondering.
Thanks again for your prompt reply.
Sincerely;
Fayyaz

P.S. This is actually a personal project, as I have developed an interest
in Information Retrieval and simply wanted to work on a creative project to
help me develop my skills. :-)


Karsten F. wrote:
>
> Hi Fayyaz,
>
> From my point of view, this is not a lucene question.
>
> If I understand your SAX-Handler correctly, you start a document with each
> "speech"-start-Tag and you end this document with each "lines"-close-Tag.
> So if you know that the SCENE-COMMENTARY Elements and the speech elements
> are disjunctive, you could use the same Element for generate document
> (start-tag) and adding document to index (close-tag).
>
> Best regards
>
> Karsten
>
> p.s. is this a homework?
>

--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18682150.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


karsten-lucene at fiz-technik

Jul 28, 2008, 1:54 AM

Post #4 of 7 (251 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

Hi Fayyaz,

again, this is about SAX-Handler not about lucene.

My understanding of what you want:
1. one lucene document for each SPEECH-Element (already implemented)
2. one lucene document for each SCENE-COMMENTARY-Element (not implemented
yet).

correct?

If yes, you can write
if(qName.equals("SPEECH") ||
qName.equals("SCENE-COMMENTARY")){
doc=new Document();
}
and

public void endElement(String uri, String localName, String qName) throws
SAXException{
...
else if(qName.equals("SCENE-COMMENTARY")){
Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES);
doc.add(lines);
}
...
if(qName.equals("SPEECH") || qName.equals("SCENE-COMMENTARY")){
indexWriter.addDocument(doc);
}

(instead of "indexWriter.addDocument(doc);" in block of
if(qName.equals("LINES")){ )



Best regards
Karsten

P.S.:
If you want to learn java:
I really like
http://www.java-hamster-modell.de/
possible there is an english version somewhere?


syedfa wrote:
>
> I think I understand what you are saying, but I was hoping you could
> clarify a little further. in the start-element method, I have the
> following:
>
> if(qName.equals("SPEECH")){
> doc=new Document();
> }
>
> are you saying that I should add an identical block of code for
> <SCENE-COMMENTARY> as well, and include a similar clause in the endElement
> method as well? i.e.
>
> else if(qName.equals("SCENE-COMMENTARY")){
> Field lines = new Field(qName,
> elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED,
> Field.TermVector.YES);
> lines.setBoost(1.0f);
> doc.add(lines);
> indexWriter.addDocument(doc);
> }
>
> Does it also matter where in the if/else if clauses I mention the
> "SCENE-COMMENTARY" tag? ie. should I mention it first? last? or does
> the order matter?
>
> Just wondering.
> Thanks again for your prompt reply.
> Sincerely;
> Fayyaz
>
> P.S. This is actually a personal project, as I have developed an interest
> in Information Retrieval and simply wanted to work on a creative project
> to help me develop my skills. :-)
>

--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18686430.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


fayyazuddin at gmail

Jul 28, 2008, 7:08 AM

Post #5 of 7 (250 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

Thanks Karsten for your reply. I will implement your solution tonight,
however I did have a quick follow up question. I understand how you are
implementing the solution for the "SCENE-COMMENTARY" tag, however because at
present I am working with the "LINES" tag, shouldn't I continue using that
instead of the "SPEECH" element? Any reason why I should switch? Just
wondering. I will try both approaches tonight and let you know how it goes.

Thanks again for your help, I really appreciate it.

Take care.
Sincerely;
Fayyaz



Karsten F. wrote:
>
> Hi Fayyaz,
>
> again, this is about SAX-Handler not about lucene.
>
> My understanding of what you want:
> 1. one lucene document for each SPEECH-Element (already implemented)
> 2. one lucene document for each SCENE-COMMENTARY-Element (not implemented
> yet).
>
> correct?
>
> If yes, you can write
> if(qName.equals("SPEECH") ||
> qName.equals("SCENE-COMMENTARY")){
> doc=new Document();
> }
> and
>
> public void endElement(String uri, String localName, String qName) throws
> SAXException{
> ...
> else if(qName.equals("SCENE-COMMENTARY")){
> Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES);
> doc.add(lines);
> }
> ...
> if(qName.equals("SPEECH") || qName.equals("SCENE-COMMENTARY")){
> indexWriter.addDocument(doc);
> }
>
> (instead of "indexWriter.addDocument(doc);" in block of
> if(qName.equals("LINES")){ )
>
>
>
> Best regards
> Karsten
>
> P.S.:
> If you want to learn java:
> I really like
> http://www.java-hamster-modell.de/
> possible there is an english version somewhere?
>
>
> syedfa wrote:
>>
>> I think I understand what you are saying, but I was hoping you could
>> clarify a little further. in the start-element method, I have the
>> following:
>>
>> if(qName.equals("SPEECH")){
>> doc=new Document();
>> }
>>
>> are you saying that I should add an identical block of code for
>> <SCENE-COMMENTARY> as well, and include a similar clause in the
>> endElement method as well? i.e.
>>
>> else if(qName.equals("SCENE-COMMENTARY")){
>> Field lines = new Field(qName,
>> elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED,
>> Field.TermVector.YES);
>> lines.setBoost(1.0f);
>> doc.add(lines);
>> indexWriter.addDocument(doc);
>> }
>>
>> Does it also matter where in the if/else if clauses I mention the
>> "SCENE-COMMENTARY" tag? ie. should I mention it first? last? or does
>> the order matter?
>>
>> Just wondering.
>> Thanks again for your prompt reply.
>> Sincerely;
>> Fayyaz
>>
>> P.S. This is actually a personal project, as I have developed an
>> interest in Information Retrieval and simply wanted to work on a creative
>> project to help me develop my skills. :-)
>>
>
>

--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18689216.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


fayyazuddin at gmail

Jul 28, 2008, 10:31 AM

Post #6 of 7 (245 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

Hi Karsten:

I have another follow-up question for you. Once I create the index the way
you suggested, how would I modify my code to search it?


At present, I have the following code:

Analyzer analyser = new StandardAnalyzer();
Query parser=new QueryParser("LINES", analyser).parse(q);
long start=new Date().getTime();
Hits hits=is.search(parser);
long end=new Date().getTime();
QueryScorer scorer = new QueryScorer(parser);

Which I am using in my search() method from my Searcher class. This method
returns a List that holds javaBean objects which hold the results of each
hit. The difference here is that I will be searching through another
element, whose attributes are slightly different, eg. if a hit is found in
the <SCENE-COMMENTARY> element, I won't be providing a reference to a
<SPEAKER>, but rather, just the Act, and the Scene. I get the feeling that
based on the above code, that I would need to create another Query object to
hold the results found in the <SCENE-COMMENTARY> element. Am I right, or is
there a way around this so that I could use the same Query object to
accomplish this? Would I need to create a new JavaBean object to store
these results? Finally, would I be able to display these results to the
user in order of strongest match, to least strongest? At the moment my
results are created and presented to the user as follows:

for(int i=0; i<hits.length(); i++){
Document doc=hits.doc(i);
String lns = doc.get("LINES");
TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
CachingTokenFilter filter = new CachingTokenFilter(lines);
String highlightedLines = highlighter.getBestFragment(filter, lns);
filter.reset();
String highlight = high.getBestFragment(filter, lns);
SearchResult resultBean = new SearchResult();
resultBean.setReference(hits.doc(i).get("REFERENCE"));
resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
resultBean.setHitResult(highlight);
resultBean.setQuote(highlightedLines);
searchResult.add(resultBean);
System.out.println(resultBean.getReference());
System.out.println(resultBean.getNarrator());
System.out.println(resultBean.getHitResult());
System.out.println("");
System.out.println(resultBean.getQuote());

Thanks so much once again for your time and patience Karsten, I really do
appreciate it.

Take care.
Sincerely;
Fayyaz


Karsten F. wrote:
>
> Hi Fayyaz,
>
> again, this is about SAX-Handler not about lucene.
>
> My understanding of what you want:
> 1. one lucene document for each SPEECH-Element (already implemented)
> 2. one lucene document for each SCENE-COMMENTARY-Element (not implemented
> yet).
>
> correct?
>
> If yes, you can write
> if(qName.equals("SPEECH") ||
> qName.equals("SCENE-COMMENTARY")){
> doc=new Document();
> }
> and
>
> public void endElement(String uri, String localName, String qName) throws
> SAXException{
> ...
> else if(qName.equals("SCENE-COMMENTARY")){
> Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES);
> doc.add(lines);
> }
> ...
> if(qName.equals("SPEECH") || qName.equals("SCENE-COMMENTARY")){
> indexWriter.addDocument(doc);
> }
>
> (instead of "indexWriter.addDocument(doc);" in block of
> if(qName.equals("LINES")){ )
>
>
>
> Best regards
> Karsten
>
> P.S.:
> If you want to learn java:
> I really like
> http://www.java-hamster-modell.de/
> possible there is an english version somewhere?
>
>
> syedfa wrote:
>>
>> I think I understand what you are saying, but I was hoping you could
>> clarify a little further. in the start-element method, I have the
>> following:
>>
>> if(qName.equals("SPEECH")){
>> doc=new Document();
>> }
>>
>> are you saying that I should add an identical block of code for
>> <SCENE-COMMENTARY> as well, and include a similar clause in the
>> endElement method as well? i.e.
>>
>> else if(qName.equals("SCENE-COMMENTARY")){
>> Field lines = new Field(qName,
>> elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED,
>> Field.TermVector.YES);
>> lines.setBoost(1.0f);
>> doc.add(lines);
>> indexWriter.addDocument(doc);
>> }
>>
>> Does it also matter where in the if/else if clauses I mention the
>> "SCENE-COMMENTARY" tag? ie. should I mention it first? last? or does
>> the order matter?
>>
>> Just wondering.
>> Thanks again for your prompt reply.
>> Sincerely;
>> Fayyaz
>>
>> P.S. This is actually a personal project, as I have developed an
>> interest in Information Retrieval and simply wanted to work on a creative
>> project to help me develop my skills. :-)
>>
>
>

--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18695513.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


fayyazuddin at gmail

Jul 28, 2008, 9:40 PM

Post #7 of 7 (243 views)
Permalink
Re: Creating an index from an XML file using Lucene in Java [In reply to]

Dear Karsten:

Sorry for the multiple posts, but I have made some progress. I think in
order to search multiple fields, I should be using the
MultipleFieldsQueryParser class, and simply pass a String array containing
the fields I wish to search over. My follow-up question to you is this:
How do I highlight the results returned from the MultipleFieldsQueryParser?
As of this moment, my Searcher code looks like this:

List searchResult = new ArrayList();
Directory fsDir=FSDirectory.getDirectory(indexDir);
IndexSearcher is=new IndexSearcher(fsDir);

String[] fields = {"SCENE-COMMENTARY", "LINES"};
Analyzer analyser = new StandardAnalyzer();
Query parser=new MultiFieldQueryParser(fields, analyser).parse(q);
//parser.setAllowLeadingWildcard(true);
long start=new Date().getTime();
Hits hits=is.search(parser);
long end=new Date().getTime();
QueryScorer scorer = new QueryScorer(parser);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
Highlighter highlighter = new Highlighter(formatter, scorer);
//Highlighter highlighter = new Highlighter(scorer);
Highlighter high = new Highlighter(formatter, scorer);
//Highlighter high = new Highlighter(scorer);
Fragmenter fragmenter = new NullFragmenter();
Fragmenter fragment = new SimpleFragmenter(250);
highlighter.setTextFragmenter(fragmenter);
high.setTextFragmenter(fragment);

for(int i=0; i<hits.length(); i++){
Document doc=hits.doc(i);
String com = doc.get("SCENE-COMMENTARY");
String lns = doc.get("LINES");
//String spkr = doc.get("SPEAKER");
TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
CachingTokenFilter filter = new CachingTokenFilter(lines);
//TokenStream speaker = analyser.tokenStream("SPEAKER", new
StringReader(spkr));
String highlightedLines = highlighter.getBestFragment(filter,
lns);
filter.reset();
String highlight = high.getBestFragment(filter, lns);
SearchResult resultBean = new SearchResult();
resultBean.setReference(hits.doc(i).get("REFERENCE"));
resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
resultBean.setHitResult(highlight);
resultBean.setQuote(highlightedLines);
searchResult.add(resultBean);
System.out.println(resultBean.getReference());
System.out.println(resultBean.getNarrator());
System.out.println(resultBean.getHitResult());
System.out.println("");
System.out.println(resultBean.getQuote());
System.out.println("");
}

System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + q + "':");

return searchResult;
}

Thanks again for all of your help, I do sincerely appreciate it.

Take care.
Fayyaz


Karsten F. wrote:
>
> Hi Fayyaz,
>
> again, this is about SAX-Handler not about lucene.
>
> My understanding of what you want:
> 1. one lucene document for each SPEECH-Element (already implemented)
> 2. one lucene document for each SCENE-COMMENTARY-Element (not implemented
> yet).
>
> correct?
>
> If yes, you can write
> if(qName.equals("SPEECH") ||
> qName.equals("SCENE-COMMENTARY")){
> doc=new Document();
> }
> and
>
> public void endElement(String uri, String localName, String qName) throws
> SAXException{
> ...
> else if(qName.equals("SCENE-COMMENTARY")){
> Field lines = new Field(qName, elementBuffer.toString(), Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES);
> doc.add(lines);
> }
> ...
> if(qName.equals("SPEECH") || qName.equals("SCENE-COMMENTARY")){
> indexWriter.addDocument(doc);
> }
>
> (instead of "indexWriter.addDocument(doc);" in block of
> if(qName.equals("LINES")){ )
>
>
>
> Best regards
> Karsten
>
> P.S.:
> If you want to learn java:
> I really like
> http://www.java-hamster-modell.de/
> possible there is an english version somewhere?
>
>
> syedfa wrote:
>>
>> I think I understand what you are saying, but I was hoping you could
>> clarify a little further. in the start-element method, I have the
>> following:
>>
>> if(qName.equals("SPEECH")){
>> doc=new Document();
>> }
>>
>> are you saying that I should add an identical block of code for
>> <SCENE-COMMENTARY> as well, and include a similar clause in the
>> endElement method as well? i.e.
>>
>> else if(qName.equals("SCENE-COMMENTARY")){
>> Field lines = new Field(qName,
>> elementBuffer.toString(), Field.Store.YES, Field.Index.TOKENIZED,
>> Field.TermVector.YES);
>> lines.setBoost(1.0f);
>> doc.add(lines);
>> indexWriter.addDocument(doc);
>> }
>>
>> Does it also matter where in the if/else if clauses I mention the
>> "SCENE-COMMENTARY" tag? ie. should I mention it first? last? or does
>> the order matter?
>>
>> Just wondering.
>> Thanks again for your prompt reply.
>> Sincerely;
>> Fayyaz
>>
>> P.S. This is actually a personal project, as I have developed an
>> interest in Information Retrieval and simply wanted to work on a creative
>> project to help me develop my skills. :-)
>>
>
>

--
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18705179.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.