Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Getting the frequencies by corresponding order of documents were indexed

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kasunp at opensource

May 11, 2012, 12:58 AM

Post #1 of 5 (169 views)
Permalink
Getting the frequencies by corresponding order of documents were indexed

I have collection of documents (say 10 documents)and i'm indexing them this
way, by storing the term vector

StringReader strRdElt = new StringReader(content);


Document doc = new Document();

String docname=docNames[docNo];

doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));

IndexWriter iW;
try {

NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;

iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,

new StandardAnalyzer(Version.LUCENE_35)));

iW.addDocument(doc);
iW.close();

}

After Index all the documents, i'm getting the term-frequencies of each
document this way


IndexReader re = IndexReader.open(FSDirectory.open(new
File(pathToIndex)), true) ;
TermFreqVector termsFreq[];
for(int i=0;i<noOfDocs;i++){
termsFreq[i] = re.getTermFreqVector(i, "doccontent");

}

my problem is i'm not getting the termfreqncy vector correspondingly. Say
for 2nd document that I have indexed i'm getting it's corresponding
termfrequncies and terms at "termsFreq[9]"

What is the reason for that?, how can I get the corresponding
termfrequncies by the order that I have indexed the documents?


--
Regards

Kasun Perera


ian.lea at gmail

May 11, 2012, 4:22 AM

Post #2 of 5 (156 views)
Permalink
Re: Getting the frequencies by corresponding order of documents were indexed [In reply to]

Can't spot anything obviously wrong in your code and what you are
trying to do should work. Are you positive that what you think is the
second doc is really being added second? You only show one doc being
added. Are there already 7 docs in the index before you start?


--
Ian.


On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <kasunp [at] opensource> wrote:
> I have collection of documents (say 10 documents)and i'm indexing them this
> way, by storing the term vector
>
> StringReader strRdElt = new StringReader(content);
>
>
>    Document doc = new Document();
>
>    String docname=docNames[docNo];
>
>    doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
>
>    IndexWriter iW;
>    try {
>
>        NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
>
>        iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
>
>                new StandardAnalyzer(Version.LUCENE_35)));
>
>        iW.addDocument(doc);
>        iW.close();
>
>    }
>
> After Index all the documents, i'm getting the term-frequencies of each
> document this way
>
>
> IndexReader re = IndexReader.open(FSDirectory.open(new
> File(pathToIndex)), true) ;
> TermFreqVector termsFreq[];
> for(int i=0;i<noOfDocs;i++){
>        termsFreq[i] = re.getTermFreqVector(i, "doccontent");
>
>      }
>
> my problem is i'm not getting the termfreqncy vector correspondingly. Say
> for 2nd document that I have indexed i'm getting it's corresponding
> termfrequncies and terms at "termsFreq[9]"
>
> What is the reason for that?, how can I get the corresponding
> termfrequncies by the order that I have indexed the documents?
>
>
> --
> Regards
>
> Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kasunp at opensource

May 11, 2012, 4:35 AM

Post #3 of 5 (159 views)
Permalink
Re: Getting the frequencies by corresponding order of documents were indexed [In reply to]

On Fri, May 11, 2012 at 4:52 PM, Ian Lea <ian.lea [at] gmail> wrote:

> Can't spot anything obviously wrong in your code and what you are
> trying to do should work. Are you positive that what you think is the
> second doc is really being added second? You only show one doc being
> added. Are there already 7 docs in the index before you start?
>
>
>
Hi Ian

yes I'm sure 2nd doc is added second and I use debugger several times to
confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors
but their positions are changed. I gave doc #2 as example. #5th
termfrequncy vector is correspond to doc and so on.

I figured out to overcome this but it may be not efficient. I stored
another field at indexing time, base on the content inside new field i'm
able to map the doc with its termfrequncy vector. Is there any other
efficient way? This may be a bug in Lucene?

Thanks

> --
> Ian.
>
>
> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <kasunp [at] opensource>
> wrote:
> > I have collection of documents (say 10 documents)and i'm indexing them
> this
> > way, by storing the term vector
> >
> > StringReader strRdElt = new StringReader(content);
> >
> >
> > Document doc = new Document();
> >
> > String docname=docNames[docNo];
> >
> > doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
> >
> > IndexWriter iW;
> > try {
> >
> > NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
> >
> > iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
> >
> > new StandardAnalyzer(Version.LUCENE_35)));
> >
> > iW.addDocument(doc);
> > iW.close();
> >
> > }
> >
> > After Index all the documents, i'm getting the term-frequencies of each
> > document this way
> >
> >
> > IndexReader re = IndexReader.open(FSDirectory.open(new
> > File(pathToIndex)), true) ;
> > TermFreqVector termsFreq[];
> > for(int i=0;i<noOfDocs;i++){
> > termsFreq[i] = re.getTermFreqVector(i, "doccontent");
> >
> > }
> >
> > my problem is i'm not getting the termfreqncy vector correspondingly. Say
> > for 2nd document that I have indexed i'm getting it's corresponding
> > termfrequncies and terms at "termsFreq[9]"
> >
> > What is the reason for that?, how can I get the corresponding
> > termfrequncies by the order that I have indexed the documents?
> >
> >
> > --
> > Regards
> >
> > Kasun Perera
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


--
Regards

Kasun Perera


ian.lea at gmail

May 11, 2012, 5:50 AM

Post #4 of 5 (159 views)
Permalink
Re: Getting the frequencies by corresponding order of documents were indexed [In reply to]

What version of lucene are you using? If not the latest, try that.
If you really think there is a lucene bug post a small self-contained
test case that demonstrates the problem.


--
Ian.


On Fri, May 11, 2012 at 12:35 PM, Kasun Perera <kasunp [at] opensource> wrote:
> On Fri, May 11, 2012 at 4:52 PM, Ian Lea <ian.lea [at] gmail> wrote:
>
>> Can't spot anything obviously wrong in your code and what you are
>> trying to do should work.  Are you positive that what you think is the
>> second doc is really being added second?  You only show one doc being
>> added.  Are there already 7 docs in the index before you start?
>>
>>
>>
> Hi Ian
>
> yes I'm sure 2nd doc is added second and I use debugger several times to
> confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors
> but their positions are changed. I gave doc #2 as example.  #5th
> termfrequncy vector is correspond to doc and so on.
>
> I figured out to overcome this but it may be not efficient. I stored
> another field at indexing time, base on the content inside new field i'm
> able to map the doc with its termfrequncy vector. Is there any other
> efficient way? This may be a bug in Lucene?
>
> Thanks
>
>> --
>> Ian.
>>
>>
>> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <kasunp [at] opensource>
>> wrote:
>> > I have collection of documents (say 10 documents)and i'm indexing them
>> this
>> > way, by storing the term vector
>> >
>> > StringReader strRdElt = new StringReader(content);
>> >
>> >
>> >    Document doc = new Document();
>> >
>> >    String docname=docNames[docNo];
>> >
>> >    doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
>> >
>> >    IndexWriter iW;
>> >    try {
>> >
>> >        NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
>> >
>> >        iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
>> >
>> >                new StandardAnalyzer(Version.LUCENE_35)));
>> >
>> >        iW.addDocument(doc);
>> >        iW.close();
>> >
>> >    }
>> >
>> > After Index all the documents, i'm getting the term-frequencies of each
>> > document this way
>> >
>> >
>> > IndexReader re = IndexReader.open(FSDirectory.open(new
>> > File(pathToIndex)), true) ;
>> > TermFreqVector termsFreq[];
>> > for(int i=0;i<noOfDocs;i++){
>> >        termsFreq[i] = re.getTermFreqVector(i, "doccontent");
>> >
>> >      }
>> >
>> > my problem is i'm not getting the termfreqncy vector correspondingly. Say
>> > for 2nd document that I have indexed i'm getting it's corresponding
>> > termfrequncies and terms at "termsFreq[9]"
>> >
>> > What is the reason for that?, how can I get the corresponding
>> > termfrequncies by the order that I have indexed the documents?
>> >
>> >
>> > --
>> > Regards
>> >
>> > Kasun Perera
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
> --
> Regards
>
> Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

May 14, 2012, 4:30 AM

Post #5 of 5 (142 views)
Permalink
Re: Getting the frequencies by corresponding order of documents were indexed [In reply to]

In general you can't rely on anything like this. I admit the merge
stuff isn't my area of expertise, but when segments are merged,
there's no guarantee that they're merged in order. In general
the internal Lucene doc ID should be treated as predictable only
for closed segments.

Your solution of using your own unique ID is much better.

Best
Erick

On Fri, May 11, 2012 at 8:50 AM, Ian Lea <ian.lea [at] gmail> wrote:
> What version of lucene are you using?  If not the latest, try that.
> If you really think there is a lucene bug post a small self-contained
> test case that demonstrates the problem.
>
>
> --
> Ian.
>
>
> On Fri, May 11, 2012 at 12:35 PM, Kasun Perera <kasunp [at] opensource> wrote:
>> On Fri, May 11, 2012 at 4:52 PM, Ian Lea <ian.lea [at] gmail> wrote:
>>
>>> Can't spot anything obviously wrong in your code and what you are
>>> trying to do should work.  Are you positive that what you think is the
>>> second doc is really being added second?  You only show one doc being
>>> added.  Are there already 7 docs in the index before you start?
>>>
>>>
>>>
>> Hi Ian
>>
>> yes I'm sure 2nd doc is added second and I use debugger several times to
>> confirm it. If I index 10 documents, I'm getting 10 termFrequncy vectors
>> but their positions are changed. I gave doc #2 as example.  #5th
>> termfrequncy vector is correspond to doc and so on.
>>
>> I figured out to overcome this but it may be not efficient. I stored
>> another field at indexing time, base on the content inside new field i'm
>> able to map the doc with its termfrequncy vector. Is there any other
>> efficient way? This may be a bug in Lucene?
>>
>> Thanks
>>
>>> --
>>> Ian.
>>>
>>>
>>> On Fri, May 11, 2012 at 8:58 AM, Kasun Perera <kasunp [at] opensource>
>>> wrote:
>>> > I have collection of documents (say 10 documents)and i'm indexing them
>>> this
>>> > way, by storing the term vector
>>> >
>>> > StringReader strRdElt = new StringReader(content);
>>> >
>>> >
>>> >    Document doc = new Document();
>>> >
>>> >    String docname=docNames[docNo];
>>> >
>>> >    doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
>>> >
>>> >    IndexWriter iW;
>>> >    try {
>>> >
>>> >        NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
>>> >
>>> >        iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
>>> >
>>> >                new StandardAnalyzer(Version.LUCENE_35)));
>>> >
>>> >        iW.addDocument(doc);
>>> >        iW.close();
>>> >
>>> >    }
>>> >
>>> > After Index all the documents, i'm getting the term-frequencies of each
>>> > document this way
>>> >
>>> >
>>> > IndexReader re = IndexReader.open(FSDirectory.open(new
>>> > File(pathToIndex)), true) ;
>>> > TermFreqVector termsFreq[];
>>> > for(int i=0;i<noOfDocs;i++){
>>> >        termsFreq[i] = re.getTermFreqVector(i, "doccontent");
>>> >
>>> >      }
>>> >
>>> > my problem is i'm not getting the termfreqncy vector correspondingly. Say
>>> > for 2nd document that I have indexed i'm getting it's corresponding
>>> > termfrequncies and terms at "termsFreq[9]"
>>> >
>>> > What is the reason for that?, how can I get the corresponding
>>> > termfrequncies by the order that I have indexed the documents?
>>> >
>>> >
>>> > --
>>> > Regards
>>> >
>>> > Kasun Perera
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>>
>> --
>> Regards
>>
>> Kasun Perera
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.