Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Getting terms from unstored fields, doc-wise

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


phaninra at gmail

Jul 26, 2012, 9:56 AM

Post #1 of 6 (289 views)
Permalink
Getting terms from unstored fields, doc-wise

Hi,
I've an index to analyze (manually). Unfortunately, I cannot rebuild
the index. Some of the fields are 'unstored'. I was wondering whether
there's any way to get the terms from an unstored field for each doc.
Positional information is not necessary. Lucene version is 3.5.

The reason am trying to get those terms is that I can add that field to my
own index for every doc. And, yes, there's another id-type-field which
allows me to recognize the document in both indices.

Any guidance is highly appeciated.

Thanks,
Phani


in.abdul at gmail

Jul 26, 2012, 11:46 AM

Post #2 of 6 (280 views)
Permalink
Re: Getting terms from unstored fields, doc-wise [In reply to]

No , it's not possible to get the data which not stored ..
On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
<ml-node+s472066n3997487h23 [at] n3
>
> Hi,
> I've an index to analyze (manually). Unfortunately, I cannot rebuild
> the index. Some of the fields are 'unstored'. I was wondering whether
> there's any way to get the terms from an unstored field for each doc.
> Positional information is not necessary. Lucene version is 3.5.
>
> The reason am trying to get those terms is that I can add that field to my
> own index for every doc. And, yes, there's another id-type-field which
> allows me to recognize the document in both indices.
>
> Any guidance is highly appeciated.
>
> Thanks,
> Phani
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
> To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




-----
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


phaninra at gmail

Jul 26, 2012, 1:04 PM

Post #3 of 6 (278 views)
Permalink
Re: Getting terms from unstored fields, doc-wise [In reply to]

Thanks for the reply Abdul.

I was exploring the API and I think we can retrieve all those words by
using a brute-force approach.

1) Get all the terms using indexReader.terms()

2) Process the term only if it belongs to the target field.

3) Get all the docs using indexReader.termDocs(term);

4) So, we have the term-doc pairs at this point.

Is there any better approach other than the above forever-taking procedure?

Thanks,
Phanindra



On Thu, Jul 26, 2012 at 11:46 AM, in.abdul <in.abdul [at] gmail> wrote:

> No , it's not possible to get the data which not stored ..
> On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
> <ml-node+s472066n3997487h23 [at] n3
> >
> > Hi,
> > I've an index to analyze (manually). Unfortunately, I cannot rebuild
> > the index. Some of the fields are 'unstored'. I was wondering whether
> > there's any way to get the terms from an unstored field for each doc.
> > Positional information is not necessary. Lucene version is 3.5.
> >
> > The reason am trying to get those terms is that I can add that field to
> my
> > own index for every doc. And, yes, there's another id-type-field which
> > allows me to recognize the document in both indices.
> >
> > Any guidance is highly appeciated.
> >
> > Thanks,
> > Phani
> >
> >
> > ------------------------------
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
> > To unsubscribe from Lucene, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.


findbestopensource at gmail

Jul 26, 2012, 11:11 PM

Post #4 of 6 (279 views)
Permalink
Re: Getting terms from unstored fields, doc-wise [In reply to]

Hi

If the data is not stored then it cannot be retrieved in the same format.
Using IndexReader as you listed you could retrieve the list of the terms
available in the doc. It may be analyzed. You may not be getting exact data.

Regards
Aditya
www.findbestopensource.com

On Fri, Jul 27, 2012 at 1:34 AM, Phanindra R <phaninra [at] gmail> wrote:

> Thanks for the reply Abdul.
>
> I was exploring the API and I think we can retrieve all those words by
> using a brute-force approach.
>
> 1) Get all the terms using indexReader.terms()
>
> 2) Process the term only if it belongs to the target field.
>
> 3) Get all the docs using indexReader.termDocs(term);
>
> 4) So, we have the term-doc pairs at this point.
>
> Is there any better approach other than the above forever-taking procedure?
>
> Thanks,
> Phanindra
>
>
>
> On Thu, Jul 26, 2012 at 11:46 AM, in.abdul <in.abdul [at] gmail> wrote:
>
> > No , it's not possible to get the data which not stored ..
> > On Jul 26, 2012 10:27 PM, "Phanindra R [via Lucene]"
> > <ml-node+s472066n3997487h23 [at] n3
> > >
> > > Hi,
> > > I've an index to analyze (manually). Unfortunately, I cannot
> rebuild
> > > the index. Some of the fields are 'unstored'. I was wondering whether
> > > there's any way to get the terms from an unstored field for each doc.
> > > Positional information is not necessary. Lucene version is 3.5.
> > >
> > > The reason am trying to get those terms is that I can add that field to
> > my
> > > own index for every doc. And, yes, there's another id-type-field which
> > > allows me to recognize the document in both indices.
> > >
> > > Any guidance is highly appeciated.
> > >
> > > Thanks,
> > > Phani
> > >
> > >
> > > ------------------------------
> > > If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487.html
> > > To unsubscribe from Lucene, click here<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
> > >
> > > .
> > > NAML<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> >
> >
> >
> >
> > -----
> > THANKS AND REGARDS,
> > SYED ABDUL KATHER
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Getting-terms-from-unstored-fields-doc-wise-tp3997487p3997510.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>


ab at getopt

Jul 27, 2012, 6:15 AM

Post #5 of 6 (278 views)
Permalink
Re: Getting terms from unstored fields, doc-wise [In reply to]

On 26/07/2012 22:04, Phanindra R wrote:
> Thanks for the reply Abdul.
>
> I was exploring the API and I think we can retrieve all those words by
> using a brute-force approach.
>
> 1) Get all the terms using indexReader.terms()
>
> 2) Process the term only if it belongs to the target field.
>
> 3) Get all the docs using indexReader.termDocs(term);
>
> 4) So, we have the term-doc pairs at this point.

This procedure is implemented in Luke (http://code.google.com/p/luke) in
the "Reconstruct & Edit" function. In case of larger indexes it's indeed
a time-consuming procedure.

>
> Is there any better approach other than the above forever-taking procedure?

No. Indexing is usually a lossy process - some data is irretrievably
lost - and the resulting data structure is not optimized for
re-assembling the original content. If you need to retrieve the original
content you have to store it, either using stored fields or in an
external system.


--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
___.,___,___,___,_._. __________________<><____________________
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


phaninra at gmail

Jul 27, 2012, 3:15 PM

Post #6 of 6 (279 views)
Permalink
Re: Getting terms from unstored fields, doc-wise [In reply to]

Thanks a lot Aditya and Andrzej .. Your responses were really helpful.

On Fri, Jul 27, 2012 at 6:15 AM, Andrzej Bialecki <ab [at] getopt> wrote:

> On 26/07/2012 22:04, Phanindra R wrote:
>
>> Thanks for the reply Abdul.
>>
>> I was exploring the API and I think we can retrieve all those words by
>> using a brute-force approach.
>>
>> 1) Get all the terms using indexReader.terms()
>>
>> 2) Process the term only if it belongs to the target field.
>>
>> 3) Get all the docs using indexReader.termDocs(term);
>>
>> 4) So, we have the term-doc pairs at this point.
>>
>
> This procedure is implemented in Luke (http://code.google.com/p/luke**)
> in the "Reconstruct & Edit" function. In case of larger indexes it's indeed
> a time-consuming procedure.
>
>
>
>> Is there any better approach other than the above forever-taking
>> procedure?
>>
>
> No. Indexing is usually a lossy process - some data is irretrievably lost
> - and the resulting data structure is not optimized for re-assembling the
> original content. If you need to retrieve the original content you have to
> store it, either using stored fields or in an external system.
>
>
> --
> Best regards,
> Andrzej Bialecki
> http://www.sigram.com, blog http://www.sigram.com/blog
> ___.,___,___,___,_._. __________________<><_________**___________
> [___||.__|__/|__||\/|: Information Retrieval, System Integration
> ___|||__||..\|..||..|: Contact: info at sigram dot com
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene**apache.org<java-user-unsubscribe [at] lucene>
> For additional commands, e-mail: java-user-help [at] lucene**org<java-user-help [at] lucene>
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.