Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Field value vs TokenStream

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


schnober at ids-mannheim

Apr 18, 2012, 8:00 AM

Post #1 of 3 (280 views)
Permalink
Field value vs TokenStream

Dear list,
I'm studying the Lucene index file formats and I wonder: after having
initialized a field with Field(String name, String value, Field.Store
store, Field.Index index), where is the value String stored?

I understand that the chosen analyzer does its processing on that value,
including tokenization, and returns a TokenStream from which the Indexer
retrieves the attributes that it stores in the index.
When I use a binary editor to inspect the term infos (tis) file in the
index directory, I can see every single token (term).
For experimenting purposes, I implemented an analyzer that converts the
value input to the field and noticed the following: the TokenStream
still correctly generates the terms that end up to be stored in the tis
file, but the initial input value is still displayed as the field value
when I retrieve a document from the index and output it with
Document.toString(). I tried to analyse the Field's tokenStream, but
tokenStreamValue() returns null; is that normal when retrieving a
document from an existing index?

Can someone let me know what happens to a Field's value string and at
which point in the pipeline it is replaced by the (term) attributes
generated by the TokenStream?

Thank you very much!
Best,
Carsten


--
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Apr 18, 2012, 11:06 AM

Post #2 of 3 (269 views)
Permalink
RE: Field value vs TokenStream [In reply to]

Hi,

You should inform yourself about the difference between "stored" and
"indexed" fields: The tokens in the ".tis" file are in fact the analyzed
tokens retrieved from the TokenStream. This is controlled by the Field
parameter Field.Index. The Field.Store parameter has nothing to do with
indexing: if a field is marked as "stored", the full and unchanged string /
binary is stored in the stored fields file (".fdt"). Stored fields are used
to e.g. display search results after a search has executed (because the
tokens alone do not help for the search result display). In general for
every field you should think about what you want to do with it: Index it, if
you want to search on it; store it if you want the value be displayed in the
search results (available via IndexReader/IndexSearcher.document()). In most
cases only one option of both is really needed (I prefer to have the stored
and indexed fields completely separated with different fiel names; e.g.
stored fields can also be used to store a XML file for search result display
in the index that has nothing to do with the field used for retrieval, but
tokenizing and indexing this plain XML will not be useful).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Carsten Schnober [mailto:schnober [at] ids-mannheim]
> Sent: Wednesday, April 18, 2012 5:00 PM
> To: java-user [at] lucene
> Subject: Field value vs TokenStream
>
> Dear list,
> I'm studying the Lucene index file formats and I wonder: after having
initialized
> a field with Field(String name, String value, Field.Store store,
Field.Index index),
> where is the value String stored?
>
> I understand that the chosen analyzer does its processing on that value,
> including tokenization, and returns a TokenStream from which the Indexer
> retrieves the attributes that it stores in the index.
> When I use a binary editor to inspect the term infos (tis) file in the
index
> directory, I can see every single token (term).
> For experimenting purposes, I implemented an analyzer that converts the
value
> input to the field and noticed the following: the TokenStream still
correctly
> generates the terms that end up to be stored in the tis file, but the
initial input
> value is still displayed as the field value when I retrieve a document
from the
> index and output it with Document.toString(). I tried to analyse the
Field's
> tokenStream, but
> tokenStreamValue() returns null; is that normal when retrieving a document
> from an existing index?
>
> Can someone let me know what happens to a Field's value string and at
which
> point in the pipeline it is replaced by the (term) attributes generated by
the
> TokenStream?
>
> Thank you very much!
> Best,
> Carsten
>
>
> --
> Carsten Schnober
> Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP
--
> Korpusanalyseplattform der nächsten Generation http://korap.ids-
> mannheim.de/ | Tel.: +49-(0)621-1581-238
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


schnober at ids-mannheim

Apr 20, 2012, 1:56 AM

Post #3 of 3 (264 views)
Permalink
Re: Field value vs TokenStream [In reply to]

Am 18.04.2012 20:06, schrieb Uwe Schindler:

Hi,

> You should inform yourself about the difference between "stored" and
> "indexed" fields: The tokens in the ".tis" file are in fact the analyzed
> tokens retrieved from the TokenStream. This is controlled by the Field
> parameter Field.Index. The Field.Store parameter has nothing to do with
> indexing: if a field is marked as "stored", the full and unchanged string /
> binary is stored in the stored fields file (".fdt"). Stored fields are used

Thanks for that clarification!
Best,
Carsten

--
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.