Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

token positions

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ctignor at thinkmap

Nov 17, 2009, 10:45 AM

Post #1 of 2 (371 views)
Permalink
token positions

Hello,

Hoping someone might clear up a question for me:

When Tokenizing we provide the start and end character offsets for each
token locating it within the source text.

If I tokenize the text "word" and then search for the term "word" in the
same field, how can I recover this character offset information in the
matching documents to precisely locate the word? I have been storing this
character info myself using payload data but if lucene stores it, then I am
doing so needlessly. If recovering this character offset info isn't
possible, what is this character offset info used for?

thanks so much,

C>T>

--
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


lucene at mikemccandless

Nov 17, 2009, 11:18 AM

Post #2 of 2 (331 views)
Permalink
Re: token positions [In reply to]

The character offset info is only stored if you enable
Field.TermVector.WITH_OFFSETS or WITH_POSITIONS_OFFSETS on the field.

Then, it can only be retrieved if you get the term vectors for that
document, and locate the term & specific occurrence that you're
interested in.

This is likely quite a bit more costly than what you're doing now
(using payloads) so if you retrieve this eg for every hit it's
probably best to stick with payloads.

Mike

On Tue, Nov 17, 2009 at 1:45 PM, Christopher Tignor
<ctignor [at] thinkmap> wrote:
> Hello,
>
> Hoping someone might clear up a question for me:
>
> When Tokenizing we provide the start and end character offsets for each
> token locating it within the source text.
>
> If I tokenize the text "word" and then search for the term "word" in the
> same field, how can I recover this character offset information in the
> matching documents to precisely locate the word?  I have been storing this
> character info myself using payload data but if lucene stores it, then I am
> doing so needlessly.  If recovering this character offset info isn't
> possible, what is this character offset info used for?
>
> thanks so much,
>
> C>T>
>
> --
> TH!NKMAP
>
> Christopher Tignor | Senior Software Architect
> 155 Spring Street NY, NY 10012
> p.212-285-8600 x385 f.212-285-8999
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.