Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Token character positions

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ctignor at thinkmap

Nov 17, 2009, 7:37 AM

Post #1 of 2 (447 views)
Permalink
Token character positions

Hello,

Hoping someone might clear up a question for me:

When Tokenizing we provide the start and end character offsets for each
token locating it within the source text.

If I tokenize the text "word" and then serach for the term "word" in the
same field, how can I recover this character offset information in the
matching documents to precisely locate the word? I have been storing this
character info myself using payload data but if lucene stores it, then I am
doing so needlessly. If recovering this character offset info isn't
possible, what is this charcter offset info used for?

thanks so much,

C>T>

--
TH!NKMAP

Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999


gsingers at apache

Nov 18, 2009, 3:24 AM

Post #2 of 2 (405 views)
Permalink
Re: Token character positions [In reply to]

On Nov 17, 2009, at 10:37 AM, Christopher Tignor wrote:

> Hello,
>
> Hoping someone might clear up a question for me:
>
> When Tokenizing we provide the start and end character offsets for each
> token locating it within the source text.
>
> If I tokenize the text "word" and then serach for the term "word" in the
> same field, how can I recover this character offset information in the
> matching documents to precisely locate the word? I have been storing this
> character info myself using payload data but if lucene stores it, then I am
> doing so needlessly. If recovering this character offset info isn't
> possible, what is this charcter offset info used for?

Lucene doesn't, currently, store offset information, so you are not duplicating this.

There are 4 possible ways to store it that I know of, one of which is under construction now and will eventually be the best solution:

1. Payloads - in fact, there is a TokenFilter in the payloads package under contrib/analysis that does just this.
2. Term Vectors - Stores lots of info along w/ the offsets. I've often used this along w/ SpanQueries to get precise locations
3. Hack up the highlighter (assuming you aren't just doing this for highlighting)
4. Flexible indexing (future) - Create your index your way and store the offset info in a strongly typed payload. This will require writing your own code.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.