Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Getting left and right offsets of term search results

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


till.kolter at googlemail

Oct 9, 2009, 9:11 AM

Post #1 of 3 (652 views)
Permalink
Getting left and right offsets of term search results

I am quite new to Lucene, but I have searched the FAQs and consulted
the mailinglist archive. I debugged through the source codes as well.

I have writen an Analyzer, that analyzes a stream by sending it to a
whole pipeline of linguistic processing and uses the internal
representation to construct a TokenStream, that tokenizes chunks
(semantic units). The Term-Attribute String hold the abstract
representations of those units. For further uses (for instance:
highlighting the results in text), I need access to the
OffsetAttribute, that I defined in my TokenStream implementation. Like
in StandardTokenizer I defined an OffsetAttribute to save the left and
right values of the original chunks.

Now I want to search for all documents containing an
"AdjectivePhrase", get those APs from the Documents and highlight all
APs in the found documents.

I tried to find results by getting TermPositions with
"Reader.termPositions(term)" and then iterate over the positions, but
the positions only represent the left offset.

Is there another function to get structured results from term queries
over documents, where I can get the whole set of attributes, that I
constructed in the TokenStream with addAttribute(Class)? I did not
find such a function, but I guess I dont know all retrieval methods of
Lucene, yet. For my search I used the IndexSearcher.

Thanks
Till Kolter

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dcausse at spotter

Oct 9, 2009, 10:16 AM

Post #2 of 3 (599 views)
Permalink
Re: Getting left and right offsets of term search results [In reply to]

Hi,

we also index linguistic data, but (someone correct me if I'm wrong) you
have to deal with what the lucene store is offering.
You can store
usable on the search side :
- a term (TermAttribute)
- the position of the term (PositionIncrementAttribute)
- an arbitrary payload (PayloadAttribute)
usable when you found results :
- TermVector (no attribute or OffsetAttribute and/or PositionIncrementAttribute)
- Any data you stored in a field (arbitrary data)

OffsetAttribute are stored in TermVector (if you specified you wanted
it) you can't search data within the TermPositionVector but you can
iterate your results and ask the reader to return the TermPositionVector
for a specific document and a field.

Lucene can't store arbitrary Attributes they are only useful in a
analyze pipe. You have to serialize (if you want to search for this
info) the data inside the term itself (eg add a char at the end of term
to describe the part of speech) and inside the Payload for position
specific info (eg a relation id, paragraph id or whatever you want :it's
a byte[]).

With those techniques you can do many things, you have to be inventive but
with payloads you can do very interesting things.
You can also store the offsets inside the payload and don't bother with
term vector!
Well there is really hundreds of solutions to deal with linguistic data
inside lucene. What is hard is when you have to deal with relations but
a triplet store should be more adapted for this.

I suggest also to store a serialized form of your internal
representation in the index, it may be more flexible to use it versus
TermPositionvector.

Hope it helps.

On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote:
> I am quite new to Lucene, but I have searched the FAQs and consulted
> the mailinglist archive. I debugged through the source codes as well.
>
> I have writen an Analyzer, that analyzes a stream by sending it to a
> whole pipeline of linguistic processing and uses the internal
> representation to construct a TokenStream, that tokenizes chunks
> (semantic units). The Term-Attribute String hold the abstract
> representations of those units. For further uses (for instance:
> highlighting the results in text), I need access to the
> OffsetAttribute, that I defined in my TokenStream implementation. Like
> in StandardTokenizer I defined an OffsetAttribute to save the left and
> right values of the original chunks.
>
> Now I want to search for all documents containing an
> "AdjectivePhrase", get those APs from the Documents and highlight all
> APs in the found documents.
>
> I tried to find results by getting TermPositions with
> "Reader.termPositions(term)" and then iterate over the positions, but
> the positions only represent the left offset.
>
> Is there another function to get structured results from term queries
> over documents, where I can get the whole set of attributes, that I
> constructed in the TokenStream with addAttribute(Class)? I did not
> find such a function, but I guess I dont know all retrieval methods of
> Lucene, yet. For my search I used the IndexSearcher.
>
> Thanks
> Till Kolter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

--
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


till.kolter at googlemail

Oct 12, 2009, 9:35 AM

Post #3 of 3 (573 views)
Permalink
Re: Getting left and right offsets of term search results [In reply to]

Thanks a lot. I think TermPositionsVector will solve my problem.
Although it seems to be a little inperformant

Concerning the term representation: our data is way more complex then
just phrasal annotation, it was just an example, because I am not
allowed to talk about our internal organisation. I will inspect the
Payload class, it should help me come up with a solution.


On Fri, Oct 9, 2009 at 7:16 PM, David Causse <dcausse [at] spotter> wrote:
> Hi,
>
> we also index linguistic data, but (someone correct me if I'm wrong) you
> have to deal with what the lucene store is offering.
> You can store
> usable on the search side :
>  - a term (TermAttribute)
>  - the position of the term (PositionIncrementAttribute)
>  - an arbitrary payload (PayloadAttribute)
> usable when you found results :
>  - TermVector (no attribute or OffsetAttribute and/or PositionIncrementAttribute)
>  - Any data you stored in a field (arbitrary data)
>
> OffsetAttribute are stored in TermVector (if you specified you wanted
> it) you can't search data within the TermPositionVector but you can
> iterate your results and ask the reader to return the TermPositionVector
> for a specific document and a field.
>
> Lucene can't store arbitrary Attributes they are only useful in a
> analyze pipe. You have to serialize (if you want to search for this
> info) the data inside the term itself (eg add a char at the end of term
> to describe the part of speech) and inside the Payload for position
> specific info (eg a relation id, paragraph id or whatever you want :it's
> a byte[]).
>
> With those techniques you can do many things, you have to be inventive but
> with payloads you can do very interesting things.
> You can also store the offsets inside the payload and don't bother with
> term vector!
> Well there is really hundreds of solutions to deal with linguistic data
> inside lucene. What is hard is when you have to deal with relations but
> a triplet store should be more adapted for this.
>
> I suggest also to store a serialized form of your internal
> representation in the index, it may be more flexible to use it versus
> TermPositionvector.
>
> Hope it helps.
>
> On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote:
>> I am quite new to Lucene, but I have searched the FAQs and consulted
>> the mailinglist archive. I debugged through the source codes as well.
>>
>> I have writen an Analyzer, that analyzes a stream by sending it to a
>> whole pipeline of linguistic processing and uses the internal
>> representation to construct a TokenStream, that tokenizes chunks
>> (semantic units). The Term-Attribute String hold the abstract
>> representations of those units. For further uses (for instance:
>> highlighting the results in text), I need access to the
>> OffsetAttribute, that I defined in my TokenStream implementation. Like
>> in StandardTokenizer I defined an OffsetAttribute to save the left and
>> right values of the original chunks.
>>
>> Now I want to search for all documents containing an
>> "AdjectivePhrase", get those APs from the Documents and highlight all
>> APs in the found documents.
>>
>> I tried to find results by getting TermPositions with
>> "Reader.termPositions(term)" and then iterate over the positions, but
>> the positions only represent the left offset.
>>
>> Is there another function to get structured results from term queries
>> over documents, where I can get the whole set of attributes, that I
>> constructed in the TokenStream with addAttribute(Class)? I did not
>> find such a function, but I guess I dont know all retrieval methods of
>> Lucene, yet. For my search I used the IndexSearcher.
>>
>> Thanks
>> Till Kolter
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> --
> David Causse
> Spotter
> http://www.spotter.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.