Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Lucene 4 - POS and Syntactic Tagging

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


mmcguir3 at hawk

Mar 14, 2012, 9:37 AM

Post #1 of 5 (695 views)
Permalink
Lucene 4 - POS and Syntactic Tagging

I'm working on a project where I need to tag both the part of speech and
other syntactic information on tokens so that this information is
searchable. I have read the threads on the mailing list regarding part
of speech tagging here
<http://mail-archives.apache.org/mod_mbox/lucene-java-user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ [at] mail%3E>
and the many responses to similar questions. To me, inserting 0
increment tokens seems rather clunky, especially when TypeAttributes
appear to be what one would want to use. Does Lucene do anything extra
when the Type is set to or not set to its default, "word"? Is it
possible to write a search that uses multiple attributes from
TokenAttributes (ie a search that searches for CharTermAttribute "dog"
followed by a TypeAttribute of verb)?

Also if I were to use 0 increment tokens for tagging, would data like
document length or sumTotalTermFreq be different from a document indexed
without these tags? How would I counteract these differences if any occur?

Thanks,
Mark McGuire


paul at metajure

Apr 2, 2012, 9:49 AM

Post #2 of 5 (646 views)
Permalink
RE: Lucene 4 - POS and Syntactic Tagging [In reply to]

> Mark McGuire wrote:
> I'm working on a project where I need to tag both the part of speech and other syntactic information on tokens

To pick up on this thread from a few weeks back.

I've never done this myself, but I think that your desire to put extra information that is not really a token in the index at a particular location is exactly what Payloads are for.
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

The above article even mentions:
"A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information. "

I don't believe that searching on attributes is the way to speak about it. Attributes are features of some of Lucene objects, a way to ask for something from a complex object. Some attributes return information from the index, but attributes are not in indexes, tokens and payloads are in indexes. But I'm sure my understanding is incomplete also, because using something other than "WORD" seems like a way to go, but I can't see how to get a query to search on a particular type of token.

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kuro at basistech

Apr 9, 2012, 1:10 PM

Post #3 of 5 (630 views)
Permalink
Re: Lucene 4 - POS and Syntactic Tagging [In reply to]

If you want to search on part-of-speech tag, I'd just make a parallel
field ("text_pos" for the field "text", for example) and search on that
field (text_pos:noun).

Kuro

On 3/14/12 9:37 AM, Mark McGuire wrote:
> I'm working on a project where I need to tag both the part of speech
> and other syntactic information on tokens so that this information is
> searchable. I have read the threads on the mailing list regarding
> part of speech tagging here
> <http://mail-archives.apache.org/mod_mbox/lucene-java-user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ [at] mail%3E>
> and the many responses to similar questions. To me, inserting 0
> increment tokens seems rather clunky, especially when TypeAttributes
> appear to be what one would want to use. Does Lucene do anything
> extra when the Type is set to or not set to its default, "word"? Is
> it possible to write a search that uses multiple attributes from
> TokenAttributes (ie a search that searches for CharTermAttribute "dog"
> followed by a TypeAttribute of verb)?
>
> Also if I were to use 0 increment tokens for tagging, would data like
> document length or sumTotalTermFreq be different from a document
> indexed without these tags? How would I counteract these differences
> if any occur?
>
> Thanks,
> Mark McGuire
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


kuro at basistech

Apr 9, 2012, 11:57 PM

Post #4 of 5 (634 views)
Permalink
Re: Lucene 4 - POS and Syntactic Tagging [In reply to]

Please disregard this suggestion. It is a bad idea. Almost every text
would have a verb, noun, etc. so search on a pos tag only field won't
make sense. Maybe the parallel field should have a lemma (dictionary
form) plus part-of-speech tag putting together as a token like
"like_verb", "lemming_propernoun"?

On 4/9/12 1:10 PM, T. Kuro Kurosaka wrote:
> If you want to search on part-of-speech tag, I'd just make a parallel
> field ("text_pos" for the field "text", for example) and search on
> that field (text_pos:noun).
>
> Kuro
>
> On 3/14/12 9:37 AM, Mark McGuire wrote:
>> I'm working on a project where I need to tag both the part of speech
>> and other syntactic information on tokens so that this information is
>> searchable. I have read the threads on the mailing list regarding
>> part of speech tagging here
>> <http://mail-archives.apache.org/mod_mbox/lucene-java-user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ [at] mail%3E>
>> and the many responses to similar questions. To me, inserting 0
>> increment tokens seems rather clunky, especially when TypeAttributes
>> appear to be what one would want to use. Does Lucene do anything
>> extra when the Type is set to or not set to its default, "word"? Is
>> it possible to write a search that uses multiple attributes from
>> TokenAttributes (ie a search that searches for CharTermAttribute
>> "dog" followed by a TypeAttribute of verb)?
>>
>> Also if I were to use 0 increment tokens for tagging, would data like
>> document length or sumTotalTermFreq be different from a document
>> indexed without these tags? How would I counteract these differences
>> if any occur?
>>
>> Thanks,
>> Mark McGuire
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Apr 10, 2012, 12:22 AM

Post #5 of 5 (628 views)
Permalink
RE: Lucene 4 - POS and Syntactic Tagging [In reply to]

Hi,

A simple approach to get this is by making the "type" part of the term text.
This does not hurt your search, because this adding of the type would be
done both on query and search side (the Analyzer simply appends the type to
the term text for both sides): "term#word". Of course you can have a second
field without that additional information (if you want to search without
that).

Appending the term type to the term can be done with a TokenFilter that
calls termAttribute.append("#").append(typeAttribute.getType()). Be sure to
use this analyzer on both the query and the indexing side, possibly with
PerFieldAnalyzerWrapper to limit it to specific fields.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Mark McGuire [mailto:mmcguir3 [at] hawk]
> Sent: Wednesday, March 14, 2012 5:38 PM
> To: java-user [at] lucene
> Subject: Lucene 4 - POS and Syntactic Tagging
>
> I'm working on a project where I need to tag both the part of speech and
other
> syntactic information on tokens so that this information is searchable. I
have
> read the threads on the mailing list regarding part of speech tagging here
> <http://mail-archives.apache.org/mod_mbox/lucene-java-
> user/201105.mbox/%3CBANLkTimwqcQ_GF2pxE8Hyc_R75NcWDRWbQ [at] mail
> mail.com%3E>
> and the many responses to similar questions. To me, inserting 0 increment
> tokens seems rather clunky, especially when TypeAttributes appear to be
what
> one would want to use. Does Lucene do anything extra when the Type is set
to
> or not set to its default, "word"? Is it possible to write a search that
uses
> multiple attributes from TokenAttributes (ie a search that searches for
> CharTermAttribute "dog"
> followed by a TypeAttribute of verb)?
>
> Also if I were to use 0 increment tokens for tagging, would data like
document
> length or sumTotalTermFreq be different from a document indexed without
> these tags? How would I counteract these differences if any occur?
>
> Thanks,
> Mark McGuire


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.