Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Access next token in a stream

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


dameriangr at gmail

Feb 9, 2012, 11:18 AM

Post #1 of 7 (500 views)
Permalink
Access next token in a stream

Hello i want to implement my custom filter, my wuestion is quite simple
but i cannot find a solution to it no matter how i try:

How can i access the TermAttribute of the next token than the one i
currently have in my stream?

For example in the phrase "My name is James Bond" if let's say i am in
the token [My], i would like to be able to check the TermAttribute of
the following token [name] and fix my position increment accordingly.

Thank you in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sarowe at syr

Feb 9, 2012, 11:54 AM

Post #2 of 7 (494 views)
Permalink
RE: Access next token in a stream [In reply to]

Hi Damerian,

One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end-of-stream). Your incrementToken() method could look something like:

1. Get current attributes: input.incrementToken()
2. If previous token does not exist:
2a. Store current attributes as previous token (see AttributeSource#cloneAttributes)
2b. Get current attributes: input.incrementToken()
3. Check for & store conditions that will affect previous token's attributes
4. Store current attributes as next token (see AttributeSource#cloneAttributes)
5. Copy previous token into current attributes (see AttributeSource#copyTo);
the target will be "this", which is an AttributeSource.
6. Make changes based on conditions found in step #3 above
7. set previous token = next token
8. return true

(Everywhere I say "token" I mean "instance of AttributeSource".)

The final token in the input stream will need special handling, as will single-token input streams.

Good luck,
Steve

> -----Original Message-----
> From: Damerian [mailto:dameriangr [at] gmail]
> Sent: Thursday, February 09, 2012 2:19 PM
> To: java-user [at] lucene
> Subject: Access next token in a stream
>
> Hello i want to implement my custom filter, my wuestion is quite simple
> but i cannot find a solution to it no matter how i try:
>
> How can i access the TermAttribute of the next token than the one i
> currently have in my stream?
>
> For example in the phrase "My name is James Bond" if let's say i am in
> the token [My], i would like to be able to check the TermAttribute of
> the following token [name] and fix my position increment accordingly.
>
> Thank you in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


dameriangr at gmail

Feb 9, 2012, 1:15 PM

Post #3 of 7 (495 views)
Permalink
Re: Access next token in a stream [In reply to]

Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> Hi Damerian,
>
> One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end-of-stream). Your incrementToken() method could look something like:
>
> 1. Get current attributes: input.incrementToken()
> 2. If previous token does not exist:
> 2a. Store current attributes as previous token (see AttributeSource#cloneAttributes)
> 2b. Get current attributes: input.incrementToken()
> 3. Check for& store conditions that will affect previous token's attributes
> 4. Store current attributes as next token (see AttributeSource#cloneAttributes)
> 5. Copy previous token into current attributes (see AttributeSource#copyTo);
> the target will be "this", which is an AttributeSource.
> 6. Make changes based on conditions found in step #3 above
> 7. set previous token = next token
> 8. return true
>
> (Everywhere I say "token" I mean "instance of AttributeSource".)
>
> The final token in the input stream will need special handling, as will single-token input streams.
>
> Good luck,
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr [at] gmail]
>> Sent: Thursday, February 09, 2012 2:19 PM
>> To: java-user [at] lucene
>> Subject: Access next token in a stream
>>
>> Hello i want to implement my custom filter, my wuestion is quite simple
>> but i cannot find a solution to it no matter how i try:
>>
>> How can i access the TermAttribute of the next token than the one i
>> currently have in my stream?
>>
>> For example in the phrase "My name is James Bond" if let's say i am in
>> the token [My], i would like to be able to check the TermAttribute of
>> the following token [name] and fix my position increment accordingly.
>>
>> Thank you in advance!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
Hi Steve,
Thank you for your immediate reply. i will try your solution but i feel
that it does not solve my case.
What i am trying to make is a filter that joins together two
terms/tokens that start with a capital letter (it is trying to find all
the Names/Surnames and make them one token) so in my aforementioned
example when i examine [James] even if i store the TermAttribute to a
temporary token how can i check the next one [Bond] , to join them
without actually emmiting (and therefore creating a term in my inverted
index) that has [James] on its own.
Thank you again for your insight and i would relly appreciate any other
views on the matter.

Regards, Damerian


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sarowe at syr

Feb 9, 2012, 1:51 PM

Post #4 of 7 (491 views)
Permalink
RE: Access next token in a stream [In reply to]

Damerian,

The technique I mentioned would work for you with a little tweaking: when you see consecutive capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the previous token.

Another idea: you could use ShingleFilter with min size = max size = 2, and then use a following Filter extending FilteringTokenFilter, with an accept() method that examines shingles and rejects ones that don't qualify, something like the following. (Notes: this is untested; I assume you will use the default shingle token separator " "; and this filter will reject all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter to emit single tokens):

public final class MyNameFilter extends FilteringTokenFilter {
private static final Pattern NAME_PATTERN
= Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
@Override public boolean accept() throws IOException {
return NAME_PATTERN.matcher(termAtt).matches();
}
}

Steve

> -----Original Message-----
> From: Damerian [mailto:dameriangr [at] gmail]
> Sent: Thursday, February 09, 2012 4:15 PM
> To: java-user [at] lucene
> Subject: Re: Access next token in a stream
>
> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> > Hi Damerian,
> >
> > One way to handle your scenario is to hold on to the previous token, and
> only emit a token after you reach at least the second token (or at end-of-
> stream). Your incrementToken() method could look something like:
> >
> > 1. Get current attributes: input.incrementToken()
> > 2. If previous token does not exist:
> > 2a. Store current attributes as previous token (see
> AttributeSource#cloneAttributes)
> > 2b. Get current attributes: input.incrementToken()
> > 3. Check for& store conditions that will affect previous token's
> attributes
> > 4. Store current attributes as next token (see
> AttributeSource#cloneAttributes)
> > 5. Copy previous token into current attributes (see
> AttributeSource#copyTo);
> > the target will be "this", which is an AttributeSource.
> > 6. Make changes based on conditions found in step #3 above
> > 7. set previous token = next token
> > 8. return true
> >
> > (Everywhere I say "token" I mean "instance of AttributeSource".)
> >
> > The final token in the input stream will need special handling, as will
> single-token input streams.
> >
> > Good luck,
> > Steve
> >
> >> -----Original Message-----
> >> From: Damerian [mailto:dameriangr [at] gmail]
> >> Sent: Thursday, February 09, 2012 2:19 PM
> >> To: java-user [at] lucene
> >> Subject: Access next token in a stream
> >>
> >> Hello i want to implement my custom filter, my wuestion is quite simple
> >> but i cannot find a solution to it no matter how i try:
> >>
> >> How can i access the TermAttribute of the next token than the one i
> >> currently have in my stream?
> >>
> >> For example in the phrase "My name is James Bond" if let's say i am in
> >> the token [My], i would like to be able to check the TermAttribute of
> >> the following token [name] and fix my position increment accordingly.
> >>
> >> Thank you in advance!
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> Hi Steve,
> Thank you for your immediate reply. i will try your solution but i feel
> that it does not solve my case.
> What i am trying to make is a filter that joins together two
> terms/tokens that start with a capital letter (it is trying to find all
> the Names/Surnames and make them one token) so in my aforementioned
> example when i examine [James] even if i store the TermAttribute to a
> temporary token how can i check the next one [Bond] , to join them
> without actually emmiting (and therefore creating a term in my inverted
> index) that has [James] on its own.
> Thank you again for your insight and i would relly appreciate any other
> views on the matter.
>
> Regards, Damerian
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


dameriangr at gmail

Feb 9, 2012, 1:59 PM

Post #5 of 7 (490 views)
Permalink
Re: Access next token in a stream [In reply to]

Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
> Damerian,
>
> The technique I mentioned would work for you with a little tweaking: when you see consecutive capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the previous token.
>
> Another idea: you could use ShingleFilter with min size = max size = 2, and then use a following Filter extending FilteringTokenFilter, with an accept() method that examines shingles and rejects ones that don't qualify, something like the following. (Notes: this is untested; I assume you will use the default shingle token separator " "; and this filter will reject all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter to emit single tokens):
>
> public final class MyNameFilter extends FilteringTokenFilter {
> private static final Pattern NAME_PATTERN
> = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
> private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
> @Override public boolean accept() throws IOException {
> return NAME_PATTERN.matcher(termAtt).matches();
> }
> }
>
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr [at] gmail]
>> Sent: Thursday, February 09, 2012 4:15 PM
>> To: java-user [at] lucene
>> Subject: Re: Access next token in a stream
>>
>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
>>> Hi Damerian,
>>>
>>> One way to handle your scenario is to hold on to the previous token, and
>> only emit a token after you reach at least the second token (or at end-of-
>> stream). Your incrementToken() method could look something like:
>>> 1. Get current attributes: input.incrementToken()
>>> 2. If previous token does not exist:
>>> 2a. Store current attributes as previous token (see
>> AttributeSource#cloneAttributes)
>>> 2b. Get current attributes: input.incrementToken()
>>> 3. Check for& store conditions that will affect previous token's
>> attributes
>>> 4. Store current attributes as next token (see
>> AttributeSource#cloneAttributes)
>>> 5. Copy previous token into current attributes (see
>> AttributeSource#copyTo);
>>> the target will be "this", which is an AttributeSource.
>>> 6. Make changes based on conditions found in step #3 above
>>> 7. set previous token = next token
>>> 8. return true
>>>
>>> (Everywhere I say "token" I mean "instance of AttributeSource".)
>>>
>>> The final token in the input stream will need special handling, as will
>> single-token input streams.
>>> Good luck,
>>> Steve
>>>
>>>> -----Original Message-----
>>>> From: Damerian [mailto:dameriangr [at] gmail]
>>>> Sent: Thursday, February 09, 2012 2:19 PM
>>>> To: java-user [at] lucene
>>>> Subject: Access next token in a stream
>>>>
>>>> Hello i want to implement my custom filter, my wuestion is quite simple
>>>> but i cannot find a solution to it no matter how i try:
>>>>
>>>> How can i access the TermAttribute of the next token than the one i
>>>> currently have in my stream?
>>>>
>>>> For example in the phrase "My name is James Bond" if let's say i am in
>>>> the token [My], i would like to be able to check the TermAttribute of
>>>> the following token [name] and fix my position increment accordingly.
>>>>
>>>> Thank you in advance!
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>> Hi Steve,
>> Thank you for your immediate reply. i will try your solution but i feel
>> that it does not solve my case.
>> What i am trying to make is a filter that joins together two
>> terms/tokens that start with a capital letter (it is trying to find all
>> the Names/Surnames and make them one token) so in my aforementioned
>> example when i examine [James] even if i store the TermAttribute to a
>> temporary token how can i check the next one [Bond] , to join them
>> without actually emmiting (and therefore creating a term in my inverted
>> index) that has [James] on its own.
>> Thank you again for your insight and i would relly appreciate any other
>> views on the matter.
>>
>> Regards, Damerian
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
I think my solution in almost full now only one question you mentioned
"clear the previous token. ". Is there a built-in method for doing that?
In the begining i thought that if i put my new token into the same
position increment it would "overwrite" the previous one , but what i
succeeded was to simply inject code.. my method that does that so far is
this:

@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
//Case were the previous token WAS NOT starting with capital
letter and the rest small
if (previousTokenCanditateMainName == false) {
if (CheckIfMainName(termAtt.term())) {
previousTokenCanditateMainName = true;
tempString =
this.termAtt.term(); /*This is the*/
//
myToken.offsetAtt=this.offsetAtt; /*Token i
need to "delete"*/
tempStartOffset = this.offsetAtt.startOffset();
tempEndOffset = this.offsetAtt.endOffset();
//this.nextInputStreamToken.clearAttributes();

return true;
} else {
return true;
}
} //Case were the previous token WAS a Proper name (starting
with Capital and continuiing with small letters)
else {
if (CheckIfMainName(termAtt.term())) {
previousTokenCanditateMainName = false;
posIncrAtt.setPositionIncrement(0);
String myString=tempString + TOKEN_SEPARATOR +
this.termAtt.term();

//termAtt.setTermBuffer(myString, tempStartOffset,
myString.length());
termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
this.termAtt.term());
offsetAtt.setOffset(tempStartOffset,
this.offsetAtt.endOffset());
return true;
} else {
previousTokenCanditateMainName = false;
return true;
}
}

}

The checkIfMain() method is a simple custom made method to decide
whether the token fullfills the criteria.

Once again thank you very much for your help, and the time that you
spend in helping me

regards
/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sarowe at syr

Feb 9, 2012, 2:12 PM

Post #6 of 7 (492 views)
Permalink
RE: Access next token in a stream [In reply to]

Damerian,

When I said "clear the previous token", I was referring to the pseudo-code I gave in my first response to you. There is no built-in method to do that. If you want to conditionally output tokens, you should store AttributeSource clones, as in my pseudo-code.

Steve

> -----Original Message-----
> From: Damerian [mailto:dameriangr [at] gmail]
> Sent: Thursday, February 09, 2012 5:00 PM
> To: java-user [at] lucene
> Subject: Re: Access next token in a stream
>
> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
> > Damerian,
> >
> > The technique I mentioned would work for you with a little tweaking:
> when you see consecutive capitalized tokens, then just set the
> CharTermAttribute to the joined tokens, and clear the previous token.
> >
> > Another idea: you could use ShingleFilter with min size = max size = 2,
> and then use a following Filter extending FilteringTokenFilter, with an
> accept() method that examines shingles and rejects ones that don't
> qualify, something like the following. (Notes: this is untested; I assume
> you will use the default shingle token separator " "; and this filter will
> reject all non-shingle terms, so you won't get anything but names, even if
> you configure ShingleFilter to emit single tokens):
> >
> > public final class MyNameFilter extends FilteringTokenFilter {
> > private static final Pattern NAME_PATTERN
> > = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
> > private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
> > @Override public boolean accept() throws IOException {
> > return NAME_PATTERN.matcher(termAtt).matches();
> > }
> > }
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Damerian [mailto:dameriangr [at] gmail]
> >> Sent: Thursday, February 09, 2012 4:15 PM
> >> To: java-user [at] lucene
> >> Subject: Re: Access next token in a stream
> >>
> >> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
> >>> Hi Damerian,
> >>>
> >>> One way to handle your scenario is to hold on to the previous token,
> and
> >> only emit a token after you reach at least the second token (or at end-
> of-
> >> stream). Your incrementToken() method could look something like:
> >>> 1. Get current attributes: input.incrementToken()
> >>> 2. If previous token does not exist:
> >>> 2a. Store current attributes as previous token (see
> >> AttributeSource#cloneAttributes)
> >>> 2b. Get current attributes: input.incrementToken()
> >>> 3. Check for& store conditions that will affect previous token's
> >> attributes
> >>> 4. Store current attributes as next token (see
> >> AttributeSource#cloneAttributes)
> >>> 5. Copy previous token into current attributes (see
> >> AttributeSource#copyTo);
> >>> the target will be "this", which is an AttributeSource.
> >>> 6. Make changes based on conditions found in step #3 above
> >>> 7. set previous token = next token
> >>> 8. return true
> >>>
> >>> (Everywhere I say "token" I mean "instance of AttributeSource".)
> >>>
> >>> The final token in the input stream will need special handling, as
> will
> >> single-token input streams.
> >>> Good luck,
> >>> Steve
> >>>
> >>>> -----Original Message-----
> >>>> From: Damerian [mailto:dameriangr [at] gmail]
> >>>> Sent: Thursday, February 09, 2012 2:19 PM
> >>>> To: java-user [at] lucene
> >>>> Subject: Access next token in a stream
> >>>>
> >>>> Hello i want to implement my custom filter, my wuestion is quite
> simple
> >>>> but i cannot find a solution to it no matter how i try:
> >>>>
> >>>> How can i access the TermAttribute of the next token than the one i
> >>>> currently have in my stream?
> >>>>
> >>>> For example in the phrase "My name is James Bond" if let's say i am
> in
> >>>> the token [My], i would like to be able to check the TermAttribute of
> >>>> the following token [name] and fix my position increment accordingly.
> >>>>
> >>>> Thank you in advance!
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >>>> For additional commands, e-mail: java-user-help [at] lucene
> >> Hi Steve,
> >> Thank you for your immediate reply. i will try your solution but i feel
> >> that it does not solve my case.
> >> What i am trying to make is a filter that joins together two
> >> terms/tokens that start with a capital letter (it is trying to find all
> >> the Names/Surnames and make them one token) so in my aforementioned
> >> example when i examine [James] even if i store the TermAttribute to a
> >> temporary token how can i check the next one [Bond] , to join them
> >> without actually emmiting (and therefore creating a term in my inverted
> >> index) that has [James] on its own.
> >> Thank you again for your insight and i would relly appreciate any other
> >> views on the matter.
> >>
> >> Regards, Damerian
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> I think my solution in almost full now only one question you mentioned
> "clear the previous token. ". Is there a built-in method for doing that?
> In the begining i thought that if i put my new token into the same
> position increment it would "overwrite" the previous one , but what i
> succeeded was to simply inject code.. my method that does that so far is
> this:
>
> @Override
> public boolean incrementToken() throws IOException {
> if (!input.incrementToken()) {
> return false;
> }
> //Case were the previous token WAS NOT starting with capital
> letter and the rest small
> if (previousTokenCanditateMainName == false) {
> if (CheckIfMainName(termAtt.term())) {
> previousTokenCanditateMainName = true;
> tempString =
> this.termAtt.term(); /*This is the*/
> //
> myToken.offsetAtt=this.offsetAtt; /*Token i
> need to "delete"*/
> tempStartOffset = this.offsetAtt.startOffset();
> tempEndOffset = this.offsetAtt.endOffset();
> //this.nextInputStreamToken.clearAttributes();
>
> return true;
> } else {
> return true;
> }
> } //Case were the previous token WAS a Proper name (starting
> with Capital and continuiing with small letters)
> else {
> if (CheckIfMainName(termAtt.term())) {
> previousTokenCanditateMainName = false;
> posIncrAtt.setPositionIncrement(0);
> String myString=tempString + TOKEN_SEPARATOR +
> this.termAtt.term();
>
> //termAtt.setTermBuffer(myString, tempStartOffset,
> myString.length());
> termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
> this.termAtt.term());
> offsetAtt.setOffset(tempStartOffset,
> this.offsetAtt.endOffset());
> return true;
> } else {
> previousTokenCanditateMainName = false;
> return true;
> }
> }
>
> }
>
> The checkIfMain() method is a simple custom made method to decide
> whether the token fullfills the criteria.
>
> Once again thank you very much for your help, and the time that you
> spend in helping me
>
> regards
> /Damerian
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


dameriangr at gmail

Feb 9, 2012, 2:14 PM

Post #7 of 7 (494 views)
Permalink
Re: Access next token in a stream [In reply to]

Στις 9/2/2012 11:12 μμ, ο/η Steven A Rowe έγραψε:
> Damerian,
>
> When I said "clear the previous token", I was referring to the pseudo-code I gave in my first response to you. There is no built-in method to do that. If you want to conditionally output tokens, you should store AttributeSource clones, as in my pseudo-code.
>
> Steve
>
>> -----Original Message-----
>> From: Damerian [mailto:dameriangr [at] gmail]
>> Sent: Thursday, February 09, 2012 5:00 PM
>> To: java-user [at] lucene
>> Subject: Re: Access next token in a stream
>>
>> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε:
>>> Damerian,
>>>
>>> The technique I mentioned would work for you with a little tweaking:
>> when you see consecutive capitalized tokens, then just set the
>> CharTermAttribute to the joined tokens, and clear the previous token.
>>> Another idea: you could use ShingleFilter with min size = max size = 2,
>> and then use a following Filter extending FilteringTokenFilter, with an
>> accept() method that examines shingles and rejects ones that don't
>> qualify, something like the following. (Notes: this is untested; I assume
>> you will use the default shingle token separator " "; and this filter will
>> reject all non-shingle terms, so you won't get anything but names, even if
>> you configure ShingleFilter to emit single tokens):
>>> public final class MyNameFilter extends FilteringTokenFilter {
>>> private static final Pattern NAME_PATTERN
>>> = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+");
>>> private final CharTermAttribute termAtt =
>> addAttribute(CharTermAttribute.class);
>>> @Override public boolean accept() throws IOException {
>>> return NAME_PATTERN.matcher(termAtt).matches();
>>> }
>>> }
>>>
>>> Steve
>>>
>>>> -----Original Message-----
>>>> From: Damerian [mailto:dameriangr [at] gmail]
>>>> Sent: Thursday, February 09, 2012 4:15 PM
>>>> To: java-user [at] lucene
>>>> Subject: Re: Access next token in a stream
>>>>
>>>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε:
>>>>> Hi Damerian,
>>>>>
>>>>> One way to handle your scenario is to hold on to the previous token,
>> and
>>>> only emit a token after you reach at least the second token (or at end-
>> of-
>>>> stream). Your incrementToken() method could look something like:
>>>>> 1. Get current attributes: input.incrementToken()
>>>>> 2. If previous token does not exist:
>>>>> 2a. Store current attributes as previous token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 2b. Get current attributes: input.incrementToken()
>>>>> 3. Check for& store conditions that will affect previous token's
>>>> attributes
>>>>> 4. Store current attributes as next token (see
>>>> AttributeSource#cloneAttributes)
>>>>> 5. Copy previous token into current attributes (see
>>>> AttributeSource#copyTo);
>>>>> the target will be "this", which is an AttributeSource.
>>>>> 6. Make changes based on conditions found in step #3 above
>>>>> 7. set previous token = next token
>>>>> 8. return true
>>>>>
>>>>> (Everywhere I say "token" I mean "instance of AttributeSource".)
>>>>>
>>>>> The final token in the input stream will need special handling, as
>> will
>>>> single-token input streams.
>>>>> Good luck,
>>>>> Steve
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Damerian [mailto:dameriangr [at] gmail]
>>>>>> Sent: Thursday, February 09, 2012 2:19 PM
>>>>>> To: java-user [at] lucene
>>>>>> Subject: Access next token in a stream
>>>>>>
>>>>>> Hello i want to implement my custom filter, my wuestion is quite
>> simple
>>>>>> but i cannot find a solution to it no matter how i try:
>>>>>>
>>>>>> How can i access the TermAttribute of the next token than the one i
>>>>>> currently have in my stream?
>>>>>>
>>>>>> For example in the phrase "My name is James Bond" if let's say i am
>> in
>>>>>> the token [My], i would like to be able to check the TermAttribute of
>>>>>> the following token [name] and fix my position increment accordingly.
>>>>>>
>>>>>> Thank you in advance!
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>> Hi Steve,
>>>> Thank you for your immediate reply. i will try your solution but i feel
>>>> that it does not solve my case.
>>>> What i am trying to make is a filter that joins together two
>>>> terms/tokens that start with a capital letter (it is trying to find all
>>>> the Names/Surnames and make them one token) so in my aforementioned
>>>> example when i examine [James] even if i store the TermAttribute to a
>>>> temporary token how can i check the next one [Bond] , to join them
>>>> without actually emmiting (and therefore creating a term in my inverted
>>>> index) that has [James] on its own.
>>>> Thank you again for your insight and i would relly appreciate any other
>>>> views on the matter.
>>>>
>>>> Regards, Damerian
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>> I think my solution in almost full now only one question you mentioned
>> "clear the previous token. ". Is there a built-in method for doing that?
>> In the begining i thought that if i put my new token into the same
>> position increment it would "overwrite" the previous one , but what i
>> succeeded was to simply inject code.. my method that does that so far is
>> this:
>>
>> @Override
>> public boolean incrementToken() throws IOException {
>> if (!input.incrementToken()) {
>> return false;
>> }
>> //Case were the previous token WAS NOT starting with capital
>> letter and the rest small
>> if (previousTokenCanditateMainName == false) {
>> if (CheckIfMainName(termAtt.term())) {
>> previousTokenCanditateMainName = true;
>> tempString =
>> this.termAtt.term(); /*This is the*/
>> //
>> myToken.offsetAtt=this.offsetAtt; /*Token i
>> need to "delete"*/
>> tempStartOffset = this.offsetAtt.startOffset();
>> tempEndOffset = this.offsetAtt.endOffset();
>> //this.nextInputStreamToken.clearAttributes();
>>
>> return true;
>> } else {
>> return true;
>> }
>> } //Case were the previous token WAS a Proper name (starting
>> with Capital and continuiing with small letters)
>> else {
>> if (CheckIfMainName(termAtt.term())) {
>> previousTokenCanditateMainName = false;
>> posIncrAtt.setPositionIncrement(0);
>> String myString=tempString + TOKEN_SEPARATOR +
>> this.termAtt.term();
>>
>> //termAtt.setTermBuffer(myString, tempStartOffset,
>> myString.length());
>> termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR +
>> this.termAtt.term());
>> offsetAtt.setOffset(tempStartOffset,
>> this.offsetAtt.endOffset());
>> return true;
>> } else {
>> previousTokenCanditateMainName = false;
>> return true;
>> }
>> }
>>
>> }
>>
>> The checkIfMain() method is a simple custom made method to decide
>> whether the token fullfills the criteria.
>>
>> Once again thank you very much for your help, and the time that you
>> spend in helping me
>>
>> regards
>> /Damerian
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
Steve one last Thank you! I gained valueable knowledge tonight!

/Damerian

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.