
dameriangr at gmail
Feb 9, 2012, 2:14 PM
Post #7 of 7
(312 views)
Permalink
|
Στις 9/2/2012 11:12 μμ, ο/η Steven A Rowe έγραψε: > Damerian, > > When I said "clear the previous token", I was referring to the pseudo-code I gave in my first response to you. There is no built-in method to do that. If you want to conditionally output tokens, you should store AttributeSource clones, as in my pseudo-code. > > Steve > >> -----Original Message----- >> From: Damerian [mailto:dameriangr [at] gmail] >> Sent: Thursday, February 09, 2012 5:00 PM >> To: java-user [at] lucene >> Subject: Re: Access next token in a stream >> >> Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: >>> Damerian, >>> >>> The technique I mentioned would work for you with a little tweaking: >> when you see consecutive capitalized tokens, then just set the >> CharTermAttribute to the joined tokens, and clear the previous token. >>> Another idea: you could use ShingleFilter with min size = max size = 2, >> and then use a following Filter extending FilteringTokenFilter, with an >> accept() method that examines shingles and rejects ones that don't >> qualify, something like the following. (Notes: this is untested; I assume >> you will use the default shingle token separator " "; and this filter will >> reject all non-shingle terms, so you won't get anything but names, even if >> you configure ShingleFilter to emit single tokens): >>> public final class MyNameFilter extends FilteringTokenFilter { >>> private static final Pattern NAME_PATTERN >>> = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+"); >>> private final CharTermAttribute termAtt = >> addAttribute(CharTermAttribute.class); >>> @Override public boolean accept() throws IOException { >>> return NAME_PATTERN.matcher(termAtt).matches(); >>> } >>> } >>> >>> Steve >>> >>>> -----Original Message----- >>>> From: Damerian [mailto:dameriangr [at] gmail] >>>> Sent: Thursday, February 09, 2012 4:15 PM >>>> To: java-user [at] lucene >>>> Subject: Re: Access next token in a stream >>>> >>>> Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: >>>>> Hi Damerian, >>>>> >>>>> One way to handle your scenario is to hold on to the previous token, >> and >>>> only emit a token after you reach at least the second token (or at end- >> of- >>>> stream). Your incrementToken() method could look something like: >>>>> 1. Get current attributes: input.incrementToken() >>>>> 2. If previous token does not exist: >>>>> 2a. Store current attributes as previous token (see >>>> AttributeSource#cloneAttributes) >>>>> 2b. Get current attributes: input.incrementToken() >>>>> 3. Check for& store conditions that will affect previous token's >>>> attributes >>>>> 4. Store current attributes as next token (see >>>> AttributeSource#cloneAttributes) >>>>> 5. Copy previous token into current attributes (see >>>> AttributeSource#copyTo); >>>>> the target will be "this", which is an AttributeSource. >>>>> 6. Make changes based on conditions found in step #3 above >>>>> 7. set previous token = next token >>>>> 8. return true >>>>> >>>>> (Everywhere I say "token" I mean "instance of AttributeSource".) >>>>> >>>>> The final token in the input stream will need special handling, as >> will >>>> single-token input streams. >>>>> Good luck, >>>>> Steve >>>>> >>>>>> -----Original Message----- >>>>>> From: Damerian [mailto:dameriangr [at] gmail] >>>>>> Sent: Thursday, February 09, 2012 2:19 PM >>>>>> To: java-user [at] lucene >>>>>> Subject: Access next token in a stream >>>>>> >>>>>> Hello i want to implement my custom filter, my wuestion is quite >> simple >>>>>> but i cannot find a solution to it no matter how i try: >>>>>> >>>>>> How can i access the TermAttribute of the next token than the one i >>>>>> currently have in my stream? >>>>>> >>>>>> For example in the phrase "My name is James Bond" if let's say i am >> in >>>>>> the token [My], i would like to be able to check the TermAttribute of >>>>>> the following token [name] and fix my position increment accordingly. >>>>>> >>>>>> Thank you in advance! >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >>>>>> For additional commands, e-mail: java-user-help [at] lucene >>>> Hi Steve, >>>> Thank you for your immediate reply. i will try your solution but i feel >>>> that it does not solve my case. >>>> What i am trying to make is a filter that joins together two >>>> terms/tokens that start with a capital letter (it is trying to find all >>>> the Names/Surnames and make them one token) so in my aforementioned >>>> example when i examine [James] even if i store the TermAttribute to a >>>> temporary token how can i check the next one [Bond] , to join them >>>> without actually emmiting (and therefore creating a term in my inverted >>>> index) that has [James] on its own. >>>> Thank you again for your insight and i would relly appreciate any other >>>> views on the matter. >>>> >>>> Regards, Damerian >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >>>> For additional commands, e-mail: java-user-help [at] lucene >> I think my solution in almost full now only one question you mentioned >> "clear the previous token. ". Is there a built-in method for doing that? >> In the begining i thought that if i put my new token into the same >> position increment it would "overwrite" the previous one , but what i >> succeeded was to simply inject code.. my method that does that so far is >> this: >> >> @Override >> public boolean incrementToken() throws IOException { >> if (!input.incrementToken()) { >> return false; >> } >> //Case were the previous token WAS NOT starting with capital >> letter and the rest small >> if (previousTokenCanditateMainName == false) { >> if (CheckIfMainName(termAtt.term())) { >> previousTokenCanditateMainName = true; >> tempString = >> this.termAtt.term(); /*This is the*/ >> // >> myToken.offsetAtt=this.offsetAtt; /*Token i >> need to "delete"*/ >> tempStartOffset = this.offsetAtt.startOffset(); >> tempEndOffset = this.offsetAtt.endOffset(); >> //this.nextInputStreamToken.clearAttributes(); >> >> return true; >> } else { >> return true; >> } >> } //Case were the previous token WAS a Proper name (starting >> with Capital and continuiing with small letters) >> else { >> if (CheckIfMainName(termAtt.term())) { >> previousTokenCanditateMainName = false; >> posIncrAtt.setPositionIncrement(0); >> String myString=tempString + TOKEN_SEPARATOR + >> this.termAtt.term(); >> >> //termAtt.setTermBuffer(myString, tempStartOffset, >> myString.length()); >> termAtt.setTermBuffer(tempString + TOKEN_SEPARATOR + >> this.termAtt.term()); >> offsetAtt.setOffset(tempStartOffset, >> this.offsetAtt.endOffset()); >> return true; >> } else { >> previousTokenCanditateMainName = false; >> return true; >> } >> } >> >> } >> >> The checkIfMain() method is a simple custom made method to decide >> whether the token fullfills the criteria. >> >> Once again thank you very much for your help, and the time that you >> spend in helping me >> >> regards >> /Damerian >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >> For additional commands, e-mail: java-user-help [at] lucene Steve one last Thank you! I gained valueable knowledge tonight! /Damerian --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe [at] lucene For additional commands, e-mail: java-user-help [at] lucene
|