Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

How to implement a GivenCharFilter using incrementToken

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


se.dxt at hotmail

Nov 24, 2009, 8:15 PM

Post #1 of 3 (1186 views)
Permalink
How to implement a GivenCharFilter using incrementToken

Hi,
I find it is very hard to implement a GivenCharFilter(extends
TokenFilter)using incrementToken. My requirment is like this: I want to
analyze a StringReader("axb xxa xx c") to these token[term(startOffset,
endOffset, posIncre)]:
a(0,1,1) b(2,3,1) a(4,5,1) c(6,7,1).
First I use a WhiteSpaceFilter to filter the token, and then use
GivenCharFilter( assume filter "x" like above). the problem is: when call
incrementToken, I get attributeTerm "axb", but this time I need return two
attribute terms "a" and "b", but how to? because it just return only one
term.
Please help me.

My code just like this:

============================================

public class GivenCharFilter extends TokenFilter{

private char filterChar='x';
private TermAttribute termAtt;
private OffsetAttribute offsetAtt;
private PositionIncrementAttribute posIncrAtt;

public GivenCharFilter(){
init();
}

public void init() {
termAtt = (TermAttribute) addAttribute(TermAttribute.class);
offsetAtt = (OffsetAttribute) addAttribute(OffsetAttribute.class);
posIncrAtt = (PositionIncrementAttribute)
addAttribute(PositionIncrementAttribute.class);
}


/** Returns the next token in the stream, or null at EOS. */
public final boolean incrementToken() throws IOException {
//How To?
}

}
================================

Regards,
Xiaotao Deng
--
View this message in context: http://old.nabble.com/How-to-implement-a-GivenCharFilter-using-incrementToken-tp26507318p26507318.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Nov 25, 2009, 2:05 AM

Post #2 of 3 (1146 views)
Permalink
RE: How to implement a GivenCharFilter using incrementToken [In reply to]

I do not understand your request completely, maybe you tell us some more
requirements of your implementation.

The example you have given is invalid, as offsets should always refer to the
original position in the source stream, so should be:

a(0,1,1) b(2,3,1) a(6,7,1) c(11,12,1).

The second problem, why extend TokenFilter? The super ctor call is wrong, it
must get an other TokenStream as input. Maybe you want to write an Tokenizer
reading from stream?

Should a,b be always single chars or any term?

If the above offsets are correct and what you want (otherwise it would make
no sense), the following would work without any custom class:

Reader input = StringReader("axb xxa xx c");
NormalizeCahrMap map = new NormalizeCharMap();
Map.add("x"," "); // replace all x by whitespace
input = MappingCharFilter(map,input);
TokenStream stream = new WhitespaceTokenizer(input);

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: KingShooter [mailto:se.dxt [at] hotmail]
> Sent: Wednesday, November 25, 2009 5:16 AM
> To: java-user [at] lucene
> Subject: How to implement a GivenCharFilter using incrementToken
>
>
> Hi,
> I find it is very hard to implement a GivenCharFilter(extends
> TokenFilter)using incrementToken. My requirment is like this: I want to
> analyze a StringReader("axb xxa xx c") to these token[term(startOffset,
> endOffset, posIncre)]:
> a(0,1,1) b(2,3,1) a(4,5,1) c(6,7,1).
> First I use a WhiteSpaceFilter to filter the token, and then use
> GivenCharFilter( assume filter "x" like above). the problem is: when call
> incrementToken, I get attributeTerm "axb", but this time I need return two
> attribute terms "a" and "b", but how to? because it just return only one
> term.
> Please help me.
>
> My code just like this:
>
> ============================================
>
> public class GivenCharFilter extends TokenFilter{
>
> private char filterChar='x';
> private TermAttribute termAtt;
> private OffsetAttribute offsetAtt;
> private PositionIncrementAttribute posIncrAtt;
>
> public GivenCharFilter(){
> init();
> }
>
> public void init() {
> termAtt = (TermAttribute) addAttribute(TermAttribute.class);
> offsetAtt = (OffsetAttribute) addAttribute(OffsetAttribute.class);
> posIncrAtt = (PositionIncrementAttribute)
> addAttribute(PositionIncrementAttribute.class);
> }
>
>
> /** Returns the next token in the stream, or null at EOS. */
> public final boolean incrementToken() throws IOException {
> //How To?
> }
>
> }
> ================================
>
> Regards,
> Xiaotao Deng
> --
> View this message in context: http://old.nabble.com/How-to-implement-a-
> GivenCharFilter-using-incrementToken-tp26507318p26507318.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


se.dxt at hotmail

Nov 25, 2009, 5:04 PM

Post #3 of 3 (1133 views)
Permalink
RE: How to implement a GivenCharFilter using incrementToken [In reply to]

The example you have given is invalid, as offsets should always refer to the
original position in the source stream, so should be:

a(0,1,1) b(2,3,1) a(6,7,1) c(11,12,1).

Deng: I'm afraid that if (case1) index "axxxb", and then I search "axb" or
(case2) index "axb" and then search "axxxxb", which one would work? So I
just want to omit the destence between filter term.

The second problem, why extend TokenFilter? The super ctor call is wrong, it
must get an other TokenStream as input.
Deng: You are right:) Because in my porject, the old version implement it
using "nextToken" and extends TokenFilter. And then I want to rewrite it
using incrementToken method, so I want to know how to achieve it, maybe it's
impossible

Should a,b be always single chars or any term?
Deng: a,b,c here I just use single char make it simple, it can be any term.
and also "x" can be a set contain many filter chars(maybe more than 1,000)

If the above offsets are correct and what you want (otherwise it would make
no sense), the following would work without any custom class:

Reader input = StringReader("axb xxa xx c");
NormalizeCahrMap map = new NormalizeCharMap();
Map.add("x"," "); // replace all x by whitespace
input = MappingCharFilter(map,input);
TokenStream stream = new WhitespaceTokenizer(input);

Deng: That's good idea. And I get answer from you. Thank you very much.

<p>
Another question is:
When incrementToken get a token "abc#def#hi", How to preserve current
attribute
"def#hi" and then return attrTerm "abc", and next I call incrementToken I
can get "def#hi".
? ("#" stand for token separator, and I want to get 3 tokens(abc, def, hi)
from original token).
</p>



--
View this message in context: http://old.nabble.com/How-to-implement-a-GivenCharFilter-using-incrementToken-tp26507318p26522986.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.