Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

How to deal with Token in the new TS API

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


serera at gmail

Nov 22, 2009, 3:12 AM

Post #1 of 24 (2193 views)
Permalink
How to deal with Token in the new TS API

Hi

I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters
to the new API. Since the entire set of classes handled Token before, I
decided to not change it for now, and was happy to discover that Token
extends AttributeImpl, which makes the migration easier.

So I started w/ my Tokenizer. I had a "private final Token token =
addAttribute(Token.class);" line. I got startled when I received
"java.lang.IllegalArgumentException: Could not find implementing class for
org.apache.lucene.analysis.Token". I checked my classpath, tried to run from
eclipse and cmd-line, nada. I then checked the source code, and discovered
that the default attribute factory adds an "Impl" to the class name. So:

1) Phew ... nothing's wrong w/ my classpath.
2) Mental note - read the documentation more closely: in package.html it's
said that if you implement an Attribute, make sure to add Impl to its class
name, or otherwise you'll need to provide your own AttributeFactory.
3) But, why is the exception so vague? If Lucene adds "Impl" to the class
name that I pass, shouldn't it also say that "... class for ....NameImpl"?
That way, I'd see TokenImpl and immediately figure out that I should read
the documentation.

I then went on to read about AttributeFactory, and was wondering in the
process why the hell do I need to implement one which is marked EXPERT
whereas I use a "basic" Lucene class, when I discovered that Token includes
a TokenAttributeFactory. So:

1) Good ! I don't need to implement an AttributeFactory.
2) Why isn't it mentioned in the documentation? If Token was kept for easy
migration from pre-2.9 API, I'd expect this to appear very clearly in
package.html. Something like "if you're migrating from pre-2.9 API and would
like to keep using Token, MAKE SURE TO CALL
super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
that, maybe with less upper-casing.

I went on and moved the addAttribute line to inside the ctor, after I call
super(...). But then something else hit me. In my TokenFilters I call
input.hasAttribute(Token.class) to ensure the input TS will process Token. I
was surprised to find out this method returns 'false'. Debug-tracing the
code I discovered that when I call addAttribute, all the Attribute classes
Token implements are added to the map, but not Token itself. So:

1) Hmmm ... not so easy to migrate my Token-based API to the new API ...
2) I assume getAttribute(Token.class) won't work either ... so what benefit
did I get from calling addAttribute(Token.class) in the first place? Now I
need, in my consumer API, to rebuild a Token on every incrementToken call?
3) Isn't that a crime? I added X and called has(X) and got false ... again
documentation could help, but I get a sense that this is buggy behavior.

Before you answer that I can call getAttribute(TermAttribute.class),
remember that I started this email as a user that wants to migrate to a new
API, and the documentation says I can use Token for easier migration. So
using all the other attributes is a less preferred option now, especially as
I'm not going to introduce, at the moment, new attributes, but just continue
to work with the 'default' ones.

Any help will be appreciated. I really hope I'm missing something basic ...

Shai


serera at gmail

Nov 22, 2009, 3:50 AM

Post #2 of 24 (2142 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

To add to my previous email, If I do the following:

StringReader sr = new StringReader("hello world");
TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY, sr);

for (Iterator<Class<? extends Attribute>> iter =
ts.getAttributeClassesIterator(); iter.hasNext();) {
Class< ? extends Attribute> type = iter.next();
System.out.println(type);
}

TermAttribute ta = ts.getAttribute(TermAttribute.class);
OffsetAttribute oa = ts.getAttribute(OffsetAttribute.class);

while (ts.incrementToken()) {
System.out.println(ta + " " + oa);
}

Then it prints:

interface org.apache.lucene.analysis.tokenattributes.TermAttribute
interface org.apache.lucene.analysis.tokenattributes.TypeAttribute
interface
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute
interface org.apache.lucene.analysis.tokenattributes.FlagsAttribute
interface org.apache.lucene.analysis.tokenattributes.OffsetAttribute
interface org.apache.lucene.analysis.tokenattributes.PayloadAttribute
(hello,0,5) (hello,0,5)
(world,6,11) (world,6,11)

Reason for all the attributes - I use Token.TOKEN_ATTRIBUTE_FACTORY.
WhitespaceTokenizer, through CharTokenizer, adds just Term and Offset
attributes. However, TokenAttributeFactory's createAttributeInstance code
adds Token itself every time. That's because the code:

return attClass.isAssignableFrom(Token.class) ? new Token() :
delegate.createAttributeInstance(attClass);

always returns new Token(), since every Token can be assigned to
TermAttribute or OffsetAttribute. Shouldn't it be the other way around?
I.e., we want to add Tokens, not classes Token implements. So I thin it
should be Token.class.isAssignableFrom(attCls), and so only sub-classes on
Token will get added by this factory, otherwise it'll call the delegate?

Reason for the double printing ... the actual instance that gets added to
the map is of Token. Therefore regardless if I call
getAttribute(TermAttribute) or getAttribute(OffsetAttribute), I get the
Token instance. And when I print it, it calls Token.toString().

It's strange ... I can't "addA(Token) -- hasA(Token)" but I can "addA(Token)
-- hasA(Term) -- getA(Term) -- cast to Token" ...

I don't know if this is a bug or not, but it's strange.

Shai

On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera <serera [at] gmail> wrote:

> Hi
>
> I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters
> to the new API. Since the entire set of classes handled Token before, I
> decided to not change it for now, and was happy to discover that Token
> extends AttributeImpl, which makes the migration easier.
>
> So I started w/ my Tokenizer. I had a "private final Token token =
> addAttribute(Token.class);" line. I got startled when I received
> "java.lang.IllegalArgumentException: Could not find implementing class for
> org.apache.lucene.analysis.Token". I checked my classpath, tried to run from
> eclipse and cmd-line, nada. I then checked the source code, and discovered
> that the default attribute factory adds an "Impl" to the class name. So:
>
> 1) Phew ... nothing's wrong w/ my classpath.
> 2) Mental note - read the documentation more closely: in package.html it's
> said that if you implement an Attribute, make sure to add Impl to its class
> name, or otherwise you'll need to provide your own AttributeFactory.
> 3) But, why is the exception so vague? If Lucene adds "Impl" to the class
> name that I pass, shouldn't it also say that "... class for ....NameImpl"?
> That way, I'd see TokenImpl and immediately figure out that I should read
> the documentation.
>
> I then went on to read about AttributeFactory, and was wondering in the
> process why the hell do I need to implement one which is marked EXPERT
> whereas I use a "basic" Lucene class, when I discovered that Token includes
> a TokenAttributeFactory. So:
>
> 1) Good ! I don't need to implement an AttributeFactory.
> 2) Why isn't it mentioned in the documentation? If Token was kept for easy
> migration from pre-2.9 API, I'd expect this to appear very clearly in
> package.html. Something like "if you're migrating from pre-2.9 API and would
> like to keep using Token, MAKE SURE TO CALL
> super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
> that, maybe with less upper-casing.
>
> I went on and moved the addAttribute line to inside the ctor, after I call
> super(...). But then something else hit me. In my TokenFilters I call
> input.hasAttribute(Token.class) to ensure the input TS will process Token. I
> was surprised to find out this method returns 'false'. Debug-tracing the
> code I discovered that when I call addAttribute, all the Attribute classes
> Token implements are added to the map, but not Token itself. So:
>
> 1) Hmmm ... not so easy to migrate my Token-based API to the new API ...
> 2) I assume getAttribute(Token.class) won't work either ... so what benefit
> did I get from calling addAttribute(Token.class) in the first place? Now I
> need, in my consumer API, to rebuild a Token on every incrementToken call?
> 3) Isn't that a crime? I added X and called has(X) and got false ... again
> documentation could help, but I get a sense that this is buggy behavior.
>
> Before you answer that I can call getAttribute(TermAttribute.class),
> remember that I started this email as a user that wants to migrate to a new
> API, and the documentation says I can use Token for easier migration. So
> using all the other attributes is a less preferred option now, especially as
> I'm not going to introduce, at the moment, new attributes, but just continue
> to work with the 'default' ones.
>
> Any help will be appreciated. I really hope I'm missing something basic ...
>
> Shai
>


uwe at thetaphi

Nov 22, 2009, 4:10 AM

Post #3 of 24 (2154 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

> To add to my previous email, If I do the following:
>
> StringReader sr = new StringReader("hello world");
> TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY,
> sr);
>
> for (Iterator<Class<? extends Attribute>> iter =
> ts.getAttributeClassesIterator(); iter.hasNext();) {
> Class< ? extends Attribute> type = iter.next();
> System.out.println(type);
> }
>
> TermAttribute ta = ts.getAttribute(TermAttribute.class);
> OffsetAttribute oa = ts.getAttribute(OffsetAttribute.class);
>
> while (ts.incrementToken()) {
> System.out.println(ta + " " + oa);
> }
>
> Then it prints:
>
> interface org.apache.lucene.analysis.tokenattributes.TermAttribute
> interface org.apache.lucene.analysis.tokenattributes.TypeAttribute
> interface
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute
> interface org.apache.lucene.analysis.tokenattributes.FlagsAttribute
> interface org.apache.lucene.analysis.tokenattributes.OffsetAttribute
> interface org.apache.lucene.analysis.tokenattributes.PayloadAttribute
> (hello,0,5) (hello,0,5)
> (world,6,11) (world,6,11)

That is correct, because you are iterating the attribute instances.

> Reason for all the attributes - I use Token.TOKEN_ATTRIBUTE_FACTORY.
> WhitespaceTokenizer, through CharTokenizer, adds just Term and Offset
> attributes. However, TokenAttributeFactory's createAttributeInstance code
> adds Token itself every time. That's because the code:
>
> return attClass.isAssignableFrom(Token.class) ? new Token() :
> delegate.createAttributeInstance(attClass);
>
> always returns new Token(), since every Token can be assigned to
> TermAttribute or OffsetAttribute. Shouldn't it be the other way around?

No that is exactly correct. If you add a TermAttribute to the TS, and use
the Token attribute afctory, it *must* add all implemented attributes. And
this is the reason, why you cannot relay on the fact, that (unused)
attributes may not be already be in the TS. And by the way, Token is only
added once to the TS, all 6 attributes (after a call to addAttribute) will
return the same instance!

> I.e., we want to add Tokens, not classes Token implements. So I thin it
> should be Token.class.isAssignableFrom(attCls), and so only sub-classes on
> Token will get added by this factory, otherwise it'll call the delegate?

The AttributeSource only allows *one* instance per impl, so if you add one
Token you cannot add more of them. Other way round the TS will then have all
attributes, Token implements automatically.

> Reason for the double printing ... the actual instance that gets added to
> the map is of Token. Therefore regardless if I call
> getAttribute(TermAttribute) or getAttribute(OffsetAttribute), I get the
> Token instance. And when I print it, it calls Token.toString().

The double printing cannot be removed. The simpliest it to use
TokenStream.toString() instead, it will present you a full snapshot as
string. This is exactly the case, why Attribute does not implement
toString(). The println works, because javac casts to (Object).

> It's strange ... I can't "addA(Token) -- hasA(Token)" but I can
> "addA(Token)
> -- hasA(Term) -- getA(Term) -- cast to Token" ...
>
> I don't know if this is a bug or not, but it's strange.
>
> Shai
>
> On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera <serera [at] gmail> wrote:
>
> > Hi
> >
> > I started to migrate my Analyzers, Tokenizer, TokenStreams and
> TokenFilters
> > to the new API. Since the entire set of classes handled Token before, I
> > decided to not change it for now, and was happy to discover that Token
> > extends AttributeImpl, which makes the migration easier.
> >
> > So I started w/ my Tokenizer. I had a "private final Token token =
> > addAttribute(Token.class);" line. I got startled when I received
> > "java.lang.IllegalArgumentException: Could not find implementing class
> for
> > org.apache.lucene.analysis.Token". I checked my classpath, tried to run
> from
> > eclipse and cmd-line, nada. I then checked the source code, and
> discovered
> > that the default attribute factory adds an "Impl" to the class name. So:
> >
> > 1) Phew ... nothing's wrong w/ my classpath.
> > 2) Mental note - read the documentation more closely: in package.html
> it's
> > said that if you implement an Attribute, make sure to add Impl to its
> class
> > name, or otherwise you'll need to provide your own AttributeFactory.
> > 3) But, why is the exception so vague? If Lucene adds "Impl" to the
> class
> > name that I pass, shouldn't it also say that "... class for
> ....NameImpl"?
> > That way, I'd see TokenImpl and immediately figure out that I should
> read
> > the documentation.
> >
> > I then went on to read about AttributeFactory, and was wondering in the
> > process why the hell do I need to implement one which is marked EXPERT
> > whereas I use a "basic" Lucene class, when I discovered that Token
> includes
> > a TokenAttributeFactory. So:
> >
> > 1) Good ! I don't need to implement an AttributeFactory.
> > 2) Why isn't it mentioned in the documentation? If Token was kept for
> easy
> > migration from pre-2.9 API, I'd expect this to appear very clearly in
> > package.html. Something like "if you're migrating from pre-2.9 API and
> would
> > like to keep using Token, MAKE SURE TO CALL
> > super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
> > that, maybe with less upper-casing.
> >
> > I went on and moved the addAttribute line to inside the ctor, after I
> call
> > super(...). But then something else hit me. In my TokenFilters I call
> > input.hasAttribute(Token.class) to ensure the input TS will process
> Token. I
> > was surprised to find out this method returns 'false'. Debug-tracing the
> > code I discovered that when I call addAttribute, all the Attribute
> classes
> > Token implements are added to the map, but not Token itself. So:
> >
> > 1) Hmmm ... not so easy to migrate my Token-based API to the new API ...
> > 2) I assume getAttribute(Token.class) won't work either ... so what
> benefit
> > did I get from calling addAttribute(Token.class) in the first place? Now
> I
> > need, in my consumer API, to rebuild a Token on every incrementToken
> call?
> > 3) Isn't that a crime? I added X and called has(X) and got false ...
> again
> > documentation could help, but I get a sense that this is buggy behavior.
> >
> > Before you answer that I can call getAttribute(TermAttribute.class),
> > remember that I started this email as a user that wants to migrate to a
> new
> > API, and the documentation says I can use Token for easier migration. So
> > using all the other attributes is a less preferred option now,
> especially as
> > I'm not going to introduce, at the moment, new attributes, but just
> continue
> > to work with the 'default' ones.
> >
> > Any help will be appreciated. I really hope I'm missing something basic
> ...
> >
> > Shai
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 4:30 AM

Post #4 of 24 (2141 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Thanks Uwe for the response, however that doesn't get me anywhere. I already
know that Token is added once, and that after I add Token I cannot add more
of them. And I understand why the double printing.

I want to add Token.class, and then work w/ Token. Not TermAttribute,
PosIncrAttribute, OffsetAttribute, PayloadAttribute and TypeAttribute (these
are the five attributes I'm using from Token). Why can't the code add Token
to the attributes map? If all of these are anyway mapped to the same
instance, what problems will it cause?

What I'll do for now is call addAttribute(Token.class) which will return me
a Token. But, per the other thread, this behavior is buggy IMO, because I'd
then rely on the input TS to support Token, which may not be the cases ...
So perhaps I can move to check whether all the attributes that I care about
are there. But this just complicates the code. If Token was added to the
Attributes map, I wouldn't need to do this juggling ...

Shai

On Sun, Nov 22, 2009 at 2:10 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> > To add to my previous email, If I do the following:
> >
> > StringReader sr = new StringReader("hello world");
> > TokenStream ts = new WhitespaceTokenizer(Token.TOKEN_ATTRIBUTE_FACTORY,
> > sr);
> >
> > for (Iterator<Class<? extends Attribute>> iter =
> > ts.getAttributeClassesIterator(); iter.hasNext();) {
> > Class< ? extends Attribute> type = iter.next();
> > System.out.println(type);
> > }
> >
> > TermAttribute ta = ts.getAttribute(TermAttribute.class);
> > OffsetAttribute oa = ts.getAttribute(OffsetAttribute.class);
> >
> > while (ts.incrementToken()) {
> > System.out.println(ta + " " + oa);
> > }
> >
> > Then it prints:
> >
> > interface org.apache.lucene.analysis.tokenattributes.TermAttribute
> > interface org.apache.lucene.analysis.tokenattributes.TypeAttribute
> > interface
> > org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute
> > interface org.apache.lucene.analysis.tokenattributes.FlagsAttribute
> > interface org.apache.lucene.analysis.tokenattributes.OffsetAttribute
> > interface org.apache.lucene.analysis.tokenattributes.PayloadAttribute
> > (hello,0,5) (hello,0,5)
> > (world,6,11) (world,6,11)
>
> That is correct, because you are iterating the attribute instances.
>
> > Reason for all the attributes - I use Token.TOKEN_ATTRIBUTE_FACTORY.
> > WhitespaceTokenizer, through CharTokenizer, adds just Term and Offset
> > attributes. However, TokenAttributeFactory's createAttributeInstance code
> > adds Token itself every time. That's because the code:
> >
> > return attClass.isAssignableFrom(Token.class) ? new Token() :
> > delegate.createAttributeInstance(attClass);
> >
> > always returns new Token(), since every Token can be assigned to
> > TermAttribute or OffsetAttribute. Shouldn't it be the other way around?
>
> No that is exactly correct. If you add a TermAttribute to the TS, and use
> the Token attribute afctory, it *must* add all implemented attributes. And
> this is the reason, why you cannot relay on the fact, that (unused)
> attributes may not be already be in the TS. And by the way, Token is only
> added once to the TS, all 6 attributes (after a call to addAttribute) will
> return the same instance!
>
> > I.e., we want to add Tokens, not classes Token implements. So I thin it
> > should be Token.class.isAssignableFrom(attCls), and so only sub-classes
> on
> > Token will get added by this factory, otherwise it'll call the delegate?
>
> The AttributeSource only allows *one* instance per impl, so if you add one
> Token you cannot add more of them. Other way round the TS will then have
> all
> attributes, Token implements automatically.
>
> > Reason for the double printing ... the actual instance that gets added to
> > the map is of Token. Therefore regardless if I call
> > getAttribute(TermAttribute) or getAttribute(OffsetAttribute), I get the
> > Token instance. And when I print it, it calls Token.toString().
>
> The double printing cannot be removed. The simpliest it to use
> TokenStream.toString() instead, it will present you a full snapshot as
> string. This is exactly the case, why Attribute does not implement
> toString(). The println works, because javac casts to (Object).
>
> > It's strange ... I can't "addA(Token) -- hasA(Token)" but I can
> > "addA(Token)
> > -- hasA(Term) -- getA(Term) -- cast to Token" ...
> >
> > I don't know if this is a bug or not, but it's strange.
> >
> > Shai
> >
> > On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera <serera [at] gmail> wrote:
> >
> > > Hi
> > >
> > > I started to migrate my Analyzers, Tokenizer, TokenStreams and
> > TokenFilters
> > > to the new API. Since the entire set of classes handled Token before, I
> > > decided to not change it for now, and was happy to discover that Token
> > > extends AttributeImpl, which makes the migration easier.
> > >
> > > So I started w/ my Tokenizer. I had a "private final Token token =
> > > addAttribute(Token.class);" line. I got startled when I received
> > > "java.lang.IllegalArgumentException: Could not find implementing class
> > for
> > > org.apache.lucene.analysis.Token". I checked my classpath, tried to run
> > from
> > > eclipse and cmd-line, nada. I then checked the source code, and
> > discovered
> > > that the default attribute factory adds an "Impl" to the class name.
> So:
> > >
> > > 1) Phew ... nothing's wrong w/ my classpath.
> > > 2) Mental note - read the documentation more closely: in package.html
> > it's
> > > said that if you implement an Attribute, make sure to add Impl to its
> > class
> > > name, or otherwise you'll need to provide your own AttributeFactory.
> > > 3) But, why is the exception so vague? If Lucene adds "Impl" to the
> > class
> > > name that I pass, shouldn't it also say that "... class for
> > ....NameImpl"?
> > > That way, I'd see TokenImpl and immediately figure out that I should
> > read
> > > the documentation.
> > >
> > > I then went on to read about AttributeFactory, and was wondering in the
> > > process why the hell do I need to implement one which is marked EXPERT
> > > whereas I use a "basic" Lucene class, when I discovered that Token
> > includes
> > > a TokenAttributeFactory. So:
> > >
> > > 1) Good ! I don't need to implement an AttributeFactory.
> > > 2) Why isn't it mentioned in the documentation? If Token was kept for
> > easy
> > > migration from pre-2.9 API, I'd expect this to appear very clearly in
> > > package.html. Something like "if you're migrating from pre-2.9 API and
> > would
> > > like to keep using Token, MAKE SURE TO CALL
> > > super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
> > > that, maybe with less upper-casing.
> > >
> > > I went on and moved the addAttribute line to inside the ctor, after I
> > call
> > > super(...). But then something else hit me. In my TokenFilters I call
> > > input.hasAttribute(Token.class) to ensure the input TS will process
> > Token. I
> > > was surprised to find out this method returns 'false'. Debug-tracing
> the
> > > code I discovered that when I call addAttribute, all the Attribute
> > classes
> > > Token implements are added to the map, but not Token itself. So:
> > >
> > > 1) Hmmm ... not so easy to migrate my Token-based API to the new API
> ...
> > > 2) I assume getAttribute(Token.class) won't work either ... so what
> > benefit
> > > did I get from calling addAttribute(Token.class) in the first place?
> Now
> > I
> > > need, in my consumer API, to rebuild a Token on every incrementToken
> > call?
> > > 3) Isn't that a crime? I added X and called has(X) and got false ...
> > again
> > > documentation could help, but I get a sense that this is buggy
> behavior.
> > >
> > > Before you answer that I can call getAttribute(TermAttribute.class),
> > > remember that I started this email as a user that wants to migrate to a
> > new
> > > API, and the documentation says I can use Token for easier migration.
> So
> > > using all the other attributes is a less preferred option now,
> > especially as
> > > I'm not going to introduce, at the moment, new attributes, but just
> > continue
> > > to work with the 'default' ones.
> > >
> > > Any help will be appreciated. I really hope I'm missing something basic
> > ...
> > >
> > > Shai
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Nov 22, 2009, 4:35 AM

Post #5 of 24 (2146 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

> I started to migrate my Analyzers, Tokenizer, TokenStreams and
> TokenFilters
> to the new API. Since the entire set of classes handled Token before, I
> decided to not change it for now, and was happy to discover that Token
> extends AttributeImpl, which makes the migration easier.
>
> So I started w/ my Tokenizer. I had a "private final Token token =
> addAttribute(Token.class);" line. I got startled when I received
> "java.lang.IllegalArgumentException: Could not find implementing class for
> org.apache.lucene.analysis.Token". I checked my classpath, tried to run
> from
> eclipse and cmd-line, nada. I then checked the source code, and discovered
> that the default attribute factory adds an "Impl" to the class name. So:

That's a problem of 2.9 using Java 1.4. addAttribute on accepts Class<T>,
where T extends Attribute not AttributeImpl (the problem is that Token also
extends this interface, but you should only pass attributes as interfaces!).
The problem is that generics cannot prevent passing Token.class because it
extends Attribute. The method should only get attribute interfaces as
parameter, I have no idea how to enforce this by generics.

Maybe we should add an extra check to the input parameter like if
(!clazz.isInterface()) throw new IAE with a good explanation instead of
trying to load a class.

> 1) Phew ... nothing's wrong w/ my classpath.
> 2) Mental note - read the documentation more closely: in package.html it's
> said that if you implement an Attribute, make sure to add Impl to its
> class
> name, or otherwise you'll need to provide your own AttributeFactory.

All six default attributes habe an corresponding Impl that is loaded by the
default AttributeFactory. Token is *not* an attribute, it is an Impl as a
replacement for the 6 basic Impl classes.

> 3) But, why is the exception so vague? If Lucene adds "Impl" to the class
> name that I pass, shouldn't it also say that "... class for ....NameImpl"?
> That way, I'd see TokenImpl and immediately figure out that I should read
> the documentation.

The Exception say exactly whats happening.

> I then went on to read about AttributeFactory, and was wondering in the
> process why the hell do I need to implement one which is marked EXPERT
> whereas I use a "basic" Lucene class, when I discovered that Token
> includes
> a TokenAttributeFactory. So:
>
> 1) Good ! I don't need to implement an AttributeFactory.

This was a fault in 2.9, the Token.TOKEN_ATTRIBUTE_FACTORY was not
available, so you had to implement it yourself. In 3.0 it's available.

> 2) Why isn't it mentioned in the documentation? If Token was kept for easy
> migration from pre-2.9 API, I'd expect this to appear very clearly in
> package.html. Something like "if you're migrating from pre-2.9 API and
> would
> like to keep using Token, MAKE SURE TO CALL
> super(Token.TOKEN_ATTRIBUTE_FACTORY) IN YOUR TOKENIZER". Something like
> that, maybe with less upper-casing.

Token is *not* kept for easy migration, it is kept for two reasons:
a) For supporting the old next() API
b) as a class behind the implementation. Using the new API, you always have
to write your code using *interfaces* not the impls. So if you call
addAttribute, you get the interface impl reference back and you should only
use it as the interface (never cast it, in 2.9 you have to, but in 3.0 it
should keep T extends Attribute, as addAttribute(Class<T>) enforces. Never
do:

Token tok = (Token) addAttribute(TermAttribute.class)

Because you can not rely on the fact that the return value is Token, even if
you use AttributeFactory (in 2.9, with old API support enabled, the returned
class is TokenWrapper, also implementing the attributes).

Always do:
TermAttribute tok = addAttribute(TermAttribute.class)

So all your Tokenstream impls should never rely on implementations, only use
the interfaces. And as the interfaces do not support toString, clone,
copyTo,... do not use it. Use captureState and so on. Everything else could
easy break if you not have control over the whole Tokenizer chain!!!

> I went on and moved the addAttribute line to inside the ctor, after I call
> super(...). But then something else hit me. In my TokenFilters I call
> input.hasAttribute(Token.class) to ensure the input TS will process Token.
> I
> was surprised to find out this method returns 'false'.

Attribute accepts only Class<? extends Attribute> and ? must be an interface
not a implementation.

> Debug-tracing the
> code I discovered that when I call addAttribute, all the Attribute classes
> Token implements are added to the map, but not Token itself. So:

That cannot happen, as addAttribute(Token.class) does not work. Do you mean
addAttributeImpl() or are you using the factory?

> 1) Hmmm ... not so easy to migrate my Token-based API to the new API ...

Token is no interface that extends Attribute.

> 2) I assume getAttribute(Token.class) won't work either ... so what
> benefit
> did I get from calling addAttribute(Token.class) in the first place? Now I
> need, in my consumer API, to rebuild a Token on every incrementToken call?

It will never work. You have to use a factory.

> 3) Isn't that a crime? I added X and called has(X) and got false ... again
> documentation could help, but I get a sense that this is buggy behavior.

As said before, the check for attributes avaialble or not is not
recommended. Always use addAttribute and register your attributes using
addAttribute. If its there, it is returned, if not it is created with
default content.

> Before you answer that I can call getAttribute(TermAttribute.class),
> remember that I started this email as a user that wants to migrate to a
> new
> API, and the documentation says I can use Token for easier migration. So
> using all the other attributes is a less preferred option now, especially
> as
> I'm not going to introduce, at the moment, new attributes, but just
> continue
> to work with the 'default' ones.
>
> Any help will be appreciated. I really hope I'm missing something basic
> ...

All to say is: You cannot use Token inside your implementations of
TokenStreams. You can only back all basic attributes using Token for speed
efficiency.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Nov 22, 2009, 4:36 AM

Post #6 of 24 (2146 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

>
> I want to add Token.class, and then work w/ Token. Not TermAttribute,
> PosIncrAttribute, OffsetAttribute, PayloadAttribute and TypeAttribute
> (these
> are the five attributes I'm using from Token). Why can't the code add
> Token
> to the attributes map? If all of these are anyway mapped to the same
> instance, what problems will it cause?

That is simply not possible (see other mail), last sentence.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 4:44 AM

Post #7 of 24 (2141 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

But I do use addAttribute(Token.class), so I don't understand why you say
it's not possible. And I completely don't understand why the new API allows
me to just work w/ interfaces and not impls ... A while ago I got the
impression that we're trying to get rid of interfaces because they're not
easy to maintain back-compat with ...

Shai

On Sun, Nov 22, 2009 at 2:36 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> >
> > I want to add Token.class, and then work w/ Token. Not TermAttribute,
> > PosIncrAttribute, OffsetAttribute, PayloadAttribute and TypeAttribute
> > (these
> > are the five attributes I'm using from Token). Why can't the code add
> > Token
> > to the attributes map? If all of these are anyway mapped to the same
> > instance, what problems will it cause?
>
> That is simply not possible (see other mail), last sentence.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Nov 22, 2009, 4:58 AM

Post #8 of 24 (2143 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

> But I do use addAttribute(Token.class), so I don't understand why you say
> it's not possible. And I completely don't understand why the new API
> allows
> me to just work w/ interfaces and not impls ... A while ago I got the
> impression that we're trying to get rid of interfaces because they're not
> easy to maintain back-compat with ...

AddAttribute(Token.class) should throw an Exception, but it doesn't (it's a
bug in 3.0). addAttribute should only affect interfaces, it also accepts
Token, because the AttributeFactory accepts it - bang.

Sorry, but you can only pass attribute class literals to
addAttribute/getAttribute/hasAttribute and so on.

Sorry.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 5:28 AM

Post #9 of 24 (2147 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

ok so from what I understand, I should stop working w/ Token, and move to
working w/ the Attributes.

addAttribute indeed does not work. Even though it does not through an
exception, if I call in.addAttribute(Token.class), I get a new instance of
Token and not the once that was added by in. So this is even more severe
than just not blocking this option.

I thought I can move to use addAttributeImpl, but that won't help me,
because I won't be able to call getAttribute(Token.class).

So this leaves me w/ just working w/ the interfaces.

What do I need to do in order to clone an attribute? Previously I used
token.copyTo(target). How I can do it now if I don't have copyTo on the
interfaces, and/or clone?

Shai

On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> > But I do use addAttribute(Token.class), so I don't understand why you say
> > it's not possible. And I completely don't understand why the new API
> > allows
> > me to just work w/ interfaces and not impls ... A while ago I got the
> > impression that we're trying to get rid of interfaces because they're not
> > easy to maintain back-compat with ...
>
> AddAttribute(Token.class) should throw an Exception, but it doesn't (it's a
> bug in 3.0). addAttribute should only affect interfaces, it also accepts
> Token, because the AttributeFactory accepts it - bang.
>
> Sorry, but you can only pass attribute class literals to
> addAttribute/getAttribute/hasAttribute and so on.
>
> Sorry.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Nov 22, 2009, 5:33 AM

Post #10 of 24 (2149 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

Use captureState and save the state somewhere. You can restore the state
with restoreState to the TokenStream. CachingTokenFilter does this.

So the new API uses the State object to put away tokens for later reference.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

> -----Original Message-----
> From: Shai Erera [mailto:serera [at] gmail]
> Sent: Sunday, November 22, 2009 2:29 PM
> To: java-user [at] lucene
> Subject: Re: How to deal with Token in the new TS API
>
> ok so from what I understand, I should stop working w/ Token, and move to
> working w/ the Attributes.
>
> addAttribute indeed does not work. Even though it does not through an
> exception, if I call in.addAttribute(Token.class), I get a new instance of
> Token and not the once that was added by in. So this is even more severe
> than just not blocking this option.
>
> I thought I can move to use addAttributeImpl, but that won't help me,
> because I won't be able to call getAttribute(Token.class).
>
> So this leaves me w/ just working w/ the interfaces.
>
> What do I need to do in order to clone an attribute? Previously I used
> token.copyTo(target). How I can do it now if I don't have copyTo on the
> interfaces, and/or clone?
>
> Shai
>
> On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> > > But I do use addAttribute(Token.class), so I don't understand why you
> say
> > > it's not possible. And I completely don't understand why the new API
> > > allows
> > > me to just work w/ interfaces and not impls ... A while ago I got the
> > > impression that we're trying to get rid of interfaces because they're
> not
> > > easy to maintain back-compat with ...
> >
> > AddAttribute(Token.class) should throw an Exception, but it doesn't
> (it's a
> > bug in 3.0). addAttribute should only affect interfaces, it also accepts
> > Token, because the AttributeFactory accepts it - bang.
> >
> > Sorry, but you can only pass attribute class literals to
> > addAttribute/getAttribute/hasAttribute and so on.
> >
> > Sorry.
> >
> > Uwe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 5:57 AM

Post #11 of 24 (2140 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Perhaps I misunderstand something. The current use case I'm trying to solve
is - I have an abbreviations TokenFilter which reads a token and stores it.
If the next token is end-of-sentence, it checks whether the previous one is
in the abbreviations list, and discards the end-of-sentence token. I need to
store the first token somewhere so I can reference it.

Example: "hello mr. shai"
First token = hello -> store it and return
Second token = mr -> store it and return
Third token = "." -> check if "mr" is an abbreviation, if so don't return
".".
Fourth token = "shai" -> store it and return.
...

How do I store "mr" (or any of the others)? It was easy w/ copyTo. If I
captureState, I get a State, but I can't query it for a TermAttribute. Any
ideas?

Shai

On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> Use captureState and save the state somewhere. You can restore the state
> with restoreState to the TokenStream. CachingTokenFilter does this.
>
> So the new API uses the State object to put away tokens for later
> reference.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
> > -----Original Message-----
> > From: Shai Erera [mailto:serera [at] gmail]
> > Sent: Sunday, November 22, 2009 2:29 PM
> > To: java-user [at] lucene
> > Subject: Re: How to deal with Token in the new TS API
> >
> > ok so from what I understand, I should stop working w/ Token, and move to
> > working w/ the Attributes.
> >
> > addAttribute indeed does not work. Even though it does not through an
> > exception, if I call in.addAttribute(Token.class), I get a new instance
> of
> > Token and not the once that was added by in. So this is even more severe
> > than just not blocking this option.
> >
> > I thought I can move to use addAttributeImpl, but that won't help me,
> > because I won't be able to call getAttribute(Token.class).
> >
> > So this leaves me w/ just working w/ the interfaces.
> >
> > What do I need to do in order to clone an attribute? Previously I used
> > token.copyTo(target). How I can do it now if I don't have copyTo on the
> > interfaces, and/or clone?
> >
> > Shai
> >
> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> > > > But I do use addAttribute(Token.class), so I don't understand why you
> > say
> > > > it's not possible. And I completely don't understand why the new API
> > > > allows
> > > > me to just work w/ interfaces and not impls ... A while ago I got the
> > > > impression that we're trying to get rid of interfaces because they're
> > not
> > > > easy to maintain back-compat with ...
> > >
> > > AddAttribute(Token.class) should throw an Exception, but it doesn't
> > (it's a
> > > bug in 3.0). addAttribute should only affect interfaces, it also
> accepts
> > > Token, because the AttributeFactory accepts it - bang.
> > >
> > > Sorry, but you can only pass attribute class literals to
> > > addAttribute/getAttribute/hasAttribute and so on.
> > >
> > > Sorry.
> > >
> > > Uwe
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > For additional commands, e-mail: java-user-help [at] lucene
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


serera at gmail

Nov 22, 2009, 6:27 AM

Post #12 of 24 (2129 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

What I've done is:

State state = in.captureState();
...
// Upon new call to incrementToken().
State tmp = in.captureState();
in.restoreState(state);
// check if termAttribute is an abbreviation.
If not : in.restoreState(tmp);

But seems a lot of capturing/restoring to me ... how expensive is that?

Shai

On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail> wrote:

> Perhaps I misunderstand something. The current use case I'm trying to solve
> is - I have an abbreviations TokenFilter which reads a token and stores it.
> If the next token is end-of-sentence, it checks whether the previous one is
> in the abbreviations list, and discards the end-of-sentence token. I need to
> store the first token somewhere so I can reference it.
>
> Example: "hello mr. shai"
> First token = hello -> store it and return
> Second token = mr -> store it and return
> Third token = "." -> check if "mr" is an abbreviation, if so don't return
> ".".
> Fourth token = "shai" -> store it and return.
> ...
>
> How do I store "mr" (or any of the others)? It was easy w/ copyTo. If I
> captureState, I get a State, but I can't query it for a TermAttribute. Any
> ideas?
>
> Shai
>
>
> On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
>> Use captureState and save the state somewhere. You can restore the state
>> with restoreState to the TokenStream. CachingTokenFilter does this.
>>
>> So the new API uses the State object to put away tokens for later
>> reference.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>>
>> > -----Original Message-----
>> > From: Shai Erera [mailto:serera [at] gmail]
>> > Sent: Sunday, November 22, 2009 2:29 PM
>> > To: java-user [at] lucene
>> > Subject: Re: How to deal with Token in the new TS API
>> >
>> > ok so from what I understand, I should stop working w/ Token, and move
>> to
>> > working w/ the Attributes.
>> >
>> > addAttribute indeed does not work. Even though it does not through an
>> > exception, if I call in.addAttribute(Token.class), I get a new instance
>> of
>> > Token and not the once that was added by in. So this is even more severe
>> > than just not blocking this option.
>> >
>> > I thought I can move to use addAttributeImpl, but that won't help me,
>> > because I won't be able to call getAttribute(Token.class).
>> >
>> > So this leaves me w/ just working w/ the interfaces.
>> >
>> > What do I need to do in order to clone an attribute? Previously I used
>> > token.copyTo(target). How I can do it now if I don't have copyTo on the
>> > interfaces, and/or clone?
>> >
>> > Shai
>> >
>> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>> >
>> > > > But I do use addAttribute(Token.class), so I don't understand why
>> you
>> > say
>> > > > it's not possible. And I completely don't understand why the new API
>> > > > allows
>> > > > me to just work w/ interfaces and not impls ... A while ago I got
>> the
>> > > > impression that we're trying to get rid of interfaces because
>> they're
>> > not
>> > > > easy to maintain back-compat with ...
>> > >
>> > > AddAttribute(Token.class) should throw an Exception, but it doesn't
>> > (it's a
>> > > bug in 3.0). addAttribute should only affect interfaces, it also
>> accepts
>> > > Token, because the AttributeFactory accepts it - bang.
>> > >
>> > > Sorry, but you can only pass attribute class literals to
>> > > addAttribute/getAttribute/hasAttribute and so on.
>> > >
>> > > Sorry.
>> > >
>> > > Uwe
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> > > For additional commands, e-mail: java-user-help [at] lucene
>> > >
>> > >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>


uwe at thetaphi

Nov 22, 2009, 6:34 AM

Post #13 of 24 (2134 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

If you just want to lookup if "Mr" is an abbreviation, why not look it up
when you handle that token and set a boolean variable in the TS
(lastTokenWasAbbreviation). When you process the ".", remove it if the
Boolean is set.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Shai Erera [mailto:serera [at] gmail]
> Sent: Sunday, November 22, 2009 3:28 PM
> To: java-user [at] lucene
> Subject: Re: How to deal with Token in the new TS API
>
> What I've done is:
>
> State state = in.captureState();
> ...
> // Upon new call to incrementToken().
> State tmp = in.captureState();
> in.restoreState(state);
> // check if termAttribute is an abbreviation.
> If not : in.restoreState(tmp);
>
> But seems a lot of capturing/restoring to me ... how expensive is that?
>
> Shai
>
> On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail> wrote:
>
> > Perhaps I misunderstand something. The current use case I'm trying to
> solve
> > is - I have an abbreviations TokenFilter which reads a token and stores
> it.
> > If the next token is end-of-sentence, it checks whether the previous one
> is
> > in the abbreviations list, and discards the end-of-sentence token. I
> need to
> > store the first token somewhere so I can reference it.
> >
> > Example: "hello mr. shai"
> > First token = hello -> store it and return
> > Second token = mr -> store it and return
> > Third token = "." -> check if "mr" is an abbreviation, if so don't
> return
> > ".".
> > Fourth token = "shai" -> store it and return.
> > ...
> >
> > How do I store "mr" (or any of the others)? It was easy w/ copyTo. If I
> > captureState, I get a State, but I can't query it for a TermAttribute.
> Any
> > ideas?
> >
> > Shai
> >
> >
> > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> >> Use captureState and save the state somewhere. You can restore the
> state
> >> with restoreState to the TokenStream. CachingTokenFilter does this.
> >>
> >> So the new API uses the State object to put away tokens for later
> >> reference.
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe [at] thetaphi
> >>
> >> > -----Original Message-----
> >> > From: Shai Erera [mailto:serera [at] gmail]
> >> > Sent: Sunday, November 22, 2009 2:29 PM
> >> > To: java-user [at] lucene
> >> > Subject: Re: How to deal with Token in the new TS API
> >> >
> >> > ok so from what I understand, I should stop working w/ Token, and
> move
> >> to
> >> > working w/ the Attributes.
> >> >
> >> > addAttribute indeed does not work. Even though it does not through an
> >> > exception, if I call in.addAttribute(Token.class), I get a new
> instance
> >> of
> >> > Token and not the once that was added by in. So this is even more
> severe
> >> > than just not blocking this option.
> >> >
> >> > I thought I can move to use addAttributeImpl, but that won't help me,
> >> > because I won't be able to call getAttribute(Token.class).
> >> >
> >> > So this leaves me w/ just working w/ the interfaces.
> >> >
> >> > What do I need to do in order to clone an attribute? Previously I
> used
> >> > token.copyTo(target). How I can do it now if I don't have copyTo on
> the
> >> > interfaces, and/or clone?
> >> >
> >> > Shai
> >> >
> >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi>
> wrote:
> >> >
> >> > > > But I do use addAttribute(Token.class), so I don't understand why
> >> you
> >> > say
> >> > > > it's not possible. And I completely don't understand why the new
> API
> >> > > > allows
> >> > > > me to just work w/ interfaces and not impls ... A while ago I got
> >> the
> >> > > > impression that we're trying to get rid of interfaces because
> >> they're
> >> > not
> >> > > > easy to maintain back-compat with ...
> >> > >
> >> > > AddAttribute(Token.class) should throw an Exception, but it doesn't
> >> > (it's a
> >> > > bug in 3.0). addAttribute should only affect interfaces, it also
> >> accepts
> >> > > Token, because the AttributeFactory accepts it - bang.
> >> > >
> >> > > Sorry, but you can only pass attribute class literals to
> >> > > addAttribute/getAttribute/hasAttribute and so on.
> >> > >
> >> > > Sorry.
> >> > >
> >> > > Uwe
> >> > >
> >> > >
> >> > > -------------------------------------------------------------------
> --
> >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> > > For additional commands, e-mail: java-user-help [at] lucene
> >> > >
> >> > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 6:37 AM

Post #14 of 24 (2130 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Because that'd mean I'll check for abbreviations for every token. Which is a
big performance loss. That way, I can just check abbr if I encountered a "."
(not even all end-of-sentence tokens).

Why can't State offer a "getAttribute" like AttributeSource?

Shai

On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> If you just want to lookup if "Mr" is an abbreviation, why not look it up
> when you handle that token and set a boolean variable in the TS
> (lastTokenWasAbbreviation). When you process the ".", remove it if the
> Boolean is set.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Shai Erera [mailto:serera [at] gmail]
> > Sent: Sunday, November 22, 2009 3:28 PM
> > To: java-user [at] lucene
> > Subject: Re: How to deal with Token in the new TS API
> >
> > What I've done is:
> >
> > State state = in.captureState();
> > ...
> > // Upon new call to incrementToken().
> > State tmp = in.captureState();
> > in.restoreState(state);
> > // check if termAttribute is an abbreviation.
> > If not : in.restoreState(tmp);
> >
> > But seems a lot of capturing/restoring to me ... how expensive is that?
> >
> > Shai
> >
> > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail> wrote:
> >
> > > Perhaps I misunderstand something. The current use case I'm trying to
> > solve
> > > is - I have an abbreviations TokenFilter which reads a token and stores
> > it.
> > > If the next token is end-of-sentence, it checks whether the previous
> one
> > is
> > > in the abbreviations list, and discards the end-of-sentence token. I
> > need to
> > > store the first token somewhere so I can reference it.
> > >
> > > Example: "hello mr. shai"
> > > First token = hello -> store it and return
> > > Second token = mr -> store it and return
> > > Third token = "." -> check if "mr" is an abbreviation, if so don't
> > return
> > > ".".
> > > Fourth token = "shai" -> store it and return.
> > > ...
> > >
> > > How do I store "mr" (or any of the others)? It was easy w/ copyTo. If I
> > > captureState, I get a State, but I can't query it for a TermAttribute.
> > Any
> > > ideas?
> > >
> > > Shai
> > >
> > >
> > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi>
> wrote:
> > >
> > >> Use captureState and save the state somewhere. You can restore the
> > state
> > >> with restoreState to the TokenStream. CachingTokenFilter does this.
> > >>
> > >> So the new API uses the State object to put away tokens for later
> > >> reference.
> > >>
> > >> -----
> > >> Uwe Schindler
> > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: uwe [at] thetaphi
> > >>
> > >> > -----Original Message-----
> > >> > From: Shai Erera [mailto:serera [at] gmail]
> > >> > Sent: Sunday, November 22, 2009 2:29 PM
> > >> > To: java-user [at] lucene
> > >> > Subject: Re: How to deal with Token in the new TS API
> > >> >
> > >> > ok so from what I understand, I should stop working w/ Token, and
> > move
> > >> to
> > >> > working w/ the Attributes.
> > >> >
> > >> > addAttribute indeed does not work. Even though it does not through
> an
> > >> > exception, if I call in.addAttribute(Token.class), I get a new
> > instance
> > >> of
> > >> > Token and not the once that was added by in. So this is even more
> > severe
> > >> > than just not blocking this option.
> > >> >
> > >> > I thought I can move to use addAttributeImpl, but that won't help
> me,
> > >> > because I won't be able to call getAttribute(Token.class).
> > >> >
> > >> > So this leaves me w/ just working w/ the interfaces.
> > >> >
> > >> > What do I need to do in order to clone an attribute? Previously I
> > used
> > >> > token.copyTo(target). How I can do it now if I don't have copyTo on
> > the
> > >> > interfaces, and/or clone?
> > >> >
> > >> > Shai
> > >> >
> > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi>
> > wrote:
> > >> >
> > >> > > > But I do use addAttribute(Token.class), so I don't understand
> why
> > >> you
> > >> > say
> > >> > > > it's not possible. And I completely don't understand why the new
> > API
> > >> > > > allows
> > >> > > > me to just work w/ interfaces and not impls ... A while ago I
> got
> > >> the
> > >> > > > impression that we're trying to get rid of interfaces because
> > >> they're
> > >> > not
> > >> > > > easy to maintain back-compat with ...
> > >> > >
> > >> > > AddAttribute(Token.class) should throw an Exception, but it
> doesn't
> > >> > (it's a
> > >> > > bug in 3.0). addAttribute should only affect interfaces, it also
> > >> accepts
> > >> > > Token, because the AttributeFactory accepts it - bang.
> > >> > >
> > >> > > Sorry, but you can only pass attribute class literals to
> > >> > > addAttribute/getAttribute/hasAttribute and so on.
> > >> > >
> > >> > > Sorry.
> > >> > >
> > >> > > Uwe
> > >> > >
> > >> > >
> > >> > >
> -------------------------------------------------------------------
> > --
> > >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > >> > > For additional commands, e-mail: java-user-help [at] lucene
> > >> > >
> > >> > >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > >> For additional commands, e-mail: java-user-help [at] lucene
> > >>
> > >>
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Nov 22, 2009, 6:42 AM

Post #15 of 24 (2135 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

> Because that'd mean I'll check for abbreviations for every token. Which is
> a
> big performance loss. That way, I can just check abbr if I encountered a
> "."
> (not even all end-of-sentence tokens).

OK, than simply copy the term to a String and store it. The cost is the same
like cloning/copying. If you find the ".", use the String and look it up.

> Why can't State offer a "getAttribute" like AttributeSource?

Because State is optimized for fast restore. In previous 2.9 versions State
was itself an AttributeSource instance, but the capture/store was very, very
slow.

If you want to check an State, you would have need to iterate over all
attributes and find the correct one, which is also slow. The best is to
simply clone the term text as a string. You must create new objects in all
cases, even with clone/copy.

Uwe

> Shai
>
> On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> > If you just want to lookup if "Mr" is an abbreviation, why not look it
> up
> > when you handle that token and set a boolean variable in the TS
> > (lastTokenWasAbbreviation). When you process the ".", remove it if the
> > Boolean is set.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe [at] thetaphi
> >
> >
> > > -----Original Message-----
> > > From: Shai Erera [mailto:serera [at] gmail]
> > > Sent: Sunday, November 22, 2009 3:28 PM
> > > To: java-user [at] lucene
> > > Subject: Re: How to deal with Token in the new TS API
> > >
> > > What I've done is:
> > >
> > > State state = in.captureState();
> > > ...
> > > // Upon new call to incrementToken().
> > > State tmp = in.captureState();
> > > in.restoreState(state);
> > > // check if termAttribute is an abbreviation.
> > > If not : in.restoreState(tmp);
> > >
> > > But seems a lot of capturing/restoring to me ... how expensive is
> that?
> > >
> > > Shai
> > >
> > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail> wrote:
> > >
> > > > Perhaps I misunderstand something. The current use case I'm trying
> to
> > > solve
> > > > is - I have an abbreviations TokenFilter which reads a token and
> stores
> > > it.
> > > > If the next token is end-of-sentence, it checks whether the previous
> > one
> > > is
> > > > in the abbreviations list, and discards the end-of-sentence token. I
> > > need to
> > > > store the first token somewhere so I can reference it.
> > > >
> > > > Example: "hello mr. shai"
> > > > First token = hello -> store it and return
> > > > Second token = mr -> store it and return
> > > > Third token = "." -> check if "mr" is an abbreviation, if so don't
> > > return
> > > > ".".
> > > > Fourth token = "shai" -> store it and return.
> > > > ...
> > > >
> > > > How do I store "mr" (or any of the others)? It was easy w/ copyTo.
> If I
> > > > captureState, I get a State, but I can't query it for a
> TermAttribute.
> > > Any
> > > > ideas?
> > > >
> > > > Shai
> > > >
> > > >
> > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi>
> > wrote:
> > > >
> > > >> Use captureState and save the state somewhere. You can restore the
> > > state
> > > >> with restoreState to the TokenStream. CachingTokenFilter does this.
> > > >>
> > > >> So the new API uses the State object to put away tokens for later
> > > >> reference.
> > > >>
> > > >> -----
> > > >> Uwe Schindler
> > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > > >> http://www.thetaphi.de
> > > >> eMail: uwe [at] thetaphi
> > > >>
> > > >> > -----Original Message-----
> > > >> > From: Shai Erera [mailto:serera [at] gmail]
> > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> > > >> > To: java-user [at] lucene
> > > >> > Subject: Re: How to deal with Token in the new TS API
> > > >> >
> > > >> > ok so from what I understand, I should stop working w/ Token, and
> > > move
> > > >> to
> > > >> > working w/ the Attributes.
> > > >> >
> > > >> > addAttribute indeed does not work. Even though it does not
> through
> > an
> > > >> > exception, if I call in.addAttribute(Token.class), I get a new
> > > instance
> > > >> of
> > > >> > Token and not the once that was added by in. So this is even more
> > > severe
> > > >> > than just not blocking this option.
> > > >> >
> > > >> > I thought I can move to use addAttributeImpl, but that won't help
> > me,
> > > >> > because I won't be able to call getAttribute(Token.class).
> > > >> >
> > > >> > So this leaves me w/ just working w/ the interfaces.
> > > >> >
> > > >> > What do I need to do in order to clone an attribute? Previously I
> > > used
> > > >> > token.copyTo(target). How I can do it now if I don't have copyTo
> on
> > > the
> > > >> > interfaces, and/or clone?
> > > >> >
> > > >> > Shai
> > > >> >
> > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi>
> > > wrote:
> > > >> >
> > > >> > > > But I do use addAttribute(Token.class), so I don't understand
> > why
> > > >> you
> > > >> > say
> > > >> > > > it's not possible. And I completely don't understand why the
> new
> > > API
> > > >> > > > allows
> > > >> > > > me to just work w/ interfaces and not impls ... A while ago I
> > got
> > > >> the
> > > >> > > > impression that we're trying to get rid of interfaces because
> > > >> they're
> > > >> > not
> > > >> > > > easy to maintain back-compat with ...
> > > >> > >
> > > >> > > AddAttribute(Token.class) should throw an Exception, but it
> > doesn't
> > > >> > (it's a
> > > >> > > bug in 3.0). addAttribute should only affect interfaces, it
> also
> > > >> accepts
> > > >> > > Token, because the AttributeFactory accepts it - bang.
> > > >> > >
> > > >> > > Sorry, but you can only pass attribute class literals to
> > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> > > >> > >
> > > >> > > Sorry.
> > > >> > >
> > > >> > > Uwe
> > > >> > >
> > > >> > >
> > > >> > >
> > -------------------------------------------------------------------
> > > --
> > > >> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > >> > > For additional commands, e-mail: java-user-
> help [at] lucene
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >> -------------------------------------------------------------------
> --
> > > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > >> For additional commands, e-mail: java-user-help [at] lucene
> > > >>
> > > >>
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 10:52 AM

Post #16 of 24 (2108 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Yes I can clone the term itself by instantiating a TermAttributeImpl, which
is better than storing the String, because the latter always allocates
char[], while the former will reuse the char[] if it's big enough.

What if State included a HashMap of all attributes, in addition to its
"linked-list" structure?

Anyway, you mention that I can iterate on all Attributes of a State, but
it's not clear to me how to do it, since I don't see any relevant method in
its API. Am I missing something?

Shai

On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> > Because that'd mean I'll check for abbreviations for every token. Which
> is
> > a
> > big performance loss. That way, I can just check abbr if I encountered a
> > "."
> > (not even all end-of-sentence tokens).
>
> OK, than simply copy the term to a String and store it. The cost is the
> same
> like cloning/copying. If you find the ".", use the String and look it up.
>
> > Why can't State offer a "getAttribute" like AttributeSource?
>
> Because State is optimized for fast restore. In previous 2.9 versions State
> was itself an AttributeSource instance, but the capture/store was very,
> very
> slow.
>
> If you want to check an State, you would have need to iterate over all
> attributes and find the correct one, which is also slow. The best is to
> simply clone the term text as a string. You must create new objects in all
> cases, even with clone/copy.
>
> Uwe
>
> > Shai
> >
> > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> > > If you just want to lookup if "Mr" is an abbreviation, why not look it
> > up
> > > when you handle that token and set a boolean variable in the TS
> > > (lastTokenWasAbbreviation). When you process the ".", remove it if the
> > > Boolean is set.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe [at] thetaphi
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shai Erera [mailto:serera [at] gmail]
> > > > Sent: Sunday, November 22, 2009 3:28 PM
> > > > To: java-user [at] lucene
> > > > Subject: Re: How to deal with Token in the new TS API
> > > >
> > > > What I've done is:
> > > >
> > > > State state = in.captureState();
> > > > ...
> > > > // Upon new call to incrementToken().
> > > > State tmp = in.captureState();
> > > > in.restoreState(state);
> > > > // check if termAttribute is an abbreviation.
> > > > If not : in.restoreState(tmp);
> > > >
> > > > But seems a lot of capturing/restoring to me ... how expensive is
> > that?
> > > >
> > > > Shai
> > > >
> > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
> wrote:
> > > >
> > > > > Perhaps I misunderstand something. The current use case I'm trying
> > to
> > > > solve
> > > > > is - I have an abbreviations TokenFilter which reads a token and
> > stores
> > > > it.
> > > > > If the next token is end-of-sentence, it checks whether the
> previous
> > > one
> > > > is
> > > > > in the abbreviations list, and discards the end-of-sentence token.
> I
> > > > need to
> > > > > store the first token somewhere so I can reference it.
> > > > >
> > > > > Example: "hello mr. shai"
> > > > > First token = hello -> store it and return
> > > > > Second token = mr -> store it and return
> > > > > Third token = "." -> check if "mr" is an abbreviation, if so don't
> > > > return
> > > > > ".".
> > > > > Fourth token = "shai" -> store it and return.
> > > > > ...
> > > > >
> > > > > How do I store "mr" (or any of the others)? It was easy w/ copyTo.
> > If I
> > > > > captureState, I get a State, but I can't query it for a
> > TermAttribute.
> > > > Any
> > > > > ideas?
> > > > >
> > > > > Shai
> > > > >
> > > > >
> > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi>
> > > wrote:
> > > > >
> > > > >> Use captureState and save the state somewhere. You can restore the
> > > > state
> > > > >> with restoreState to the TokenStream. CachingTokenFilter does
> this.
> > > > >>
> > > > >> So the new API uses the State object to put away tokens for later
> > > > >> reference.
> > > > >>
> > > > >> -----
> > > > >> Uwe Schindler
> > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > >> http://www.thetaphi.de
> > > > >> eMail: uwe [at] thetaphi
> > > > >>
> > > > >> > -----Original Message-----
> > > > >> > From: Shai Erera [mailto:serera [at] gmail]
> > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> > > > >> > To: java-user [at] lucene
> > > > >> > Subject: Re: How to deal with Token in the new TS API
> > > > >> >
> > > > >> > ok so from what I understand, I should stop working w/ Token,
> and
> > > > move
> > > > >> to
> > > > >> > working w/ the Attributes.
> > > > >> >
> > > > >> > addAttribute indeed does not work. Even though it does not
> > through
> > > an
> > > > >> > exception, if I call in.addAttribute(Token.class), I get a new
> > > > instance
> > > > >> of
> > > > >> > Token and not the once that was added by in. So this is even
> more
> > > > severe
> > > > >> > than just not blocking this option.
> > > > >> >
> > > > >> > I thought I can move to use addAttributeImpl, but that won't
> help
> > > me,
> > > > >> > because I won't be able to call getAttribute(Token.class).
> > > > >> >
> > > > >> > So this leaves me w/ just working w/ the interfaces.
> > > > >> >
> > > > >> > What do I need to do in order to clone an attribute? Previously
> I
> > > > used
> > > > >> > token.copyTo(target). How I can do it now if I don't have copyTo
> > on
> > > > the
> > > > >> > interfaces, and/or clone?
> > > > >> >
> > > > >> > Shai
> > > > >> >
> > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler <uwe [at] thetaphi
> >
> > > > wrote:
> > > > >> >
> > > > >> > > > But I do use addAttribute(Token.class), so I don't
> understand
> > > why
> > > > >> you
> > > > >> > say
> > > > >> > > > it's not possible. And I completely don't understand why the
> > new
> > > > API
> > > > >> > > > allows
> > > > >> > > > me to just work w/ interfaces and not impls ... A while ago
> I
> > > got
> > > > >> the
> > > > >> > > > impression that we're trying to get rid of interfaces
> because
> > > > >> they're
> > > > >> > not
> > > > >> > > > easy to maintain back-compat with ...
> > > > >> > >
> > > > >> > > AddAttribute(Token.class) should throw an Exception, but it
> > > doesn't
> > > > >> > (it's a
> > > > >> > > bug in 3.0). addAttribute should only affect interfaces, it
> > also
> > > > >> accepts
> > > > >> > > Token, because the AttributeFactory accepts it - bang.
> > > > >> > >
> > > > >> > > Sorry, but you can only pass attribute class literals to
> > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> > > > >> > >
> > > > >> > > Sorry.
> > > > >> > >
> > > > >> > > Uwe
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > -------------------------------------------------------------------
> > > > --
> > > > >> > > To unsubscribe, e-mail:
> java-user-unsubscribe [at] lucene
> > > > >> > > For additional commands, e-mail: java-user-
> > help [at] lucene
> > > > >> > >
> > > > >> > >
> > > > >>
> > > > >>
> > > > >>
> -------------------------------------------------------------------
> > --
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > > >> For additional commands, e-mail: java-user-help [at] lucene
> > > > >>
> > > > >>
> > > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > For additional commands, e-mail: java-user-help [at] lucene
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Nov 22, 2009, 11:03 AM

Post #17 of 24 (2108 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

I said, you *could* if it would be exposed. But the State is a holder class
without functionality. Because the internals are impl dependent, maybe we
will add such thing in future. But: If the state contains a real map, it
would be slow, because each captureState call would need to fill the map,
which is slow. And: If you use the Token as AttImpl, the state will only
contain one entry. You cannot control which attribute is implemented by what
impl, so the map approach would never work correct.



You can allocate a TermAttributeImpl and copyTo, but you should create the
instance using the same factory as the tokenstream uses:



TermAttribute copy = (TermAttribute)
getAttributeFactory().createAttributeInstance(TermAttribute.class);



By that you guarantee, that both are from the same implementation type.



-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: uwe [at] thetaphi



> -----Original Message-----

> From: Shai Erera [mailto:serera [at] gmail]

> Sent: Sunday, November 22, 2009 7:53 PM

> To: java-user [at] lucene

> Subject: Re: How to deal with Token in the new TS API

>

> Yes I can clone the term itself by instantiating a TermAttributeImpl,

> which

> is better than storing the String, because the latter always allocates

> char[], while the former will reuse the char[] if it's big enough.

>

> What if State included a HashMap of all attributes, in addition to its

> "linked-list" structure?

>

> Anyway, you mention that I can iterate on all Attributes of a State, but

> it's not clear to me how to do it, since I don't see any relevant method

> in

> its API. Am I missing something?

>

> Shai

>

> On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

>

> > > Because that'd mean I'll check for abbreviations for every token.

> Which

> > is

> > > a

> > > big performance loss. That way, I can just check abbr if I encountered

> a

> > > "."

> > > (not even all end-of-sentence tokens).

> >

> > OK, than simply copy the term to a String and store it. The cost is the

> > same

> > like cloning/copying. If you find the ".", use the String and look it

> up.

> >

> > > Why can't State offer a "getAttribute" like AttributeSource?

> >

> > Because State is optimized for fast restore. In previous 2.9 versions

> State

> > was itself an AttributeSource instance, but the capture/store was very,

> > very

> > slow.

> >

> > If you want to check an State, you would have need to iterate over all

> > attributes and find the correct one, which is also slow. The best is to

> > simply clone the term text as a string. You must create new objects in

> all

> > cases, even with clone/copy.

> >

> > Uwe

> >

> > > Shai

> > >

> > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>

> wrote:

> > >

> > > > If you just want to lookup if "Mr" is an abbreviation, why not look

> it

> > > up

> > > > when you handle that token and set a boolean variable in the TS

> > > > (lastTokenWasAbbreviation). When you process the ".", remove it if

> the

> > > > Boolean is set.

> > > >

> > > > Uwe

> > > >

> > > > -----

> > > > Uwe Schindler

> > > > H.-H.-Meier-Allee 63, D-28213 Bremen

> > > > http://www.thetaphi.de

> > > > eMail: uwe [at] thetaphi

> > > >

> > > >

> > > > > -----Original Message-----

> > > > > From: Shai Erera [mailto:serera [at] gmail]

> > > > > Sent: Sunday, November 22, 2009 3:28 PM

> > > > > To: java-user [at] lucene

> > > > > Subject: Re: How to deal with Token in the new TS API

> > > > >

> > > > > What I've done is:

> > > > >

> > > > > State state = in.captureState();

> > > > > ...

> > > > > // Upon new call to incrementToken().

> > > > > State tmp = in.captureState();

> > > > > in.restoreState(state);

> > > > > // check if termAttribute is an abbreviation.

> > > > > If not : in.restoreState(tmp);

> > > > >

> > > > > But seems a lot of capturing/restoring to me ... how expensive is

> > > that?

> > > > >

> > > > > Shai

> > > > >

> > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>

> > wrote:

> > > > >

> > > > > > Perhaps I misunderstand something. The current use case I'm

> trying

> > > to

> > > > > solve

> > > > > > is - I have an abbreviations TokenFilter which reads a token and

> > > stores

> > > > > it.

> > > > > > If the next token is end-of-sentence, it checks whether the

> > previous

> > > > one

> > > > > is

> > > > > > in the abbreviations list, and discards the end-of-sentence

> token.

> > I

> > > > > need to

> > > > > > store the first token somewhere so I can reference it.

> > > > > >

> > > > > > Example: "hello mr. shai"

> > > > > > First token = hello -> store it and return

> > > > > > Second token = mr -> store it and return

> > > > > > Third token = "." -> check if "mr" is an abbreviation, if so

> don't

> > > > > return

> > > > > > ".".

> > > > > > Fourth token = "shai" -> store it and return.

> > > > > > ...

> > > > > >

> > > > > > How do I store "mr" (or any of the others)? It was easy w/

> copyTo.

> > > If I

> > > > > > captureState, I get a State, but I can't query it for a

> > > TermAttribute.

> > > > > Any

> > > > > > ideas?

> > > > > >

> > > > > > Shai

> > > > > >

> > > > > >

> > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <uwe [at] thetaphi>

> > > > wrote:

> > > > > >

> > > > > >> Use captureState and save the state somewhere. You can restore

> the

> > > > > state

> > > > > >> with restoreState to the TokenStream. CachingTokenFilter does

> > this.

> > > > > >>

> > > > > >> So the new API uses the State object to put away tokens for

> later

> > > > > >> reference.

> > > > > >>

> > > > > >> -----

> > > > > >> Uwe Schindler

> > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen

> > > > > >> http://www.thetaphi.de

> > > > > >> eMail: uwe [at] thetaphi

> > > > > >>

> > > > > >> > -----Original Message-----

> > > > > >> > From: Shai Erera [mailto:serera [at] gmail]

> > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM

> > > > > >> > To: java-user [at] lucene

> > > > > >> > Subject: Re: How to deal with Token in the new TS API

> > > > > >> >

> > > > > >> > ok so from what I understand, I should stop working w/ Token,

> > and

> > > > > move

> > > > > >> to

> > > > > >> > working w/ the Attributes.

> > > > > >> >

> > > > > >> > addAttribute indeed does not work. Even though it does not

> > > through

> > > > an

> > > > > >> > exception, if I call in.addAttribute(Token.class), I get a

> new

> > > > > instance

> > > > > >> of

> > > > > >> > Token and not the once that was added by in. So this is even

> > more

> > > > > severe

> > > > > >> > than just not blocking this option.

> > > > > >> >

> > > > > >> > I thought I can move to use addAttributeImpl, but that won't

> > help

> > > > me,

> > > > > >> > because I won't be able to call getAttribute(Token.class).

> > > > > >> >

> > > > > >> > So this leaves me w/ just working w/ the interfaces.

> > > > > >> >

> > > > > >> > What do I need to do in order to clone an attribute?

> Previously

> > I

> > > > > used

> > > > > >> > token.copyTo(target). How I can do it now if I don't have

> copyTo

> > > on

> > > > > the

> > > > > >> > interfaces, and/or clone?

> > > > > >> >

> > > > > >> > Shai

> > > > > >> >

> > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler

> <uwe [at] thetaphi

> > >

> > > > > wrote:

> > > > > >> >

> > > > > >> > > > But I do use addAttribute(Token.class), so I don't

> > understand

> > > > why

> > > > > >> you

> > > > > >> > say

> > > > > >> > > > it's not possible. And I completely don't understand why

> the

> > > new

> > > > > API

> > > > > >> > > > allows

> > > > > >> > > > me to just work w/ interfaces and not impls ... A while

> ago

> > I

> > > > got

> > > > > >> the

> > > > > >> > > > impression that we're trying to get rid of interfaces

> > because

> > > > > >> they're

> > > > > >> > not

> > > > > >> > > > easy to maintain back-compat with ...

> > > > > >> > >

> > > > > >> > > AddAttribute(Token.class) should throw an Exception, but it

> > > > doesn't

> > > > > >> > (it's a

> > > > > >> > > bug in 3.0). addAttribute should only affect interfaces, it

> > > also

> > > > > >> accepts

> > > > > >> > > Token, because the AttributeFactory accepts it - bang.

> > > > > >> > >

> > > > > >> > > Sorry, but you can only pass attribute class literals to

> > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.

> > > > > >> > >

> > > > > >> > > Sorry.

> > > > > >> > >

> > > > > >> > > Uwe

> > > > > >> > >

> > > > > >> > >

> > > > > >> > >

> > > > -------------------------------------------------------------------

> > > > > --

> > > > > >> > > To unsubscribe, e-mail:

> > java-user-unsubscribe [at] lucene

> > > > > >> > > For additional commands, e-mail: java-user-

> > > help [at] lucene

> > > > > >> > >

> > > > > >> > >

> > > > > >>

> > > > > >>

> > > > > >>

> > -------------------------------------------------------------------

> > > --

> > > > > >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene

> > > > > >> For additional commands, e-mail: java-user-

> help [at] lucene

> > > > > >>

> > > > > >>

> > > > > >

> > > >

> > > >

> > > > --------------------------------------------------------------------

> -

> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene

> > > > For additional commands, e-mail: java-user-help [at] lucene

> > > >

> > > >

> >

> >

> > ---------------------------------------------------------------------

> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene

> > For additional commands, e-mail: java-user-help [at] lucene

> >

> >


uwe at thetaphi

Nov 22, 2009, 11:14 AM

Post #18 of 24 (2101 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

Another idea, what you can also do is, create an AttributeSource instance in
your TokenStream one time using the AttributeSource.cloneAttributes() call.
You can use this copy of the attributes in parallel and maybe update the
TermAttribute there and so on. If you want to look at the last token, just
look into the copied attributesource. The calls to addAttribute/getAttribute
of this source can be done after cloning.

Class Initializer:
private final AttributeSource lastState = cloneAttributes();
private final TermAttribute lastTermAtt =
lastState.addAttribute(TermAttribute.class);

incrementToken:

if (input.incrementToken()) {
if (lastTermAtt.checkSomethingAsYouProposed) {
blubber...
}
termAtt.copyTo(lastTermAtt); // save current state
return true;
} else return false;



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe [at] thetaphi]
> Sent: Sunday, November 22, 2009 8:03 PM
> To: java-user [at] lucene
> Subject: RE: How to deal with Token in the new TS API
>
> I said, you *could* if it would be exposed. But the State is a holder
> class
> without functionality. Because the internals are impl dependent, maybe we
> will add such thing in future. But: If the state contains a real map, it
> would be slow, because each captureState call would need to fill the map,
> which is slow. And: If you use the Token as AttImpl, the state will only
> contain one entry. You cannot control which attribute is implemented by
> what
> impl, so the map approach would never work correct.
>
>
>
> You can allocate a TermAttributeImpl and copyTo, but you should create the
> instance using the same factory as the tokenstream uses:
>
>
>
> TermAttribute copy = (TermAttribute)
> getAttributeFactory().createAttributeInstance(TermAttribute.class);
>
>
>
> By that you guarantee, that both are from the same implementation type.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe [at] thetaphi
>
>
>
> > -----Original Message-----
>
> > From: Shai Erera [mailto:serera [at] gmail]
>
> > Sent: Sunday, November 22, 2009 7:53 PM
>
> > To: java-user [at] lucene
>
> > Subject: Re: How to deal with Token in the new TS API
>
> >
>
> > Yes I can clone the term itself by instantiating a TermAttributeImpl,
>
> > which
>
> > is better than storing the String, because the latter always allocates
>
> > char[], while the former will reuse the char[] if it's big enough.
>
> >
>
> > What if State included a HashMap of all attributes, in addition to its
>
> > "linked-list" structure?
>
> >
>
> > Anyway, you mention that I can iterate on all Attributes of a State, but
>
> > it's not clear to me how to do it, since I don't see any relevant method
>
> > in
>
> > its API. Am I missing something?
>
> >
>
> > Shai
>
> >
>
> > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> >
>
> > > > Because that'd mean I'll check for abbreviations for every token.
>
> > Which
>
> > > is
>
> > > > a
>
> > > > big performance loss. That way, I can just check abbr if I
> encountered
>
> > a
>
> > > > "."
>
> > > > (not even all end-of-sentence tokens).
>
> > >
>
> > > OK, than simply copy the term to a String and store it. The cost is
> the
>
> > > same
>
> > > like cloning/copying. If you find the ".", use the String and look it
>
> > up.
>
> > >
>
> > > > Why can't State offer a "getAttribute" like AttributeSource?
>
> > >
>
> > > Because State is optimized for fast restore. In previous 2.9 versions
>
> > State
>
> > > was itself an AttributeSource instance, but the capture/store was
> very,
>
> > > very
>
> > > slow.
>
> > >
>
> > > If you want to check an State, you would have need to iterate over all
>
> > > attributes and find the correct one, which is also slow. The best is
> to
>
> > > simply clone the term text as a string. You must create new objects in
>
> > all
>
> > > cases, even with clone/copy.
>
> > >
>
> > > Uwe
>
> > >
>
> > > > Shai
>
> > > >
>
> > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
>
> > wrote:
>
> > > >
>
> > > > > If you just want to lookup if "Mr" is an abbreviation, why not
> look
>
> > it
>
> > > > up
>
> > > > > when you handle that token and set a boolean variable in the TS
>
> > > > > (lastTokenWasAbbreviation). When you process the ".", remove it if
>
> > the
>
> > > > > Boolean is set.
>
> > > > >
>
> > > > > Uwe
>
> > > > >
>
> > > > > -----
>
> > > > > Uwe Schindler
>
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > http://www.thetaphi.de
>
> > > > > eMail: uwe [at] thetaphi
>
> > > > >
>
> > > > >
>
> > > > > > -----Original Message-----
>
> > > > > > From: Shai Erera [mailto:serera [at] gmail]
>
> > > > > > Sent: Sunday, November 22, 2009 3:28 PM
>
> > > > > > To: java-user [at] lucene
>
> > > > > > Subject: Re: How to deal with Token in the new TS API
>
> > > > > >
>
> > > > > > What I've done is:
>
> > > > > >
>
> > > > > > State state = in.captureState();
>
> > > > > > ...
>
> > > > > > // Upon new call to incrementToken().
>
> > > > > > State tmp = in.captureState();
>
> > > > > > in.restoreState(state);
>
> > > > > > // check if termAttribute is an abbreviation.
>
> > > > > > If not : in.restoreState(tmp);
>
> > > > > >
>
> > > > > > But seems a lot of capturing/restoring to me ... how expensive
> is
>
> > > > that?
>
> > > > > >
>
> > > > > > Shai
>
> > > > > >
>
> > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
>
> > > wrote:
>
> > > > > >
>
> > > > > > > Perhaps I misunderstand something. The current use case I'm
>
> > trying
>
> > > > to
>
> > > > > > solve
>
> > > > > > > is - I have an abbreviations TokenFilter which reads a token
> and
>
> > > > stores
>
> > > > > > it.
>
> > > > > > > If the next token is end-of-sentence, it checks whether the
>
> > > previous
>
> > > > > one
>
> > > > > > is
>
> > > > > > > in the abbreviations list, and discards the end-of-sentence
>
> > token.
>
> > > I
>
> > > > > > need to
>
> > > > > > > store the first token somewhere so I can reference it.
>
> > > > > > >
>
> > > > > > > Example: "hello mr. shai"
>
> > > > > > > First token = hello -> store it and return
>
> > > > > > > Second token = mr -> store it and return
>
> > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
>
> > don't
>
> > > > > > return
>
> > > > > > > ".".
>
> > > > > > > Fourth token = "shai" -> store it and return.
>
> > > > > > > ...
>
> > > > > > >
>
> > > > > > > How do I store "mr" (or any of the others)? It was easy w/
>
> > copyTo.
>
> > > > If I
>
> > > > > > > captureState, I get a State, but I can't query it for a
>
> > > > TermAttribute.
>
> > > > > > Any
>
> > > > > > > ideas?
>
> > > > > > >
>
> > > > > > > Shai
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
> <uwe [at] thetaphi>
>
> > > > > wrote:
>
> > > > > > >
>
> > > > > > >> Use captureState and save the state somewhere. You can
> restore
>
> > the
>
> > > > > > state
>
> > > > > > >> with restoreState to the TokenStream. CachingTokenFilter does
>
> > > this.
>
> > > > > > >>
>
> > > > > > >> So the new API uses the State object to put away tokens for
>
> > later
>
> > > > > > >> reference.
>
> > > > > > >>
>
> > > > > > >> -----
>
> > > > > > >> Uwe Schindler
>
> > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > > >> http://www.thetaphi.de
>
> > > > > > >> eMail: uwe [at] thetaphi
>
> > > > > > >>
>
> > > > > > >> > -----Original Message-----
>
> > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
>
> > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
>
> > > > > > >> > To: java-user [at] lucene
>
> > > > > > >> > Subject: Re: How to deal with Token in the new TS API
>
> > > > > > >> >
>
> > > > > > >> > ok so from what I understand, I should stop working w/
> Token,
>
> > > and
>
> > > > > > move
>
> > > > > > >> to
>
> > > > > > >> > working w/ the Attributes.
>
> > > > > > >> >
>
> > > > > > >> > addAttribute indeed does not work. Even though it does not
>
> > > > through
>
> > > > > an
>
> > > > > > >> > exception, if I call in.addAttribute(Token.class), I get a
>
> > new
>
> > > > > > instance
>
> > > > > > >> of
>
> > > > > > >> > Token and not the once that was added by in. So this is
> even
>
> > > more
>
> > > > > > severe
>
> > > > > > >> > than just not blocking this option.
>
> > > > > > >> >
>
> > > > > > >> > I thought I can move to use addAttributeImpl, but that
> won't
>
> > > help
>
> > > > > me,
>
> > > > > > >> > because I won't be able to call getAttribute(Token.class).
>
> > > > > > >> >
>
> > > > > > >> > So this leaves me w/ just working w/ the interfaces.
>
> > > > > > >> >
>
> > > > > > >> > What do I need to do in order to clone an attribute?
>
> > Previously
>
> > > I
>
> > > > > > used
>
> > > > > > >> > token.copyTo(target). How I can do it now if I don't have
>
> > copyTo
>
> > > > on
>
> > > > > > the
>
> > > > > > >> > interfaces, and/or clone?
>
> > > > > > >> >
>
> > > > > > >> > Shai
>
> > > > > > >> >
>
> > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
>
> > <uwe [at] thetaphi
>
> > > >
>
> > > > > > wrote:
>
> > > > > > >> >
>
> > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
>
> > > understand
>
> > > > > why
>
> > > > > > >> you
>
> > > > > > >> > say
>
> > > > > > >> > > > it's not possible. And I completely don't understand
> why
>
> > the
>
> > > > new
>
> > > > > > API
>
> > > > > > >> > > > allows
>
> > > > > > >> > > > me to just work w/ interfaces and not impls ... A while
>
> > ago
>
> > > I
>
> > > > > got
>
> > > > > > >> the
>
> > > > > > >> > > > impression that we're trying to get rid of interfaces
>
> > > because
>
> > > > > > >> they're
>
> > > > > > >> > not
>
> > > > > > >> > > > easy to maintain back-compat with ...
>
> > > > > > >> > >
>
> > > > > > >> > > AddAttribute(Token.class) should throw an Exception, but
> it
>
> > > > > doesn't
>
> > > > > > >> > (it's a
>
> > > > > > >> > > bug in 3.0). addAttribute should only affect interfaces,
> it
>
> > > > also
>
> > > > > > >> accepts
>
> > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry, but you can only pass attribute class literals to
>
> > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry.
>
> > > > > > >> > >
>
> > > > > > >> > > Uwe
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > ------------------------------------------------------------------
> -
>
> > > > > > --
>
> > > > > > >> > > To unsubscribe, e-mail:
>
> > > java-user-unsubscribe [at] lucene
>
> > > > > > >> > > For additional commands, e-mail: java-user-
>
> > > > help [at] lucene
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >>
>
> > > -------------------------------------------------------------------
>
> > > > --
>
> > > > > > >> To unsubscribe, e-mail: java-user-
> unsubscribe [at] lucene
>
> > > > > > >> For additional commands, e-mail: java-user-
>
> > help [at] lucene
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >
>
> > > > >
>
> > > > >
>
> > > > > ------------------------------------------------------------------
> --
>
> > -
>
> > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>
> > > > > For additional commands, e-mail: java-user-help [at] lucene
>
> > > > >
>
> > > > >
>
> > >
>
> > >
>
> > > ---------------------------------------------------------------------
>
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>
> > > For additional commands, e-mail: java-user-help [at] lucene
>
> > >
>
> > >



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 11:14 AM

Post #19 of 24 (2106 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Did you mean something like:

TermAttributeImpl termBuf = (TermAttributeImpl)
input.getAttributeFactory().createAttributeInstance(TermAttribute.class);

I need to use the methods on TermAttributeImpl like clear() ...

Shai

On Sun, Nov 22, 2009 at 9:03 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> I said, you *could* if it would be exposed. But the State is a holder class
> without functionality. Because the internals are impl dependent, maybe we
> will add such thing in future. But: If the state contains a real map, it
> would be slow, because each captureState call would need to fill the map,
> which is slow. And: If you use the Token as AttImpl, the state will only
> contain one entry. You cannot control which attribute is implemented by
> what
> impl, so the map approach would never work correct.
>
>
>
> You can allocate a TermAttributeImpl and copyTo, but you should create the
> instance using the same factory as the tokenstream uses:
>
>
>
> TermAttribute copy = (TermAttribute)
> getAttributeFactory().createAttributeInstance(TermAttribute.class);
>
>
>
> By that you guarantee, that both are from the same implementation type.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe [at] thetaphi
>
>
>
> > -----Original Message-----
>
> > From: Shai Erera [mailto:serera [at] gmail]
>
> > Sent: Sunday, November 22, 2009 7:53 PM
>
> > To: java-user [at] lucene
>
> > Subject: Re: How to deal with Token in the new TS API
>
> >
>
> > Yes I can clone the term itself by instantiating a TermAttributeImpl,
>
> > which
>
> > is better than storing the String, because the latter always allocates
>
> > char[], while the former will reuse the char[] if it's big enough.
>
> >
>
> > What if State included a HashMap of all attributes, in addition to its
>
> > "linked-list" structure?
>
> >
>
> > Anyway, you mention that I can iterate on all Attributes of a State, but
>
> > it's not clear to me how to do it, since I don't see any relevant method
>
> > in
>
> > its API. Am I missing something?
>
> >
>
> > Shai
>
> >
>
> > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> >
>
> > > > Because that'd mean I'll check for abbreviations for every token.
>
> > Which
>
> > > is
>
> > > > a
>
> > > > big performance loss. That way, I can just check abbr if I
> encountered
>
> > a
>
> > > > "."
>
> > > > (not even all end-of-sentence tokens).
>
> > >
>
> > > OK, than simply copy the term to a String and store it. The cost is the
>
> > > same
>
> > > like cloning/copying. If you find the ".", use the String and look it
>
> > up.
>
> > >
>
> > > > Why can't State offer a "getAttribute" like AttributeSource?
>
> > >
>
> > > Because State is optimized for fast restore. In previous 2.9 versions
>
> > State
>
> > > was itself an AttributeSource instance, but the capture/store was very,
>
> > > very
>
> > > slow.
>
> > >
>
> > > If you want to check an State, you would have need to iterate over all
>
> > > attributes and find the correct one, which is also slow. The best is to
>
> > > simply clone the term text as a string. You must create new objects in
>
> > all
>
> > > cases, even with clone/copy.
>
> > >
>
> > > Uwe
>
> > >
>
> > > > Shai
>
> > > >
>
> > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
>
> > wrote:
>
> > > >
>
> > > > > If you just want to lookup if "Mr" is an abbreviation, why not look
>
> > it
>
> > > > up
>
> > > > > when you handle that token and set a boolean variable in the TS
>
> > > > > (lastTokenWasAbbreviation). When you process the ".", remove it if
>
> > the
>
> > > > > Boolean is set.
>
> > > > >
>
> > > > > Uwe
>
> > > > >
>
> > > > > -----
>
> > > > > Uwe Schindler
>
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > http://www.thetaphi.de
>
> > > > > eMail: uwe [at] thetaphi
>
> > > > >
>
> > > > >
>
> > > > > > -----Original Message-----
>
> > > > > > From: Shai Erera [mailto:serera [at] gmail]
>
> > > > > > Sent: Sunday, November 22, 2009 3:28 PM
>
> > > > > > To: java-user [at] lucene
>
> > > > > > Subject: Re: How to deal with Token in the new TS API
>
> > > > > >
>
> > > > > > What I've done is:
>
> > > > > >
>
> > > > > > State state = in.captureState();
>
> > > > > > ...
>
> > > > > > // Upon new call to incrementToken().
>
> > > > > > State tmp = in.captureState();
>
> > > > > > in.restoreState(state);
>
> > > > > > // check if termAttribute is an abbreviation.
>
> > > > > > If not : in.restoreState(tmp);
>
> > > > > >
>
> > > > > > But seems a lot of capturing/restoring to me ... how expensive is
>
> > > > that?
>
> > > > > >
>
> > > > > > Shai
>
> > > > > >
>
> > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
>
> > > wrote:
>
> > > > > >
>
> > > > > > > Perhaps I misunderstand something. The current use case I'm
>
> > trying
>
> > > > to
>
> > > > > > solve
>
> > > > > > > is - I have an abbreviations TokenFilter which reads a token
> and
>
> > > > stores
>
> > > > > > it.
>
> > > > > > > If the next token is end-of-sentence, it checks whether the
>
> > > previous
>
> > > > > one
>
> > > > > > is
>
> > > > > > > in the abbreviations list, and discards the end-of-sentence
>
> > token.
>
> > > I
>
> > > > > > need to
>
> > > > > > > store the first token somewhere so I can reference it.
>
> > > > > > >
>
> > > > > > > Example: "hello mr. shai"
>
> > > > > > > First token = hello -> store it and return
>
> > > > > > > Second token = mr -> store it and return
>
> > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
>
> > don't
>
> > > > > > return
>
> > > > > > > ".".
>
> > > > > > > Fourth token = "shai" -> store it and return.
>
> > > > > > > ...
>
> > > > > > >
>
> > > > > > > How do I store "mr" (or any of the others)? It was easy w/
>
> > copyTo.
>
> > > > If I
>
> > > > > > > captureState, I get a State, but I can't query it for a
>
> > > > TermAttribute.
>
> > > > > > Any
>
> > > > > > > ideas?
>
> > > > > > >
>
> > > > > > > Shai
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <
> uwe [at] thetaphi>
>
> > > > > wrote:
>
> > > > > > >
>
> > > > > > >> Use captureState and save the state somewhere. You can restore
>
> > the
>
> > > > > > state
>
> > > > > > >> with restoreState to the TokenStream. CachingTokenFilter does
>
> > > this.
>
> > > > > > >>
>
> > > > > > >> So the new API uses the State object to put away tokens for
>
> > later
>
> > > > > > >> reference.
>
> > > > > > >>
>
> > > > > > >> -----
>
> > > > > > >> Uwe Schindler
>
> > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> > > > > > >> http://www.thetaphi.de
>
> > > > > > >> eMail: uwe [at] thetaphi
>
> > > > > > >>
>
> > > > > > >> > -----Original Message-----
>
> > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
>
> > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
>
> > > > > > >> > To: java-user [at] lucene
>
> > > > > > >> > Subject: Re: How to deal with Token in the new TS API
>
> > > > > > >> >
>
> > > > > > >> > ok so from what I understand, I should stop working w/
> Token,
>
> > > and
>
> > > > > > move
>
> > > > > > >> to
>
> > > > > > >> > working w/ the Attributes.
>
> > > > > > >> >
>
> > > > > > >> > addAttribute indeed does not work. Even though it does not
>
> > > > through
>
> > > > > an
>
> > > > > > >> > exception, if I call in.addAttribute(Token.class), I get a
>
> > new
>
> > > > > > instance
>
> > > > > > >> of
>
> > > > > > >> > Token and not the once that was added by in. So this is even
>
> > > more
>
> > > > > > severe
>
> > > > > > >> > than just not blocking this option.
>
> > > > > > >> >
>
> > > > > > >> > I thought I can move to use addAttributeImpl, but that won't
>
> > > help
>
> > > > > me,
>
> > > > > > >> > because I won't be able to call getAttribute(Token.class).
>
> > > > > > >> >
>
> > > > > > >> > So this leaves me w/ just working w/ the interfaces.
>
> > > > > > >> >
>
> > > > > > >> > What do I need to do in order to clone an attribute?
>
> > Previously
>
> > > I
>
> > > > > > used
>
> > > > > > >> > token.copyTo(target). How I can do it now if I don't have
>
> > copyTo
>
> > > > on
>
> > > > > > the
>
> > > > > > >> > interfaces, and/or clone?
>
> > > > > > >> >
>
> > > > > > >> > Shai
>
> > > > > > >> >
>
> > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
>
> > <uwe [at] thetaphi
>
> > > >
>
> > > > > > wrote:
>
> > > > > > >> >
>
> > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
>
> > > understand
>
> > > > > why
>
> > > > > > >> you
>
> > > > > > >> > say
>
> > > > > > >> > > > it's not possible. And I completely don't understand why
>
> > the
>
> > > > new
>
> > > > > > API
>
> > > > > > >> > > > allows
>
> > > > > > >> > > > me to just work w/ interfaces and not impls ... A while
>
> > ago
>
> > > I
>
> > > > > got
>
> > > > > > >> the
>
> > > > > > >> > > > impression that we're trying to get rid of interfaces
>
> > > because
>
> > > > > > >> they're
>
> > > > > > >> > not
>
> > > > > > >> > > > easy to maintain back-compat with ...
>
> > > > > > >> > >
>
> > > > > > >> > > AddAttribute(Token.class) should throw an Exception, but
> it
>
> > > > > doesn't
>
> > > > > > >> > (it's a
>
> > > > > > >> > > bug in 3.0). addAttribute should only affect interfaces,
> it
>
> > > > also
>
> > > > > > >> accepts
>
> > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry, but you can only pass attribute class literals to
>
> > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
>
> > > > > > >> > >
>
> > > > > > >> > > Sorry.
>
> > > > > > >> > >
>
> > > > > > >> > > Uwe
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > -------------------------------------------------------------------
>
> > > > > > --
>
> > > > > > >> > > To unsubscribe, e-mail:
>
> > > java-user-unsubscribe [at] lucene
>
> > > > > > >> > > For additional commands, e-mail: java-user-
>
> > > > help [at] lucene
>
> > > > > > >> > >
>
> > > > > > >> > >
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >>
>
> > > -------------------------------------------------------------------
>
> > > > --
>
> > > > > > >> To unsubscribe, e-mail:
> java-user-unsubscribe [at] lucene
>
> > > > > > >> For additional commands, e-mail: java-user-
>
> > help [at] lucene
>
> > > > > > >>
>
> > > > > > >>
>
> > > > > > >
>
> > > > >
>
> > > > >
>
> > > > >
> --------------------------------------------------------------------
>
> > -
>
> > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>
> > > > > For additional commands, e-mail: java-user-help [at] lucene
>
> > > > >
>
> > > > >
>
> > >
>
> > >
>
> > > ---------------------------------------------------------------------
>
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>
> > > For additional commands, e-mail: java-user-help [at] lucene
>
> > >
>
> > >
>
>


uwe at thetaphi

Nov 22, 2009, 11:21 AM

Post #20 of 24 (2098 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

The cast to TermAttributeImpl may not work if the factory creates a Token...
So declare termBuf as TermAttribute (without impl).

To clear, you can always downcast the interface to AttributeImpl. Or create
a second variable. Alternatively use my second approach.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Shai Erera [mailto:serera [at] gmail]
> Sent: Sunday, November 22, 2009 8:15 PM
> To: java-user [at] lucene
> Subject: Re: How to deal with Token in the new TS API
>
> Did you mean something like:
>
> TermAttributeImpl termBuf = (TermAttributeImpl)
> input.getAttributeFactory().createAttributeInstance(TermAttribute.class);
>
> I need to use the methods on TermAttributeImpl like clear() ...
>
> Shai
>
> On Sun, Nov 22, 2009 at 9:03 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> > I said, you *could* if it would be exposed. But the State is a holder
> class
> > without functionality. Because the internals are impl dependent, maybe
> we
> > will add such thing in future. But: If the state contains a real map, it
> > would be slow, because each captureState call would need to fill the
> map,
> > which is slow. And: If you use the Token as AttImpl, the state will only
> > contain one entry. You cannot control which attribute is implemented by
> > what
> > impl, so the map approach would never work correct.
> >
> >
> >
> > You can allocate a TermAttributeImpl and copyTo, but you should create
> the
> > instance using the same factory as the tokenstream uses:
> >
> >
> >
> > TermAttribute copy = (TermAttribute)
> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >
> >
> >
> > By that you guarantee, that both are from the same implementation type.
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe [at] thetaphi
> >
> >
> >
> > > -----Original Message-----
> >
> > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > Sent: Sunday, November 22, 2009 7:53 PM
> >
> > > To: java-user [at] lucene
> >
> > > Subject: Re: How to deal with Token in the new TS API
> >
> > >
> >
> > > Yes I can clone the term itself by instantiating a TermAttributeImpl,
> >
> > > which
> >
> > > is better than storing the String, because the latter always allocates
> >
> > > char[], while the former will reuse the char[] if it's big enough.
> >
> > >
> >
> > > What if State included a HashMap of all attributes, in addition to its
> >
> > > "linked-list" structure?
> >
> > >
> >
> > > Anyway, you mention that I can iterate on all Attributes of a State,
> but
> >
> > > it's not clear to me how to do it, since I don't see any relevant
> method
> >
> > > in
> >
> > > its API. Am I missing something?
> >
> > >
> >
> > > Shai
> >
> > >
> >
> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi>
> wrote:
> >
> > >
> >
> > > > > Because that'd mean I'll check for abbreviations for every token.
> >
> > > Which
> >
> > > > is
> >
> > > > > a
> >
> > > > > big performance loss. That way, I can just check abbr if I
> > encountered
> >
> > > a
> >
> > > > > "."
> >
> > > > > (not even all end-of-sentence tokens).
> >
> > > >
> >
> > > > OK, than simply copy the term to a String and store it. The cost is
> the
> >
> > > > same
> >
> > > > like cloning/copying. If you find the ".", use the String and look
> it
> >
> > > up.
> >
> > > >
> >
> > > > > Why can't State offer a "getAttribute" like AttributeSource?
> >
> > > >
> >
> > > > Because State is optimized for fast restore. In previous 2.9
> versions
> >
> > > State
> >
> > > > was itself an AttributeSource instance, but the capture/store was
> very,
> >
> > > > very
> >
> > > > slow.
> >
> > > >
> >
> > > > If you want to check an State, you would have need to iterate over
> all
> >
> > > > attributes and find the correct one, which is also slow. The best is
> to
> >
> > > > simply clone the term text as a string. You must create new objects
> in
> >
> > > all
> >
> > > > cases, even with clone/copy.
> >
> > > >
> >
> > > > Uwe
> >
> > > >
> >
> > > > > Shai
> >
> > > > >
> >
> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
> >
> > > wrote:
> >
> > > > >
> >
> > > > > > If you just want to lookup if "Mr" is an abbreviation, why not
> look
> >
> > > it
> >
> > > > > up
> >
> > > > > > when you handle that token and set a boolean variable in the TS
> >
> > > > > > (lastTokenWasAbbreviation). When you process the ".", remove it
> if
> >
> > > the
> >
> > > > > > Boolean is set.
> >
> > > > > >
> >
> > > > > > Uwe
> >
> > > > > >
> >
> > > > > > -----
> >
> > > > > > Uwe Schindler
> >
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > http://www.thetaphi.de
> >
> > > > > > eMail: uwe [at] thetaphi
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > > -----Original Message-----
> >
> > > > > > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
> >
> > > > > > > To: java-user [at] lucene
> >
> > > > > > > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > >
> >
> > > > > > > What I've done is:
> >
> > > > > > >
> >
> > > > > > > State state = in.captureState();
> >
> > > > > > > ...
> >
> > > > > > > // Upon new call to incrementToken().
> >
> > > > > > > State tmp = in.captureState();
> >
> > > > > > > in.restoreState(state);
> >
> > > > > > > // check if termAttribute is an abbreviation.
> >
> > > > > > > If not : in.restoreState(tmp);
> >
> > > > > > >
> >
> > > > > > > But seems a lot of capturing/restoring to me ... how expensive
> is
> >
> > > > > that?
> >
> > > > > > >
> >
> > > > > > > Shai
> >
> > > > > > >
> >
> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
> >
> > > > wrote:
> >
> > > > > > >
> >
> > > > > > > > Perhaps I misunderstand something. The current use case I'm
> >
> > > trying
> >
> > > > > to
> >
> > > > > > > solve
> >
> > > > > > > > is - I have an abbreviations TokenFilter which reads a token
> > and
> >
> > > > > stores
> >
> > > > > > > it.
> >
> > > > > > > > If the next token is end-of-sentence, it checks whether the
> >
> > > > previous
> >
> > > > > > one
> >
> > > > > > > is
> >
> > > > > > > > in the abbreviations list, and discards the end-of-sentence
> >
> > > token.
> >
> > > > I
> >
> > > > > > > need to
> >
> > > > > > > > store the first token somewhere so I can reference it.
> >
> > > > > > > >
> >
> > > > > > > > Example: "hello mr. shai"
> >
> > > > > > > > First token = hello -> store it and return
> >
> > > > > > > > Second token = mr -> store it and return
> >
> > > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
> >
> > > don't
> >
> > > > > > > return
> >
> > > > > > > > ".".
> >
> > > > > > > > Fourth token = "shai" -> store it and return.
> >
> > > > > > > > ...
> >
> > > > > > > >
> >
> > > > > > > > How do I store "mr" (or any of the others)? It was easy w/
> >
> > > copyTo.
> >
> > > > > If I
> >
> > > > > > > > captureState, I get a State, but I can't query it for a
> >
> > > > > TermAttribute.
> >
> > > > > > > Any
> >
> > > > > > > > ideas?
> >
> > > > > > > >
> >
> > > > > > > > Shai
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler <
> > uwe [at] thetaphi>
> >
> > > > > > wrote:
> >
> > > > > > > >
> >
> > > > > > > >> Use captureState and save the state somewhere. You can
> restore
> >
> > > the
> >
> > > > > > > state
> >
> > > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
> does
> >
> > > > this.
> >
> > > > > > > >>
> >
> > > > > > > >> So the new API uses the State object to put away tokens for
> >
> > > later
> >
> > > > > > > >> reference.
> >
> > > > > > > >>
> >
> > > > > > > >> -----
> >
> > > > > > > >> Uwe Schindler
> >
> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > > >> http://www.thetaphi.de
> >
> > > > > > > >> eMail: uwe [at] thetaphi
> >
> > > > > > > >>
> >
> > > > > > > >> > -----Original Message-----
> >
> > > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> >
> > > > > > > >> > To: java-user [at] lucene
> >
> > > > > > > >> > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > > >> >
> >
> > > > > > > >> > ok so from what I understand, I should stop working w/
> > Token,
> >
> > > > and
> >
> > > > > > > move
> >
> > > > > > > >> to
> >
> > > > > > > >> > working w/ the Attributes.
> >
> > > > > > > >> >
> >
> > > > > > > >> > addAttribute indeed does not work. Even though it does
> not
> >
> > > > > through
> >
> > > > > > an
> >
> > > > > > > >> > exception, if I call in.addAttribute(Token.class), I get
> a
> >
> > > new
> >
> > > > > > > instance
> >
> > > > > > > >> of
> >
> > > > > > > >> > Token and not the once that was added by in. So this is
> even
> >
> > > > more
> >
> > > > > > > severe
> >
> > > > > > > >> > than just not blocking this option.
> >
> > > > > > > >> >
> >
> > > > > > > >> > I thought I can move to use addAttributeImpl, but that
> won't
> >
> > > > help
> >
> > > > > > me,
> >
> > > > > > > >> > because I won't be able to call
> getAttribute(Token.class).
> >
> > > > > > > >> >
> >
> > > > > > > >> > So this leaves me w/ just working w/ the interfaces.
> >
> > > > > > > >> >
> >
> > > > > > > >> > What do I need to do in order to clone an attribute?
> >
> > > Previously
> >
> > > > I
> >
> > > > > > > used
> >
> > > > > > > >> > token.copyTo(target). How I can do it now if I don't have
> >
> > > copyTo
> >
> > > > > on
> >
> > > > > > > the
> >
> > > > > > > >> > interfaces, and/or clone?
> >
> > > > > > > >> >
> >
> > > > > > > >> > Shai
> >
> > > > > > > >> >
> >
> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
> >
> > > <uwe [at] thetaphi
> >
> > > > >
> >
> > > > > > > wrote:
> >
> > > > > > > >> >
> >
> > > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
> >
> > > > understand
> >
> > > > > > why
> >
> > > > > > > >> you
> >
> > > > > > > >> > say
> >
> > > > > > > >> > > > it's not possible. And I completely don't understand
> why
> >
> > > the
> >
> > > > > new
> >
> > > > > > > API
> >
> > > > > > > >> > > > allows
> >
> > > > > > > >> > > > me to just work w/ interfaces and not impls ... A
> while
> >
> > > ago
> >
> > > > I
> >
> > > > > > got
> >
> > > > > > > >> the
> >
> > > > > > > >> > > > impression that we're trying to get rid of interfaces
> >
> > > > because
> >
> > > > > > > >> they're
> >
> > > > > > > >> > not
> >
> > > > > > > >> > > > easy to maintain back-compat with ...
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > AddAttribute(Token.class) should throw an Exception,
> but
> > it
> >
> > > > > > doesn't
> >
> > > > > > > >> > (it's a
> >
> > > > > > > >> > > bug in 3.0). addAttribute should only affect
> interfaces,
> > it
> >
> > > > > also
> >
> > > > > > > >> accepts
> >
> > > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry, but you can only pass attribute class literals
> to
> >
> > > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Uwe
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > ----------------------------------------------------------------
> ---
> >
> > > > > > > --
> >
> > > > > > > >> > > To unsubscribe, e-mail:
> >
> > > > java-user-unsubscribe [at] lucene
> >
> > > > > > > >> > > For additional commands, e-mail: java-user-
> >
> > > > > help [at] lucene
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > -------------------------------------------------------------------
> >
> > > > > --
> >
> > > > > > > >> To unsubscribe, e-mail:
> > java-user-unsubscribe [at] lucene
> >
> > > > > > > >> For additional commands, e-mail: java-user-
> >
> > > help [at] lucene
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >
> >
> > > > > >
> >
> > > > > >
> >
> > > > > >
> > --------------------------------------------------------------------
> >
> > > -
> >
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > > > For additional commands, e-mail: java-user-
> help [at] lucene
> >
> > > > > >
> >
> > > > > >
> >
> > > >
> >
> > > >
> >
> > > > --------------------------------------------------------------------
> -
> >
> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > For additional commands, e-mail: java-user-help [at] lucene
> >
> > > >
> >
> > > >
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Nov 22, 2009, 11:22 AM

Post #21 of 24 (2096 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

Sorry small error:

Class Initializer:
private final AttributeSource lastState = cloneAttributes();
private final TermAttribute lastTermAtt =
lastState.addAttribute(TermAttribute.class);

incrementToken:

if (input.incrementToken()) {
if (lastTermAtt.checkSomethingAsYouProposed) {
blubber...
}
// save current state:
((AttributeImpl) termAtt).copyTo((AttributeImpl) lastTermAtt);
return true;
} else return false;

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Uwe Schindler [mailto:uwe [at] thetaphi]
> Sent: Sunday, November 22, 2009 8:14 PM
> To: java-user [at] lucene
> Subject: RE: How to deal with Token in the new TS API
>
> Another idea, what you can also do is, create an AttributeSource instance
> in
> your TokenStream one time using the AttributeSource.cloneAttributes()
> call.
> You can use this copy of the attributes in parallel and maybe update the
> TermAttribute there and so on. If you want to look at the last token, just
> look into the copied attributesource. The calls to
> addAttribute/getAttribute
> of this source can be done after cloning.
>
> Class Initializer:
> private final AttributeSource lastState = cloneAttributes();
> private final TermAttribute lastTermAtt =
> lastState.addAttribute(TermAttribute.class);
>
> incrementToken:
>
> if (input.incrementToken()) {
> if (lastTermAtt.checkSomethingAsYouProposed) {
> blubber...
> }
> termAtt.copyTo(lastTermAtt); // save current state
> return true;
> } else return false;
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Uwe Schindler [mailto:uwe [at] thetaphi]
> > Sent: Sunday, November 22, 2009 8:03 PM
> > To: java-user [at] lucene
> > Subject: RE: How to deal with Token in the new TS API
> >
> > I said, you *could* if it would be exposed. But the State is a holder
> > class
> > without functionality. Because the internals are impl dependent, maybe
> we
> > will add such thing in future. But: If the state contains a real map, it
> > would be slow, because each captureState call would need to fill the
> map,
> > which is slow. And: If you use the Token as AttImpl, the state will only
> > contain one entry. You cannot control which attribute is implemented by
> > what
> > impl, so the map approach would never work correct.
> >
> >
> >
> > You can allocate a TermAttributeImpl and copyTo, but you should create
> the
> > instance using the same factory as the tokenstream uses:
> >
> >
> >
> > TermAttribute copy = (TermAttribute)
> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >
> >
> >
> > By that you guarantee, that both are from the same implementation type.
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe [at] thetaphi
> >
> >
> >
> > > -----Original Message-----
> >
> > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > Sent: Sunday, November 22, 2009 7:53 PM
> >
> > > To: java-user [at] lucene
> >
> > > Subject: Re: How to deal with Token in the new TS API
> >
> > >
> >
> > > Yes I can clone the term itself by instantiating a TermAttributeImpl,
> >
> > > which
> >
> > > is better than storing the String, because the latter always allocates
> >
> > > char[], while the former will reuse the char[] if it's big enough.
> >
> > >
> >
> > > What if State included a HashMap of all attributes, in addition to its
> >
> > > "linked-list" structure?
> >
> > >
> >
> > > Anyway, you mention that I can iterate on all Attributes of a State,
> but
> >
> > > it's not clear to me how to do it, since I don't see any relevant
> method
> >
> > > in
> >
> > > its API. Am I missing something?
> >
> > >
> >
> > > Shai
> >
> > >
> >
> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi>
> wrote:
> >
> > >
> >
> > > > > Because that'd mean I'll check for abbreviations for every token.
> >
> > > Which
> >
> > > > is
> >
> > > > > a
> >
> > > > > big performance loss. That way, I can just check abbr if I
> > encountered
> >
> > > a
> >
> > > > > "."
> >
> > > > > (not even all end-of-sentence tokens).
> >
> > > >
> >
> > > > OK, than simply copy the term to a String and store it. The cost is
> > the
> >
> > > > same
> >
> > > > like cloning/copying. If you find the ".", use the String and look
> it
> >
> > > up.
> >
> > > >
> >
> > > > > Why can't State offer a "getAttribute" like AttributeSource?
> >
> > > >
> >
> > > > Because State is optimized for fast restore. In previous 2.9
> versions
> >
> > > State
> >
> > > > was itself an AttributeSource instance, but the capture/store was
> > very,
> >
> > > > very
> >
> > > > slow.
> >
> > > >
> >
> > > > If you want to check an State, you would have need to iterate over
> all
> >
> > > > attributes and find the correct one, which is also slow. The best is
> > to
> >
> > > > simply clone the term text as a string. You must create new objects
> in
> >
> > > all
> >
> > > > cases, even with clone/copy.
> >
> > > >
> >
> > > > Uwe
> >
> > > >
> >
> > > > > Shai
> >
> > > > >
> >
> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
> >
> > > wrote:
> >
> > > > >
> >
> > > > > > If you just want to lookup if "Mr" is an abbreviation, why not
> > look
> >
> > > it
> >
> > > > > up
> >
> > > > > > when you handle that token and set a boolean variable in the TS
> >
> > > > > > (lastTokenWasAbbreviation). When you process the ".", remove it
> if
> >
> > > the
> >
> > > > > > Boolean is set.
> >
> > > > > >
> >
> > > > > > Uwe
> >
> > > > > >
> >
> > > > > > -----
> >
> > > > > > Uwe Schindler
> >
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > http://www.thetaphi.de
> >
> > > > > > eMail: uwe [at] thetaphi
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > > -----Original Message-----
> >
> > > > > > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
> >
> > > > > > > To: java-user [at] lucene
> >
> > > > > > > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > >
> >
> > > > > > > What I've done is:
> >
> > > > > > >
> >
> > > > > > > State state = in.captureState();
> >
> > > > > > > ...
> >
> > > > > > > // Upon new call to incrementToken().
> >
> > > > > > > State tmp = in.captureState();
> >
> > > > > > > in.restoreState(state);
> >
> > > > > > > // check if termAttribute is an abbreviation.
> >
> > > > > > > If not : in.restoreState(tmp);
> >
> > > > > > >
> >
> > > > > > > But seems a lot of capturing/restoring to me ... how expensive
> > is
> >
> > > > > that?
> >
> > > > > > >
> >
> > > > > > > Shai
> >
> > > > > > >
> >
> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
> >
> > > > wrote:
> >
> > > > > > >
> >
> > > > > > > > Perhaps I misunderstand something. The current use case I'm
> >
> > > trying
> >
> > > > > to
> >
> > > > > > > solve
> >
> > > > > > > > is - I have an abbreviations TokenFilter which reads a token
> > and
> >
> > > > > stores
> >
> > > > > > > it.
> >
> > > > > > > > If the next token is end-of-sentence, it checks whether the
> >
> > > > previous
> >
> > > > > > one
> >
> > > > > > > is
> >
> > > > > > > > in the abbreviations list, and discards the end-of-sentence
> >
> > > token.
> >
> > > > I
> >
> > > > > > > need to
> >
> > > > > > > > store the first token somewhere so I can reference it.
> >
> > > > > > > >
> >
> > > > > > > > Example: "hello mr. shai"
> >
> > > > > > > > First token = hello -> store it and return
> >
> > > > > > > > Second token = mr -> store it and return
> >
> > > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
> >
> > > don't
> >
> > > > > > > return
> >
> > > > > > > > ".".
> >
> > > > > > > > Fourth token = "shai" -> store it and return.
> >
> > > > > > > > ...
> >
> > > > > > > >
> >
> > > > > > > > How do I store "mr" (or any of the others)? It was easy w/
> >
> > > copyTo.
> >
> > > > > If I
> >
> > > > > > > > captureState, I get a State, but I can't query it for a
> >
> > > > > TermAttribute.
> >
> > > > > > > Any
> >
> > > > > > > > ideas?
> >
> > > > > > > >
> >
> > > > > > > > Shai
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
> > <uwe [at] thetaphi>
> >
> > > > > > wrote:
> >
> > > > > > > >
> >
> > > > > > > >> Use captureState and save the state somewhere. You can
> > restore
> >
> > > the
> >
> > > > > > > state
> >
> > > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
> does
> >
> > > > this.
> >
> > > > > > > >>
> >
> > > > > > > >> So the new API uses the State object to put away tokens for
> >
> > > later
> >
> > > > > > > >> reference.
> >
> > > > > > > >>
> >
> > > > > > > >> -----
> >
> > > > > > > >> Uwe Schindler
> >
> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > > >> http://www.thetaphi.de
> >
> > > > > > > >> eMail: uwe [at] thetaphi
> >
> > > > > > > >>
> >
> > > > > > > >> > -----Original Message-----
> >
> > > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> >
> > > > > > > >> > To: java-user [at] lucene
> >
> > > > > > > >> > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > > >> >
> >
> > > > > > > >> > ok so from what I understand, I should stop working w/
> > Token,
> >
> > > > and
> >
> > > > > > > move
> >
> > > > > > > >> to
> >
> > > > > > > >> > working w/ the Attributes.
> >
> > > > > > > >> >
> >
> > > > > > > >> > addAttribute indeed does not work. Even though it does
> not
> >
> > > > > through
> >
> > > > > > an
> >
> > > > > > > >> > exception, if I call in.addAttribute(Token.class), I get
> a
> >
> > > new
> >
> > > > > > > instance
> >
> > > > > > > >> of
> >
> > > > > > > >> > Token and not the once that was added by in. So this is
> > even
> >
> > > > more
> >
> > > > > > > severe
> >
> > > > > > > >> > than just not blocking this option.
> >
> > > > > > > >> >
> >
> > > > > > > >> > I thought I can move to use addAttributeImpl, but that
> > won't
> >
> > > > help
> >
> > > > > > me,
> >
> > > > > > > >> > because I won't be able to call
> getAttribute(Token.class).
> >
> > > > > > > >> >
> >
> > > > > > > >> > So this leaves me w/ just working w/ the interfaces.
> >
> > > > > > > >> >
> >
> > > > > > > >> > What do I need to do in order to clone an attribute?
> >
> > > Previously
> >
> > > > I
> >
> > > > > > > used
> >
> > > > > > > >> > token.copyTo(target). How I can do it now if I don't have
> >
> > > copyTo
> >
> > > > > on
> >
> > > > > > > the
> >
> > > > > > > >> > interfaces, and/or clone?
> >
> > > > > > > >> >
> >
> > > > > > > >> > Shai
> >
> > > > > > > >> >
> >
> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
> >
> > > <uwe [at] thetaphi
> >
> > > > >
> >
> > > > > > > wrote:
> >
> > > > > > > >> >
> >
> > > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
> >
> > > > understand
> >
> > > > > > why
> >
> > > > > > > >> you
> >
> > > > > > > >> > say
> >
> > > > > > > >> > > > it's not possible. And I completely don't understand
> > why
> >
> > > the
> >
> > > > > new
> >
> > > > > > > API
> >
> > > > > > > >> > > > allows
> >
> > > > > > > >> > > > me to just work w/ interfaces and not impls ... A
> while
> >
> > > ago
> >
> > > > I
> >
> > > > > > got
> >
> > > > > > > >> the
> >
> > > > > > > >> > > > impression that we're trying to get rid of interfaces
> >
> > > > because
> >
> > > > > > > >> they're
> >
> > > > > > > >> > not
> >
> > > > > > > >> > > > easy to maintain back-compat with ...
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > AddAttribute(Token.class) should throw an Exception,
> but
> > it
> >
> > > > > > doesn't
> >
> > > > > > > >> > (it's a
> >
> > > > > > > >> > > bug in 3.0). addAttribute should only affect
> interfaces,
> > it
> >
> > > > > also
> >
> > > > > > > >> accepts
> >
> > > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry, but you can only pass attribute class literals
> to
> >
> > > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Uwe
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > ----------------------------------------------------------------
> --
> > -
> >
> > > > > > > --
> >
> > > > > > > >> > > To unsubscribe, e-mail:
> >
> > > > java-user-unsubscribe [at] lucene
> >
> > > > > > > >> > > For additional commands, e-mail: java-user-
> >
> > > > > help [at] lucene
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > -------------------------------------------------------------------
> >
> > > > > --
> >
> > > > > > > >> To unsubscribe, e-mail: java-user-
> > unsubscribe [at] lucene
> >
> > > > > > > >> For additional commands, e-mail: java-user-
> >
> > > help [at] lucene
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > ----------------------------------------------------------------
> --
> > --
> >
> > > -
> >
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > > > For additional commands, e-mail: java-user-
> help [at] lucene
> >
> > > > > >
> >
> > > > > >
> >
> > > >
> >
> > > >
> >
> > > > --------------------------------------------------------------------
> -
> >
> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > For additional commands, e-mail: java-user-help [at] lucene
> >
> > > >
> >
> > > >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Nov 22, 2009, 11:23 AM

Post #22 of 24 (2095 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

I assume termAtt is the input's TermAttribute, right? Therefore it has no
copyTo ...

What I've done so far is create a TermAttribute like you proposed (fixed
from my previous TermAttributeImpl):

TermAttribute clone = (TermAttribute)
input.getAttributeFactory().createAttributeInstance(TermAttribute.class);

and to clear() it I just clone.setTermLength(0);

This at least works for me.

Shai

On Sun, Nov 22, 2009 at 9:14 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> Another idea, what you can also do is, create an AttributeSource instance
> in
> your TokenStream one time using the AttributeSource.cloneAttributes() call.
> You can use this copy of the attributes in parallel and maybe update the
> TermAttribute there and so on. If you want to look at the last token, just
> look into the copied attributesource. The calls to
> addAttribute/getAttribute
> of this source can be done after cloning.
>
> Class Initializer:
> private final AttributeSource lastState = cloneAttributes();
> private final TermAttribute lastTermAtt =
> lastState.addAttribute(TermAttribute.class);
>
> incrementToken:
>
> if (input.incrementToken()) {
> if (lastTermAtt.checkSomethingAsYouProposed) {
> blubber...
> }
> termAtt.copyTo(lastTermAtt); // save current state
> return true;
> } else return false;
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
> > -----Original Message-----
> > From: Uwe Schindler [mailto:uwe [at] thetaphi]
> > Sent: Sunday, November 22, 2009 8:03 PM
> > To: java-user [at] lucene
> > Subject: RE: How to deal with Token in the new TS API
> >
> > I said, you *could* if it would be exposed. But the State is a holder
> > class
> > without functionality. Because the internals are impl dependent, maybe we
> > will add such thing in future. But: If the state contains a real map, it
> > would be slow, because each captureState call would need to fill the map,
> > which is slow. And: If you use the Token as AttImpl, the state will only
> > contain one entry. You cannot control which attribute is implemented by
> > what
> > impl, so the map approach would never work correct.
> >
> >
> >
> > You can allocate a TermAttributeImpl and copyTo, but you should create
> the
> > instance using the same factory as the tokenstream uses:
> >
> >
> >
> > TermAttribute copy = (TermAttribute)
> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >
> >
> >
> > By that you guarantee, that both are from the same implementation type.
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe [at] thetaphi
> >
> >
> >
> > > -----Original Message-----
> >
> > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > Sent: Sunday, November 22, 2009 7:53 PM
> >
> > > To: java-user [at] lucene
> >
> > > Subject: Re: How to deal with Token in the new TS API
> >
> > >
> >
> > > Yes I can clone the term itself by instantiating a TermAttributeImpl,
> >
> > > which
> >
> > > is better than storing the String, because the latter always allocates
> >
> > > char[], while the former will reuse the char[] if it's big enough.
> >
> > >
> >
> > > What if State included a HashMap of all attributes, in addition to its
> >
> > > "linked-list" structure?
> >
> > >
> >
> > > Anyway, you mention that I can iterate on all Attributes of a State,
> but
> >
> > > it's not clear to me how to do it, since I don't see any relevant
> method
> >
> > > in
> >
> > > its API. Am I missing something?
> >
> > >
> >
> > > Shai
> >
> > >
> >
> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi>
> wrote:
> >
> > >
> >
> > > > > Because that'd mean I'll check for abbreviations for every token.
> >
> > > Which
> >
> > > > is
> >
> > > > > a
> >
> > > > > big performance loss. That way, I can just check abbr if I
> > encountered
> >
> > > a
> >
> > > > > "."
> >
> > > > > (not even all end-of-sentence tokens).
> >
> > > >
> >
> > > > OK, than simply copy the term to a String and store it. The cost is
> > the
> >
> > > > same
> >
> > > > like cloning/copying. If you find the ".", use the String and look it
> >
> > > up.
> >
> > > >
> >
> > > > > Why can't State offer a "getAttribute" like AttributeSource?
> >
> > > >
> >
> > > > Because State is optimized for fast restore. In previous 2.9 versions
> >
> > > State
> >
> > > > was itself an AttributeSource instance, but the capture/store was
> > very,
> >
> > > > very
> >
> > > > slow.
> >
> > > >
> >
> > > > If you want to check an State, you would have need to iterate over
> all
> >
> > > > attributes and find the correct one, which is also slow. The best is
> > to
> >
> > > > simply clone the term text as a string. You must create new objects
> in
> >
> > > all
> >
> > > > cases, even with clone/copy.
> >
> > > >
> >
> > > > Uwe
> >
> > > >
> >
> > > > > Shai
> >
> > > > >
> >
> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
> >
> > > wrote:
> >
> > > > >
> >
> > > > > > If you just want to lookup if "Mr" is an abbreviation, why not
> > look
> >
> > > it
> >
> > > > > up
> >
> > > > > > when you handle that token and set a boolean variable in the TS
> >
> > > > > > (lastTokenWasAbbreviation). When you process the ".", remove it
> if
> >
> > > the
> >
> > > > > > Boolean is set.
> >
> > > > > >
> >
> > > > > > Uwe
> >
> > > > > >
> >
> > > > > > -----
> >
> > > > > > Uwe Schindler
> >
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > http://www.thetaphi.de
> >
> > > > > > eMail: uwe [at] thetaphi
> >
> > > > > >
> >
> > > > > >
> >
> > > > > > > -----Original Message-----
> >
> > > > > > > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
> >
> > > > > > > To: java-user [at] lucene
> >
> > > > > > > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > >
> >
> > > > > > > What I've done is:
> >
> > > > > > >
> >
> > > > > > > State state = in.captureState();
> >
> > > > > > > ...
> >
> > > > > > > // Upon new call to incrementToken().
> >
> > > > > > > State tmp = in.captureState();
> >
> > > > > > > in.restoreState(state);
> >
> > > > > > > // check if termAttribute is an abbreviation.
> >
> > > > > > > If not : in.restoreState(tmp);
> >
> > > > > > >
> >
> > > > > > > But seems a lot of capturing/restoring to me ... how expensive
> > is
> >
> > > > > that?
> >
> > > > > > >
> >
> > > > > > > Shai
> >
> > > > > > >
> >
> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail>
> >
> > > > wrote:
> >
> > > > > > >
> >
> > > > > > > > Perhaps I misunderstand something. The current use case I'm
> >
> > > trying
> >
> > > > > to
> >
> > > > > > > solve
> >
> > > > > > > > is - I have an abbreviations TokenFilter which reads a token
> > and
> >
> > > > > stores
> >
> > > > > > > it.
> >
> > > > > > > > If the next token is end-of-sentence, it checks whether the
> >
> > > > previous
> >
> > > > > > one
> >
> > > > > > > is
> >
> > > > > > > > in the abbreviations list, and discards the end-of-sentence
> >
> > > token.
> >
> > > > I
> >
> > > > > > > need to
> >
> > > > > > > > store the first token somewhere so I can reference it.
> >
> > > > > > > >
> >
> > > > > > > > Example: "hello mr. shai"
> >
> > > > > > > > First token = hello -> store it and return
> >
> > > > > > > > Second token = mr -> store it and return
> >
> > > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
> >
> > > don't
> >
> > > > > > > return
> >
> > > > > > > > ".".
> >
> > > > > > > > Fourth token = "shai" -> store it and return.
> >
> > > > > > > > ...
> >
> > > > > > > >
> >
> > > > > > > > How do I store "mr" (or any of the others)? It was easy w/
> >
> > > copyTo.
> >
> > > > > If I
> >
> > > > > > > > captureState, I get a State, but I can't query it for a
> >
> > > > > TermAttribute.
> >
> > > > > > > Any
> >
> > > > > > > > ideas?
> >
> > > > > > > >
> >
> > > > > > > > Shai
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
> > <uwe [at] thetaphi>
> >
> > > > > > wrote:
> >
> > > > > > > >
> >
> > > > > > > >> Use captureState and save the state somewhere. You can
> > restore
> >
> > > the
> >
> > > > > > > state
> >
> > > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
> does
> >
> > > > this.
> >
> > > > > > > >>
> >
> > > > > > > >> So the new API uses the State object to put away tokens for
> >
> > > later
> >
> > > > > > > >> reference.
> >
> > > > > > > >>
> >
> > > > > > > >> -----
> >
> > > > > > > >> Uwe Schindler
> >
> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > > > > > > >> http://www.thetaphi.de
> >
> > > > > > > >> eMail: uwe [at] thetaphi
> >
> > > > > > > >>
> >
> > > > > > > >> > -----Original Message-----
> >
> > > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
> >
> > > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> >
> > > > > > > >> > To: java-user [at] lucene
> >
> > > > > > > >> > Subject: Re: How to deal with Token in the new TS API
> >
> > > > > > > >> >
> >
> > > > > > > >> > ok so from what I understand, I should stop working w/
> > Token,
> >
> > > > and
> >
> > > > > > > move
> >
> > > > > > > >> to
> >
> > > > > > > >> > working w/ the Attributes.
> >
> > > > > > > >> >
> >
> > > > > > > >> > addAttribute indeed does not work. Even though it does not
> >
> > > > > through
> >
> > > > > > an
> >
> > > > > > > >> > exception, if I call in.addAttribute(Token.class), I get a
> >
> > > new
> >
> > > > > > > instance
> >
> > > > > > > >> of
> >
> > > > > > > >> > Token and not the once that was added by in. So this is
> > even
> >
> > > > more
> >
> > > > > > > severe
> >
> > > > > > > >> > than just not blocking this option.
> >
> > > > > > > >> >
> >
> > > > > > > >> > I thought I can move to use addAttributeImpl, but that
> > won't
> >
> > > > help
> >
> > > > > > me,
> >
> > > > > > > >> > because I won't be able to call getAttribute(Token.class).
> >
> > > > > > > >> >
> >
> > > > > > > >> > So this leaves me w/ just working w/ the interfaces.
> >
> > > > > > > >> >
> >
> > > > > > > >> > What do I need to do in order to clone an attribute?
> >
> > > Previously
> >
> > > > I
> >
> > > > > > > used
> >
> > > > > > > >> > token.copyTo(target). How I can do it now if I don't have
> >
> > > copyTo
> >
> > > > > on
> >
> > > > > > > the
> >
> > > > > > > >> > interfaces, and/or clone?
> >
> > > > > > > >> >
> >
> > > > > > > >> > Shai
> >
> > > > > > > >> >
> >
> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
> >
> > > <uwe [at] thetaphi
> >
> > > > >
> >
> > > > > > > wrote:
> >
> > > > > > > >> >
> >
> > > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
> >
> > > > understand
> >
> > > > > > why
> >
> > > > > > > >> you
> >
> > > > > > > >> > say
> >
> > > > > > > >> > > > it's not possible. And I completely don't understand
> > why
> >
> > > the
> >
> > > > > new
> >
> > > > > > > API
> >
> > > > > > > >> > > > allows
> >
> > > > > > > >> > > > me to just work w/ interfaces and not impls ... A
> while
> >
> > > ago
> >
> > > > I
> >
> > > > > > got
> >
> > > > > > > >> the
> >
> > > > > > > >> > > > impression that we're trying to get rid of interfaces
> >
> > > > because
> >
> > > > > > > >> they're
> >
> > > > > > > >> > not
> >
> > > > > > > >> > > > easy to maintain back-compat with ...
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > AddAttribute(Token.class) should throw an Exception, but
> > it
> >
> > > > > > doesn't
> >
> > > > > > > >> > (it's a
> >
> > > > > > > >> > > bug in 3.0). addAttribute should only affect interfaces,
> > it
> >
> > > > > also
> >
> > > > > > > >> accepts
> >
> > > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry, but you can only pass attribute class literals to
> >
> > > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Sorry.
> >
> > > > > > > >> > >
> >
> > > > > > > >> > > Uwe
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > >
> ------------------------------------------------------------------
> > -
> >
> > > > > > > --
> >
> > > > > > > >> > > To unsubscribe, e-mail:
> >
> > > > java-user-unsubscribe [at] lucene
> >
> > > > > > > >> > > For additional commands, e-mail: java-user-
> >
> > > > > help [at] lucene
> >
> > > > > > > >> > >
> >
> > > > > > > >> > >
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > -------------------------------------------------------------------
> >
> > > > > --
> >
> > > > > > > >> To unsubscribe, e-mail: java-user-
> > unsubscribe [at] lucene
> >
> > > > > > > >> For additional commands, e-mail: java-user-
> >
> > > help [at] lucene
> >
> > > > > > > >>
> >
> > > > > > > >>
> >
> > > > > > > >
> >
> > > > > >
> >
> > > > > >
> >
> > > > > >
> ------------------------------------------------------------------
> > --
> >
> > > -
> >
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > > > For additional commands, e-mail:
> java-user-help [at] lucene
> >
> > > > > >
> >
> > > > > >
> >
> > > >
> >
> > > >
> >
> > > > ---------------------------------------------------------------------
> >
> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >
> > > > For additional commands, e-mail: java-user-help [at] lucene
> >
> > > >
> >
> > > >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


serera at gmail

Nov 22, 2009, 11:25 AM

Post #23 of 24 (2095 views)
Permalink
Re: How to deal with Token in the new TS API [In reply to]

Ok I see you fixed it at the same time I sent the email :).

I think I get it ... so far.

So far I had to cache just TermAttribute. I think it'll get messy when I'll
need to cache more, like Type and PositionIncrement. But I haven't reached
those yet. Perhaps instead of creating many types of clones, I'll create
Token and populate it w/ what I need, just for convenience ...

Thanks,
Shai

On Sun, Nov 22, 2009 at 9:23 PM, Shai Erera <serera [at] gmail> wrote:

> I assume termAtt is the input's TermAttribute, right? Therefore it has no
> copyTo ...
>
> What I've done so far is create a TermAttribute like you proposed (fixed
> from my previous TermAttributeImpl):
>
> TermAttribute clone = (TermAttribute)
> input.getAttributeFactory().createAttributeInstance(TermAttribute.class);
>
> and to clear() it I just clone.setTermLength(0);
>
> This at least works for me.
>
> Shai
>
>
> On Sun, Nov 22, 2009 at 9:14 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
>> Another idea, what you can also do is, create an AttributeSource instance
>> in
>> your TokenStream one time using the AttributeSource.cloneAttributes()
>> call.
>> You can use this copy of the attributes in parallel and maybe update the
>> TermAttribute there and so on. If you want to look at the last token, just
>> look into the copied attributesource. The calls to
>> addAttribute/getAttribute
>> of this source can be done after cloning.
>>
>> Class Initializer:
>> private final AttributeSource lastState = cloneAttributes();
>> private final TermAttribute lastTermAtt =
>> lastState.addAttribute(TermAttribute.class);
>>
>> incrementToken:
>>
>> if (input.incrementToken()) {
>> if (lastTermAtt.checkSomethingAsYouProposed) {
>> blubber...
>> }
>> termAtt.copyTo(lastTermAtt); // save current state
>> return true;
>> } else return false;
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>>
>>
>> > -----Original Message-----
>> > From: Uwe Schindler [mailto:uwe [at] thetaphi]
>> > Sent: Sunday, November 22, 2009 8:03 PM
>> > To: java-user [at] lucene
>> > Subject: RE: How to deal with Token in the new TS API
>> >
>> > I said, you *could* if it would be exposed. But the State is a holder
>> > class
>> > without functionality. Because the internals are impl dependent, maybe
>> we
>> > will add such thing in future. But: If the state contains a real map, it
>> > would be slow, because each captureState call would need to fill the
>> map,
>> > which is slow. And: If you use the Token as AttImpl, the state will only
>> > contain one entry. You cannot control which attribute is implemented by
>> > what
>> > impl, so the map approach would never work correct.
>> >
>> >
>> >
>> > You can allocate a TermAttributeImpl and copyTo, but you should create
>> the
>> > instance using the same factory as the tokenstream uses:
>> >
>> >
>> >
>> > TermAttribute copy = (TermAttribute)
>> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
>> >
>> >
>> >
>> > By that you guarantee, that both are from the same implementation type.
>> >
>> >
>> >
>> > -----
>> >
>> > Uwe Schindler
>> >
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >
>> > http://www.thetaphi.de
>> >
>> > eMail: uwe [at] thetaphi
>> >
>> >
>> >
>> > > -----Original Message-----
>> >
>> > > From: Shai Erera [mailto:serera [at] gmail]
>> >
>> > > Sent: Sunday, November 22, 2009 7:53 PM
>> >
>> > > To: java-user [at] lucene
>> >
>> > > Subject: Re: How to deal with Token in the new TS API
>> >
>> > >
>> >
>> > > Yes I can clone the term itself by instantiating a TermAttributeImpl,
>> >
>> > > which
>> >
>> > > is better than storing the String, because the latter always allocates
>> >
>> > > char[], while the former will reuse the char[] if it's big enough.
>> >
>> > >
>> >
>> > > What if State included a HashMap of all attributes, in addition to its
>> >
>> > > "linked-list" structure?
>> >
>> > >
>> >
>> > > Anyway, you mention that I can iterate on all Attributes of a State,
>> but
>> >
>> > > it's not clear to me how to do it, since I don't see any relevant
>> method
>> >
>> > > in
>> >
>> > > its API. Am I missing something?
>> >
>> > >
>> >
>> > > Shai
>> >
>> > >
>> >
>> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi>
>> wrote:
>> >
>> > >
>> >
>> > > > > Because that'd mean I'll check for abbreviations for every token.
>> >
>> > > Which
>> >
>> > > > is
>> >
>> > > > > a
>> >
>> > > > > big performance loss. That way, I can just check abbr if I
>> > encountered
>> >
>> > > a
>> >
>> > > > > "."
>> >
>> > > > > (not even all end-of-sentence tokens).
>> >
>> > > >
>> >
>> > > > OK, than simply copy the term to a String and store it. The cost is
>> > the
>> >
>> > > > same
>> >
>> > > > like cloning/copying. If you find the ".", use the String and look
>> it
>> >
>> > > up.
>> >
>> > > >
>> >
>> > > > > Why can't State offer a "getAttribute" like AttributeSource?
>> >
>> > > >
>> >
>> > > > Because State is optimized for fast restore. In previous 2.9
>> versions
>> >
>> > > State
>> >
>> > > > was itself an AttributeSource instance, but the capture/store was
>> > very,
>> >
>> > > > very
>> >
>> > > > slow.
>> >
>> > > >
>> >
>> > > > If you want to check an State, you would have need to iterate over
>> all
>> >
>> > > > attributes and find the correct one, which is also slow. The best is
>> > to
>> >
>> > > > simply clone the term text as a string. You must create new objects
>> in
>> >
>> > > all
>> >
>> > > > cases, even with clone/copy.
>> >
>> > > >
>> >
>> > > > Uwe
>> >
>> > > >
>> >
>> > > > > Shai
>> >
>> > > > >
>> >
>> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler <uwe [at] thetaphi>
>> >
>> > > wrote:
>> >
>> > > > >
>> >
>> > > > > > If you just want to lookup if "Mr" is an abbreviation, why not
>> > look
>> >
>> > > it
>> >
>> > > > > up
>> >
>> > > > > > when you handle that token and set a boolean variable in the TS
>> >
>> > > > > > (lastTokenWasAbbreviation). When you process the ".", remove it
>> if
>> >
>> > > the
>> >
>> > > > > > Boolean is set.
>> >
>> > > > > >
>> >
>> > > > > > Uwe
>> >
>> > > > > >
>> >
>> > > > > > -----
>> >
>> > > > > > Uwe Schindler
>> >
>> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >
>> > > > > > http://www.thetaphi.de
>> >
>> > > > > > eMail: uwe [at] thetaphi
>> >
>> > > > > >
>> >
>> > > > > >
>> >
>> > > > > > > -----Original Message-----
>> >
>> > > > > > > From: Shai Erera [mailto:serera [at] gmail]
>> >
>> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
>> >
>> > > > > > > To: java-user [at] lucene
>> >
>> > > > > > > Subject: Re: How to deal with Token in the new TS API
>> >
>> > > > > > >
>> >
>> > > > > > > What I've done is:
>> >
>> > > > > > >
>> >
>> > > > > > > State state = in.captureState();
>> >
>> > > > > > > ...
>> >
>> > > > > > > // Upon new call to incrementToken().
>> >
>> > > > > > > State tmp = in.captureState();
>> >
>> > > > > > > in.restoreState(state);
>> >
>> > > > > > > // check if termAttribute is an abbreviation.
>> >
>> > > > > > > If not : in.restoreState(tmp);
>> >
>> > > > > > >
>> >
>> > > > > > > But seems a lot of capturing/restoring to me ... how expensive
>> > is
>> >
>> > > > > that?
>> >
>> > > > > > >
>> >
>> > > > > > > Shai
>> >
>> > > > > > >
>> >
>> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera <serera [at] gmail
>> >
>> >
>> > > > wrote:
>> >
>> > > > > > >
>> >
>> > > > > > > > Perhaps I misunderstand something. The current use case I'm
>> >
>> > > trying
>> >
>> > > > > to
>> >
>> > > > > > > solve
>> >
>> > > > > > > > is - I have an abbreviations TokenFilter which reads a token
>> > and
>> >
>> > > > > stores
>> >
>> > > > > > > it.
>> >
>> > > > > > > > If the next token is end-of-sentence, it checks whether the
>> >
>> > > > previous
>> >
>> > > > > > one
>> >
>> > > > > > > is
>> >
>> > > > > > > > in the abbreviations list, and discards the end-of-sentence
>> >
>> > > token.
>> >
>> > > > I
>> >
>> > > > > > > need to
>> >
>> > > > > > > > store the first token somewhere so I can reference it.
>> >
>> > > > > > > >
>> >
>> > > > > > > > Example: "hello mr. shai"
>> >
>> > > > > > > > First token = hello -> store it and return
>> >
>> > > > > > > > Second token = mr -> store it and return
>> >
>> > > > > > > > Third token = "." -> check if "mr" is an abbreviation, if so
>> >
>> > > don't
>> >
>> > > > > > > return
>> >
>> > > > > > > > ".".
>> >
>> > > > > > > > Fourth token = "shai" -> store it and return.
>> >
>> > > > > > > > ...
>> >
>> > > > > > > >
>> >
>> > > > > > > > How do I store "mr" (or any of the others)? It was easy w/
>> >
>> > > copyTo.
>> >
>> > > > > If I
>> >
>> > > > > > > > captureState, I get a State, but I can't query it for a
>> >
>> > > > > TermAttribute.
>> >
>> > > > > > > Any
>> >
>> > > > > > > > ideas?
>> >
>> > > > > > > >
>> >
>> > > > > > > > Shai
>> >
>> > > > > > > >
>> >
>> > > > > > > >
>> >
>> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
>> > <uwe [at] thetaphi>
>> >
>> > > > > > wrote:
>> >
>> > > > > > > >
>> >
>> > > > > > > >> Use captureState and save the state somewhere. You can
>> > restore
>> >
>> > > the
>> >
>> > > > > > > state
>> >
>> > > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
>> does
>> >
>> > > > this.
>> >
>> > > > > > > >>
>> >
>> > > > > > > >> So the new API uses the State object to put away tokens for
>> >
>> > > later
>> >
>> > > > > > > >> reference.
>> >
>> > > > > > > >>
>> >
>> > > > > > > >> -----
>> >
>> > > > > > > >> Uwe Schindler
>> >
>> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >
>> > > > > > > >> http://www.thetaphi.de
>> >
>> > > > > > > >> eMail: uwe [at] thetaphi
>> >
>> > > > > > > >>
>> >
>> > > > > > > >> > -----Original Message-----
>> >
>> > > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
>> >
>> > > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
>> >
>> > > > > > > >> > To: java-user [at] lucene
>> >
>> > > > > > > >> > Subject: Re: How to deal with Token in the new TS API
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > ok so from what I understand, I should stop working w/
>> > Token,
>> >
>> > > > and
>> >
>> > > > > > > move
>> >
>> > > > > > > >> to
>> >
>> > > > > > > >> > working w/ the Attributes.
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > addAttribute indeed does not work. Even though it does
>> not
>> >
>> > > > > through
>> >
>> > > > > > an
>> >
>> > > > > > > >> > exception, if I call in.addAttribute(Token.class), I get
>> a
>> >
>> > > new
>> >
>> > > > > > > instance
>> >
>> > > > > > > >> of
>> >
>> > > > > > > >> > Token and not the once that was added by in. So this is
>> > even
>> >
>> > > > more
>> >
>> > > > > > > severe
>> >
>> > > > > > > >> > than just not blocking this option.
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > I thought I can move to use addAttributeImpl, but that
>> > won't
>> >
>> > > > help
>> >
>> > > > > > me,
>> >
>> > > > > > > >> > because I won't be able to call
>> getAttribute(Token.class).
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > So this leaves me w/ just working w/ the interfaces.
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > What do I need to do in order to clone an attribute?
>> >
>> > > Previously
>> >
>> > > > I
>> >
>> > > > > > > used
>> >
>> > > > > > > >> > token.copyTo(target). How I can do it now if I don't have
>> >
>> > > copyTo
>> >
>> > > > > on
>> >
>> > > > > > > the
>> >
>> > > > > > > >> > interfaces, and/or clone?
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > Shai
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
>> >
>> > > <uwe [at] thetaphi
>> >
>> > > > >
>> >
>> > > > > > > wrote:
>> >
>> > > > > > > >> >
>> >
>> > > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
>> >
>> > > > understand
>> >
>> > > > > > why
>> >
>> > > > > > > >> you
>> >
>> > > > > > > >> > say
>> >
>> > > > > > > >> > > > it's not possible. And I completely don't understand
>> > why
>> >
>> > > the
>> >
>> > > > > new
>> >
>> > > > > > > API
>> >
>> > > > > > > >> > > > allows
>> >
>> > > > > > > >> > > > me to just work w/ interfaces and not impls ... A
>> while
>> >
>> > > ago
>> >
>> > > > I
>> >
>> > > > > > got
>> >
>> > > > > > > >> the
>> >
>> > > > > > > >> > > > impression that we're trying to get rid of interfaces
>> >
>> > > > because
>> >
>> > > > > > > >> they're
>> >
>> > > > > > > >> > not
>> >
>> > > > > > > >> > > > easy to maintain back-compat with ...
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > > AddAttribute(Token.class) should throw an Exception,
>> but
>> > it
>> >
>> > > > > > doesn't
>> >
>> > > > > > > >> > (it's a
>> >
>> > > > > > > >> > > bug in 3.0). addAttribute should only affect
>> interfaces,
>> > it
>> >
>> > > > > also
>> >
>> > > > > > > >> accepts
>> >
>> > > > > > > >> > > Token, because the AttributeFactory accepts it - bang.
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > > Sorry, but you can only pass attribute class literals
>> to
>> >
>> > > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > > Sorry.
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > > Uwe
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > >
>> >
>> > > > > >
>> ------------------------------------------------------------------
>> > -
>> >
>> > > > > > > --
>> >
>> > > > > > > >> > > To unsubscribe, e-mail:
>> >
>> > > > java-user-unsubscribe [at] lucene
>> >
>> > > > > > > >> > > For additional commands, e-mail: java-user-
>> >
>> > > > > help [at] lucene
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >> > >
>> >
>> > > > > > > >>
>> >
>> > > > > > > >>
>> >
>> > > > > > > >>
>> >
>> > > > -------------------------------------------------------------------
>> >
>> > > > > --
>> >
>> > > > > > > >> To unsubscribe, e-mail: java-user-
>> > unsubscribe [at] lucene
>> >
>> > > > > > > >> For additional commands, e-mail: java-user-
>> >
>> > > help [at] lucene
>> >
>> > > > > > > >>
>> >
>> > > > > > > >>
>> >
>> > > > > > > >
>> >
>> > > > > >
>> >
>> > > > > >
>> >
>> > > > > >
>> ------------------------------------------------------------------
>> > --
>> >
>> > > -
>> >
>> > > > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >
>> > > > > > For additional commands, e-mail:
>> java-user-help [at] lucene
>> >
>> > > > > >
>> >
>> > > > > >
>> >
>> > > >
>> >
>> > > >
>> >
>> > > >
>> ---------------------------------------------------------------------
>> >
>> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> >
>> > > > For additional commands, e-mail: java-user-help [at] lucene
>> >
>> > > >
>> >
>> > > >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>


uwe at thetaphi

Nov 22, 2009, 11:30 AM

Post #24 of 24 (2099 views)
Permalink
RE: How to deal with Token in the new TS API [In reply to]

To call clear, you can always downcast to AttributeImpl. But you need to
know, that it may clear also other attributes (like if it is a Token). So
setting termLength to 0 is the fastest approach, if you only need the term
att.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Shai Erera [mailto:serera [at] gmail]
> Sent: Sunday, November 22, 2009 8:26 PM
> To: java-user [at] lucene
> Subject: Re: How to deal with Token in the new TS API
>
> Ok I see you fixed it at the same time I sent the email :).
>
> I think I get it ... so far.
>
> So far I had to cache just TermAttribute. I think it'll get messy when
> I'll
> need to cache more, like Type and PositionIncrement. But I haven't reached
> those yet. Perhaps instead of creating many types of clones, I'll create
> Token and populate it w/ what I need, just for convenience ...
>
> Thanks,
> Shai
>
> On Sun, Nov 22, 2009 at 9:23 PM, Shai Erera <serera [at] gmail> wrote:
>
> > I assume termAtt is the input's TermAttribute, right? Therefore it has
> no
> > copyTo ...
> >
> > What I've done so far is create a TermAttribute like you proposed (fixed
> > from my previous TermAttributeImpl):
> >
> > TermAttribute clone = (TermAttribute)
> >
> input.getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >
> > and to clear() it I just clone.setTermLength(0);
> >
> > This at least works for me.
> >
> > Shai
> >
> >
> > On Sun, Nov 22, 2009 at 9:14 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
> >
> >> Another idea, what you can also do is, create an AttributeSource
> instance
> >> in
> >> your TokenStream one time using the AttributeSource.cloneAttributes()
> >> call.
> >> You can use this copy of the attributes in parallel and maybe update
> the
> >> TermAttribute there and so on. If you want to look at the last token,
> just
> >> look into the copied attributesource. The calls to
> >> addAttribute/getAttribute
> >> of this source can be done after cloning.
> >>
> >> Class Initializer:
> >> private final AttributeSource lastState = cloneAttributes();
> >> private final TermAttribute lastTermAtt =
> >> lastState.addAttribute(TermAttribute.class);
> >>
> >> incrementToken:
> >>
> >> if (input.incrementToken()) {
> >> if (lastTermAtt.checkSomethingAsYouProposed) {
> >> blubber...
> >> }
> >> termAtt.copyTo(lastTermAtt); // save current state
> >> return true;
> >> } else return false;
> >>
> >>
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe [at] thetaphi
> >>
> >>
> >> > -----Original Message-----
> >> > From: Uwe Schindler [mailto:uwe [at] thetaphi]
> >> > Sent: Sunday, November 22, 2009 8:03 PM
> >> > To: java-user [at] lucene
> >> > Subject: RE: How to deal with Token in the new TS API
> >> >
> >> > I said, you *could* if it would be exposed. But the State is a holder
> >> > class
> >> > without functionality. Because the internals are impl dependent,
> maybe
> >> we
> >> > will add such thing in future. But: If the state contains a real map,
> it
> >> > would be slow, because each captureState call would need to fill the
> >> map,
> >> > which is slow. And: If you use the Token as AttImpl, the state will
> only
> >> > contain one entry. You cannot control which attribute is implemented
> by
> >> > what
> >> > impl, so the map approach would never work correct.
> >> >
> >> >
> >> >
> >> > You can allocate a TermAttributeImpl and copyTo, but you should
> create
> >> the
> >> > instance using the same factory as the tokenstream uses:
> >> >
> >> >
> >> >
> >> > TermAttribute copy = (TermAttribute)
> >> > getAttributeFactory().createAttributeInstance(TermAttribute.class);
> >> >
> >> >
> >> >
> >> > By that you guarantee, that both are from the same implementation
> type.
> >> >
> >> >
> >> >
> >> > -----
> >> >
> >> > Uwe Schindler
> >> >
> >> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > http://www.thetaphi.de
> >> >
> >> > eMail: uwe [at] thetaphi
> >> >
> >> >
> >> >
> >> > > -----Original Message-----
> >> >
> >> > > From: Shai Erera [mailto:serera [at] gmail]
> >> >
> >> > > Sent: Sunday, November 22, 2009 7:53 PM
> >> >
> >> > > To: java-user [at] lucene
> >> >
> >> > > Subject: Re: How to deal with Token in the new TS API
> >> >
> >> > >
> >> >
> >> > > Yes I can clone the term itself by instantiating a
> TermAttributeImpl,
> >> >
> >> > > which
> >> >
> >> > > is better than storing the String, because the latter always
> allocates
> >> >
> >> > > char[], while the former will reuse the char[] if it's big enough.
> >> >
> >> > >
> >> >
> >> > > What if State included a HashMap of all attributes, in addition to
> its
> >> >
> >> > > "linked-list" structure?
> >> >
> >> > >
> >> >
> >> > > Anyway, you mention that I can iterate on all Attributes of a
> State,
> >> but
> >> >
> >> > > it's not clear to me how to do it, since I don't see any relevant
> >> method
> >> >
> >> > > in
> >> >
> >> > > its API. Am I missing something?
> >> >
> >> > >
> >> >
> >> > > Shai
> >> >
> >> > >
> >> >
> >> > > On Sun, Nov 22, 2009 at 4:42 PM, Uwe Schindler <uwe [at] thetaphi>
> >> wrote:
> >> >
> >> > >
> >> >
> >> > > > > Because that'd mean I'll check for abbreviations for every
> token.
> >> >
> >> > > Which
> >> >
> >> > > > is
> >> >
> >> > > > > a
> >> >
> >> > > > > big performance loss. That way, I can just check abbr if I
> >> > encountered
> >> >
> >> > > a
> >> >
> >> > > > > "."
> >> >
> >> > > > > (not even all end-of-sentence tokens).
> >> >
> >> > > >
> >> >
> >> > > > OK, than simply copy the term to a String and store it. The cost
> is
> >> > the
> >> >
> >> > > > same
> >> >
> >> > > > like cloning/copying. If you find the ".", use the String and
> look
> >> it
> >> >
> >> > > up.
> >> >
> >> > > >
> >> >
> >> > > > > Why can't State offer a "getAttribute" like AttributeSource?
> >> >
> >> > > >
> >> >
> >> > > > Because State is optimized for fast restore. In previous 2.9
> >> versions
> >> >
> >> > > State
> >> >
> >> > > > was itself an AttributeSource instance, but the capture/store was
> >> > very,
> >> >
> >> > > > very
> >> >
> >> > > > slow.
> >> >
> >> > > >
> >> >
> >> > > > If you want to check an State, you would have need to iterate
> over
> >> all
> >> >
> >> > > > attributes and find the correct one, which is also slow. The best
> is
> >> > to
> >> >
> >> > > > simply clone the term text as a string. You must create new
> objects
> >> in
> >> >
> >> > > all
> >> >
> >> > > > cases, even with clone/copy.
> >> >
> >> > > >
> >> >
> >> > > > Uwe
> >> >
> >> > > >
> >> >
> >> > > > > Shai
> >> >
> >> > > > >
> >> >
> >> > > > > On Sun, Nov 22, 2009 at 4:34 PM, Uwe Schindler
> <uwe [at] thetaphi>
> >> >
> >> > > wrote:
> >> >
> >> > > > >
> >> >
> >> > > > > > If you just want to lookup if "Mr" is an abbreviation, why
> not
> >> > look
> >> >
> >> > > it
> >> >
> >> > > > > up
> >> >
> >> > > > > > when you handle that token and set a boolean variable in the
> TS
> >> >
> >> > > > > > (lastTokenWasAbbreviation). When you process the ".", remove
> it
> >> if
> >> >
> >> > > the
> >> >
> >> > > > > > Boolean is set.
> >> >
> >> > > > > >
> >> >
> >> > > > > > Uwe
> >> >
> >> > > > > >
> >> >
> >> > > > > > -----
> >> >
> >> > > > > > Uwe Schindler
> >> >
> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > > > > > http://www.thetaphi.de
> >> >
> >> > > > > > eMail: uwe [at] thetaphi
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > > > -----Original Message-----
> >> >
> >> > > > > > > From: Shai Erera [mailto:serera [at] gmail]
> >> >
> >> > > > > > > Sent: Sunday, November 22, 2009 3:28 PM
> >> >
> >> > > > > > > To: java-user [at] lucene
> >> >
> >> > > > > > > Subject: Re: How to deal with Token in the new TS API
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > What I've done is:
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > State state = in.captureState();
> >> >
> >> > > > > > > ...
> >> >
> >> > > > > > > // Upon new call to incrementToken().
> >> >
> >> > > > > > > State tmp = in.captureState();
> >> >
> >> > > > > > > in.restoreState(state);
> >> >
> >> > > > > > > // check if termAttribute is an abbreviation.
> >> >
> >> > > > > > > If not : in.restoreState(tmp);
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > But seems a lot of capturing/restoring to me ... how
> expensive
> >> > is
> >> >
> >> > > > > that?
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > Shai
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera
> <serera [at] gmail
> >> >
> >> >
> >> > > > wrote:
> >> >
> >> > > > > > >
> >> >
> >> > > > > > > > Perhaps I misunderstand something. The current use case
> I'm
> >> >
> >> > > trying
> >> >
> >> > > > > to
> >> >
> >> > > > > > > solve
> >> >
> >> > > > > > > > is - I have an abbreviations TokenFilter which reads a
> token
> >> > and
> >> >
> >> > > > > stores
> >> >
> >> > > > > > > it.
> >> >
> >> > > > > > > > If the next token is end-of-sentence, it checks whether
> the
> >> >
> >> > > > previous
> >> >
> >> > > > > > one
> >> >
> >> > > > > > > is
> >> >
> >> > > > > > > > in the abbreviations list, and discards the end-of-
> sentence
> >> >
> >> > > token.
> >> >
> >> > > > I
> >> >
> >> > > > > > > need to
> >> >
> >> > > > > > > > store the first token somewhere so I can reference it.
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > Example: "hello mr. shai"
> >> >
> >> > > > > > > > First token = hello -> store it and return
> >> >
> >> > > > > > > > Second token = mr -> store it and return
> >> >
> >> > > > > > > > Third token = "." -> check if "mr" is an abbreviation, if
> so
> >> >
> >> > > don't
> >> >
> >> > > > > > > return
> >> >
> >> > > > > > > > ".".
> >> >
> >> > > > > > > > Fourth token = "shai" -> store it and return.
> >> >
> >> > > > > > > > ...
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > How do I store "mr" (or any of the others)? It was easy
> w/
> >> >
> >> > > copyTo.
> >> >
> >> > > > > If I
> >> >
> >> > > > > > > > captureState, I get a State, but I can't query it for a
> >> >
> >> > > > > TermAttribute.
> >> >
> >> > > > > > > Any
> >> >
> >> > > > > > > > ideas?
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > Shai
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > > On Sun, Nov 22, 2009 at 3:33 PM, Uwe Schindler
> >> > <uwe [at] thetaphi>
> >> >
> >> > > > > > wrote:
> >> >
> >> > > > > > > >
> >> >
> >> > > > > > > >> Use captureState and save the state somewhere. You can
> >> > restore
> >> >
> >> > > the
> >> >
> >> > > > > > > state
> >> >
> >> > > > > > > >> with restoreState to the TokenStream. CachingTokenFilter
> >> does
> >> >
> >> > > > this.
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> So the new API uses the State object to put away tokens
> for
> >> >
> >> > > later
> >> >
> >> > > > > > > >> reference.
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> -----
> >> >
> >> > > > > > > >> Uwe Schindler
> >> >
> >> > > > > > > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> >
> >> > > > > > > >> http://www.thetaphi.de
> >> >
> >> > > > > > > >> eMail: uwe [at] thetaphi
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >> > -----Original Message-----
> >> >
> >> > > > > > > >> > From: Shai Erera [mailto:serera [at] gmail]
> >> >
> >> > > > > > > >> > Sent: Sunday, November 22, 2009 2:29 PM
> >> >
> >> > > > > > > >> > To: java-user [at] lucene
> >> >
> >> > > > > > > >> > Subject: Re: How to deal with Token in the new TS API
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > ok so from what I understand, I should stop working w/
> >> > Token,
> >> >
> >> > > > and
> >> >
> >> > > > > > > move
> >> >
> >> > > > > > > >> to
> >> >
> >> > > > > > > >> > working w/ the Attributes.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > addAttribute indeed does not work. Even though it does
> >> not
> >> >
> >> > > > > through
> >> >
> >> > > > > > an
> >> >
> >> > > > > > > >> > exception, if I call in.addAttribute(Token.class), I
> get
> >> a
> >> >
> >> > > new
> >> >
> >> > > > > > > instance
> >> >
> >> > > > > > > >> of
> >> >
> >> > > > > > > >> > Token and not the once that was added by in. So this
> is
> >> > even
> >> >
> >> > > > more
> >> >
> >> > > > > > > severe
> >> >
> >> > > > > > > >> > than just not blocking this option.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > I thought I can move to use addAttributeImpl, but that
> >> > won't
> >> >
> >> > > > help
> >> >
> >> > > > > > me,
> >> >
> >> > > > > > > >> > because I won't be able to call
> >> getAttribute(Token.class).
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > So this leaves me w/ just working w/ the interfaces.
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > What do I need to do in order to clone an attribute?
> >> >
> >> > > Previously
> >> >
> >> > > > I
> >> >
> >> > > > > > > used
> >> >
> >> > > > > > > >> > token.copyTo(target). How I can do it now if I don't
> have
> >> >
> >> > > copyTo
> >> >
> >> > > > > on
> >> >
> >> > > > > > > the
> >> >
> >> > > > > > > >> > interfaces, and/or clone?
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > Shai
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > On Sun, Nov 22, 2009 at 2:58 PM, Uwe Schindler
> >> >
> >> > > <uwe [at] thetaphi
> >> >
> >> > > > >
> >> >
> >> > > > > > > wrote:
> >> >
> >> > > > > > > >> >
> >> >
> >> > > > > > > >> > > > But I do use addAttribute(Token.class), so I don't
> >> >
> >> > > > understand
> >> >
> >> > > > > > why
> >> >
> >> > > > > > > >> you
> >> >
> >> > > > > > > >> > say
> >> >
> >> > > > > > > >> > > > it's not possible. And I completely don't
> understand
> >> > why
> >> >
> >> > > the
> >> >
> >> > > > > new
> >> >
> >> > > > > > > API
> >> >
> >> > > > > > > >> > > > allows
> >> >
> >> > > > > > > >> > > > me to just work w/ interfaces and not impls ... A
> >> while
> >> >
> >> > > ago
> >> >
> >> > > > I
> >> >
> >> > > > > > got
> >> >
> >> > > > > > > >> the
> >> >
> >> > > > > > > >> > > > impression that we're trying to get rid of
> interfaces
> >> >
> >> > > > because
> >> >
> >> > > > > > > >> they're
> >> >
> >> > > > > > > >> > not
> >> >
> >> > > > > > > >> > > > easy to maintain back-compat with ...
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > AddAttribute(Token.class) should throw an Exception,
> >> but
> >> > it
> >> >
> >> > > > > > doesn't
> >> >
> >> > > > > > > >> > (it's a
> >> >
> >> > > > > > > >> > > bug in 3.0). addAttribute should only affect
> >> interfaces,
> >> > it
> >> >
> >> > > > > also
> >> >
> >> > > > > > > >> accepts
> >> >
> >> > > > > > > >> > > Token, because the AttributeFactory accepts it -
> bang.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Sorry, but you can only pass attribute class
> literals
> >> to
> >> >
> >> > > > > > > >> > > addAttribute/getAttribute/hasAttribute and so on.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Sorry.
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > > Uwe
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > >
> >> ------------------------------------------------------------------
> >> > -
> >> >
> >> > > > > > > --
> >> >
> >> > > > > > > >> > > To unsubscribe, e-mail:
> >> >
> >> > > > java-user-unsubscribe [at] lucene
> >> >
> >> > > > > > > >> > > For additional commands, e-mail: java-user-
> >> >
> >> > > > > help [at] lucene
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >> > >
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > -----------------------------------------------------------------
> --
> >> >
> >> > > > > --
> >> >
> >> > > > > > > >> To unsubscribe, e-mail: java-user-
> >> > unsubscribe [at] lucene
> >> >
> >> > > > > > > >> For additional commands, e-mail: java-user-
> >> >
> >> > > help [at] lucene
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >>
> >> >
> >> > > > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> ------------------------------------------------------------------
> >> > --
> >> >
> >> > > -
> >> >
> >> > > > > > To unsubscribe, e-mail: java-user-
> unsubscribe [at] lucene
> >> >
> >> > > > > > For additional commands, e-mail:
> >> java-user-help [at] lucene
> >> >
> >> > > > > >
> >> >
> >> > > > > >
> >> >
> >> > > >
> >> >
> >> > > >
> >> >
> >> > > >
> >> ---------------------------------------------------------------------
> >> >
> >> > > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> >
> >> > > > For additional commands, e-mail: java-user-help [at] lucene
> >> >
> >> > > >
> >> >
> >> > > >
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.