Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Bug in StandardAnalyzer + StopAnalyzer?

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


eransevi at gmail

Nov 15, 2009, 8:19 AM

Post #1 of 14 (527 views)
Permalink
Bug in StandardAnalyzer + StopAnalyzer?

Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.


uwe at thetaphi

Nov 15, 2009, 8:30 AM

Post #2 of 14 (502 views)
Permalink
RE: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

It must call both reset on the top-level TokenStream and reset(Reader) on
the Tokenizer-. If the latter is not done, how should the TokenStream get
his new Reader?



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:19 PM
To: java-dev [at] lucene
Subject: Bug in StandardAnalyzer + StopAnalyzer?



Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.


eransevi at gmail

Nov 15, 2009, 8:50 AM

Post #3 of 14 (494 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Good point. I missed that part :) since only the tokenizer uses the reader,
we must call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because
there's not underlying implementation? or is it really missing?

On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>


uwe at thetaphi

Nov 15, 2009, 8:55 AM

Post #4 of 14 (494 views)
Permalink
RE: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

It should be there... But ist unimplemented in the TokenFilters used by
Standard/Stop Analyzer. Buf for consistency it should be there. I'll talk
with Robert Muir about it.



Uwe



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
<http://www.thetaphi.de> http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:51 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Good point. I missed that part :) since only the tokenizer uses the reader,
we must call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because
there's not underlying implementation? or is it really missing?

On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

It must call both reset on the top-level TokenStream and reset(Reader) on
the Tokenizer-. If the latter is not done, how should the TokenStream get
his new Reader?



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:19 PM
To: java-dev [at] lucene
Subject: Bug in StandardAnalyzer + StopAnalyzer?



Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.


uwe at thetaphi

Nov 15, 2009, 9:02 AM

Post #5 of 14 (507 views)
Permalink
RE: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

I checked again, reset() on the top filter does not need to be there, as the
indexer calls it automatically as the first call after reusableTokenStream.
For reusing only reset(Reader) must be called. It's a little bit strange
that both methods have the same name, the reset(Reader) one has a completely
different meaning.



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Sunday, November 15, 2009 5:56 PM
To: java-dev [at] lucene
Subject: RE: Bug in StandardAnalyzer + StopAnalyzer?



It should be there... But ist unimplemented in the TokenFilters used by
Standard/Stop Analyzer. Buf for consistency it should be there. I'll talk
with Robert Muir about it.



Uwe



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
<http://www.thetaphi.de> http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:51 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Good point. I missed that part :) since only the tokenizer uses the reader,
we must call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because
there's not underlying implementation? or is it really missing?

On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

It must call both reset on the top-level TokenStream and reset(Reader) on
the Tokenizer-. If the latter is not done, how should the TokenStream get
his new Reader?



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:19 PM
To: java-dev [at] lucene
Subject: Bug in StandardAnalyzer + StopAnalyzer?



Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.


eransevi at gmail

Nov 15, 2009, 9:10 AM

Post #6 of 14 (495 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

OK.
Thanks for the clarification.
it's a bit confusing - maybe the comments should be updated so other people
won't do the same mistake as I did.

On Sun, Nov 15, 2009 at 7:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> I checked again, reset() on the top filter does not need to be there, as
> the indexer calls it automatically as the first call after
> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
> little bit strange that both methods have the same name, the reset(Reader)
> one has a completely different meaning.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
> *Sent:* Sunday, November 15, 2009 5:56 PM
>
> *To:* java-dev [at] lucene
> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> It should be there... But ist unimplemented in the TokenFilters used by
> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
> with Robert Muir about it.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:51 PM
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Good point. I missed that part :) since only the tokenizer uses the reader,
> we must call it directly.
>
> So the reset() on the filteredTokenStream was omitted on purpose because
> there's not underlying implementation? or is it really missing?
>
> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>
>
>


rcmuir at gmail

Nov 15, 2009, 9:11 AM

Post #7 of 14 (495 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

right, this is a consistency issue.
when reusing token streams, we should call reset(Reader) on the tokenizer,
and also call reset() on the chain, and it should be passed down entire
chain if all filters call super.reset()

the reason you don't see it happening in StandardAnalyzer, is because none
of the filters keep any state that needs to be reset.

on the other hand, look at ThaiAnalyzer in contrib, it has both resets
because its ThaiWordFilter keeps state.

maybe for consistency, it would be best to do both resets all the time, to
set a good example.

On Sun, Nov 15, 2009 at 11:55 AM, Uwe Schindler <uwe [at] thetaphi> wrote:

> It should be there... But ist unimplemented in the TokenFilters used by
> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
> with Robert Muir about it.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:51 PM
>
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Good point. I missed that part :) since only the tokenizer uses the reader,
> we must call it directly.
>
> So the reset() on the filteredTokenStream was omitted on purpose because
> there's not underlying implementation? or is it really missing?
>
> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>
>
>



--
Robert Muir
rcmuir [at] gmail


rcmuir at gmail

Nov 15, 2009, 9:13 AM

Post #8 of 14 (492 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Uwe, not so sure it doesn't need to be there, what about other consumers
such as QueryParser?

On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> I checked again, reset() on the top filter does not need to be there, as
> the indexer calls it automatically as the first call after
> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
> little bit strange that both methods have the same name, the reset(Reader)
> one has a completely different meaning.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
> *Sent:* Sunday, November 15, 2009 5:56 PM
>
> *To:* java-dev [at] lucene
> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> It should be there... But ist unimplemented in the TokenFilters used by
> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
> with Robert Muir about it.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:51 PM
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Good point. I missed that part :) since only the tokenizer uses the reader,
> we must call it directly.
>
> So the reset() on the filteredTokenStream was omitted on purpose because
> there's not underlying implementation? or is it really missing?
>
> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>
>
>



--
Robert Muir
rcmuir [at] gmail


uwe at thetaphi

Nov 15, 2009, 9:17 AM

Post #9 of 14 (502 views)
Permalink
RE: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase
does it.



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Sunday, November 15, 2009 6:14 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Uwe, not so sure it doesn't need to be there, what about other consumers
such as QueryParser?

On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

I checked again, reset() on the top filter does not need to be there, as the
indexer calls it automatically as the first call after reusableTokenStream.
For reusing only reset(Reader) must be called. It's a little bit strange
that both methods have the same name, the reset(Reader) one has a completely
different meaning.



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Sunday, November 15, 2009 5:56 PM


To: java-dev [at] lucene

Subject: RE: Bug in StandardAnalyzer + StopAnalyzer?



It should be there... But ist unimplemented in the TokenFilters used by
Standard/Stop Analyzer. Buf for consistency it should be there. I'll talk
with Robert Muir about it.



Uwe



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
<http://www.thetaphi.de> http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:51 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Good point. I missed that part :) since only the tokenizer uses the reader,
we must call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because
there's not underlying implementation? or is it really missing?

On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

It must call both reset on the top-level TokenStream and reset(Reader) on
the Tokenizer-. If the latter is not done, how should the TokenStream get
his new Reader?



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:19 PM
To: java-dev [at] lucene
Subject: Bug in StandardAnalyzer + StopAnalyzer?



Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.






--
Robert Muir
rcmuir [at] gmail


rcmuir at gmail

Nov 15, 2009, 9:39 AM

Post #10 of 14 (500 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

ok, at one point i do not think BaseTokenStreamTestCase did.

if this is the case, then its the consumer's responsibility to call reset,
and we should remove extra resets() inside reusableTokenStream() from
analyzers that have it... and probably improve the docs of this contract.

On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> Even QueryParser calls reset() as first call. Also
> BaseTokenStreamTestCase does it.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Robert Muir [mailto:rcmuir [at] gmail]
> *Sent:* Sunday, November 15, 2009 6:14 PM
>
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Uwe, not so sure it doesn't need to be there, what about other consumers
> such as QueryParser?
>
> On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> I checked again, reset() on the top filter does not need to be there, as
> the indexer calls it automatically as the first call after
> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
> little bit strange that both methods have the same name, the reset(Reader)
> one has a completely different meaning.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
> *Sent:* Sunday, November 15, 2009 5:56 PM
>
>
> *To:* java-dev [at] lucene
>
> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> It should be there... But ist unimplemented in the TokenFilters used by
> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
> with Robert Muir about it.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:51 PM
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Good point. I missed that part :) since only the tokenizer uses the reader,
> we must call it directly.
>
> So the reset() on the filteredTokenStream was omitted on purpose because
> there's not underlying implementation? or is it really missing?
>
> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
>



--
Robert Muir
rcmuir [at] gmail


uwe at thetaphi

Nov 15, 2009, 9:45 AM

Post #11 of 14 (492 views)
Permalink
RE: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Yes, but on the other hand it does not hurt to automaticall reset in the
analyzer.... *krr* I do not know how to proceed. I think we should keep it
as it was since the beginning of Lucene (call to reset inside analyzer, QP)
and document it correctly.



You are right, at the beginning, BaseTokenStreamTestCase and many other
hardcoded tests did not call reset. Now, the test also calls end() and
close().



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Sunday, November 15, 2009 6:40 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



ok, at one point i do not think BaseTokenStreamTestCase did.

if this is the case, then its the consumer's responsibility to call reset,
and we should remove extra resets() inside reusableTokenStream() from
analyzers that have it... and probably improve the docs of this contract.

On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase
does it.



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Sunday, November 15, 2009 6:14 PM


To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Uwe, not so sure it doesn't need to be there, what about other consumers
such as QueryParser?

On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

I checked again, reset() on the top filter does not need to be there, as the
indexer calls it automatically as the first call after reusableTokenStream.
For reusing only reset(Reader) must be called. It's a little bit strange
that both methods have the same name, the reset(Reader) one has a completely
different meaning.



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Sunday, November 15, 2009 5:56 PM


To: java-dev [at] lucene

Subject: RE: Bug in StandardAnalyzer + StopAnalyzer?



It should be there... But ist unimplemented in the TokenFilters used by
Standard/Stop Analyzer. Buf for consistency it should be there. I'll talk
with Robert Muir about it.



Uwe



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
<http://www.thetaphi.de> http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:51 PM
To: java-dev [at] lucene
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?



Good point. I missed that part :) since only the tokenizer uses the reader,
we must call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because
there's not underlying implementation? or is it really missing?

On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

It must call both reset on the top-level TokenStream and reset(Reader) on
the Tokenizer-. If the latter is not done, how should the TokenStream get
his new Reader?



-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi

_____

From: Eran Sevi [mailto:eransevi [at] gmail]
Sent: Sunday, November 15, 2009 5:19 PM
To: java-dev [at] lucene
Subject: Bug in StandardAnalyzer + StopAnalyzer?



Hi,
when changing my code to support the not-so-new reusableTokenStream I
noticed that in the cases when a SavedStream class was used in an analyzer
(Standard,Stop and maybe others as well) the reset() method is called on the
tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input reset()
methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the filters
used in the analyzers overrides the reset() method, but it might cause
problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.tokenStream.reset(reader);
}

should become:

if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new StandardTokenizer(matchVersion, reader);
streams.filteredTokenStream = new StandardFilter(streams.tokenStream);
streams.filteredTokenStream = new
LowerCaseFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),

streams.filteredTokenStream, stopSet);
} else {
streams.filteredTokenStream.reset(); // changed line.
}


What do you think?

Eran.






--
Robert Muir
rcmuir [at] gmail




--
Robert Muir
rcmuir [at] gmail


rcmuir at gmail

Nov 15, 2009, 10:33 AM

Post #12 of 14 (494 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Uwe, I will throw another twist at it: I do not like the name of
Tokenizer.reset(Reader) method.
I wish it was called .setReader() or something else, I think it is confusing
for Tokenizer to have .reset() and .reset(Reader), when the latter should
hardly ever be overridden.

Great example of how insane this is right now, is StandardTokenizer:

@Override
public void reset(Reader reader) throws IOException {
super.reset(reader);
reset();
}

LOL!

On Sun, Nov 15, 2009 at 12:45 PM, Uwe Schindler <uwe [at] thetaphi> wrote:

> Yes, but on the other hand it does not hurt to automaticall reset in the
> analyzer.... **krr** I do not know how to proceed. I think we should keep
> it as it was since the beginning of Lucene (call to reset inside analyzer,
> QP) and document it correctly.
>
>
>
> You are right, at the beginning, BaseTokenStreamTestCase and many other
> hardcoded tests did not call reset. Now, the test also calls end() and
> close().
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Robert Muir [mailto:rcmuir [at] gmail]
> *Sent:* Sunday, November 15, 2009 6:40 PM
>
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> ok, at one point i do not think BaseTokenStreamTestCase did.
>
> if this is the case, then its the consumer's responsibility to call reset,
> and we should remove extra resets() inside reusableTokenStream() from
> analyzers that have it... and probably improve the docs of this contract.
>
> On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase
> does it.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Robert Muir [mailto:rcmuir [at] gmail]
> *Sent:* Sunday, November 15, 2009 6:14 PM
>
>
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Uwe, not so sure it doesn't need to be there, what about other consumers
> such as QueryParser?
>
> On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> I checked again, reset() on the top filter does not need to be there, as
> the indexer calls it automatically as the first call after
> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
> little bit strange that both methods have the same name, the reset(Reader)
> one has a completely different meaning.
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
> *Sent:* Sunday, November 15, 2009 5:56 PM
>
>
> *To:* java-dev [at] lucene
>
> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> It should be there... But ist unimplemented in the TokenFilters used by
> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
> with Robert Muir about it.
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:51 PM
> *To:* java-dev [at] lucene
> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Good point. I missed that part :) since only the tokenizer uses the reader,
> we must call it directly.
>
> So the reset() on the filteredTokenStream was omitted on purpose because
> there's not underlying implementation? or is it really missing?
>
> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
> It must call both reset on the top-level TokenStream and reset(Reader) on
> the Tokenizer-. If the latter is not done, how should the TokenStream get
> his new Reader?
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
> ------------------------------
>
> *From:* Eran Sevi [mailto:eransevi [at] gmail]
> *Sent:* Sunday, November 15, 2009 5:19 PM
> *To:* java-dev [at] lucene
> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>
>
>
> Hi,
> when changing my code to support the not-so-new reusableTokenStream I
> noticed that in the cases when a SavedStream class was used in an analyzer
> (Standard,Stop and maybe others as well) the reset() method is called on the
> tokenizer instead of on the filter.
>
> The filter implementation of reset() calls the inner filters+input reset()
> methods, but the tokenizer reset() method can't do that.
> I think this bug hasn't caused any errors so far since none of the filters
> used in the analyzers overrides the reset() method, but it might cause
> problems if the implementation changes in the future.
>
> the fix is very simple. for example (in StandardAnalyzer):
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.tokenStream.reset(reader);
> }
>
> should become:
>
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
> streams.filteredTokenStream = new
> StandardFilter(streams.tokenStream);
> streams.filteredTokenStream = new
> LowerCaseFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new
> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>
> streams.filteredTokenStream, stopSet);
> } else {
> streams.filteredTokenStream.reset(); // changed line.
> }
>
>
> What do you think?
>
> Eran.
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
>
>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
>



--
Robert Muir
rcmuir [at] gmail


eransevi at gmail

Nov 16, 2009, 12:04 AM

Post #13 of 14 (472 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

I think it's always better to do what you can inside the code and don't
leave it for the clients that calls the method (just code duplication and a
potential place for error if it's forgotten).
So if it's not complicated I think the call to reset() on the chain of
filters should be the responsibility of each analyzer and not the indexer/QP
and others.

Changing the method name is also a good idea in my opinion.

On Sun, Nov 15, 2009 at 8:33 PM, Robert Muir <rcmuir [at] gmail> wrote:

> Uwe, I will throw another twist at it: I do not like the name of
> Tokenizer.reset(Reader) method.
> I wish it was called .setReader() or something else, I think it is
> confusing for Tokenizer to have .reset() and .reset(Reader), when the latter
> should hardly ever be overridden.
>
> Great example of how insane this is right now, is StandardTokenizer:
>
> @Override
> public void reset(Reader reader) throws IOException {
> super.reset(reader);
> reset();
> }
>
> LOL!
>
>
> On Sun, Nov 15, 2009 at 12:45 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>
>> Yes, but on the other hand it does not hurt to automaticall reset in the
>> analyzer.... **krr** I do not know how to proceed. I think we should keep
>> it as it was since the beginning of Lucene (call to reset inside analyzer,
>> QP) and document it correctly.
>>
>>
>>
>> You are right, at the beginning, BaseTokenStreamTestCase and many other
>> hardcoded tests did not call reset. Now, the test also calls end() and
>> close().
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>> ------------------------------
>>
>> *From:* Robert Muir [mailto:rcmuir [at] gmail]
>> *Sent:* Sunday, November 15, 2009 6:40 PM
>>
>> *To:* java-dev [at] lucene
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> ok, at one point i do not think BaseTokenStreamTestCase did.
>>
>> if this is the case, then its the consumer's responsibility to call reset,
>> and we should remove extra resets() inside reusableTokenStream() from
>> analyzers that have it... and probably improve the docs of this contract.
>>
>> On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>
>> Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase
>> does it.
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>> ------------------------------
>>
>> *From:* Robert Muir [mailto:rcmuir [at] gmail]
>> *Sent:* Sunday, November 15, 2009 6:14 PM
>>
>>
>> *To:* java-dev [at] lucene
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Uwe, not so sure it doesn't need to be there, what about other consumers
>> such as QueryParser?
>>
>> On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>
>> I checked again, reset() on the top filter does not need to be there, as
>> the indexer calls it automatically as the first call after
>> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
>> little bit strange that both methods have the same name, the reset(Reader)
>> one has a completely different meaning.
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>> ------------------------------
>>
>> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
>> *Sent:* Sunday, November 15, 2009 5:56 PM
>>
>>
>> *To:* java-dev [at] lucene
>>
>> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> It should be there... But ist unimplemented in the TokenFilters used by
>> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
>> with Robert Muir about it.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>> ------------------------------
>>
>> *From:* Eran Sevi [mailto:eransevi [at] gmail]
>> *Sent:* Sunday, November 15, 2009 5:51 PM
>> *To:* java-dev [at] lucene
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Good point. I missed that part :) since only the tokenizer uses the
>> reader, we must call it directly.
>>
>> So the reset() on the filteredTokenStream was omitted on purpose because
>> there's not underlying implementation? or is it really missing?
>>
>> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>
>> It must call both reset on the top-level TokenStream and reset(Reader) on
>> the Tokenizer-. If the latter is not done, how should the TokenStream get
>> his new Reader?
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe [at] thetaphi
>> ------------------------------
>>
>> *From:* Eran Sevi [mailto:eransevi [at] gmail]
>> *Sent:* Sunday, November 15, 2009 5:19 PM
>> *To:* java-dev [at] lucene
>> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Hi,
>> when changing my code to support the not-so-new reusableTokenStream I
>> noticed that in the cases when a SavedStream class was used in an analyzer
>> (Standard,Stop and maybe others as well) the reset() method is called on the
>> tokenizer instead of on the filter.
>>
>> The filter implementation of reset() calls the inner filters+input reset()
>> methods, but the tokenizer reset() method can't do that.
>> I think this bug hasn't caused any errors so far since none of the filters
>> used in the analyzers overrides the reset() method, but it might cause
>> problems if the implementation changes in the future.
>>
>> the fix is very simple. for example (in StandardAnalyzer):
>>
>> if (streams == null) {
>> streams = new SavedStreams();
>> setPreviousTokenStream(streams);
>> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>> streams.filteredTokenStream = new
>> StandardFilter(streams.tokenStream);
>> streams.filteredTokenStream = new
>> LowerCaseFilter(streams.filteredTokenStream);
>> streams.filteredTokenStream = new
>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>
>> streams.filteredTokenStream, stopSet);
>> } else {
>> streams.tokenStream.reset(reader);
>> }
>>
>> should become:
>>
>> if (streams == null) {
>> streams = new SavedStreams();
>> setPreviousTokenStream(streams);
>> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>> streams.filteredTokenStream = new
>> StandardFilter(streams.tokenStream);
>> streams.filteredTokenStream = new
>> LowerCaseFilter(streams.filteredTokenStream);
>> streams.filteredTokenStream = new
>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>
>> streams.filteredTokenStream, stopSet);
>> } else {
>> streams.filteredTokenStream.reset(); // changed line.
>> }
>>
>>
>> What do you think?
>>
>> Eran.
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
>


rcmuir at gmail

Nov 16, 2009, 6:38 AM

Post #14 of 14 (465 views)
Permalink
Re: Bug in StandardAnalyzer + StopAnalyzer? [In reply to]

Thanks for your feedback. I think we can improve this consistency with this
issue: LUCENE-2034
In this issue Simon is proposing a subclass that makes it easier to reuse
tokenstreams.

and for the StandardTokenizer, we can consider removing this entire
override, reset(Reader) calls reset(), there is no need to reset()
StandardTokenizer THREE times...

On Mon, Nov 16, 2009 at 3:04 AM, Eran Sevi <eransevi [at] gmail> wrote:

> I think it's always better to do what you can inside the code and don't
> leave it for the clients that calls the method (just code duplication and a
> potential place for error if it's forgotten).
> So if it's not complicated I think the call to reset() on the chain of
> filters should be the responsibility of each analyzer and not the indexer/QP
> and others.
>
> Changing the method name is also a good idea in my opinion.
>
>
> On Sun, Nov 15, 2009 at 8:33 PM, Robert Muir <rcmuir [at] gmail> wrote:
>
>> Uwe, I will throw another twist at it: I do not like the name of
>> Tokenizer.reset(Reader) method.
>> I wish it was called .setReader() or something else, I think it is
>> confusing for Tokenizer to have .reset() and .reset(Reader), when the latter
>> should hardly ever be overridden.
>>
>> Great example of how insane this is right now, is StandardTokenizer:
>>
>> @Override
>> public void reset(Reader reader) throws IOException {
>> super.reset(reader);
>> reset();
>> }
>>
>> LOL!
>>
>>
>> On Sun, Nov 15, 2009 at 12:45 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>
>>> Yes, but on the other hand it does not hurt to automaticall reset in
>>> the analyzer.... **krr** I do not know how to proceed. I think we should
>>> keep it as it was since the beginning of Lucene (call to reset inside
>>> analyzer, QP) and document it correctly.
>>>
>>>
>>>
>>> You are right, at the beginning, BaseTokenStreamTestCase and many other
>>> hardcoded tests did not call reset. Now, the test also calls end() and
>>> close().
>>>
>>>
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>> ------------------------------
>>>
>>> *From:* Robert Muir [mailto:rcmuir [at] gmail]
>>> *Sent:* Sunday, November 15, 2009 6:40 PM
>>>
>>> *To:* java-dev [at] lucene
>>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>>
>>>
>>>
>>> ok, at one point i do not think BaseTokenStreamTestCase did.
>>>
>>> if this is the case, then its the consumer's responsibility to call
>>> reset, and we should remove extra resets() inside reusableTokenStream() from
>>> analyzers that have it... and probably improve the docs of this contract.
>>>
>>> On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>>
>>> Even QueryParser calls reset() as first call. Also
>>> BaseTokenStreamTestCase does it.
>>>
>>>
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>> ------------------------------
>>>
>>> *From:* Robert Muir [mailto:rcmuir [at] gmail]
>>> *Sent:* Sunday, November 15, 2009 6:14 PM
>>>
>>>
>>> *To:* java-dev [at] lucene
>>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>>
>>>
>>>
>>> Uwe, not so sure it doesn't need to be there, what about other consumers
>>> such as QueryParser?
>>>
>>> On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>>
>>> I checked again, reset() on the top filter does not need to be there, as
>>> the indexer calls it automatically as the first call after
>>> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
>>> little bit strange that both methods have the same name, the reset(Reader)
>>> one has a completely different meaning.
>>>
>>>
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>> ------------------------------
>>>
>>> *From:* Uwe Schindler [mailto:uwe [at] thetaphi]
>>> *Sent:* Sunday, November 15, 2009 5:56 PM
>>>
>>>
>>> *To:* java-dev [at] lucene
>>>
>>> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>>>
>>>
>>>
>>> It should be there... But ist unimplemented in the TokenFilters used by
>>> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
>>> with Robert Muir about it.
>>>
>>>
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>> ------------------------------
>>>
>>> *From:* Eran Sevi [mailto:eransevi [at] gmail]
>>> *Sent:* Sunday, November 15, 2009 5:51 PM
>>> *To:* java-dev [at] lucene
>>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>>
>>>
>>>
>>> Good point. I missed that part :) since only the tokenizer uses the
>>> reader, we must call it directly.
>>>
>>> So the reset() on the filteredTokenStream was omitted on purpose because
>>> there's not underlying implementation? or is it really missing?
>>>
>>> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe [at] thetaphi> wrote:
>>>
>>> It must call both reset on the top-level TokenStream and reset(Reader) on
>>> the Tokenizer-. If the latter is not done, how should the TokenStream get
>>> his new Reader?
>>>
>>>
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe [at] thetaphi
>>> ------------------------------
>>>
>>> *From:* Eran Sevi [mailto:eransevi [at] gmail]
>>> *Sent:* Sunday, November 15, 2009 5:19 PM
>>> *To:* java-dev [at] lucene
>>> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>>>
>>>
>>>
>>> Hi,
>>> when changing my code to support the not-so-new reusableTokenStream I
>>> noticed that in the cases when a SavedStream class was used in an analyzer
>>> (Standard,Stop and maybe others as well) the reset() method is called on the
>>> tokenizer instead of on the filter.
>>>
>>> The filter implementation of reset() calls the inner filters+input
>>> reset() methods, but the tokenizer reset() method can't do that.
>>> I think this bug hasn't caused any errors so far since none of the
>>> filters used in the analyzers overrides the reset() method, but it might
>>> cause problems if the implementation changes in the future.
>>>
>>> the fix is very simple. for example (in StandardAnalyzer):
>>>
>>> if (streams == null) {
>>> streams = new SavedStreams();
>>> setPreviousTokenStream(streams);
>>> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>>> streams.filteredTokenStream = new
>>> StandardFilter(streams.tokenStream);
>>> streams.filteredTokenStream = new
>>> LowerCaseFilter(streams.filteredTokenStream);
>>> streams.filteredTokenStream = new
>>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>>
>>> streams.filteredTokenStream, stopSet);
>>> } else {
>>> streams.tokenStream.reset(reader);
>>> }
>>>
>>> should become:
>>>
>>> if (streams == null) {
>>> streams = new SavedStreams();
>>> setPreviousTokenStream(streams);
>>> streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>>> streams.filteredTokenStream = new
>>> StandardFilter(streams.tokenStream);
>>> streams.filteredTokenStream = new
>>> LowerCaseFilter(streams.filteredTokenStream);
>>> streams.filteredTokenStream = new
>>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>>
>>> streams.filteredTokenStream, stopSet);
>>> } else {
>>> streams.filteredTokenStream.reset(); // changed line.
>>> }
>>>
>>>
>>> What do you think?
>>>
>>> Eran.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir [at] gmail
>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir [at] gmail
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>


--
Robert Muir
rcmuir [at] gmail

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.