Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Token termBuffer issues

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


yonik at apache

Jul 19, 2007, 3:23 PM

Post #1 of 24 (7978 views)
Permalink
Token termBuffer issues

I had previously missed the changes to Token that add support for
using an array (termBuffer):

+ // For better indexing speed, use termBuffer (and
+ // termBufferOffset/termBufferLength) instead of termText
+ // to save new'ing a String per token
+ char[] termBuffer;
+ int termBufferOffset;
+ int termBufferLength;

While I think this approach would have been best to start off with
rather than String,
I'm concerned that it will do little more than add overhead at this
point, resulting in slower code, not faster.

- If any tokenizer or token filter tries setting the termBuffer, any
downstream components would need to check for both. It could be made
backward compatible by constructing a string on demand, but that will
really slow things down, unless the whole chain is converted to only
using the char[] somehow.

- It doesn't look like the indexing code currently pays any attention
to the char[], right?

- What if both the String and char[] are set? A filter that doesn't
know better sets the String... this doesn't clear the char[]
currently, should it?

Thoughts?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 19, 2007, 5:27 PM

Post #2 of 24 (7856 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Yonik Seeley" <yonik [at] apache> wrote:
> I had previously missed the changes to Token that add support for
> using an array (termBuffer):
>
> + // For better indexing speed, use termBuffer (and
> + // termBufferOffset/termBufferLength) instead of termText
> + // to save new'ing a String per token
> + char[] termBuffer;
> + int termBufferOffset;
> + int termBufferLength;
>
> While I think this approach would have been best to start off with
> rather than String,
> I'm concerned that it will do little more than add overhead at this
> point, resulting in slower code, not faster.
>
> - If any tokenizer or token filter tries setting the termBuffer, any
> downstream components would need to check for both. It could be made
> backward compatible by constructing a string on demand, but that will
> really slow things down, unless the whole chain is converted to only
> using the char[] somehow.

Good point: if your analyzer/tokenizer produces char[] tokens then
your downstream filters would have to accept char[] tokens.

I think on-demand constructing a String (and saving it as termText)
would be an OK solution? Why would that be slower than having to make
a String in the first place (if we didn't have the char[] API)? It's
at least graceful degradation.

> - It doesn't look like the indexing code currently pays any attention
> to the char[], right?

It does, in DocumentsWriter.addPosition().

> - What if both the String and char[] are set? A filter that doesn't
> know better sets the String... this doesn't clear the char[]
> currently, should it?

Currently the char[] wins, but good point: seems like each setter
should null out the other one?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 20, 2007, 11:52 AM

Post #3 of 24 (7842 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/19/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> "Yonik Seeley" <yonik [at] apache> wrote:
> > I had previously missed the changes to Token that add support for
> > using an array (termBuffer):
> >
> > + // For better indexing speed, use termBuffer (and
> > + // termBufferOffset/termBufferLength) instead of termText
> > + // to save new'ing a String per token
> > + char[] termBuffer;
> > + int termBufferOffset;
> > + int termBufferLength;
> >
> > While I think this approach would have been best to start off with
> > rather than String,
> > I'm concerned that it will do little more than add overhead at this
> > point, resulting in slower code, not faster.
> >
> > - If any tokenizer or token filter tries setting the termBuffer, any
> > downstream components would need to check for both. It could be made
> > backward compatible by constructing a string on demand, but that will
> > really slow things down, unless the whole chain is converted to only
> > using the char[] somehow.
>
> Good point: if your analyzer/tokenizer produces char[] tokens then
> your downstream filters would have to accept char[] tokens.
>
> I think on-demand constructing a String (and saving it as termText)
> would be an OK solution? Why would that be slower than having to make
> a String in the first place (if we didn't have the char[] API)? It's
> at least graceful degradation.

It's the rule rather than the exception though. Pretty much
everything is based on String.

> > - It doesn't look like the indexing code currently pays any attention
> > to the char[], right?
>
> It does, in DocumentsWriter.addPosition().

Ah, thanks.

> > - What if both the String and char[] are set? A filter that doesn't
> > know better sets the String... this doesn't clear the char[]
> > currently, should it?
>
> Currently the char[] wins, but good point: seems like each setter
> should null out the other one?

Certainly the String setter should null the char[] (that's the only
way to keep back compatibility), and probably vice-versa.

Note that there are many existing filters that directly access and
manipulate the package protected String termText. These will need to
be changed.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


hossman_lucene at fucit

Jul 20, 2007, 4:28 PM

Post #4 of 24 (7859 views)
Permalink
Re: Token termBuffer issues [In reply to]

: > Currently the char[] wins, but good point: seems like each setter
: > should null out the other one?
:
: Certainly the String setter should null the char[] (that's the only
: way to keep back compatibility), and probably vice-versa.

i haven't really thought baout this before today (i missed seeing the
char[] stuff get added to Token as well) but if we're confident the char[]
stuff is hte direction we want to go, then i believe the cleanest forward
migration plan is...

1) deprecate Token.termText, Token.getTermText(), Token.setTermText
2) make Token.setTermBuffer() null out Token.termText (document)
3) make Token.setTermText() null out Token.termBuffer
4) refactor all of the the "if (null == termBuffer)" logic in
DocumentsWriter into a the Token class, ala...
public final char[] termBuffer() {
initTermBuffer();
return termBuffer;
}
public final int termBufferOffset() {
initTermBuffer();
return termBufferOffset;
}
public final int termBufferLength() {
initTermBuffer();
return termBufferLength;
}
private void initTermBuffer() {
if (null != termBuffer) return;
termBufferOffset=0;
termBufferLength = termText.length();
termBuffer = char[termBufferLength];
termText.getChars(0, termBufferLength, termBuffer, 0)
}
...such that DocumentsWRiter never uses termText just termBuffer
5) at some point down the road, modify all of the "core" TokenStreams to
use termBuffer instead of termText
6) at some point way down the road, delete the depreacated
methods/variables and the Token.initTermBuffer method.

Unless I've missed something, the end result should be...

a) existing TokenStreams that use termText exclusively and don't know
anything about termBuffer will have the exact same performance
characteristics that they currently do (a char[] will be created on demand
the first time termBuffer is used -- by DocumentsWriter)

b) TokenStreams which wind up being a mix of old and new code using both
termText and termBuffer should work correctly in any combination.

c) new TokenStreams that use termBuffer exclusively should work fine, and
have decent performance even with the overhead of the initTermBuffer()
call (which will get better once the deprecated legacy termText usage can
be removed.


Side note: Token.toString() is current jacked in cases where termBuffer is
used)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 21, 2007, 6:12 AM

Post #5 of 24 (7868 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Chris Hostetter" <hossman_lucene [at] fucit> wrote:
>
> : > Currently the char[] wins, but good point: seems like each setter
> : > should null out the other one?
> :
> : Certainly the String setter should null the char[] (that's the only
> : way to keep back compatibility), and probably vice-versa.
>
> i haven't really thought baout this before today (i missed seeing the
> char[] stuff get added to Token as well) but if we're confident the
> char[]
> stuff is hte direction we want to go, then i believe the cleanest forward
> migration plan is...
>
> 1) deprecate Token.termText, Token.getTermText(), Token.setTermText
> 2) make Token.setTermBuffer() null out Token.termText (document)
> 3) make Token.setTermText() null out Token.termBuffer
> 4) refactor all of the the "if (null == termBuffer)" logic in
> DocumentsWriter into a the Token class, ala...
> public final char[] termBuffer() {
> initTermBuffer();
> return termBuffer;
> }
> public final int termBufferOffset() {
> initTermBuffer();
> return termBufferOffset;
> }
> public final int termBufferLength() {
> initTermBuffer();
> return termBufferLength;
> }
> private void initTermBuffer() {
> if (null != termBuffer) return;
> termBufferOffset=0;
> termBufferLength = termText.length();
> termBuffer = char[termBufferLength];
> termText.getChars(0, termBufferLength, termBuffer, 0)
> }
> ...such that DocumentsWRiter never uses termText just termBuffer
> 5) at some point down the road, modify all of the "core" TokenStreams to
> use termBuffer instead of termText
> 6) at some point way down the road, delete the depreacated
> methods/variables and the Token.initTermBuffer method.
>
> Unless I've missed something, the end result should be...
>
> a) existing TokenStreams that use termText exclusively and don't know
> anything about termBuffer will have the exact same performance
> characteristics that they currently do (a char[] will be created on
> demand
> the first time termBuffer is used -- by DocumentsWriter)
>
> b) TokenStreams which wind up being a mix of old and new code using both
> termText and termBuffer should work correctly in any combination.
>
> c) new TokenStreams that use termBuffer exclusively should work fine, and
> have decent performance even with the overhead of the initTermBuffer()
> call (which will get better once the deprecated legacy termText usage can
> be removed.

I like this approach Hoss!

I will open an issue and work on it; I'd like to finish up through
your step 5 below. This way "out of the box" performance of Lucene is
improved, for people who use the core analyzers and filters.

To further improve "out of the box" performance I would really also
like to fix the core analyzers, when possible, to re-use a single
Token instance for each Token they produce. This would then mean no
objects are created as you step through Tokens in the TokenStream
... so this would give the best performance.

The problem is, this would be a sneaky API change and would for
example break anyone who gathers the Tokens into an array and
processes them later (eg maybe highlighter does?).

Maybe one way to migrate to this would be to add a new method "Token
nextDirect()" to TokenStream which is allowed to return a Token that
may be recycled. This means callers of nextDirect() must make their
own copy of the Token if they intend to keep it for later usage. It
would default to "return next()" and I would then default "next()" to
call nextDirect() but make a full copy of what's
returned. DocumentsWriter would use this to step through the tokens.

Analyzers would then implement either next() or nextDirect(). We
would fix all core analyzers to implemente nextDirect(), and then all
other analyzers would remain backwards compatible.

From a caller's standpoint the only difference between nextDirect()
and next() is that next() guarantees to give you a "full private copy"
of the Token but nextDirect() does not.

We could also punt on this entirely since it is always possible for
people to make their own analyzers that re-use Tokens, but I think
having decent "out of the box" performance with Lucene is important.
Ie, Lucene's defaults should be set up so that you get decent
performance without changing anything; you should not have to work
hard to get good performance and unfortunately today you still do....

> Side note: Token.toString() is current jacked in cases where termBuffer is
> used)

Woops -- sorry -- I will add test case & fix it.


> Note that there are many existing filters that directly access and
> manipulate the package protected String termText. These will need to
> be changed.

Hmmm ... is it too late to make all of Token's package protected attrs
private, so we require use of getters? Or, maybe just make the new
ones (termBuffer*) private and then leave termText package protected
but deprecate it and add a comment saying that core analyzers and
filters have switched to using termBuffer and so you may get an NPE if
you access termText directly? Plus fix all core analyzers to not
directly access termText and use termBuffer instead...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 21, 2007, 7:28 AM

Post #6 of 24 (7868 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/21/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> To further improve "out of the box" performance I would really also
> like to fix the core analyzers, when possible, to re-use a single
> Token instance for each Token they produce. This would then mean no
> objects are created as you step through Tokens in the TokenStream
> ... so this would give the best performance.

How much better I wonder? Small object allocation & reclaiming is
supposed to be very good in current JVMs.
You got good performance out of reusing Field objects, i think,
partially because of the convoluted logic-tree in the constructor,
including interning the field name.

Token is a much lighter weight class, with no unpredictable branches,
less state (except we increased it with the termBuffer stuff ;-) and
no string interning (which presumably includes synchronization).

> The problem is, this would be a sneaky API change and would for
> example break anyone who gathers the Tokens into an array and
> processes them later (eg maybe highlighter does?).

I think there is probably quite a bit of code out there that needs to
look at tokens in context (and thus does some sort of buffering).

> Maybe one way to migrate to this would be to add a new method "Token
> nextDirect()" to TokenStream which is allowed to return a Token that
> may be recycled. This means callers of nextDirect() must make their
> own copy of the Token if they intend to keep it for later usage. It
> would default to "return next()" and I would then default "next()" to
> call nextDirect() but make a full copy of what's
> returned. DocumentsWriter would use this to step through the tokens.
>
> Analyzers would then implement either next() or nextDirect(). We
> would fix all core analyzers to implemente nextDirect(), and then all
> other analyzers would remain backwards compatible.
>
> From a caller's standpoint the only difference between nextDirect()
> and next() is that next() guarantees to give you a "full private copy"
> of the Token but nextDirect() does not.
>
> We could also punt on this entirely since it is always possible for
> people to make their own analyzers that re-use Tokens, but I think
> having decent "out of the box" performance with Lucene is important.
> Ie, Lucene's defaults should be set up so that you get decent
> performance without changing anything; you should not have to work
> hard to get good performance and unfortunately today you still do....

I'd be surprised if we can gain meaningful performance with token
reuse in a typical full-text analysis chain (stemming, stop-words,
etc). Have you done any measurements, or is it a hunch at this
point?

> Hmmm ... is it too late to make all of Token's package protected attrs
> private, so we require use of getters?

They are package protected, which means that they are safe to change
because we own all the code in that package.

Another thing that would slow down slightly with termBuffer use is
hash lookups (because Strings cache their hashCode). So an anlalyzer
with a synonym filter and a stop filter would end up calculating that
twice (or constructing a String). I don't know if it would be worth
maintaining that in the Token or not...

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markrmiller at gmail

Jul 21, 2007, 8:02 AM

Post #7 of 24 (7866 views)
Permalink
Re: Token termBuffer issues [In reply to]

> The problem is, this would be a sneaky API change and would for
> example break anyone who gathers the Tokens into an array and
> processes them later (eg maybe highlighter does?).
>
org.apache.lucene.analysis.CachingTokenFilter does this.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 21, 2007, 9:14 AM

Post #8 of 24 (7863 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Yonik Seeley" <yonik [at] apache> wrote:
> On 7/21/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > To further improve "out of the box" performance I would really also
> > like to fix the core analyzers, when possible, to re-use a single
> > Token instance for each Token they produce. This would then mean no
> > objects are created as you step through Tokens in the TokenStream
> > ... so this would give the best performance.
>
> How much better I wonder? Small object allocation & reclaiming is
> supposed to be very good in current JVMs.
> You got good performance out of reusing Field objects, i think,
> partially because of the convoluted logic-tree in the constructor,
> including interning the field name.
>
> Token is a much lighter weight class, with no unpredictable branches,
> less state (except we increased it with the termBuffer stuff ;-) and
> no string interning (which presumably includes synchronization).

Good question ... I will run some benchmarks to see if it's
worthwhile.

> > The problem is, this would be a sneaky API change and would for
> > example break anyone who gathers the Tokens into an array and
> > processes them later (eg maybe highlighter does?).
>
> I think there is probably quite a bit of code out there that needs to
> look at tokens in context (and thus does some sort of buffering).

Agreed, so we can't change the API. So the next/nextDirect proposal
should work well: it doesn't change the API yet would allow consumers
that don't require "full private copy" of every Token, like
DocumentsWriter, to have good performance.

> > Maybe one way to migrate to this would be to add a new method "Token
> > nextDirect()" to TokenStream which is allowed to return a Token that
> > may be recycled. This means callers of nextDirect() must make their
> > own copy of the Token if they intend to keep it for later usage. It
> > would default to "return next()" and I would then default "next()" to
> > call nextDirect() but make a full copy of what's
> > returned. DocumentsWriter would use this to step through the tokens.
> >
> > Analyzers would then implement either next() or nextDirect(). We
> > would fix all core analyzers to implemente nextDirect(), and then all
> > other analyzers would remain backwards compatible.
> >
> > From a caller's standpoint the only difference between nextDirect()
> > and next() is that next() guarantees to give you a "full private copy"
> > of the Token but nextDirect() does not.
> >
> > We could also punt on this entirely since it is always possible for
> > people to make their own analyzers that re-use Tokens, but I think
> > having decent "out of the box" performance with Lucene is important.
> > Ie, Lucene's defaults should be set up so that you get decent
> > performance without changing anything; you should not have to work
> > hard to get good performance and unfortunately today you still do....
>
> I'd be surprised if we can gain meaningful performance with token
> reuse in a typical full-text analysis chain (stemming, stop-words,
> etc). Have you done any measurements, or is it a hunch at this
> point?

Will test.

> > Hmmm ... is it too late to make all of Token's package protected attrs
> > private, so we require use of getters?
>
> They are package protected, which means that they are safe to change
> because we own all the code in that package.

OK, I was unsure whether this is considered an API change since users
can make their own classes in this package too. So I will make all of
them (including Payload) private.

> Another thing that would slow down slightly with termBuffer use is
> hash lookups (because Strings cache their hashCode). So an anlalyzer
> with a synonym filter and a stop filter would end up calculating that
> twice (or constructing a String). I don't know if it would be worth
> maintaining that in the Token or not...

Good point; I will benchmark before/after to assess...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


eksdev at yahoo

Jul 21, 2007, 9:23 AM

Post #9 of 24 (7854 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/21/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce. This would then mean no
>> objects are created as you step through Tokens in the TokenStream
>> ... so this would give the best performance.

>How much better I wonder? Small object allocation & reclaiming is
>supposed to be very good in current JVMs.


Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing due to String in Token and than do just the simple whitespace tokenisation during indexing. And this was not just out for fun, there was some real benefit in it.
The issue with performance here is in making transformations on tokens during analysis (where this applies), you gave very nice example , stemming, that itself generates new Strings, another nice example is NGram generation in SpellChecker that generates rater huge numbers of small objects.

The simplest model, tokenize(without modifying)/index ironically also benefits from char[] as than things go really fast in general so new String() on the way gets noticed in profiler. While testing new indexing code from Mike, we also changed our vanilla Tokenizer to use termBuffer and there was again something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this many times the hard way, nothing beats char[] when it comes to text processing.

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done before Token gets ready for prime time in Lucene core, and this is the place where having String overproduction hurts.







___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


eksdev at yahoo

Jul 21, 2007, 9:31 AM

Post #10 of 24 (7839 views)
Permalink
Re: Token termBuffer issues [In reply to]

as a side note,
Maybe a good point to mention it, clever people from mg4j project solved this polymorphic needs for char[]/String (fast modification/compact, fast hashing, safety) using something called MutableString, it is worth reading javadoc there.



----- Original Message ----
From: eks dev <eksdev [at] yahoo>
To: java-dev [at] lucene
Sent: Saturday, 21 July, 2007 6:23:25 PM
Subject: Re: Token termBuffer issues


On 7/21/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce. This would then mean no
>> objects are created as you step through Tokens in the TokenStream
>> ... so this would give the best performance.

>How much better I wonder? Small object allocation & reclaiming is
>supposed to be very good in current JVMs.


Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing due to String in Token and than do just the simple whitespace tokenisation during indexing. And this was not just out for fun, there was some real benefit in it.
The issue with performance here is in making transformations on tokens during analysis (where this applies), you gave very nice example , stemming, that itself generates new Strings, another nice example is NGram generation in SpellChecker that generates rater huge numbers of small objects.

The simplest model, tokenize(without modifying)/index ironically also benefits from char[] as than things go really fast in general so new String() on the way gets noticed in profiler. While testing new indexing code from Mike, we also changed our vanilla Tokenizer to use termBuffer and there was again something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this many times the hard way, nothing beats char[] when it comes to text processing.

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done before Token gets ready for prime time in Lucene core, and this is the place where having String overproduction hurts.







___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene






___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jul 22, 2007, 8:41 PM

Post #11 of 24 (7847 views)
Permalink
Re: Token termBuffer issues [In reply to]

> Agreed, so we can't change the API. So the next/nextDirect proposal
> should work well: it doesn't change the API yet would allow consumers
> that don't require "full private copy" of every Token, like
> DocumentsWriter, to have good performance.

If we eventually go this way, my preferred API for reuse is
next(Token resToken)
where a non-null resToken reads as
"if supported, please use this object for the returned result"


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 24, 2007, 11:40 AM

Post #12 of 24 (7844 views)
Permalink
Re: Token termBuffer issues [In reply to]

OK, I ran some benchmarks here.

The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFilter. I think it's worth pursuing!

Here are the optimizations I tested:

* Change core analyzers to re-use a single Token instance and reuse
the char[] termBuffer (using a new method "boolean next(Token t)" so it's
backwards compatible).

* For the StopFilter I created a new helper class (CharArraySet) to
create a hash set that can key off of char[]'s without having to
new a String.

* Fix the analyzer to re-use the same tokenizer across documents &
fields (rather than new'ing one every time)

I ran tests with "java -server -Xmx1024M", running on an Intel Core II
Duo box with Debian Linux 2.6.18 kernel and a RAID 5 IO system.

I index all text (every single term) in Wikipedia, pulling from a
single line file (I'm using the patch from LUCENE-947 that adds
line-file creation & indexing to contrib/benchmark).

First I create a single large file that has one doc per line from
Wikipedia content, using this alg:

docs.dir=enwiki
doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
line.file.out=/lucene/wikifull.txt
doc.maker.forever=false
{WriteLineDoc()}: *

Resulting file is 8.4 GB and 3.2 million docs. Then I indexed it
using this alg:

analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
directory=FSDirectory
ram.flush.mb=64
max.field.length=2147483647
compound=false
max.buffered=70000
doc.add.log.step=5000
docs.file=/lucene/wikifull.txt
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
doc.tokenized=true
doc.maker.forever=false

ResetSystemErase
CreateIndex
{ "All"
{AddDoc}: *
}

CloseIndex

RepSumByPref All
RepSumByPref AddDoc

Resulting index is 2.2 GB.

The LowercaseStopPorterAnalyzer just looks like this:

public class LowercaseStopPorterAnalyzer extends Analyzer {

Tokenizer tokenizer;
TokenStream stream;
public final TokenStream tokenStream(String fieldName, Reader reader) {
if (tokenizer == null) {
tokenizer = new LowerCaseTokenizer(reader);
stream = new PorterStemFilter(new StopFilter(tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS));
} else
tokenizer.reset(reader);
return stream;
}
};

I then record the elapsed time reported by the "All" task. I ran each
test twice and took the faster time:

JDK 5 Trunk 21 min 41 sec
JDK 5 New 18 min 54 sec
-> 12.8% faster

JDK 6 Trunk 21 min 43 sec
JDK 6 New 17 min 59 sec
-> 17.2% faster

It's rather odd that we see better gains in JDK 6 ... I had expected
the reverse (assuming GC performance is better in JDK 6 than JDK 5).

I also think it's quite cool that we can index all of Wikipedia in 18
minutes :) That works out to ~8 MB/sec.

I will open an issue...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 24, 2007, 11:42 AM

Post #13 of 24 (7828 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Doron Cohen" <DORONC [at] il> wrote:
> > Agreed, so we can't change the API. So the next/nextDirect proposal
> > should work well: it doesn't change the API yet would allow consumers
> > that don't require "full private copy" of every Token, like
> > DocumentsWriter, to have good performance.
>
> If we eventually go this way, my preferred API for reuse is
> next(Token resToken)
> where a non-null resToken reads as
> "if supported, please use this object for the returned result"

OK, I'm taking this approach right now in the benchmarks I just
tested. I added this method:

boolean next(Token resToken)

which returns true if it has updated resToken with another token,
else false if end-of-stream was hit.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 24, 2007, 1:15 PM

Post #14 of 24 (7817 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> OK, I ran some benchmarks here.
>
> The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> Wikipedia content using LowerCaseTokenizer + StopFilter +
> PorterStemFilter. I think it's worth pursuing!

Did you try it w/o token reuse (reuse tokens only when mutating, not
when creating new tokens from the tokenizer)?
It would be interesting to see what's attributable to Token reuse only
(after core filters have been optimized to use the char[] setters,
etc).

We've had issues in the past regarding errors with filters dealing
with token properties:
1) filters creating a new token from and old token, but forgetting
about setting positionIncrement
2) legacy filters losing "new" information such as payloads when
creating , because they didn't exist when the filter was written.

#1 is solved by token mutation because there are setters for the value
(before, the filter author was forced to create a new token, unless
they could access the package-private String).

#2 can now be solved by clone() (another relatively new addition)

So what new problems might crop up with token reuse?
- a filter reusing a token, but not zeroing out something new like
payloads because they didn't exist when the filter was authored (the
opposite problem from before)

Would a Token.clear() be needed for use by (primarily) tokenizers?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 24, 2007, 2:45 PM

Post #15 of 24 (7854 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Yonik Seeley" <yonik [at] apache> wrote:
> On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > OK, I ran some benchmarks here.
> >
> > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > PorterStemFilter. I think it's worth pursuing!
>
> Did you try it w/o token reuse (reuse tokens only when mutating, not
> when creating new tokens from the tokenizer)?

I haven't tried this variant yet. I guess for long filter chains the
GC cost of the tokenizer making the initial token should go down as
overall part of the time. Though I think we should still re-use the
initial token since it should (?) only help.

> It would be interesting to see what's attributable to Token reuse only
> (after core filters have been optimized to use the char[] setters,
> etc).

Good question; it could be the gains are mostly from switching to
char[] termBuffer and less so from Token instance re-use. Too many
tests to try :)

> We've had issues in the past regarding errors with filters dealing
> with token properties:
> 1) filters creating a new token from and old token, but forgetting
> about setting positionIncrement
> 2) legacy filters losing "new" information such as payloads when
> creating , because they didn't exist when the filter was written.
>
> #1 is solved by token mutation because there are setters for the value
> (before, the filter author was forced to create a new token, unless
> they could access the package-private String).

Ahhh, good!

> #2 can now be solved by clone() (another relatively new addition)
>
> So what new problems might crop up with token reuse?
> - a filter reusing a token, but not zeroing out something new like
> payloads because they didn't exist when the filter was authored (the
> opposite problem from before)
>
> Would a Token.clear() be needed for use by (primarily) tokenizers?

Hmm, good point; I like the clear() idea. I will add that.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jul 24, 2007, 2:51 PM

Post #16 of 24 (7833 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Michael McCandless" wrote:

> boolean next(Token resToken)
>
> which returns true if it has updated resToken with another token,
> else false if end-of-stream was hit.

I would actually prefer
Token next(Token resToken)
because:
- this was the API with reuse is very much like the one without reuse
(except for the reusable param) - easier for app dev.
- it allows to return useful result also in cases where the specific
implementation does not support reuse, or null was passed (indicating reuse
is not desired), or something in the data/state inhibited reuse.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 24, 2007, 3:07 PM

Post #17 of 24 (7841 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Doron Cohen" <DORONC [at] il> wrote:
> "Michael McCandless" wrote:
>
> > boolean next(Token resToken)
> >
> > which returns true if it has updated resToken with another token,
> > else false if end-of-stream was hit.
>
> I would actually prefer
> Token next(Token resToken)
> because:
> - this was the API with reuse is very much like the one without reuse
> (except for the reusable param) - easier for app dev.
> - it allows to return useful result also in cases where the specific
> implementation does not support reuse, or null was passed (indicating
> reuse
> is not desired), or something in the data/state inhibited reuse.

I see. So the callee is allowed to return a different token than
resToken? Ie passing "resToken" is just a suggestion that you may use
resToken to return your result but you are not required to. OK I'll
try this approach...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 24, 2007, 3:09 PM

Post #18 of 24 (7838 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> "Yonik Seeley" <yonik [at] apache> wrote:
> > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > OK, I ran some benchmarks here.
> > >
> > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > PorterStemFilter. I think it's worth pursuing!
> >
> > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > when creating new tokens from the tokenizer)?
>
> I haven't tried this variant yet. I guess for long filter chains the
> GC cost of the tokenizer making the initial token should go down as
> overall part of the time. Though I think we should still re-use the
> initial token since it should (?) only help.

If it weren't any slower, that would be great... but I worry about
filters that need buffering (either on the input side or the output
side) and how that interacts with filters that try and reuse.

Tokens that do output buffering could be slowed down if they must copy
the token state to the passed token. I like Doron's idea that a new
Token could be returned anyway.

The extra complexity seemingly involved in trying to make both
scenarios perform well is what prompts me to ask what the performance
gain will be.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 25, 2007, 2:59 AM

Post #19 of 24 (7834 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Yonik Seeley" <yonik [at] apache> wrote:
> On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > "Yonik Seeley" <yonik [at] apache> wrote:
> > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > OK, I ran some benchmarks here.
> > > >
> > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > PorterStemFilter. I think it's worth pursuing!
> > >
> > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > when creating new tokens from the tokenizer)?
> >
> > I haven't tried this variant yet. I guess for long filter chains the
> > GC cost of the tokenizer making the initial token should go down as
> > overall part of the time. Though I think we should still re-use the
> > initial token since it should (?) only help.
>
> If it weren't any slower, that would be great... but I worry about
> filters that need buffering (either on the input side or the output
> side) and how that interacts with filters that try and reuse.

OK I will tease out this effect & measure performance impact.

This would mean that the tokenizer must not only produce new Token
instance for each term but also cannot re-use the underlying char[]
buffer in that token, right? EG with mods for CharTokenizer I re-use
its "char[] buffer" with every Token, but I'll change that to be a new
buffer for each Token for this test.

> Tokens that do output buffering could be slowed down if they must copy
> the token state to the passed token. I like Doron's idea that a new
> Token could be returned anyway.
>
> The extra complexity seemingly involved in trying to make both
> scenarios perform well is what prompts me to ask what the performance
> gain will be.

Yes I like Doron's idea too -- it's just a "suggestion" to use the
input Token if it's convenient.

I think the resulting API is fairly simple with this change: if you
(the consumer) want a "full private copy" of the Token (like
QueryParser, Highlighter, CachedTokenFilter, a filter that does input
buffering, etc.) you call the input.next() call. If instead you can
handle re-use because you will consume this Token once, right now, and
never look at it again (like DocumentsWriter), then you call the
next(Token) API.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 25, 2007, 7:45 AM

Post #20 of 24 (7811 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/25/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> "Yonik Seeley" <yonik [at] apache> wrote:
> > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > "Yonik Seeley" <yonik [at] apache> wrote:
> > > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > > OK, I ran some benchmarks here.
> > > > >
> > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > > PorterStemFilter. I think it's worth pursuing!
> > > >
> > > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > > when creating new tokens from the tokenizer)?
> > >
> > > I haven't tried this variant yet. I guess for long filter chains the
> > > GC cost of the tokenizer making the initial token should go down as
> > > overall part of the time. Though I think we should still re-use the
> > > initial token since it should (?) only help.
> >
> > If it weren't any slower, that would be great... but I worry about
> > filters that need buffering (either on the input side or the output
> > side) and how that interacts with filters that try and reuse.
>
> OK I will tease out this effect & measure performance impact.
>
> This would mean that the tokenizer must not only produce new Token
> instance for each term but also cannot re-use the underlying char[]
> buffer in that token, right?

If the tokenizer can actually change the contents of the char[], then
yes, it seems like when next() is called rather than next(Token), a
new char[] would need to be allocated.

> EG with mods for CharTokenizer I re-use
> its "char[] buffer" with every Token, but I'll change that to be a new
> buffer for each Token for this test.

It's not just for a test, right? If next() is called, it can't reuse
the char[]. And there is no getting around the fact that some
tokenizers will need to call next() because of buffering.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 25, 2007, 8:30 AM

Post #21 of 24 (7800 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Yonik Seeley" <yonik [at] apache> wrote:
> On 7/25/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > "Yonik Seeley" <yonik [at] apache> wrote:
> > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > "Yonik Seeley" <yonik [at] apache> wrote:
> > > > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > > > OK, I ran some benchmarks here.
> > > > > >
> > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > > > PorterStemFilter. I think it's worth pursuing!
> > > > >
> > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > > > when creating new tokens from the tokenizer)?
> > > >
> > > > I haven't tried this variant yet. I guess for long filter chains the
> > > > GC cost of the tokenizer making the initial token should go down as
> > > > overall part of the time. Though I think we should still re-use the
> > > > initial token since it should (?) only help.
> > >
> > > If it weren't any slower, that would be great... but I worry about
> > > filters that need buffering (either on the input side or the output
> > > side) and how that interacts with filters that try and reuse.
> >
> > OK I will tease out this effect & measure performance impact.
> >
> > This would mean that the tokenizer must not only produce new Token
> > instance for each term but also cannot re-use the underlying char[]
> > buffer in that token, right?
>
> If the tokenizer can actually change the contents of the char[], then
> yes, it seems like when next() is called rather than next(Token), a
> new char[] would need to be allocated.

Right. So I'm now testing "reuse all" vs "tokenizer makes a full copy
but filters get to re-use it".

> > EG with mods for CharTokenizer I re-use
> > its "char[] buffer" with every Token, but I'll change that to be a new
> > buffer for each Token for this test.
>
> It's not just for a test, right? If next() is called, it can't reuse
> the char[]. And there is no getting around the fact that some
> tokenizers will need to call next() because of buffering.

Correct -- the way I'm doing this now is in TokenStream.java I have a
default "Token next()" which calls "next(Token result)" but makes a
complete copy before returning it. This keeps full backwards
compatiblity even in the case where a consumer wants a private copy
(calls next()) but the provider only provides the "re-use" API
(next(Token result)).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


steven_parkes at esseff

Jul 26, 2007, 12:24 PM

Post #22 of 24 (7814 views)
Permalink
RE: Token termBuffer issues [In reply to]

First I create a single large file that has one doc per line
from
Wikipedia content, using this alg

Anybody disagree that the 1-line-per-doc format is better (at least for
Wikipedia)? If so, I'll get rid of the intermediate one-file-per-doc
step.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jul 26, 2007, 12:56 PM

Post #23 of 24 (7787 views)
Permalink
Re: Token termBuffer issues [In reply to]

On 7/26/07, Steven Parkes <steven_parkes [at] esseff> wrote:
> First I create a single large file that has one doc per line
> from
> Wikipedia content, using this alg
>
> Anybody disagree that the 1-line-per-doc format is better (at least for
> Wikipedia)? If so, I'll get rid of the intermediate one-file-per-doc
> step.

+1
opening a lot of files is an overhead, and something we aren't trying to test.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jul 30, 2007, 5:46 AM

Post #24 of 24 (7754 views)
Permalink
Re: Token termBuffer issues [In reply to]

"Michael McCandless" <lucene [at] mikemccandless> wrote:
> "Yonik Seeley" <yonik [at] apache> wrote:
> > On 7/25/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > "Yonik Seeley" <yonik [at] apache> wrote:
> > > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > > "Yonik Seeley" <yonik [at] apache> wrote:
> > > > > > On 7/24/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> > > > > > > OK, I ran some benchmarks here.
> > > > > > >
> > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > > > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > > > > PorterStemFilter. I think it's worth pursuing!
> > > > > >
> > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > > > > when creating new tokens from the tokenizer)?
> > > > >
> > > > > I haven't tried this variant yet. I guess for long filter chains the
> > > > > GC cost of the tokenizer making the initial token should go down as
> > > > > overall part of the time. Though I think we should still re-use the
> > > > > initial token since it should (?) only help.
> > > >
> > > > If it weren't any slower, that would be great... but I worry about
> > > > filters that need buffering (either on the input side or the output
> > > > side) and how that interacts with filters that try and reuse.
> > >
> > > OK I will tease out this effect & measure performance impact.
> > >
> > > This would mean that the tokenizer must not only produce new Token
> > > instance for each term but also cannot re-use the underlying char[]
> > > buffer in that token, right?
> >
> > If the tokenizer can actually change the contents of the char[], then
> > yes, it seems like when next() is called rather than next(Token), a
> > new char[] would need to be allocated.
>
> Right. So I'm now testing "reuse all" vs "tokenizer makes a full copy
> but filters get to re-use it".

OK, I tested this case where CharTokenizer makes a new Token (and new
char[] array) for every token instead of re-using each. This way is
6% slower than fully re-using the Token (585 sec -> 618 sec) -- using
same test as described in
https://issues.apache.org/jira/browse/LUCENE-969.

> > > EG with mods for CharTokenizer I re-use
> > > its "char[] buffer" with every Token, but I'll change that to be a new
> > > buffer for each Token for this test.
> >
> > It's not just for a test, right? If next() is called, it can't reuse
> > the char[]. And there is no getting around the fact that some
> > tokenizers will need to call next() because of buffering.
>
> Correct -- the way I'm doing this now is in TokenStream.java I have a
> default "Token next()" which calls "next(Token result)" but makes a
> complete copy before returning it. This keeps full backwards
> compatiblity even in the case where a consumer wants a private copy
> (calls next()) but the provider only provides the "re-use" API
> (next(Token result)).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.