Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Post mortem kudos for (LUCENE-843) :)

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


eksdev at yahoo

Jul 12, 2007, 10:17 AM

Post #1 of 18 (5201 views)
Permalink
Post mortem kudos for (LUCENE-843) :)

Let the numbers speak,

INDEX SIZE: 58Mio docs, 2.5G on disk
- two tokenized Fields, both with average 4 tokens (rather small), approx. 2Mio unique tokens
- one binary stored field (one VInt)
- HW commodity AMD PC, 2.8Ghz (or so) 2G RAM, single disk, WIN XP 64bit, jvm 6.0 32bit

before LUCENE-843 indexing speed was 5-6k records per second (and I believed this was already as fast as it gets)
after (trunk version yesterday) 60-65k documents per second!
All (exhaustive!) tests pass on this index.

autocommit = false, 24M RAMBuffer, using char[] instead of String for Token (this was the reason we separated Analysis in two phases, leaving for Lucene Analyzer only simple whitespace tokenization)


Brilliant work, nothing more and nothing less!











___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


markrmiller at gmail

Jul 12, 2007, 10:50 AM

Post #2 of 18 (5059 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

My results have been fantastic as well (though not as fantastic as yours
-- I have larger docs and use StandardAnalzyer). Mr. McCandless has been
on a tear this year. A Lucene machine.

Kudos as well,

- Mark

eks dev wrote:
> Let the numbers speak,
>
> INDEX SIZE: 58Mio docs, 2.5G on disk
> - two tokenized Fields, both with average 4 tokens (rather small), approx. 2Mio unique tokens
> - one binary stored field (one VInt)
> - HW commodity AMD PC, 2.8Ghz (or so) 2G RAM, single disk, WIN XP 64bit, jvm 6.0 32bit
>
> before LUCENE-843 indexing speed was 5-6k records per second (and I believed this was already as fast as it gets)
> after (trunk version yesterday) 60-65k documents per second!
> All (exhaustive!) tests pass on this index.
>
> autocommit = false, 24M RAMBuffer, using char[] instead of String for Token (this was the reason we separated Analysis in two phases, leaving for Lucene Analyzer only simple whitespace tokenization)
>
>
> Brilliant work, nothing more and nothing less!
>
>
>
>
>
>
>
>
>
>
>
> ___________________________________________________________
> Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
> your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-dev-help[at]lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


lucene at mikemccandless

Jul 12, 2007, 12:22 PM

Post #3 of 18 (5056 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

Thank you for the compliments, and thank you for being such early
adopter testers! I'm very glad you didn't hit any issues :)

> before LUCENE-843 indexing speed was 5-6k records per second (and I
> believed this was already as fast as it gets)
> after (trunk version yesterday) 60-65k documents per second! All
> (exhaustive!) tests pass on this index.

Wow, 10X speedup is even faster than my fastest results!

> autocommit = false, 24M RAMBuffer, using char[] instead of String
> for Token (this was the reason we separated Analysis in two phases,
> leaving for Lucene Analyzer only simple whitespace tokenization)

Looks like you're doing everything right to get fastest performance.

You can also re-use the Document & Field instances, and also the Token
instance in your analyzer and that should help more.

Was 24M (and not more) clearly the fastest performance?

Also note that you must workaround LUCENE-845 (still open):

http://issues.apache.org/jira/browse/LUCENE-845;jsessionid=E110C767DA8EFEC5B7D39D00EEF1EB74

You should set your maxBufferedDocs to something "close to but always
above" how many docs actually get flushed after 24 MB RAM is full,
else you could spend way too much time merging. I'm working on
LUCENE-845 now but not yet sure when it will be resolved...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 1:33 AM

Post #4 of 18 (5056 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

Hi Mike,
> Was 24M (and not more) clearly the fastest performance?

No, this is kind of optimum. Throwing more memory up to 32M makes things slightly faster at slow rate, having maximum at 32. After that things start getting slower (slowly)

We are not yet completely done with tuning, especially with two tips you mentioned in this mail.
Fields are already reused, but

1. Reusing Document, this is one new Vector() in there (and at these speeds, something like this makes difference!!!)
in Document List fields = new Vector(); (by the way, must this be synchronized Vector? Why not ArrayList? Any difference from it)

2. Reusing Field, excuse my ignorance, but how I can do it? with Document is easy with
luceneDocument.add(field)
luceneDocument.removeFields(name) //Wouldn't be better to have luceneDocument.removeAllFields()



3. "LUCENE-845" Whoops, I totally overlooked this one! And I am sure my maxBufferedDocs is well under what fits in 24Mb?!? Any good tip on how to determine good number: count added docs and see how far this number goes before flush() triggers (how I detect when flush by ram gets triggered?) and than add 10% to this number...

----- Original Message ----
From: Michael McCandless <lucene[at]mikemccandless.com>
To: java-dev[at]lucene.apache.org
Sent: Thursday, 12 July, 2007 9:22:48 PM
Subject: Re: Post mortem kudos for (LUCENE-843) :)


Thank you for the compliments, and thank you for being such early
adopter testers! I'm very glad you didn't hit any issues :)

> before LUCENE-843 indexing speed was 5-6k records per second (and I
> believed this was already as fast as it gets)
> after (trunk version yesterday) 60-65k documents per second! All
> (exhaustive!) tests pass on this index.

Wow, 10X speedup is even faster than my fastest results!

> autocommit = false, 24M RAMBuffer, using char[] instead of String
> for Token (this was the reason we separated Analysis in two phases,
> leaving for Lucene Analyzer only simple whitespace tokenization)

Looks like you're doing everything right to get fastest performance.

You can also re-use the Document & Field instances, and also the Token
instance in your analyzer and that should help more.

Was 24M (and not more) clearly the fastest performance?

Also note that you must workaround LUCENE-845 (still open):

http://issues.apache.org/jira/browse/LUCENE-845;jsessionid=E110C767DA8EFEC5B7D39D00EEF1EB74

You should set your maxBufferedDocs to something "close to but always
above" how many docs actually get flushed after 24 MB RAM is full,
else you could spend way too much time merging. I'm working on
LUCENE-845 now but not yet sure when it will be resolved...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org







___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


lucene at mikemccandless

Jul 13, 2007, 5:13 AM

Post #5 of 18 (5051 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

"eks dev" <eksdev[at]yahoo.co.uk> wrote:

> > Was 24M (and not more) clearly the fastest performance?
>
> No, this is kind of optimum. Throwing more memory up to 32M makes things
> slightly faster at slow rate, having maximum at 32. After that things
> start getting slower (slowly)

Interesting. This matches the experience Doron had where adding more
RAM actually slowed things down a bit (posted to
LUCENE-843).

> We are not yet completely done with tuning, especially with two tips
> you mentioned in this mail.
> Fields are already reused, but

Super.

> 1. Reusing Document, this is one new Vector() in there (and at these
> speeds, something like this makes difference!!!)
> in Document List fields = new Vector(); (by the way, must this be
> synchronized Vector? Why not ArrayList? Any difference from it)

Oh yeah, it would be good to not "new Vector()" every time.

What I did in the benchmarking for LUCENE-843 was make a single
Document, make my N fields (using my own class that implements
Fieldable but lets me change the value), add these fields to the
Document, and then hold onto the fields as local variables (textField,
titleField, idField, etc.).

Then for each doc I just set the field values
(textField.setValue(...), etc.) and then call writer.addDocument(doc).

> 2. Reusing Field, excuse my ignorance, but how I can do it? with Document
> is easy with
> luceneDocument.add(field)
> luceneDocument.removeFields(name) //Wouldn't be better to have
> luceneDocument.removeAllFields()

Yeah it's not so easy now: Field.java does not have setters.

You have to make your own class that implements Fieldable (or
subclasses AbstractField) and adds your own setters. Field.java is
also [currently] final so you can't subclass it.

In the benchmarking code (see patch in
http://issues.apache.org/jira/browse/LUCENE-947) I created a
ReusableStringField that lets you setStringValue(...). You could use
that as your Field class.

Alternatively you can make a "ReusableStringReader" (there's one in
DocumentsWriter in the trunk now) and then use the normal Field class
but pass in your instance of ReusableStringReader. This approach
could be faster if you implemented it to use a char[] instead of a
String (the current one in DocumentsWriter reads a String).

> 3. "LUCENE-845" Whoops, I totally overlooked this one! And I am sure my
> maxBufferedDocs is well under what fits in 24Mb?!? Any good tip on how
> to determine good number: count added docs and see how far this number
> goes before flush() triggers (how I detect when flush by ram gets
> triggered?) and than add 10% to this number...

Whoa, OK.

First you need to figure out how many docs are "typically" getting
flushed at 24 MB. Easiest way would be to call
writer.setInfoStream(System.out) and look for the lines that say
"flush postings as segment XXX numDocs=YYY". Likely your YYY is
"fairly" close every time since your docs are so predictable in size.

Then, set your maxBufferedDocs anywhere above YYY and below 10 * YYY
and you shouldn't hit LUCENE-845 (actually 5.5 * YYY is best since it
gives you max safety margin). Note that you should call
setMaxBufferedDocs(...) first and then call setRamBufferSizeMB(...)
in that order. If you do it backwards then the writer will flush @
exactly that number of buffered docs instead.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


gsingers at apache

Jul 13, 2007, 5:59 AM

Post #6 of 18 (5057 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

This is good stuff... Might be good to put a organized version of
this up on the Wiki under Best Practices

On Jul 13, 2007, at 8:13 AM, Michael McCandless wrote:

>
> Yeah it's not so easy now: Field.java does not have setters.
>
> You have to make your own class that implements Fieldable (or
> subclasses AbstractField) and adds your own setters. Field.java is
> also [currently] final so you can't subclass it.
>

Should we consider putting in these changes? I think it might be a
little weird on the Search side to have setters for Field and it
sounds like it could cause trouble for people esp. in a threaded
indexing situation, but maybe I am mistaken?

At any rate, it sounds like these would be good contributions as long
as they are well documented.


-Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 6:14 AM

Post #7 of 18 (5049 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

Thanks Mike!
well, this worked well for me :)
logger.info("bufferd documents: "+ lineCntr + " Buffer size: "+ ixWriter.getRAMBufferSizeMB() + "Mb ; from that currently in use: " + ixWriter.ramSizeInBytes());

so I know the size per doc in RAM buffer now, the rest is easy now, just * 5.5 .

Fieldable , am I blind!

But anhow, it works over Filed as you said ReusableCharArrayReader instance, you have just to "reinit" this instance and that's all.

not tested with these changes (no allocations more now but for binary, stored Field (I am too lazy at Friday afternoon to change it) , and correctly set maxBufferedDocs).
I do not know why, contrary to the logic, I would be surprised if it goes much faster than the last time, simply 60k /sec is really fast

Whatever comes out, "only 10X" Indexing boost is something I can write home about :)

thanks again and nice weekend!

----- Original Message ----
From: Michael McCandless <lucene[at]mikemccandless.com>
To: java-dev[at]lucene.apache.org
Sent: Friday, 13 July, 2007 2:13:42 PM
Subject: Re: Post mortem kudos for (LUCENE-843) :)

"eks dev" <eksdev[at]yahoo.co.uk> wrote:

> > Was 24M (and not more) clearly the fastest performance?
>
> No, this is kind of optimum. Throwing more memory up to 32M makes things
> slightly faster at slow rate, having maximum at 32. After that things
> start getting slower (slowly)

Interesting. This matches the experience Doron had where adding more
RAM actually slowed things down a bit (posted to
LUCENE-843).

> We are not yet completely done with tuning, especially with two tips
> you mentioned in this mail.
> Fields are already reused, but

Super.

> 1. Reusing Document, this is one new Vector() in there (and at these
> speeds, something like this makes difference!!!)
> in Document List fields = new Vector(); (by the way, must this be
> synchronized Vector? Why not ArrayList? Any difference from it)

Oh yeah, it would be good to not "new Vector()" every time.

What I did in the benchmarking for LUCENE-843 was make a single
Document, make my N fields (using my own class that implements
Fieldable but lets me change the value), add these fields to the
Document, and then hold onto the fields as local variables (textField,
titleField, idField, etc.).

Then for each doc I just set the field values
(textField.setValue(...), etc.) and then call writer.addDocument(doc).

> 2. Reusing Field, excuse my ignorance, but how I can do it? with Document
> is easy with
> luceneDocument.add(field)
> luceneDocument.removeFields(name) //Wouldn't be better to have
> luceneDocument.removeAllFields()

Yeah it's not so easy now: Field.java does not have setters.

You have to make your own class that implements Fieldable (or
subclasses AbstractField) and adds your own setters. Field.java is
also [currently] final so you can't subclass it.

In the benchmarking code (see patch in
http://issues.apache.org/jira/browse/LUCENE-947) I created a
ReusableStringField that lets you setStringValue(...). You could use
that as your Field class.

Alternatively you can make a "ReusableStringReader" (there's one in
DocumentsWriter in the trunk now) and then use the normal Field class
but pass in your instance of ReusableStringReader. This approach
could be faster if you implemented it to use a char[] instead of a
String (the current one in DocumentsWriter reads a String).

> 3. "LUCENE-845" Whoops, I totally overlooked this one! And I am sure my
> maxBufferedDocs is well under what fits in 24Mb?!? Any good tip on how
> to determine good number: count added docs and see how far this number
> goes before flush() triggers (how I detect when flush by ram gets
> triggered?) and than add 10% to this number...

Whoa, OK.

First you need to figure out how many docs are "typically" getting
flushed at 24 MB. Easiest way would be to call
writer.setInfoStream(System.out) and look for the lines that say
"flush postings as segment XXX numDocs=YYY". Likely your YYY is
"fairly" close every time since your docs are so predictable in size.

Then, set your maxBufferedDocs anywhere above YYY and below 10 * YYY
and you shouldn't hit LUCENE-845 (actually 5.5 * YYY is best since it
gives you max safety margin). Note that you should call
setMaxBufferedDocs(...) first and then call setRamBufferSizeMB(....)
in that order. If you do it backwards then the writer will flush @
exactly that number of buffered docs instead.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org






___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 7:27 AM

Post #8 of 18 (5058 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

>Interesting. This matches the experience Doron had where adding more
>RAM actually slowed things down a bit (posted to
>LUCENE-843).

I know this intrigues you, so our fresh experience:

The bigger the RAM Buffer the faster indexing, this holds until you hit some limit that starts irritating gc(). But this limit is somehow "natural" and is given by the environment (available RAM, competing OS File cache and who knows what else)... basically , we concluded after testing the more memory, more speed is to expect (this is kind of ideal scaling, one proof more of the algorithmic strength).

This test showed 110k Docs/second at 32Mb (what we found to be optimal for our needs, as it slowly speeds-up after that to 123k Docs/sec at 256Mb)

I suspect this phenomena on our last test and what Doron mentioned was due to the wrong maxBufferedDocs. Have no other explanation

Basically, we achieved almost 20 X speed-up by just having LUCEN-843 and your valuable comments on how to utilize this nice machine called Lucene.








___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


lucene at mikemccandless

Jul 13, 2007, 7:35 AM

Post #9 of 18 (5049 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

"Grant Ingersoll" <gsingers[at]apache.org> wrote:

> This is good stuff... Might be good to put a organized version of
> this up on the Wiki under Best Practices

I agree! I will update the ImproveIndexingSpeed page:

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

with these suggestions.

> On Jul 13, 2007, at 8:13 AM, Michael McCandless wrote:
>
> > Yeah it's not so easy now: Field.java does not have setters.
> >
> > You have to make your own class that implements Fieldable (or
> > subclasses AbstractField) and adds your own setters. Field.java is
> > also [currently] final so you can't subclass it.
> >
>
> Should we consider putting in these changes? I think it might be a
> little weird on the Search side to have setters for Field and it
> sounds like it could cause trouble for people esp. in a threaded
> indexing situation, but maybe I am mistaken?

I think adding setters would be reasonable, if we document clearly
that they are advanced, be careful about threads, use at your own risk
sort of methods? Are there any concerns with that approach? If not
I'll open an issue and do it... this just makes it easier for people
to maximize indexing performance "out of the box".

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


peterlkeegan at gmail

Jul 13, 2007, 7:39 AM

Post #10 of 18 (5048 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

I see you have a 64-bit OS but 32-bit JVM. Have you tried a 64-bit JVM (with
a bigger RAM buffer)?

Peter


On 7/13/07, eks dev <eksdev[at]yahoo.co.uk> wrote:
>
> >Interesting. This matches the experience Doron had where adding more
> >RAM actually slowed things down a bit (posted to
> >LUCENE-843).
>
> I know this intrigues you, so our fresh experience:
>
> The bigger the RAM Buffer the faster indexing, this holds until you hit
> some limit that starts irritating gc(). But this limit is somehow "natural"
> and is given by the environment (available RAM, competing OS File cache and
> who knows what else)... basically , we concluded after testing the more
> memory, more speed is to expect (this is kind of ideal scaling, one proof
> more of the algorithmic strength).
>
> This test showed 110k Docs/second at 32Mb (what we found to be optimal for
> our needs, as it slowly speeds-up after that to 123k Docs/sec at 256Mb)
>
> I suspect this phenomena on our last test and what Doron mentioned was due
> to the wrong maxBufferedDocs. Have no other explanation
>
> Basically, we achieved almost 20 X speed-up by just having LUCEN-843 and
> your valuable comments on how to utilize this nice machine called Lucene.
>
>
>
>
>
>
>
>
> ___________________________________________________________
> Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
> for
> your free account today
> http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-dev-help[at]lucene.apache.org
>
>


lucene at mikemccandless

Jul 13, 2007, 7:41 AM

Post #11 of 18 (5052 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

"eks dev" <eksdev[at]yahoo.co.uk> wrote:
> >Interesting. This matches the experience Doron had where adding more
> >RAM actually slowed things down a bit (posted to
> >LUCENE-843).
>
> I know this intrigues you, so our fresh experience:

Yes it does! Thanks :)

> The bigger the RAM Buffer the faster indexing, this holds until you hit
> some limit that starts irritating gc(). But this limit is somehow
> "natural" and is given by the environment (available RAM, competing OS
> File cache and who knows what else)... basically , we concluded after
> testing the more memory, more speed is to expect (this is kind of ideal
> scaling, one proof more of the algorithmic strength).

OK, this is a good datapoint. It sound like one has to test in their
own environment to find the optimal performance / RAM usage tradeoff.

> This test showed 110k Docs/second at 32Mb (what we found to be optimal
> for our needs, as it slowly speeds-up after that to 123k Docs/sec at
> 256Mb)
>
> I suspect this phenomena on our last test and what Doron mentioned was
> due to the wrong maxBufferedDocs. Have no other explanation
>
> Basically, we achieved almost 20 X speed-up by just having LUCEN-843 and
> your valuable comments on how to utilize this nice machine called
> Lucene.

WOW! That's great :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 7:48 AM

Post #12 of 18 (5055 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

no, we have only 2G on this machine and 32bit jvm can utilize most of it (1..6G) leaving something to the OS, and is generally faster due to shorter pointers and more availyble registers on 32bit. Lucene is not memory hungry for *our setup* and works perfectly fine with 32bit jvm...

actually, our experience with lucene based apps showed it is often better to give more memory to OS and less to the jvm so the OS file cache can show its magic. Indexing works more than happily with -Xmx512m



----- Original Message ----
From: Peter Keegan <peterlkeegan[at]gmail.com>
To: java-dev[at]lucene.apache.org
Sent: Friday, 13 July, 2007 4:39:45 PM
Subject: Re: Post mortem kudos for (LUCENE-843) :)

I see you have a 64-bit OS but 32-bit JVM. Have you tried a 64-bit JVM (with
a bigger RAM buffer)?

Peter


On 7/13/07, eks dev <eksdev[at]yahoo.co.uk> wrote:
>
> >Interesting. This matches the experience Doron had where adding more
> >RAM actually slowed things down a bit (posted to
> >LUCENE-843).
>
> I know this intrigues you, so our fresh experience:
>
> The bigger the RAM Buffer the faster indexing, this holds until you hit
> some limit that starts irritating gc(). But this limit is somehow "natural"
> and is given by the environment (available RAM, competing OS File cache and
> who knows what else)... basically , we concluded after testing the more
> memory, more speed is to expect (this is kind of ideal scaling, one proof
> more of the algorithmic strength).
>
> This test showed 110k Docs/second at 32Mb (what we found to be optimal for
> our needs, as it slowly speeds-up after that to 123k Docs/sec at 256Mb)
>
> I suspect this phenomena on our last test and what Doron mentioned was due
> to the wrong maxBufferedDocs. Have no other explanation
>
> Basically, we achieved almost 20 X speed-up by just having LUCEN-843 and
> your valuable comments on how to utilize this nice machine called Lucene.
>
>
>
>
>
>
>
>
> ___________________________________________________________
> Yahoo! Mail is the world's favourite email. Don't settle for less, sign up
> for
> your free account today
> http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-dev-help[at]lucene.apache.org
>
>





___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


yonik at apache

Jul 13, 2007, 7:52 AM

Post #13 of 18 (5054 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

On 7/13/07, eks dev <eksdev[at]yahoo.co.uk> wrote:
> no, we have only 2G on this machine and 32bit jvm can utilize most of it (1..6G) leaving something to the OS, and is generally faster due to shorter pointers and more availyble registers on 32bit.

Less available registers in 32 bit mode :-)
Right about the larger pointers though... they can make things
slightly slower (cache effects, etc).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 7:56 AM

Post #14 of 18 (5056 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

>OK, this is a good datapoint. It sound like one has to test in their
>own environment to find the optimal performance / RAM usage tradeoff.

Absolutely! Especially when you are trying to find max RAM for buffers, sooner or later you hit gc(), and this is 100% environment dependant and has nothing to do with Lucene, would happen with anything.

What was our problem in the first setup is that we reached maximum much, much earlier at 32Mb (so, no gc() or other environment effects were kicking in, this was something to do with wrongly set maxBufferedDocs, but I do not understand it enough to say exactly what)






___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


eksdev at yahoo

Jul 13, 2007, 8:04 AM

Post #15 of 18 (5049 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

thanks for correcting me.
as always, when it gets low, dirty and hardwired, call Yonik to help :)

I do not know exactly why, but we tried it a while ago and all our lucene apps were faster on 32bit jvm

Would you expect it to be faster on 64bit jvm, is it worth digging deeper?



----- Original Message ----
From: Yonik Seeley <yonik[at]apache.org>
To: java-dev[at]lucene.apache.org
Sent: Friday, 13 July, 2007 4:52:05 PM
Subject: Re: Post mortem kudos for (LUCENE-843) :)

On 7/13/07, eks dev <eksdev[at]yahoo.co.uk> wrote:
> no, we have only 2G on this machine and 32bit jvm can utilize most of it (1..6G) leaving something to the OS, and is generally faster due to shorter pointers and more availyble registers on 32bit..

Less available registers in 32 bit mode :-)
Right about the larger pointers though... they can make things
slightly slower (cache effects, etc).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org






___________________________________________________________
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


yonik at apache

Jul 13, 2007, 8:10 AM

Post #16 of 18 (5057 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

On 7/13/07, eks dev <eksdev[at]yahoo.co.uk> wrote:
> Would you expect it to be faster on 64bit jvm, is it worth digging deeper?
No.

It depends on the app. Certain apps do better with more registers,
but Java can often be a bit slower because it's so object heavy (more
bigger pointers in every object means that less can fit in the L1
cache well, etc).

I could see a smart JVM trying to use 32 bit pointers in 64 bit
mode... but I don't know if anyone has done it.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


peterlkeegan at gmail

Jul 17, 2007, 12:17 PM

Post #17 of 18 (5028 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

I did some performance comparison testing of Lucene 2.0 vs. trunk (with
LUCENE-843). I'm seeing at least a 4X increase in indexing rate with the new
DocumentsWriter in LUCENE-843 (still doing single-threaded indexing). Better
yet, the total time to build the index is much shorter because I can now
build the entire 3GB index (900K docs) in one segment in RAM (using
FSDirectory) and flush it to disk at the end. Before, I had to build smaller
segments (20K docs), merge after 20 segments and then optimize at the end.
The memory usage with LUCENE-843 is much lower, presumably because stored
fields and term vectors no longer sit in RAM.

I also observed a 20-25% gain by reusing the Field objects. Implementing my
own Fieldable class was too complicated, so I simply extended the Field
class (after removing final) and added 2 setter methods:

public void setValue(String value) {
this.fieldsData = value;
}
public void setValue(byte[] value) {
this.fieldsData = value;
}

Since this improved performance significantly, I would vote to either add
setters to Field or make it extendable.

Kudos to Mike for this huge improvement!

Peter

On 7/13/07, Michael McCandless <lucene[at]mikemccandless.com> wrote:
>
> "Grant Ingersoll" <gsingers[at]apache.org> wrote:
>
> > This is good stuff... Might be good to put a organized version of
> > this up on the Wiki under Best Practices
>
> I agree! I will update the ImproveIndexingSpeed page:
>
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> with these suggestions.
>
> > On Jul 13, 2007, at 8:13 AM, Michael McCandless wrote:
> >
> > > Yeah it's not so easy now: Field.java does not have setters.
> > >
> > > You have to make your own class that implements Fieldable (or
> > > subclasses AbstractField) and adds your own setters. Field.java is
> > > also [currently] final so you can't subclass it.
> > >
> >
> > Should we consider putting in these changes? I think it might be a
> > little weird on the Search side to have setters for Field and it
> > sounds like it could cause trouble for people esp. in a threaded
> > indexing situation, but maybe I am mistaken?
>
> I think adding setters would be reasonable, if we document clearly
> that they are advanced, be careful about threads, use at your own risk
> sort of methods? Are there any concerns with that approach? If not
> I'll open an issue and do it... this just makes it easier for people
> to maximize indexing performance "out of the box".
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-dev-help[at]lucene.apache.org
>
>


lucene at mikemccandless

Jul 17, 2007, 3:52 PM

Post #18 of 18 (5028 views)
Permalink
Re: Post mortem kudos for (LUCENE-843) :) [In reply to]

"Peter Keegan" <peterlkeegan[at]gmail.com> wrote:
> I did some performance comparison testing of Lucene 2.0 vs. trunk (with
> LUCENE-843). I'm seeing at least a 4X increase in indexing rate with the new
> DocumentsWriter in LUCENE-843 (still doing single-threaded indexing). Better
> yet, the total time to build the index is much shorter because I can now
> build the entire 3GB index (900K docs) in one segment in RAM (using
> FSDirectory) and flush it to disk at the end. Before, I had to build smaller
> segments (20K docs), merge after 20 segments and then optimize at the end.

Awesome :)

> The memory usage with LUCENE-843 is much lower, presumably because stored
> fields and term vectors no longer sit in RAM.

Right, not buffering the stored fields & term vectors in RAM is a big
win. In addition, the storage of the postings in RAM as a single shared
hash table using a pool of large byte[] arrays vs separate 1 KB
buffers for the files for a single segment document, also improve RAM
efficiency.

In my tests, using Europarl content with small docs (~100 terms = ~550
bytes per doc) with stored fields & term vectors enabled the RAM
efficiency is 44X better than before.

> I also observed a 20-25% gain by reusing the Field objects. Implementing my
> own Fieldable class was too complicated, so I simply extended the Field
> class (after removing final) and added 2 setter methods:
>
> public void setValue(String value) {
> this.fieldsData = value;
> }
> public void setValue(byte[] value) {
> this.fieldsData = value;
> }
>
> Since this improved performance significantly, I would vote to either add
> setters to Field or make it extendable.

OK I've opened LUCENE-963 for this & attached a patch.

> Kudos to Mike for this huge improvement!

Thanks!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.