Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

optimize() method call

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


gsingers at apache

Apr 6, 2007, 3:53 PM

Post #1 of 8 (5777 views)
Permalink
optimize() method call

I was looking at the javadocs for the optimize() call on IndexWriter
which contain a great amount of detail about what happens, but very
little guidance on when. I would like to add more on when. I
generally do optimize after I finish my indexing, which is pretty
straightforward to determine when one has a more or less static
collection. What isn't so clear to me, b/c I haven't dealt w/ it too
much is when optimize should be called in environments that are
frequently updated.

Here's what I have for text so far:
*
* <p>It is recommended that this method be called upon completion
of indexing. In
* environments with frequent updates optimize is best FILL IN HERE
* </p>

Essentially, I am wondering what are the best practices for calling
optimize, especially in a frequent update environment. My gut
feeling is that it should just be scheduled to be done on a regular
basis, ideally when there is a lull. The docs allude to the fact
that search performance will be better, but has anyone quantified
it? The mergeFactor docs say that a smaller merge factor results in
faster searches on unoptimized (I presume that means relatively
faster searches to higher merge factors, but still not as fast as
optimized, correct?) If it hasn't been quantified, maybe I will try
to whip a benchmark for it.

So, do people in these types of environment typically schedule
optimize to occur at night or every few hours, or what? I know, "It
depends...", just am wondering if there is a general consensus that
would be useful to pass along to readers

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Apr 7, 2007, 12:00 AM

Post #2 of 8 (5615 views)
Permalink
Re: optimize() method call [In reply to]

I think this is great, and it gave me an idea. What if another thread could call a "stop optimize" which would stop the optimize after it came to a consistent state (not in the middle of a segment merge).

We schedule our optimizes for the "lull" time period, but with 24/7 operation this could be hard to find.

Being able to stop and then resume the optimize seems like a great idea.

-----Original Message-----
>From: Grant Ingersoll <gsingers [at] apache>
>Sent: Apr 6, 2007 3:53 PM
>To: java-dev [at] lucene
>Subject: optimize() method call
>
>I was looking at the javadocs for the optimize() call on IndexWriter
>which contain a great amount of detail about what happens, but very
>little guidance on when. I would like to add more on when. I
>generally do optimize after I finish my indexing, which is pretty
>straightforward to determine when one has a more or less static
>collection. What isn't so clear to me, b/c I haven't dealt w/ it too
>much is when optimize should be called in environments that are
>frequently updated.
>
>Here's what I have for text so far:
>*
> * <p>It is recommended that this method be called upon completion
>of indexing. In
> * environments with frequent updates optimize is best FILL IN HERE
> * </p>
>
>Essentially, I am wondering what are the best practices for calling
>optimize, especially in a frequent update environment. My gut
>feeling is that it should just be scheduled to be done on a regular
>basis, ideally when there is a lull. The docs allude to the fact
>that search performance will be better, but has anyone quantified
>it? The mergeFactor docs say that a smaller merge factor results in
>faster searches on unoptimized (I presume that means relatively
>faster searches to higher merge factors, but still not as fast as
>optimized, correct?) If it hasn't been quantified, maybe I will try
>to whip a benchmark for it.
>
>So, do people in these types of environment typically schedule
>optimize to occur at night or every few hours, or what? I know, "It
>depends...", just am wondering if there is a general consensus that
>would be useful to pass along to readers
>
>--------------------------
>Grant Ingersoll
>Center for Natural Language Processing
>http://www.cnlp.org
>
>Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>LuceneFAQ
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>For additional commands, e-mail: java-dev-help [at] lucene
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


otis_gospodnetic at yahoo

Apr 7, 2007, 10:01 PM

Post #3 of 8 (5605 views)
Permalink
Re: optimize() method call [In reply to]

I'd advise against calling optimize() at all in an environment whose indices are constantly updated. That's what mergeFactor helps with. Keep it low, and Lucene itself will regularly merge segments more often. If one still wants to call optimize(), you'd want to know how long it would take on with the index of your size and if you've got enough lull time, do it, otherwise postpone it.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share

----- Original Message ----
From: Grant Ingersoll <gsingers [at] apache>
To: java-dev [at] lucene
Sent: Friday, April 6, 2007 6:53:13 PM
Subject: optimize() method call

I was looking at the javadocs for the optimize() call on IndexWriter
which contain a great amount of detail about what happens, but very
little guidance on when. I would like to add more on when. I
generally do optimize after I finish my indexing, which is pretty
straightforward to determine when one has a more or less static
collection. What isn't so clear to me, b/c I haven't dealt w/ it too
much is when optimize should be called in environments that are
frequently updated.

Here's what I have for text so far:
*
* <p>It is recommended that this method be called upon completion
of indexing. In
* environments with frequent updates optimize is best FILL IN HERE
* </p>

Essentially, I am wondering what are the best practices for calling
optimize, especially in a frequent update environment. My gut
feeling is that it should just be scheduled to be done on a regular
basis, ideally when there is a lull. The docs allude to the fact
that search performance will be better, but has anyone quantified
it? The mergeFactor docs say that a smaller merge factor results in
faster searches on unoptimized (I presume that means relatively
faster searches to higher merge factors, but still not as fast as
optimized, correct?) If it hasn't been quantified, maybe I will try
to whip a benchmark for it.

So, do people in these types of environment typically schedule
optimize to occur at night or every few hours, or what? I know, "It
depends...", just am wondering if there is a general consensus that
would be useful to pass along to readers

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


cutting at apache

Apr 9, 2007, 11:18 AM

Post #4 of 8 (5602 views)
Permalink
Re: optimize() method call [In reply to]

Otis Gospodnetic wrote:
> I'd advise against calling optimize() at all in an environment whose indices are constantly updated.

+1

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


adb at teamware

Apr 11, 2007, 6:41 PM

Post #5 of 8 (5604 views)
Permalink
Re: optimize() method call [In reply to]

Robert Engels wrote:
> I think this is great, and it gave me an idea. What if another thread could
> call a "stop optimize" which would stop the optimize after it came to a
> consistent state (not in the middle of a segment merge).
>
> We schedule our optimizes for the "lull" time period, but with 24/7 operation
> this could be hard to find.
>
> Being able to stop and then resume the optimize seems like a great idea.

+1. It would be useful in shutdown cases where immediate shutdown is needed, or
to allow a scheduled backup to kick in at a fixed time, rather than having to
wait for optimize to complete. Or is there another way to interrupt optmimize
safely?

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


grant.ingersoll at gmail

Apr 18, 2007, 1:29 PM

Post #6 of 8 (5560 views)
Permalink
Re: optimize() method call [In reply to]

Has anyone done in benchmarking to approximate how long it takes to
optimize different size indexes? Is the merging linear, sub-linear,
etc.?

On Apr 8, 2007, at 1:01 AM, Otis Gospodnetic wrote:

> I'd advise against calling optimize() at all in an environment
> whose indices are constantly updated. That's what mergeFactor
> helps with. Keep it low, and Lucene itself will regularly merge
> segments more often. If one still wants to call optimize(), you'd
> want to know how long it would take on with the index of your size
> and if you've got enough lull time, do it, otherwise postpone it.
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/ - Tag - Search - Share
>
> ----- Original Message ----
> From: Grant Ingersoll <gsingers [at] apache>
> To: java-dev [at] lucene
> Sent: Friday, April 6, 2007 6:53:13 PM
> Subject: optimize() method call
>
> I was looking at the javadocs for the optimize() call on IndexWriter
> which contain a great amount of detail about what happens, but very
> little guidance on when. I would like to add more on when. I
> generally do optimize after I finish my indexing, which is pretty
> straightforward to determine when one has a more or less static
> collection. What isn't so clear to me, b/c I haven't dealt w/ it too
> much is when optimize should be called in environments that are
> frequently updated.
>
> Here's what I have for text so far:
> *
> * <p>It is recommended that this method be called upon completion
> of indexing. In
> * environments with frequent updates optimize is best FILL IN HERE
> * </p>
>
> Essentially, I am wondering what are the best practices for calling
> optimize, especially in a frequent update environment. My gut
> feeling is that it should just be scheduled to be done on a regular
> basis, ideally when there is a lull. The docs allude to the fact
> that search performance will be better, but has anyone quantified
> it? The mergeFactor docs say that a smaller merge factor results in
> faster searches on unoptimized (I presume that means relatively
> faster searches to higher merge factors, but still not as fast as
> optimized, correct?) If it hasn't been quantified, maybe I will try
> to whip a benchmark for it.
>
> So, do people in these types of environment typically schedule
> optimize to occur at night or every few hours, or what? I know, "It
> depends...", just am wondering if there is a general consensus that
> would be useful to pass along to readers
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


timmsc at aol

Apr 18, 2007, 4:54 PM

Post #7 of 8 (5555 views)
Permalink
Re: optimize() method call [In reply to]

In the brief test I did indexing 500K documents and optimizing every 10K
documents, I found that indexing is constant time (flat) and optimize()
time increases linearly.

-Sean

Grant Ingersoll wrote on 4/18/2007, 4:29 PM:

> Has anyone done in benchmarking to approximate how long it takes to
> optimize different size indexes? Is the merging linear, sub-linear,
> etc.?
>
> On Apr 8, 2007, at 1:01 AM, Otis Gospodnetic wrote:
>
> > I'd advise against calling optimize() at all in an environment
> > whose indices are constantly updated. That's what mergeFactor
> > helps with. Keep it low, and Lucene itself will regularly merge
> > segments more often. If one still wants to call optimize(), you'd
> > want to know how long it would take on with the index of your size
> > and if you've got enough lull time, do it, otherwise postpone it.
> >
> > Otis
> > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/ - Tag - Search - Share
> >
> > ----- Original Message ----
> > From: Grant Ingersoll <gsingers [at] apache>
> > To: java-dev [at] lucene
> > Sent: Friday, April 6, 2007 6:53:13 PM
> > Subject: optimize() method call
> >
> > I was looking at the javadocs for the optimize() call on IndexWriter
> > which contain a great amount of detail about what happens, but very
> > little guidance on when. I would like to add more on when. I
> > generally do optimize after I finish my indexing, which is pretty
> > straightforward to determine when one has a more or less static
> > collection. What isn't so clear to me, b/c I haven't dealt w/ it too
> > much is when optimize should be called in environments that are
> > frequently updated.
> >
> > Here's what I have for text so far:
> > *
> > * <p>It is recommended that this method be called upon completion
> > of indexing. In
> > * environments with frequent updates optimize is best FILL IN HERE
> > * </p>
> >
> > Essentially, I am wondering what are the best practices for calling
> > optimize, especially in a frequent update environment. My gut
> > feeling is that it should just be scheduled to be done on a regular
> > basis, ideally when there is a lull. The docs allude to the fact
> > that search performance will be better, but has anyone quantified
> > it? The mergeFactor docs say that a smaller merge factor results in
> > faster searches on unoptimized (I presume that means relatively
> > faster searches to higher merge factors, but still not as fast as
> > optimized, correct?) If it hasn't been quantified, maybe I will try
> > to whip a benchmark for it.
> >
> > So, do people in these types of environment typically schedule
> > optimize to occur at night or every few hours, or what? I know, "It
> > depends...", just am wondering if there is a general consensus that
> > would be useful to pass along to readers
> >
> > --------------------------
> > Grant Ingersoll
> > Center for Natural Language Processing
> > http://www.cnlp.org
> >
> > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > LuceneFAQ
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
> >
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


steven_parkes at esseff

Apr 18, 2007, 5:43 PM

Post #8 of 8 (5597 views)
Permalink
RE: optimize() method call [In reply to]

I think can be greater than linear. It would be linear if optimize only
copied each segment into the result. However, it will only merge
maxMerge segments at a time, so in some cases, some segment data is
going to be copied more than once. So something like O(n log n)?

-----Original Message-----
From: Sean Timm [mailto:timmsc [at] aol]
Sent: Wednesday, April 18, 2007 4:55 PM
To: java-dev [at] lucene
Subject: Re: optimize() method call

In the brief test I did indexing 500K documents and optimizing every 10K

documents, I found that indexing is constant time (flat) and optimize()
time increases linearly.

-Sean

Grant Ingersoll wrote on 4/18/2007, 4:29 PM:

> Has anyone done in benchmarking to approximate how long it takes to
> optimize different size indexes? Is the merging linear, sub-linear,
> etc.?
>
> On Apr 8, 2007, at 1:01 AM, Otis Gospodnetic wrote:
>
> > I'd advise against calling optimize() at all in an environment
> > whose indices are constantly updated. That's what mergeFactor
> > helps with. Keep it low, and Lucene itself will regularly merge
> > segments more often. If one still wants to call optimize(), you'd
> > want to know how long it would take on with the index of your size
> > and if you've got enough lull time, do it, otherwise postpone it.
> >
> > Otis
> > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/ - Tag - Search - Share
> >
> > ----- Original Message ----
> > From: Grant Ingersoll <gsingers [at] apache>
> > To: java-dev [at] lucene
> > Sent: Friday, April 6, 2007 6:53:13 PM
> > Subject: optimize() method call
> >
> > I was looking at the javadocs for the optimize() call on
IndexWriter
> > which contain a great amount of detail about what happens, but very
> > little guidance on when. I would like to add more on when. I
> > generally do optimize after I finish my indexing, which is pretty
> > straightforward to determine when one has a more or less static
> > collection. What isn't so clear to me, b/c I haven't dealt w/ it
too
> > much is when optimize should be called in environments that are
> > frequently updated.
> >
> > Here's what I have for text so far:
> > *
> > * <p>It is recommended that this method be called upon
completion
> > of indexing. In
> > * environments with frequent updates optimize is best FILL IN
HERE
> > * </p>
> >
> > Essentially, I am wondering what are the best practices for calling
> > optimize, especially in a frequent update environment. My gut
> > feeling is that it should just be scheduled to be done on a regular
> > basis, ideally when there is a lull. The docs allude to the fact
> > that search performance will be better, but has anyone quantified
> > it? The mergeFactor docs say that a smaller merge factor results
in
> > faster searches on unoptimized (I presume that means relatively
> > faster searches to higher merge factors, but still not as fast as
> > optimized, correct?) If it hasn't been quantified, maybe I will
try
> > to whip a benchmark for it.
> >
> > So, do people in these types of environment typically schedule
> > optimize to occur at night or every few hours, or what? I know,
"It
> > depends...", just am wondering if there is a general consensus that
> > would be useful to pass along to readers
> >
> > --------------------------
> > Grant Ingersoll
> > Center for Natural Language Processing
> > http://www.cnlp.org
> >
> > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
> >
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.