Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Flush by RAM size question...

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


erickerickson at gmail

Jan 20, 2008, 6:48 AM

Post #1 of 3 (1713 views)
Permalink
Flush by RAM size question...

About flush by RAM

I was playing around with something similar on the 2.1 codebase
(roll-my-own)
and had the quirk of a possible *very* large incoming document. As in 250M.
So I had to put some logic in to try say, in effect, "if the incoming doc is
completely ridiculous, flush now". I should say that I was impressed that
we could even index the bloody thing at all!

Is this something that still needs to be guarded against in 2.3? In other
words, should the flush size be chosen so that (current RAM size + the
increment caused by the biggest doc possible in your data set) be < the
threshold?

You see the problem here. In the silliest case, where I have one HUGE
document
that barely fit in memory, I'd have to set the threshold very low, flushing
early
and often unless there was a "flush now" bit of logic for silly docs.

If you must know, the huge doc was the 23 volume "encyclopedia of Michigan
Civil War Volunteers". Yeah, yeah, sure. I could have done other things than
index it as a single doc, but since indexing speed wasn't really an issue
and
the PM wanted it that way and all it meant was that indexing took 6 hours
rather than, perhaps, 4 on a static data set, I didn't care enough to do
more work.

Don't get me wrong, having a flush by RAM size is sweet. And for any
reasonable
corpus, especially one with relatively constant input docs, it should be
very nice
indeed. I'm wondering about the outlier cases since I seem to run into them,
siiiggghh.
But "that's why they pay me the big bucks" <G>.

Thanks
Erick


lucene at mikemccandless

Jan 20, 2008, 8:59 AM

Post #2 of 3 (1558 views)
Permalink
Re: Flush by RAM size question... [In reply to]

Hi Erick,

Yes, you do still need to guard against this case in 2.3. IndexWriter
checks the RAM usage after each doc is processed and flushes when
that's over the limit.

However, the memory consumed by a very large doc should be quite a bit
less than before, because in 2.3 IndexWriter makes more more efficient
use of RAM.

Mike

Erick Erickson wrote:

> About flush by RAM
>
> I was playing around with something similar on the 2.1 codebase
> (roll-my-own)
> and had the quirk of a possible *very* large incoming document. As
> in 250M.
> So I had to put some logic in to try say, in effect, "if the
> incoming doc is
> completely ridiculous, flush now". I should say that I was
> impressed that
> we could even index the bloody thing at all!
>
> Is this something that still needs to be guarded against in 2.3? In
> other
> words, should the flush size be chosen so that (current RAM size + the
> increment caused by the biggest doc possible in your data set) be <
> the
> threshold?
>
> You see the problem here. In the silliest case, where I have one HUGE
> document
> that barely fit in memory, I'd have to set the threshold very low,
> flushing
> early
> and often unless there was a "flush now" bit of logic for silly docs.
>
> If you must know, the huge doc was the 23 volume "encyclopedia of
> Michigan
> Civil War Volunteers". Yeah, yeah, sure. I could have done other
> things than
> index it as a single doc, but since indexing speed wasn't really an
> issue
> and
> the PM wanted it that way and all it meant was that indexing took 6
> hours
> rather than, perhaps, 4 on a static data set, I didn't care enough
> to do
> more work.
>
> Don't get me wrong, having a flush by RAM size is sweet. And for any
> reasonable
> corpus, especially one with relatively constant input docs, it
> should be
> very nice
> indeed. I'm wondering about the outlier cases since I seem to run
> into them,
> siiiggghh.
> But "that's why they pay me the big bucks" <G>.
>
> Thanks
> Erick


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jan 20, 2008, 4:36 PM

Post #3 of 3 (1566 views)
Permalink
Re: Flush by RAM size question... [In reply to]

Michael:

Thanks, that's what I figured, but it's nice to have confirmed.

Erick

On Jan 20, 2008 11:59 AM, Michael McCandless <lucene [at] mikemccandless>
wrote:

>
> Hi Erick,
>
> Yes, you do still need to guard against this case in 2.3. IndexWriter
> checks the RAM usage after each doc is processed and flushes when
> that's over the limit.
>
> However, the memory consumed by a very large doc should be quite a bit
> less than before, because in 2.3 IndexWriter makes more more efficient
> use of RAM.
>
> Mike
>
> Erick Erickson wrote:
>
> > About flush by RAM
> >
> > I was playing around with something similar on the 2.1 codebase
> > (roll-my-own)
> > and had the quirk of a possible *very* large incoming document. As
> > in 250M.
> > So I had to put some logic in to try say, in effect, "if the
> > incoming doc is
> > completely ridiculous, flush now". I should say that I was
> > impressed that
> > we could even index the bloody thing at all!
> >
> > Is this something that still needs to be guarded against in 2.3? In
> > other
> > words, should the flush size be chosen so that (current RAM size + the
> > increment caused by the biggest doc possible in your data set) be <
> > the
> > threshold?
> >
> > You see the problem here. In the silliest case, where I have one HUGE
> > document
> > that barely fit in memory, I'd have to set the threshold very low,
> > flushing
> > early
> > and often unless there was a "flush now" bit of logic for silly docs.
> >
> > If you must know, the huge doc was the 23 volume "encyclopedia of
> > Michigan
> > Civil War Volunteers". Yeah, yeah, sure. I could have done other
> > things than
> > index it as a single doc, but since indexing speed wasn't really an
> > issue
> > and
> > the PM wanted it that way and all it meant was that indexing took 6
> > hours
> > rather than, perhaps, 4 on a static data set, I didn't care enough
> > to do
> > more work.
> >
> > Don't get me wrong, having a flush by RAM size is sweet. And for any
> > reasonable
> > corpus, especially one with relatively constant input docs, it
> > should be
> > very nice
> > indeed. I'm wondering about the outlier cases since I seem to run
> > into them,
> > siiiggghh.
> > But "that's why they pay me the big bucks" <G>.
> >
> > Thanks
> > Erick
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.