Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

suppressing FreqProxPostingsArray

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ken.mccracken at gmail

Mar 19, 2012, 12:29 PM

Post #1 of 3 (300 views)
Permalink
suppressing FreqProxPostingsArray

Hi,

I am using lucene-3.5 and getting an OutOfMemoryError on a large indexing
task of 100M documents. I am creating an index with 3 UUIDs as separate
field values. I am using Store.YES on 1 of them and Store.NO on the
others; I am using Index.NOT_ANALYZED_NO_NORMS on all three; explicitly
setting
field.setIndexOptions(IndexOptions.DOCS_ONLY); and
indexWriterConfig.setTermIndexInterval(termIndexInterval); to 1024. I am
trying to index 100M records into my index.

Is there any reason FreqProxTermsWriterPerField.FreqProxPostingsArray needs
to be constructed even though I have the positions etc suppressed? It
seems that the reason I get an OutOfMemoryError is that 7 int[] of size
proportional to number of unique fields are being constructed; however, at
least some of them are probably wasteful given my indexing configurations.

Any help is appreciated.

Thanks,
-Ken

[junit] Error:
[junit] Exception in thread "Thread-18" java.lang.OutOfMemoryError:
Java heap space
[junit] at
org.apache.lucene.index.ParallelPostingsArray.<init>(ParallelPostingsArray.java:35)
[junit] at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:190)
[junit] at
org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:204)
[junit] at
org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
[junit] at
org.apache.lucene.index.TermsHashPerField.growParallelPostingsArray(TermsHashPerField.java:137)
[junit] at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:440)
[junit] at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:94)
[junit] at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)


lucene at mikemccandless

Mar 19, 2012, 2:32 PM

Post #2 of 3 (298 views)
Permalink
Re: suppressing FreqProxPostingsArray [In reply to]

Hmm, I agree we could be more RAM efficient if the field is DOCS_ONLY.

We shouldn't have to allocate/use docFreqs, lastDocCodes,
lastPositions arrays (3 of the 7); the others are still needed, I
think.

But, that said, you shouldn't hit OOME, as long as your max heap sizes
is large enough (and, your IndexWriterConfig's RAMBufferSizeMB is
small enough); Lucene should simply flush a new segment once the
buffered documents are using too much RAM.

Hmm, and you don't index massive documents. How many UUIDs per document?

Mike McCandless

http://blog.mikemccandless.com



On Mon, Mar 19, 2012 at 3:29 PM, Ken McCracken <ken.mccracken [at] gmail> wrote:
> Hi,
>
> I am using lucene-3.5 and getting an OutOfMemoryError on a large indexing
> task of 100M documents.  I am creating an index with 3 UUIDs as separate
> field values.  I am using Store.YES on 1 of them and Store.NO on the
> others; I am using Index.NOT_ANALYZED_NO_NORMS on all three; explicitly
> setting
> field.setIndexOptions(IndexOptions.DOCS_ONLY);          and
> indexWriterConfig.setTermIndexInterval(termIndexInterval);   to 1024.  I am
> trying to index 100M records into my index.
>
> Is there any reason FreqProxTermsWriterPerField.FreqProxPostingsArray needs
> to be constructed even though I have the positions etc suppressed?  It
> seems that the reason I get an OutOfMemoryError is that 7 int[] of size
> proportional to number of unique fields are being constructed; however, at
> least some of them are probably wasteful given my indexing configurations.
>
> Any help is appreciated.
>
> Thanks,
> -Ken
>
>     [junit] Error:
>    [junit] Exception in thread "Thread-18" java.lang.OutOfMemoryError:
> Java heap space
>    [junit]     at
> org.apache.lucene.index.ParallelPostingsArray.<init>(ParallelPostingsArray.java:35)
>    [junit]     at
> org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:190)
>    [junit]     at
> org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:204)
>    [junit]     at
> org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
>    [junit]     at
> org.apache.lucene.index.TermsHashPerField.growParallelPostingsArray(TermsHashPerField.java:137)
>    [junit]     at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:440)
>    [junit]     at
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:94)
>    [junit]     at
> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ken.mccracken at gmail

Mar 20, 2012, 3:20 PM

Post #3 of 3 (294 views)
Permalink
Re: suppressing FreqProxPostingsArray [In reply to]

Hi Mike,

Thanks for the response. We will do some more investigation. We will
look to see if there is a clean way to suppress at least the extra 3
array allocations.

Cheers,

-Ken

On Mar 19, 2012, at 5:32 PM, Michael McCandless <lucene [at] mikemccandless
> wrote:

> Hmm, I agree we could be more RAM efficient if the field is DOCS_ONLY.
>
> We shouldn't have to allocate/use docFreqs, lastDocCodes,
> lastPositions arrays (3 of the 7); the others are still needed, I
> think.
>
> But, that said, you shouldn't hit OOME, as long as your max heap sizes
> is large enough (and, your IndexWriterConfig's RAMBufferSizeMB is
> small enough); Lucene should simply flush a new segment once the
> buffered documents are using too much RAM.
>
> Hmm, and you don't index massive documents. How many UUIDs per
> document?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Mon, Mar 19, 2012 at 3:29 PM, Ken McCracken <ken.mccracken [at] gmail
> > wrote:
>> Hi,
>>
>> I am using lucene-3.5 and getting an OutOfMemoryError on a large
>> indexing
>> task of 100M documents. I am creating an index with 3 UUIDs as
>> separate
>> field values. I am using Store.YES on 1 of them and Store.NO on the
>> others; I am using Index.NOT_ANALYZED_NO_NORMS on all three;
>> explicitly
>> setting
>> field.setIndexOptions(IndexOptions.DOCS_ONLY); and
>> indexWriterConfig.setTermIndexInterval(termIndexInterval); to
>> 1024. I am
>> trying to index 100M records into my index.
>>
>> Is there any reason
>> FreqProxTermsWriterPerField.FreqProxPostingsArray needs
>> to be constructed even though I have the positions etc suppressed?
>> It
>> seems that the reason I get an OutOfMemoryError is that 7 int[] of
>> size
>> proportional to number of unique fields are being constructed;
>> however, at
>> least some of them are probably wasteful given my indexing
>> configurations.
>>
>> Any help is appreciated.
>>
>> Thanks,
>> -Ken
>>
>> [junit] Error:
>> [junit] Exception in thread "Thread-18"
>> java.lang.OutOfMemoryError:
>> Java heap space
>> [junit] at
>> org.apache.lucene.index.ParallelPostingsArray.<init>
>> (ParallelPostingsArray.java:35)
>> [junit] at
>> org.apache.lucene.index.FreqProxTermsWriterPerField
>> $FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:190)
>> [junit] at
>> org.apache.lucene.index.FreqProxTermsWriterPerField
>> $FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:
>> 204)
>> [junit] at
>> org.apache.lucene.index.ParallelPostingsArray.grow
>> (ParallelPostingsArray.java:48)
>> [junit] at
>> org.apache.lucene.index.TermsHashPerField.growParallelPostingsArray
>> (TermsHashPerField.java:137)
>> [junit] at
>> org.apache.lucene.index.TermsHashPerField.add
>> (TermsHashPerField.java:440)
>> [junit] at
>> org.apache.lucene.index.DocInverterPerField.processFields
>> (DocInverterPerField.java:94)
>> [junit] at
>> org.apache.lucene.index.DocFieldProcessorPerThread.processDocument
>> (DocFieldProcessorPerThread.java:278)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.