Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Sort runs out of memory

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


rbart at cs

May 17, 2012, 2:03 PM

Post #1 of 3 (376 views)
Permalink
Sort runs out of memory

Hi all,

I am running Lucene 3.6 in a system that indexes about 4 billion documents
across several indexes, and I'm hoping to get documents in order of a
certain NumericField.

I've tried using Lucene's Sort implementation, but it looks like it tries
to do the entire sort in memory by allocating a huge array with space for
every document in the index. On my index, this quickly runs out of memory.

Instead, I've switched to using several NumericRangeQueries to approximate
getting documents in a certain order. However, NumericRangeQueries seem
slow.

Are there any alternatives or better ways of getting documents in order of
a NumericField for a very large index?

--
Rob


te at statsbiblioteket

May 21, 2012, 12:54 AM

Post #2 of 3 (366 views)
Permalink
Re: Sort runs out of memory [In reply to]

On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote:
> I am running Lucene 3.6 in a system that indexes about 4 billion documents
> across several indexes, and I'm hoping to get documents in order of a
> certain NumericField.

What is the maximum size on any single index, in terms of number of
documents? What is the type of the NumericField?

> I've tried using Lucene's Sort implementation, but it looks like it tries
> to do the entire sort in memory by allocating a huge array with space for
> every document in the index.

The FieldCache allocates an array of length #documents with the same
type that your NumericField is. The sort itself is of the sliding window
type, meaning that it only takes up memory lineary to the number of
documents wanted in the response. Do you require millions of documents
to be returned as part of a search?

Sanity check: You do specify the type when performing a sorted search,
right? If not, the values will be treated as Strings.

> On my index, this quickly runs out of memory.

Assuming that your largest index is 1B documents and that your
NumericField is of type Integer, the FieldCache's values for the sort
should take up 1B * 4 = 4GB. Are you hoping for less?

> Are there any alternatives or better ways of getting documents in order of
> a NumericField for a very large index?

Be sure to select the type of NumericField to be as small as possible.
If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you
might map them down (to 0, 1, 2 and 3 for this example) and store them
as a byte.

Currently Lucene only supports atomic types for numerics in the
FieldCache, so the smallest one is byte. It is possible to use only
ceil(log2(#unique_values)) bits/document, although that requires a bit
of custom coding.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


goksron at gmail

May 23, 2012, 3:27 PM

Post #3 of 3 (356 views)
Permalink
Re: Sort runs out of memory [In reply to]

The Trie type can be tuned for range queries v.s. single queries. This
seems to be explained in email and nowhere else:

http://www.lucidimagination.com/search/document/c501f59515a9eece

On Mon, May 21, 2012 at 12:54 AM, Toke Eskildsen <te [at] statsbiblioteket> wrote:
> On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote:
>> I am running Lucene 3.6 in a system that indexes about 4 billion documents
>> across several indexes, and I'm hoping to get documents in order of a
>> certain NumericField.
>
> What is the maximum size on any single index, in terms of number of
> documents? What is the type of the NumericField?
>
>> I've tried using Lucene's Sort implementation, but it looks like it tries
>> to do the entire sort in memory by allocating a huge array with space for
>> every document in the index.
>
> The FieldCache allocates an array of length #documents with the same
> type that your NumericField is. The sort itself is of the sliding window
> type, meaning that it only takes up memory lineary to the number of
> documents wanted in the response. Do you require millions of documents
> to be returned as part of a search?
>
> Sanity check: You do specify the type when performing a sorted search,
> right? If not, the values will be treated as Strings.
>
>>  On my index, this quickly runs out of memory.
>
> Assuming that your largest index is 1B documents and that your
> NumericField is of type Integer, the FieldCache's values for the sort
> should take up 1B * 4 = 4GB. Are you hoping for less?
>
>> Are there any alternatives or better ways of getting documents in order of
>> a NumericField for a very large index?
>
> Be sure to select the type of NumericField to be as small as possible.
> If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you
> might map them down (to 0, 1, 2 and 3 for this example) and store them
> as a byte.
>
> Currently Lucene only supports atomic types for numerics in the
> FieldCache, so the smallest one is byte. It is possible to use only
> ceil(log2(#unique_values)) bits/document, although that requires a bit
> of custom coding.
>
> Regards,
> Toke Eskildsen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
Lance Norskog
goksron [at] gmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.