Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

May 28, 2012, 7:53 AM

Post #1 of 18 (220 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284423#comment-13284423 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

Patch looks great!

Do we really need to impl DocMap.equals/hashCode? Costly equals methods scare me... can we simply throw UOE from these methods? Nobody should be calling them, I think.

Instead of looping through numDocs summing up the del count, I think you should be able to set numDeletes = reader.numDeletedDocs()? And maybe just consolidate those two build methods...



> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 28, 2012, 8:35 AM

Post #2 of 18 (209 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284437#comment-13284437 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

I implemented equals only for testing purposes (see TestSegmentMerger.java) and then hashCode for consistency. I can move the equals code to the test case if you prefer.

Regarding numDeletedDocs, I tried to add the following assert
{code}
assert docCount == reader.reader.numDocs() : "docCount=" + docCount + ", numDocs=" + reader.reader.numDocs();
{code}
to line 321 of SegmentMerger (before applying the patch) and it fails across a large number of tests (try to run TestAddIndexes a few times for example, and at least one of the {{testWithpendingDeletes*}} should fail). There used to be an assert in SegmentMerger but it was removed in r1148938 (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/SegmentMerger.java?r1=1147671&r2=1148938&pathrev=1148938&diff_format=h) so I assumed the {{numDeletedDocs()}} was unreliable and the del count should be computed from {{liveDocs}}. I am not familiar enough with the merge process to know whether some invariants are broken or not. Should I open a bug?

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 28, 2012, 10:35 AM

Post #3 of 18 (209 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284490#comment-13284490 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

bq. I implemented equals only for testing purposes (see TestSegmentMerger.java) and then hashCode for consistency. I can move the equals code to the test case if you prefer.

Ahh, OK. Yeah I think move it to the test case? Thanks.

bq. There used to be an assert in SegmentMerger but it was removed in r1148938

Ahh you're right! Hmm, but, we actually know the accurate delCount higher up; let me tweak the patch a bit to pass this down, so we don't have to re-count it separately.

bq. so I assumed the numDeletedDocs() was unreliable and the del count should be computed from liveDocs. I am not familiar enough with the merge process to know whether some invariants are broken or not. Should I open a bug?

As far as I know, it's only unreliable in this context (SegmentReader passed to SegmentMerger for merging); this is because we allow newly marked deleted docs to happen concurrently up until the moment we need to pass the SR instance to the merger (search for "// Must sync to ensure BufferedDeletesStream" in IndexWriter.java) ... but it would be nice to fix that, so I think open a new issue (it won't block this one)? We should be able to make a new SR instance, sharing the same core as the current one but using the correct delCount...

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 28, 2012, 12:04 PM

Post #4 of 18 (205 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284518#comment-13284518 ]

Dawid Weiss commented on LUCENE-2357:
-------------------------------------

bq. It would be cleaner (but I think hairier)

Is "cleaner hairier" code a new oxymoron? :)

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 28, 2012, 12:24 PM

Post #5 of 18 (207 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284525#comment-13284525 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

bq. Is "cleaner hairier" code a new oxymoron? :)

I guess so!!

What I meant was ... it would be best if the SegmentReader's numDeletedDocs were always correct, but, fixing that in IndexWriter is somewhat tricky. Ie, the fix could be hairy but the end result ("SegmentReader.numDeletedDocs can always be trusted") would be cleaner...

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 28, 2012, 2:37 PM

Post #6 of 18 (206 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284558#comment-13284558 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

I just created LUCENE-4080 for the SegmentReader.numDeletedDocs() issue.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 29, 2012, 3:47 AM

Post #7 of 18 (206 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284718#comment-13284718 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

bq. I just created LUCENE-4080 for the SegmentReader.numDeletedDocs() issue.

Thanks Adrien.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

May 29, 2012, 12:37 PM

Post #8 of 18 (210 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285045#comment-13285045 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

Thanks Adrien, new patch looks great... I'll commit shortly.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 1, 2012, 8:39 AM

Post #9 of 18 (205 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287482#comment-13287482 ]

Otis Gospodnetic commented on LUCENE-2357:
------------------------------------------

Woho, I love seeing old issues getting love like this! :)
Has anyone measured (or at least eyeballed) how much RAM this saves?


> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 1, 2012, 9:04 AM

Post #10 of 18 (206 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287499#comment-13287499 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

Hi Otis,

Before this change, each doc map used ~ {{maxDoc * 32}} bits while they now use ~ {{maxDoc * lg(min(numDocs, numDeletedDocs))}} (where lg is the ceil of log in base 2) bits. So even in the worst case (numDocs = numDeleted = maxDoc / 2), the improvement is {{((31 - lg(maxDoc))/32}}. On a segment with maxDoc=10000000, this is a 22% improvement. But the improvement is much better when the number of deleted documents is close to 0 or to maxDoc. For example, if your segment has maxDoc=10000000 and numDeletedDocs=100000, the improvement ({{32 - lg(min(numDocs, numDeletedDocs))/32}}) is close to 50%. If numDeletedDocs=100, the improvement is close to 80%.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 1, 2012, 9:16 AM

Post #11 of 18 (206 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287508#comment-13287508 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

This is the theoretical improvement. However, in order not to slow merging down too much, I instantiate the {{PackedInts.Mutable}} that holds the doc map with {{acceptableOverheadRatio=PackedInts.FAST=50%}} (see LUCENE-4062), so the actual improvement might be a little worse than the theoretical improvement. If you are more interested in memory usage than in merge speed, you could still reach the theoretical improvement by replacing {{PackedInts.FAST}} with {{PackedInts.COMPACT}} in {{MergeState.DocMap.build}}.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 1, 2012, 9:20 AM

Post #12 of 18 (204 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287510#comment-13287510 ]

Robert Muir commented on LUCENE-2357:
-------------------------------------

Do we have any high-level idea of what the performance cost of COMPACT vs FAST is for merging?
(e.g. typical case of Lucene40 codec). Is COMPACT maybe a good tradeoff?

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 1, 2012, 9:33 AM

Post #13 of 18 (206 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287521#comment-13287521 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

While working on LUCENE-4062, I had in mind that {{FAST}} (50%) would be ok for transient data structures while {{COMPACT}} (0%) and {{DEFAULT}} (20%) would be better for big and long-living structures depending on the performance requirements. However, it is true that the DocMap might not be the bottleneck for merging (especially since this operation involves disk accesses). I can try to run some benchmarks next week to find out whether {{COMPACT}} (or maybe {{DEFAULT}}) could be a better tradeoff.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 6, 2012, 3:01 PM

Post #14 of 18 (187 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290477#comment-13290477 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

I ran a quick test that indexes a few millions of documents with only one field (index, not stored, not analyzed, no terms vectors, ...) with different ratios of deleted documents, ram buffer sizes (between 1 and 50 MB) and merge factors (between 3 and 20). The global speedup with {{PackedInts.FAST}} was between 0.2% and 1.7% compared to {{PackedInts.COMPACT}} (although I ran this test on a low-end computer, other people might have slightly better results with the {{FAST}} version on a better machine). This is probably not worth the potential memory overhead. Would someone disagree to replace {{FAST}} with {{COMPACT}} for the docmaps instantiation?

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 7, 2012, 3:19 AM

Post #15 of 18 (169 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290918#comment-13290918 ]

Michael McCandless commented on LUCENE-2357:
--------------------------------------------

I think changing to COMPACT is good....

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 8, 2012, 9:13 AM

Post #16 of 18 (169 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291840#comment-13291840 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

I am going to commit this change next week unless someone objects.

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 8, 2012, 1:28 PM

Post #17 of 18 (167 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291967#comment-13291967 ]

Simon Willnauer commented on LUCENE-2357:
-----------------------------------------

s/(Adrien Grand via Mike McCandless)/(Adrien Grand)

otherwise +1

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jun 12, 2012, 3:51 AM

Post #18 of 18 (159 views)
Permalink
[jira] [Commented] (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293504#comment-13293504 ]

Adrien Grand commented on LUCENE-2357:
--------------------------------------

Committed (r1349234 on trunk and r1349241 on branch 4.x).

> Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch, LUCENE-2357.patch
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.