Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jul 11, 2012, 6:48 AM

Post #1 of 12 (174 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411495#comment-13411495 ]

Han Jiang edited comment on LUCENE-3892 at 7/11/12 1:47 PM:
------------------------------------------------------------

bq. Actually I think you want them to be suppressed, so that the original exception is seen?

Not my idea actually, I think the exception should be thrown for out.close()? closeWhileHandlingException() will suppress those exceptions. So in this patch I use out.close() instead of IOUtils.closeWhileHandlingException()

was (Author: billy):
bq. Actually I think you want them to be suppressed, so that the original exception is seen?

Not my idea actually, I think the exception should be thrown for out.close()? closeWhileHandlingException() will suppress those exceptions.

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 13, 2012, 9:01 AM

Post #2 of 12 (165 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413781#comment-13413781 ]

Han Jiang edited comment on LUCENE-3892 at 7/13/12 4:00 PM:
------------------------------------------------------------

This patch cut the extra header and merge numBytes into header. Also, it store only one int when a whole block share the same value. So no matter which strategy we use (d-gap, or d-gap minus 1), it will work well.

And here are the changes between different methods, postingsformat=PFor:
{noformat}
method base all_0s all_1s all_Vs all_Vs+header_cut
index time(s): 324 315 279 279 291
index size(MB): 591 591 589 589 577
{noformat}

postingsFormat=For:
{noformat}
method base all_Vs+header_cut
index time(s): 250 251
index size(MB): 611 598
{noformat}

where raw refers to the version when numBits==0 isn't supported, all_0s refers to last patch, all_Vs+header_cut refers to this patch. As for PFor, now the index size is almost equal to Lucene40(590.7M vs 590.0M).

was (Author: billy):
This patch cut the extra header and merge numBytes into header. Also, it store only one int when a whole block share the same value. So no matter which strategy we use (d-gap, or d-gap minus 1), it will work well.

And here are the changes between different methods:
{noformat}
method raw all_0s all_1s all_Vs all_Vs+header_cut
index time(s): 324 315 279 279 291
index size(MB): 591 591 589 589 577
{noformat}
where raw refers to the version when numBits==0 isn't supported, all_0s refers to last patch, all_Vs+header_cut refers to this patch

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 20, 2012, 6:54 AM

Post #3 of 12 (161 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419144#comment-13419144 ]

Han Jiang edited comment on LUCENE-3892 at 7/20/12 1:52 PM:
------------------------------------------------------------

An initial try with PackedInts in current trunk version. I replaced all the int[] buffer with long[] buffer so we can use the API directly. I don't quite understand the Writer part, so we have to save each long value one by one.

However, it is the Reader part we are concerned:
{noformat}
Task QPS base StdDev base QPS packedStdDev packed Pct diff
AndHighHigh 29.60 1.56 23.78 0.51 -25% - -13%
AndHighMed 74.68 3.92 53.15 2.31 -35% - -21%
Fuzzy1 88.23 1.21 87.13 1.41 -4% - 1%
Fuzzy2 30.09 0.45 29.47 0.47 -5% - 1%
IntNRQ 41.96 3.88 38.16 2.48 -22% - 6%
OrHighHigh 17.56 0.34 15.45 0.15 -14% - -9%
OrHighMed 34.71 0.76 30.77 0.53 -14% - -7%
PKLookup 111.00 1.90 110.52 1.59 -3% - 2%
Phrase 9.03 0.23 7.62 0.41 -22% - -8%
Prefix3 123.56 8.42 110.94 5.43 -20% - 1%
Respell 102.37 1.11 101.79 1.38 -2% - 1%
SloppyPhrase 3.97 0.19 3.52 0.07 -17% - -4%
SpanNear 8.24 0.18 7.22 0.25 -17% - -7%
Term 45.16 3.15 37.47 2.32 -27% - -5%
TermBGroup1M 17.19 1.09 15.86 0.77 -17% - 3%
TermBGroup1M1P 23.47 1.66 20.43 1.16 -23% - -1%
TermGroup1M 19.20 1.14 17.73 0.84 -16% - 2%
Wildcard 42.75 3.27 36.75 1.96 -24% - -1%
{noformat}

Maybe we should try PACKED_SINGLE_BLOCK for some special value of numBits, instead of using PACKED all the time?

Thanks to Adrien, we have a more direct API in LUCENE-4239, I'm trying that now.

was (Author: billy):
An initial try with PackedInts in current trunk version. I replaced all the int[] buffer with long[] buffer so we can use the API directly. I don't quite understand the Writer part, so we have to save each long value one by one.

However, it is the Reader part we are concerned:
{format}
Task QPS base StdDev base QPS packedStdDev packed Pct diff
AndHighHigh 29.60 1.56 23.78 0.51 -25% - -13%
AndHighMed 74.68 3.92 53.15 2.31 -35% - -21%
Fuzzy1 88.23 1.21 87.13 1.41 -4% - 1%
Fuzzy2 30.09 0.45 29.47 0.47 -5% - 1%
IntNRQ 41.96 3.88 38.16 2.48 -22% - 6%
OrHighHigh 17.56 0.34 15.45 0.15 -14% - -9%
OrHighMed 34.71 0.76 30.77 0.53 -14% - -7%
PKLookup 111.00 1.90 110.52 1.59 -3% - 2%
Phrase 9.03 0.23 7.62 0.41 -22% - -8%
Prefix3 123.56 8.42 110.94 5.43 -20% - 1%
Respell 102.37 1.11 101.79 1.38 -2% - 1%
SloppyPhrase 3.97 0.19 3.52 0.07 -17% - -4%
SpanNear 8.24 0.18 7.22 0.25 -17% - -7%
Term 45.16 3.15 37.47 2.32 -27% - -5%
TermBGroup1M 17.19 1.09 15.86 0.77 -17% - 3%
TermBGroup1M1P 23.47 1.66 20.43 1.16 -23% - -1%
TermGroup1M 19.20 1.14 17.73 0.84 -16% - 2%
Wildcard 42.75 3.27 36.75 1.96 -24% - -1%
{format}

Maybe we should try PACKED_SINGLE_BLOCK for some special value of numBits, instead of using PACKED all the time?

Thanks to Adrien, we have a more direct API in LUCENE-4239, I'm trying that now.

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 20, 2012, 12:01 PM

Post #4 of 12 (161 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419444#comment-13419444 ]

Han Jiang edited comment on LUCENE-3892 at 7/20/12 6:59 PM:
------------------------------------------------------------

So I changed the patch to readBytes():

base: PackedInts.getReaderNoHeader().get(long[]), file io is handled by PackedInts.
comp: PackedInts.getDecoder().decode(LongBuffer,LongBuffer), use byte[] to hold the compressed block, and ByteBuffer.wrap().asLongBuffer as a wrapper.

Well, not as expected.
{noformat}
Task QPS base StdDev base QPS comp StdDev comp Pct diff
AndHighHigh 23.78 1.06 23.38 0.42 -7% - 4%
AndHighMed 52.06 3.28 50.82 1.21 -10% - 6%
Fuzzy1 88.56 0.59 88.98 2.38 -2% - 3%
Fuzzy2 28.80 0.36 28.97 0.83 -3% - 4%
IntNRQ 41.92 1.67 41.34 0.50 -6% - 3%
OrHighHigh 15.85 0.45 15.89 0.39 -4% - 5%
OrHighMed 20.38 0.61 20.50 0.62 -5% - 6%
PKLookup 110.72 2.19 111.74 2.53 -3% - 5%
Phrase 7.51 0.12 7.05 0.18 -9% - -2%
Prefix3 106.27 2.65 105.37 1.13 -4% - 2%
Respell 112.03 0.81 112.79 2.71 -2% - 3%
SloppyPhrase 15.43 0.48 14.92 0.27 -7% - 1%
SpanNear 3.52 0.10 3.41 0.06 -7% - 1%
Term 39.19 1.34 39.04 0.81 -5% - 5%
TermBGroup1M 18.45 0.68 18.33 0.56 -7% - 6%
TermBGroup1M1P 22.78 0.90 22.26 0.56 -8% - 4%
TermGroup1M 19.50 0.73 19.42 0.63 -7% - 6%
Wildcard 29.56 1.13 29.18 0.28 -5% - 3%
{noformat}

was (Author: billy):
base: PackedInts.getReaderNoHeader().get(long[]), file io is handled by PackedInts.

comp:
PackedInts.getDecoder().decode(LongBuffer,LongBuffer), use byte[] to hold the compressed block, and ByteBuffer.wrap().asLongBuffer as a wrapper.

Well, not as expected.
{noformat}
Task QPS base StdDev base QPS comp StdDev comp Pct diff
AndHighHigh 23.78 1.06 23.38 0.42 -7% - 4%
AndHighMed 52.06 3.28 50.82 1.21 -10% - 6%
Fuzzy1 88.56 0.59 88.98 2.38 -2% - 3%
Fuzzy2 28.80 0.36 28.97 0.83 -3% - 4%
IntNRQ 41.92 1.67 41.34 0.50 -6% - 3%
OrHighHigh 15.85 0.45 15.89 0.39 -4% - 5%
OrHighMed 20.38 0.61 20.50 0.62 -5% - 6%
PKLookup 110.72 2.19 111.74 2.53 -3% - 5%
Phrase 7.51 0.12 7.05 0.18 -9% - -2%
Prefix3 106.27 2.65 105.37 1.13 -4% - 2%
Respell 112.03 0.81 112.79 2.71 -2% - 3%
SloppyPhrase 15.43 0.48 14.92 0.27 -7% - 1%
SpanNear 3.52 0.10 3.41 0.06 -7% - 1%
Term 39.19 1.34 39.04 0.81 -5% - 5%
TermBGroup1M 18.45 0.68 18.33 0.56 -7% - 6%
TermBGroup1M1P 22.78 0.90 22.26 0.56 -8% - 4%
TermGroup1M 19.50 0.73 19.42 0.63 -7% - 6%
Wildcard 29.56 1.13 29.18 0.28 -5% - 3%
{noformat}

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-for&pfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 10:37 AM

Post #5 of 12 (148 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425010#comment-13425010 ]

Han Jiang edited comment on LUCENE-3892 at 7/30/12 5:36 PM:
------------------------------------------------------------

Previous experiments showed a net loss with packed ints API, however there're slight difference e.g. all-value-the-same case is not handled equally. I suppose these two patches should make the comparison fair enough.

Base: BlockForPF + hardcoded decoder
Comp: BlockForPF + PackedInts.Decoder
{noformat}
Task QPS base StdDev base QPS comp StdDev comp Pct diff
AndHighHigh 25.66 0.31 22.61 1.21 -17% - -6%
AndHighMed 74.17 1.45 59.48 3.62 -26% - -13%
Fuzzy1 95.60 1.51 96.06 2.22 -3% - 4%
Fuzzy2 28.67 0.50 28.51 0.75 -4% - 3%
IntNRQ 33.31 0.60 30.73 1.51 -13% - -1%
OrHighHigh 17.58 0.59 16.22 1.18 -17% - 2%
OrHighMed 34.42 0.93 32.14 2.33 -15% - 2%
PKLookup 217.08 4.25 213.76 1.37 -4% - 1%
Phrase 6.10 0.12 5.34 0.07 -15% - -9%
Prefix3 77.27 1.26 70.42 2.87 -13% - -3%
Respell 92.91 1.34 92.61 1.83 -3% - 3%
SloppyPhrase 5.35 0.16 5.00 0.29 -14% - 1%
SpanNear 6.05 0.15 5.47 0.07 -12% - -6%
Term 37.62 0.32 33.08 1.70 -17% - -6%
TermBGroup1M 17.45 0.64 16.40 0.73 -13% - 1%
TermBGroup1M1P 25.20 0.69 23.47 1.24 -14% - 0%
TermGroup1M 18.53 0.65 17.40 0.76 -13% - 1%
Wildcard 44.39 0.49 40.51 1.69 -13% - -3%
{noformat}

Hmm, quite strange that we are already getting perf loss with baseline patch:

Base: BlockForPF in current branch
Comp: BlockForPF + hardcoded decoder(patch file)
{noformat}
Task QPS base StdDev base QPS comp StdDev comp Pct diff
AndHighHigh 26.71 0.98 24.15 0.82 -15% - -2%
AndHighMed 73.37 5.01 61.30 1.97 -24% - -7%
Fuzzy1 85.73 4.95 84.30 1.79 -9% - 6%
Fuzzy2 30.15 2.05 29.52 0.66 -10% - 7%
IntNRQ 38.56 1.69 36.91 1.27 -11% - 3%
OrHighHigh 16.98 1.48 16.82 0.94 -13% - 14%
OrHighMed 34.60 2.79 34.70 2.22 -13% - 16%
PKLookup 214.93 3.99 213.86 1.23 -2% - 1%
Phrase 11.53 0.23 10.75 0.42 -12% - -1%
Prefix3 107.15 3.83 102.12 2.69 -10% - 1%
Respell 87.41 5.41 86.08 1.76 -9% - 7%
SloppyPhrase 5.90 0.15 5.66 0.21 -9% - 2%
SpanNear 4.99 0.12 4.79 0.01 -6% - -1%
Term 49.37 2.38 45.53 0.49 -12% - -2%
TermBGroup1M 17.23 0.40 16.44 0.53 -9% - 0%
TermBGroup1M1P 22.02 0.50 22.42 0.60 -3% - 7%
TermGroup1M 13.65 0.29 13.05 0.28 -8% - 0%
Wildcard 48.73 2.01 46.35 1.31 -11% - 2%
{noformat}

was (Author: billy):
Previous experiments showed a net loss with packed ints API, however there're slight difference e.g. all-value-the-same case is not handled equally. I suppose these two patches should make the comparison fair enough.

Base: BlockForPF + hardwired decoder
Comp: BlockForPF + PackedInts.Decoder
{noformat}
Task QPS base StdDev base QPS comp StdDev comp Pct diff
AndHighHigh 25.66 0.31 22.61 1.21 -17% - -6%
AndHighMed 74.17 1.45 59.48 3.62 -26% - -13%
Fuzzy1 95.60 1.51 96.06 2.22 -3% - 4%
Fuzzy2 28.67 0.50 28.51 0.75 -4% - 3%
IntNRQ 33.31 0.60 30.73 1.51 -13% - -1%
OrHighHigh 17.58 0.59 16.22 1.18 -17% - 2%
OrHighMed 34.42 0.93 32.14 2.33 -15% - 2%
PKLookup 217.08 4.25 213.76 1.37 -4% - 1%
Phrase 6.10 0.12 5.34 0.07 -15% - -9%
Prefix3 77.27 1.26 70.42 2.87 -13% - -3%
Respell 92.91 1.34 92.61 1.83 -3% - 3%
SloppyPhrase 5.35 0.16 5.00 0.29 -14% - 1%
SpanNear 6.05 0.15 5.47 0.07 -12% - -6%
Term 37.62 0.32 33.08 1.70 -17% - -6%
TermBGroup1M 17.45 0.64 16.40 0.73 -13% - 1%
TermBGroup1M1P 25.20 0.69 23.47 1.24 -14% - 0%
TermGroup1M 18.53 0.65 17.40 0.76 -13% - 1%
Wildcard 44.39 0.49 40.51 1.69 -13% - -3%
{noformat}

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 7, 2012, 8:35 PM

Post #6 of 12 (136 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396987#comment-13396987 ]

Han Jiang edited comment on LUCENE-3892 at 8/8/12 3:34 AM:
-----------------------------------------------------------

Oh, thank you Mike! I haven't thought too much about those skipping policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress....
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to remove header and call ForDecompressImpl directly in readBlock():with For, blockSize=128. Data in bracket show prior benchmark.
{noformat}
Task QPS Base StdDev Base QPS For StdDev For Pct diff
Phrase 4.99 0.37 3.57 0.26 -38% - -17% (-44% - -18%)
AndHighMed 28.91 2.17 22.66 0.82 -29% - -12% (-38% - -9%)
SpanNear 2.72 0.14 2.22 0.13 -26% - -8% (-36% - -8%)
SloppyPhrase 4.24 0.26 3.70 0.16 -21% - -3% (-33% - -6%)
Respell 40.71 2.59 37.66 1.36 -16% - 2% (-18% - 0%)
Fuzzy1 43.22 2.01 40.66 0.32 -10% - 0% (-12% - 0%)
Fuzzy2 16.25 0.90 15.64 0.26 -10% - 3% (-12% - 3%)
Wildcard 19.07 0.86 19.07 0.73 -8% - 8% (-21% - 3%)
AndHighHigh 7.76 0.47 7.77 0.15 -7% - 8% (-21% - 10%)
PKLookup 87.50 4.56 88.51 1.24 -5% - 8% ( -2% - 5%)
TermBGroup1M 20.42 0.87 21.32 0.74 -3% - 12% ( 2% - 10%)
OrHighMed 5.33 0.68 5.61 0.14 -9% - 23% (-16% - 25%)
OrHighHigh 4.43 0.53 4.69 0.12 -8% - 23% (-15% - 24%)
TermGroup1M 13.30 0.34 14.31 0.40 2% - 13% ( 0% - 13%)
TermBGroup1M1P 20.92 0.59 23.71 0.86 6% - 20% ( -1% - 22%)
Prefix3 30.30 1.41 35.14 1.76 5% - 27% (-14% - 21%)
IntNRQ 3.90 0.54 4.58 0.47 -7% - 50% (-25% - 33%)
Term 42.17 1.55 52.33 2.57 13% - 35% ( 1% - 33%)
{noformat}
-The improvement is quite general. However, I still suppose this just benefits from less method calling. I'm trying to change the PFor codes, and remove those nested call.- (this is not actually true, since I was using percentage diff instead of QPS during comparison)

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other bits in exception area) natually fits "partially decode a block", that'll be possible when we optimize skipping queries.

was (Author: billy):
Oh, thank you Mike! I haven't thought too much about those skipping policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress....
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to remove header and call ForDecompressImpl directly in readBlock():with For, blockSize=128. Data in bracket show prior benchmark.
{noformat}
Task QPS Base StdDev Base QPS For StdDev For Pct diff
Phrase 4.99 0.37 3.57 0.26 -38% - -17% (-44% - -18%)
AndHighMed 28.91 2.17 22.66 0.82 -29% - -12% (-38% - -9%)
SpanNear 2.72 0.14 2.22 0.13 -26% - -8% (-36% - -8%)
SloppyPhrase 4.24 0.26 3.70 0.16 -21% - -3% (-33% - -6%)
Respell 40.71 2.59 37.66 1.36 -16% - 2% (-18% - 0%)
Fuzzy1 43.22 2.01 40.66 0.32 -10% - 0% (-12% - 0%)
Fuzzy2 16.25 0.90 15.64 0.26 -10% - 3% (-12% - 3%)
Wildcard 19.07 0.86 19.07 0.73 -8% - 8% (-21% - 3%)
AndHighHigh 7.76 0.47 7.77 0.15 -7% - 8% (-21% - 10%)
PKLookup 87.50 4.56 88.51 1.24 -5% - 8% ( -2% - 5%)
TermBGroup1M 20.42 0.87 21.32 0.74 -3% - 12% ( 2% - 10%)
OrHighMed 5.33 0.68 5.61 0.14 -9% - 23% (-16% - 25%)
OrHighHigh 4.43 0.53 4.69 0.12 -8% - 23% (-15% - 24%)
TermGroup1M 13.30 0.34 14.31 0.40 2% - 13% ( 0% - 13%)
TermBGroup1M1P 20.92 0.59 23.71 0.86 6% - 20% ( -1% - 22%)
Prefix3 30.30 1.41 35.14 1.76 5% - 27% (-14% - 21%)
IntNRQ 3.90 0.54 4.58 0.47 -7% - 50% (-25% - 33%)
Term 42.17 1.55 52.33 2.57 13% - 35% ( 1% - 33%)
{noformat}
The improvement is quite general. However, I still suppose this just benefits from less method calling. I'm trying to change the PFor codes, and remove those nested call.

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other bits in exception area) natually fits "partially decode a block", that'll be possible when we optimize skipping queries.

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 7, 2012, 8:37 PM

Post #7 of 12 (133 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397228#comment-13397228 ]

Han Jiang edited comment on LUCENE-3892 at 8/8/12 3:35 AM:
-----------------------------------------------------------

And result for PFor(blocksize=128):
{noformat}
Task QPS Base StdDev Base QPS PFor StdDev PFor Pct diff
Phrase 4.87 0.36 3.39 0.18 -38% - -20% (-47% - -25%)
AndHighMed 27.78 2.35 21.13 0.52 -31% - -14% (-37% - -15%)
SpanNear 2.70 0.14 2.20 0.11 -26% - -9% (-36% - -13%)
SloppyPhrase 4.17 0.15 3.77 0.21 -17% - 0% (-30% - -6%)
Respell 39.97 1.56 37.65 1.95 -14% - 3% (-15% - 2%)
Wildcard 19.08 0.77 18.33 0.92 -12% - 5% (-17% - 3%)
Fuzzy1 42.29 1.13 40.78 1.44 -9% - 2% (-11% - 1%)
AndHighHigh 7.61 0.55 7.45 0.08 -9% - 6% (-19% - 6%)
Fuzzy2 15.79 0.55 15.64 0.70 -8% - 7% (-11% - 6%)
PKLookup 86.71 2.13 88.92 2.24 -2% - 7% ( -2% - 7%)
TermGroup1M 13.04 0.23 14.03 0.40 2% - 12% ( 1% - 9%)
IntNRQ 3.97 0.48 4.35 0.61 -15% - 41% (-16% - 24%)
TermBGroup1M1P 21.04 0.35 23.20 0.60 5% - 14% ( 0% - 14%)
TermBGroup1M 19.27 0.47 21.28 0.84 3% - 17% ( 1% - 10%)
OrHighHigh 4.13 0.47 4.63 0.27 -5% - 34% (-14% - 27%)
OrHighMed 4.95 0.59 5.58 0.34 -5% - 35% (-14% - 27%)
Prefix3 30.33 1.36 34.26 2.14 1% - 25% ( -6% - 20%)
Term 41.99 1.19 50.75 1.72 13% - 28% ( 2% - 26%)
{noformat}
-It works, and it is quite interesting that StdDev for Term query is reduced significantly. - (same as last comment, when comparing two versions directly(method call vs. unfolded, the improvement is somewhat noisy))

was (Author: billy):
And result for PFor(blocksize=128):
{noformat}
Task QPS Base StdDev Base QPS PFor StdDev PFor Pct diff
Phrase 4.87 0.36 3.39 0.18 -38% - -20% (-47% - -25%)
AndHighMed 27.78 2.35 21.13 0.52 -31% - -14% (-37% - -15%)
SpanNear 2.70 0.14 2.20 0.11 -26% - -9% (-36% - -13%)
SloppyPhrase 4.17 0.15 3.77 0.21 -17% - 0% (-30% - -6%)
Respell 39.97 1.56 37.65 1.95 -14% - 3% (-15% - 2%)
Wildcard 19.08 0.77 18.33 0.92 -12% - 5% (-17% - 3%)
Fuzzy1 42.29 1.13 40.78 1.44 -9% - 2% (-11% - 1%)
AndHighHigh 7.61 0.55 7.45 0.08 -9% - 6% (-19% - 6%)
Fuzzy2 15.79 0.55 15.64 0.70 -8% - 7% (-11% - 6%)
PKLookup 86.71 2.13 88.92 2.24 -2% - 7% ( -2% - 7%)
TermGroup1M 13.04 0.23 14.03 0.40 2% - 12% ( 1% - 9%)
IntNRQ 3.97 0.48 4.35 0.61 -15% - 41% (-16% - 24%)
TermBGroup1M1P 21.04 0.35 23.20 0.60 5% - 14% ( 0% - 14%)
TermBGroup1M 19.27 0.47 21.28 0.84 3% - 17% ( 1% - 10%)
OrHighHigh 4.13 0.47 4.63 0.27 -5% - 34% (-14% - 27%)
OrHighMed 4.95 0.59 5.58 0.34 -5% - 35% (-14% - 27%)
Prefix3 30.33 1.36 34.26 2.14 1% - 25% ( -6% - 20%)
Term 41.99 1.19 50.75 1.72 13% - 28% ( 2% - 26%)
{noformat}
It works, and it is quite interesting that StdDev for Term query is reduced significantly.

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 7, 2012, 8:37 PM

Post #8 of 12 (133 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397228#comment-13397228 ]

Han Jiang edited comment on LUCENE-3892 at 8/8/12 3:35 AM:
-----------------------------------------------------------

And result for PFor(blocksize=128):
{noformat}
Task QPS Base StdDev Base QPS PFor StdDev PFor Pct diff
Phrase 4.87 0.36 3.39 0.18 -38% - -20% (-47% - -25%)
AndHighMed 27.78 2.35 21.13 0.52 -31% - -14% (-37% - -15%)
SpanNear 2.70 0.14 2.20 0.11 -26% - -9% (-36% - -13%)
SloppyPhrase 4.17 0.15 3.77 0.21 -17% - 0% (-30% - -6%)
Respell 39.97 1.56 37.65 1.95 -14% - 3% (-15% - 2%)
Wildcard 19.08 0.77 18.33 0.92 -12% - 5% (-17% - 3%)
Fuzzy1 42.29 1.13 40.78 1.44 -9% - 2% (-11% - 1%)
AndHighHigh 7.61 0.55 7.45 0.08 -9% - 6% (-19% - 6%)
Fuzzy2 15.79 0.55 15.64 0.70 -8% - 7% (-11% - 6%)
PKLookup 86.71 2.13 88.92 2.24 -2% - 7% ( -2% - 7%)
TermGroup1M 13.04 0.23 14.03 0.40 2% - 12% ( 1% - 9%)
IntNRQ 3.97 0.48 4.35 0.61 -15% - 41% (-16% - 24%)
TermBGroup1M1P 21.04 0.35 23.20 0.60 5% - 14% ( 0% - 14%)
TermBGroup1M 19.27 0.47 21.28 0.84 3% - 17% ( 1% - 10%)
OrHighHigh 4.13 0.47 4.63 0.27 -5% - 34% (-14% - 27%)
OrHighMed 4.95 0.59 5.58 0.34 -5% - 35% (-14% - 27%)
Prefix3 30.33 1.36 34.26 2.14 1% - 25% ( -6% - 20%)
Term 41.99 1.19 50.75 1.72 13% - 28% ( 2% - 26%)
{noformat}
-It works, and it is quite interesting that StdDev for Term query is reduced significantly.- (same as last comment, when comparing two versions directly(method call vs. unfolded, the improvement is somewhat noisy))

was (Author: billy):
And result for PFor(blocksize=128):
{noformat}
Task QPS Base StdDev Base QPS PFor StdDev PFor Pct diff
Phrase 4.87 0.36 3.39 0.18 -38% - -20% (-47% - -25%)
AndHighMed 27.78 2.35 21.13 0.52 -31% - -14% (-37% - -15%)
SpanNear 2.70 0.14 2.20 0.11 -26% - -9% (-36% - -13%)
SloppyPhrase 4.17 0.15 3.77 0.21 -17% - 0% (-30% - -6%)
Respell 39.97 1.56 37.65 1.95 -14% - 3% (-15% - 2%)
Wildcard 19.08 0.77 18.33 0.92 -12% - 5% (-17% - 3%)
Fuzzy1 42.29 1.13 40.78 1.44 -9% - 2% (-11% - 1%)
AndHighHigh 7.61 0.55 7.45 0.08 -9% - 6% (-19% - 6%)
Fuzzy2 15.79 0.55 15.64 0.70 -8% - 7% (-11% - 6%)
PKLookup 86.71 2.13 88.92 2.24 -2% - 7% ( -2% - 7%)
TermGroup1M 13.04 0.23 14.03 0.40 2% - 12% ( 1% - 9%)
IntNRQ 3.97 0.48 4.35 0.61 -15% - 41% (-16% - 24%)
TermBGroup1M1P 21.04 0.35 23.20 0.60 5% - 14% ( 0% - 14%)
TermBGroup1M 19.27 0.47 21.28 0.84 3% - 17% ( 1% - 10%)
OrHighHigh 4.13 0.47 4.63 0.27 -5% - 34% (-14% - 27%)
OrHighMed 4.95 0.59 5.58 0.34 -5% - 35% (-14% - 27%)
Prefix3 30.33 1.36 34.26 2.14 1% - 25% ( -6% - 20%)
Term 41.99 1.19 50.75 1.72 13% - 28% ( 2% - 26%)
{noformat}
-It works, and it is quite interesting that StdDev for Term query is reduced significantly. - (same as last comment, when comparing two versions directly(method call vs. unfolded, the improvement is somewhat noisy))

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 7, 2012, 8:43 PM

Post #9 of 12 (134 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399883#comment-13399883 ]

Han Jiang edited comment on LUCENE-3892 at 8/8/12 3:41 AM:
-----------------------------------------------------------

Yes, really interesting. And that should make sense. -As far as I know, a method with exception handling may be quite slow than a simple if statement check.-(Hmm, now I think this is not true, the improvement should mainly come the framework change) Here is part of the result in my test, with Mike's patch:
{noformat}
OrHighMed 2.53 0.31 2.57 0.13 -13% - 21%
Wildcard 3.86 0.12 3.94 0.38 -10% - 15%
OrHighHigh 1.57 0.18 1.61 0.08 -12% - 21%
TermBGroup1M1P 1.93 0.03 2.48 0.10 21% - 35%
TermGroup1M 1.37 0.02 1.81 0.05 26% - 37%
TermBGroup1M 1.17 0.02 1.64 0.07 32% - 47%
Term 2.92 0.13 4.46 0.23 38% - 68%
{noformat}

was (Author: billy):
Yes, really interesting. And that should make sense. As far as I know, a method with exception handling may be quite slow than a simple if statement check. Here is part of the result in my test, with Mike's patch:
{noformat}
OrHighMed 2.53 0.31 2.57 0.13 -13% - 21%
Wildcard 3.86 0.12 3.94 0.38 -10% - 15%
OrHighHigh 1.57 0.18 1.61 0.08 -12% - 21%
TermBGroup1M1P 1.93 0.03 2.48 0.10 21% - 35%
TermGroup1M 1.37 0.02 1.81 0.05 26% - 37%
TermBGroup1M 1.17 0.02 1.64 0.07 32% - 47%
Term 2.92 0.13 4.46 0.23 38% - 68%
{noformat}

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 8, 2012, 10:48 AM

Post #10 of 12 (134 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431117#comment-13431117 ]

Han Jiang edited comment on LUCENE-3892 at 8/8/12 5:47 PM:
-----------------------------------------------------------

And result on skipMulitiplier, use current 8 as the baseline: http://pastebin.com/TG4C6u6S
Somewhat noisy, but or-queries benifit a little when skipMultiplier=32.
And results when we set blockSize fixed to 64: http://pastebin.com/FQBiKGim

was (Author: billy):
And result on skipMulitiplier, use current 8 as the baseline: http://pastebin.com/TG4C6u6S
Somewhat noisy, but or-queries benifit a little when skipMultiplier=32.

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 8, 2012, 12:38 PM

Post #11 of 12 (133 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431324#comment-13431324 ]

Adrien Grand edited comment on LUCENE-3892 at 8/8/12 7:38 PM:
--------------------------------------------------------------

I did some changes to the {{BlockPacked}} codec:
- encoding and decoding using int[] instead of long[]
- selection of the format based on a configurable overhead ratio.

The results are encouraging (using acceptableOverheadRatio = PackedInts.DEFAULT = 20%):
{noformat}
Task QPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed Pct diff
PKLookup 256.93 8.89 256.85 7.47 -6% - 6%
OrHighLow 145.14 9.86 145.14 9.35 -12% - 14%
Respell 110.26 1.84 110.27 2.01 -3% - 3%
AndHighHigh 112.97 0.81 113.19 2.17 -2% - 2%
Fuzzy1 102.15 1.47 102.86 3.13 -3% - 5%
OrHighHigh 94.56 6.56 95.43 6.35 -11% - 15%
Fuzzy2 42.49 0.77 42.89 1.43 -4% - 6%
OrHighMed 175.30 11.34 177.42 10.83 -10% - 14%
AndHighLow 1925.02 23.92 1952.57 48.68 -2% - 5%
HighPhrase 8.96 0.41 9.11 0.46 -7% - 11%
Wildcard 189.79 2.13 193.12 1.57 0% - 3%
HighSpanNear 6.47 0.15 6.59 0.25 -4% - 8%
Prefix3 256.67 2.58 262.40 2.84 0% - 4%
LowTerm 1746.52 52.80 1789.54 54.30 -3% - 8%
HighTerm 238.70 13.46 245.63 16.60 -9% - 16%
MedTerm 923.64 38.19 951.18 46.85 -5% - 12%
AndHighMed 364.46 3.65 377.09 10.03 0% - 7%
IntNRQ 56.58 1.02 58.84 0.80 0% - 7%
HighSloppyPhrase 11.73 0.30 12.40 0.62 -2% - 13%
LowSpanNear 29.64 0.96 32.44 0.98 2% - 16%
MedSpanNear 22.96 0.72 25.16 0.85 2% - 16%
MedPhrase 40.99 1.25 45.09 1.24 3% - 16%
LowSloppyPhrase 37.88 0.99 41.98 1.49 4% - 17%
LowPhrase 64.40 2.04 71.84 1.41 5% - 17%
MedSloppyPhrase 42.29 1.16 47.32 1.54 5% - 18%
{noformat}

I hope this will be confirmed on your computers this time .:-)

was (Author: jpountz):
I did some changes to the {{BlockPacked}} codec:
- encoding and decoding using int[] instead of long[]
- selection of the format based on a configurable overhead ratio.

The results are encouraging:
{noformat}
Task QPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed Pct diff
PKLookup 256.93 8.89 256.85 7.47 -6% - 6%
OrHighLow 145.14 9.86 145.14 9.35 -12% - 14%
Respell 110.26 1.84 110.27 2.01 -3% - 3%
AndHighHigh 112.97 0.81 113.19 2.17 -2% - 2%
Fuzzy1 102.15 1.47 102.86 3.13 -3% - 5%
OrHighHigh 94.56 6.56 95.43 6.35 -11% - 15%
Fuzzy2 42.49 0.77 42.89 1.43 -4% - 6%
OrHighMed 175.30 11.34 177.42 10.83 -10% - 14%
AndHighLow 1925.02 23.92 1952.57 48.68 -2% - 5%
HighPhrase 8.96 0.41 9.11 0.46 -7% - 11%
Wildcard 189.79 2.13 193.12 1.57 0% - 3%
HighSpanNear 6.47 0.15 6.59 0.25 -4% - 8%
Prefix3 256.67 2.58 262.40 2.84 0% - 4%
LowTerm 1746.52 52.80 1789.54 54.30 -3% - 8%
HighTerm 238.70 13.46 245.63 16.60 -9% - 16%
MedTerm 923.64 38.19 951.18 46.85 -5% - 12%
AndHighMed 364.46 3.65 377.09 10.03 0% - 7%
IntNRQ 56.58 1.02 58.84 0.80 0% - 7%
HighSloppyPhrase 11.73 0.30 12.40 0.62 -2% - 13%
LowSpanNear 29.64 0.96 32.44 0.98 2% - 16%
MedSpanNear 22.96 0.72 25.16 0.85 2% - 16%
MedPhrase 40.99 1.25 45.09 1.24 3% - 16%
LowSloppyPhrase 37.88 0.99 41.98 1.49 4% - 17%
LowPhrase 64.40 2.04 71.84 1.41 5% - 17%
MedSloppyPhrase 42.29 1.16 47.32 1.54 5% - 18%
{noformat}

I hope this will be confirmed on your computers this time .:-)

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 9, 2012, 4:06 AM

Post #12 of 12 (132 views)
Permalink
[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431709#comment-13431709 ]

Adrien Grand edited comment on LUCENE-3892 at 8/9/12 11:05 AM:
---------------------------------------------------------------

The comment you added in 1371011 on the value of {{BLOCK_SIZE}} caught my attention: I think that BLOCK_SIZE should be at least 64 with PackedInts encoding/decoding since these conversions are long-aligned (I backported your two commits and added a comment about this). For example, the {{PACKED}} 7-bits encoder cannot encode less than 64 values in one iteration.

In case someone would really want to use smaller block sizes (eg. 32), I think it should still perform pretty well if {{acceptableOverheadRatio >= ~25%}} (in that case, all bits-per-value in the [1-24] range either use a {{PACKED_SINGLE_BLOCK}} encoder or an 8-bits, 16-bits or 24-bits {{PACKED}} encoder).

Do we plan to make the block size configurable?

was (Author: jpountz):
The comment you added in 1371011 on the value of {{BLOCK_SIZE}} caught my attention: I think that BLOCK_SIZE should be at least 64 with PackedInts encoding/decoding since these conversions are long-aligned (I backported your two commits and added a comment about this). For example, the {{PACKED}} 7-bits encoder cannot encode less than 64 values in one iteration.

In case someone would really want to use smaller block sizes (eg. 32), I think it should still perform pretty well if {{acceptableOverheadRatio >= ~25%}} (in that case, all bits-per-value in the [1-24] range either use a {{PACKED_SINGLE_BLOCK}} encoder or an 8-bits, 16-bits or 24-bits {{PACKED}} decoder).

Do we plan to make the block size configurable?

> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-3892
> URL: https://issues.apache.org/jira/browse/LUCENE-3892
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.1
>
> Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch, LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.