Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

 

 

First page Previous page 1 2 3 4 5 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


markrmiller at gmail

Nov 1, 2009, 6:31 AM

Post #76 of 110 (789 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

Yeah, this is what I saw when profiling a week or two ago. Pretty much
everything was the same - the only diff I can remember was that the
collect method bubbled up slightly different (though ended up roughly
the same), and for the single queue, after the collect method bubbled
up, the compare bottom method quickly bubbled up to 2% at the end. No
other method of interest appeared to be involved and everything else
looked relatively the same.

Michael McCandless (JIRA) wrote:
> [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772309#action_12772309 ]
>
> Michael McCandless commented on LUCENE-1997:
> --------------------------------------------
>
> The number of insertions into the queue is miniscule for these tests.
> EG with topN=10, the query "1" against the 5M wikipedia index, causes
> 110 insertions.
>
> Even at topN=1000 we see only 8053 insertions.
>
> So, the time difference of these runs is really driven by the "compare
> to bottom" check that's done for every hit.
>
> What baffles me is even if I take the inline-single-PQ from the last
> patch, and instead of invoking a separate class's
> "IntDocValues.intValue(doc)" I look it up straight from the int[] I
> get from FieldCache, I'm still seeing worse performance vs trunk.
>
> I think at this point this test is chasing java ghosts, so, we really
> can't conclude much.
>
> Also, I think, if you are sorting by native value per doc, likely the
> fastest way to take "bottom" into account is to push the check all the
> way down into the bottom TermScorers that're pulling docIDs from the
> posting lists. Ie, if your queue has converged, and you know a given
> doc must have value < 7 (say) to get into the queue, you can check
> each doc's value immediately on pulling it from the posting list and
> skip it if it won't compete (and, if you don't require exact total
> hits count).
>
> For queries that are otherwise costly, this can save alot of CPU.
> This is what the source code specialization (LUCENE-1594) does, and it
> results in excellent gains (if I'm remembering right!).
>
>
>
>> Explore performance of multi-PQ vs single-PQ sorting API
>> --------------------------------------------------------
>>
>> Key: LUCENE-1997
>> URL: https://issues.apache.org/jira/browse/LUCENE-1997
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Components: Search
>> Affects Versions: 2.9
>> Reporter: Michael McCandless
>> Assignee: Michael McCandless
>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>
>>
>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>> where a simpler (non-segment-based) comparator API is proposed that
>> gathers results into multiple PQs (one per segment) and then merges
>> them in the end.
>> I started from John's multi-PQ code and worked it into
>> contrib/benchmark so that we could run perf tests. Then I generified
>> the Python script I use for running search benchmarks (in
>> contrib/benchmark/sortBench.py).
>> The script first creates indexes with 1M docs (based on
>> SortableSingleDocSource, and based on wikipedia, if available). Then
>> it runs various combinations:
>> * Index with 20 balanced segments vs index with the "normal" log
>> segment size
>> * Queries with different numbers of hits (only for wikipedia index)
>> * Different top N
>> * Different sorts (by title, for wikipedia, and by random string,
>> random int, and country for the random index)
>> For each test, 7 search rounds are run and the best QPS is kept. The
>> script runs singlePQ then multiPQ, and records the resulting best QPS
>> for each and produces table (in Jira format) as output.
>>
>
>


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 1:54 PM

Post #77 of 110 (777 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772710#action_12772710 ]

John Wang commented on LUCENE-1997:
-----------------------------------

Hi Michael:

Any plans/decisions on moving forward with multiQ within Lucene? I am planning on making the change locally for my project, but I would rather not duplicate the work if you are planning on doing this within lucene.

Thanks

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 2:54 PM

Post #78 of 110 (775 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772735#action_12772735 ]

Michael McCandless commented on LUCENE-1997:
--------------------------------------------

Hi John -- it seems unlikely to happen any time too soon. We're still iterating here, and there are concerns (eg increased memory usage) with the multi PQ approach.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 3:04 PM

Post #79 of 110 (778 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741 ]

Jake Mannix commented on LUCENE-1997:
-------------------------------------

The current concern is to do with the memory? I'm more concerned with the weird "java ghosts" that are flying around, sometimes swaying results by 20-40%... the memory could only be an issue on a setup with hundreds of segments and sorting the top 1000 values (do we really try to optimize for this performance case?). In the normal case (no more than tens of segments, and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markrmiller at gmail

Nov 2, 2009, 3:10 PM

Post #80 of 110 (783 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

There are plenty of Lucene users that do go 1000 in. We've been
calling it "deep paging" at LI. I like that name :)

- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" <jira [at] apache>
wrote:

>
> [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
> ]
>
> Jake Mannix commented on LUCENE-1997:
> -------------------------------------
>
> The current concern is to do with the memory? I'm more concerned
> with the weird "java ghosts" that are flying around, sometimes
> swaying results by 20-40%... the memory could only be an issue on a
> setup with hundreds of segments and sorting the top 1000 values (do
> we really try to optimize for this performance case?). In the
> normal case (no more than tens of segments, and the top 10 or 100
> hits), we're talking about what, 100-1000 PQ entries?
>
>> Explore performance of multi-PQ vs single-PQ sorting API
>> --------------------------------------------------------
>>
>> Key: LUCENE-1997
>> URL: https://issues.apache.org/jira/browse/LUCENE-1997
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Components: Search
>> Affects Versions: 2.9
>> Reporter: Michael McCandless
>> Assignee: Michael McCandless
>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch
>>
>>
>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-
>> dev,
>> where a simpler (non-segment-based) comparator API is proposed that
>> gathers results into multiple PQs (one per segment) and then merges
>> them in the end.
>> I started from John's multi-PQ code and worked it into
>> contrib/benchmark so that we could run perf tests. Then I generified
>> the Python script I use for running search benchmarks (in
>> contrib/benchmark/sortBench.py).
>> The script first creates indexes with 1M docs (based on
>> SortableSingleDocSource, and based on wikipedia, if available). Then
>> it runs various combinations:
>> * Index with 20 balanced segments vs index with the "normal" log
>> segment size
>> * Queries with different numbers of hits (only for wikipedia index)
>> * Different top N
>> * Different sorts (by title, for wikipedia, and by random string,
>> random int, and country for the random index)
>> For each test, 7 search rounds are run and the best QPS is kept. The
>> script runs singlePQ then multiPQ, and records the resulting best QPS
>> for each and produces table (in Jira format) as output.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jake.mannix at gmail

Nov 2, 2009, 3:13 PM

Post #81 of 110 (778 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

Ok, and how many of those users are also running on indices with hundreds of
segments?

-jake

On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller <markrmiller [at] gmail> wrote:

> There are plenty of Lucene users that do go 1000 in. We've been calling it
> "deep paging" at LI. I like that name :)
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
>
> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" <jira [at] apache> wrote:
>
>
>> [
>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
>> ]
>>
>> Jake Mannix commented on LUCENE-1997:
>> -------------------------------------
>>
>> The current concern is to do with the memory? I'm more concerned with the
>> weird "java ghosts" that are flying around, sometimes swaying results by
>> 20-40%... the memory could only be an issue on a setup with hundreds of
>> segments and sorting the top 1000 values (do we really try to optimize for
>> this performance case?). In the normal case (no more than tens of segments,
>> and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?
>>
>> Explore performance of multi-PQ vs single-PQ sorting API
>>> --------------------------------------------------------
>>>
>>> Key: LUCENE-1997
>>> URL: https://issues.apache.org/jira/browse/LUCENE-1997
>>> Project: Lucene - Java
>>> Issue Type: Improvement
>>> Components: Search
>>> Affects Versions: 2.9
>>> Reporter: Michael McCandless
>>> Assignee: Michael McCandless
>>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>
>>>
>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>> where a simpler (non-segment-based) comparator API is proposed that
>>> gathers results into multiple PQs (one per segment) and then merges
>>> them in the end.
>>> I started from John's multi-PQ code and worked it into
>>> contrib/benchmark so that we could run perf tests. Then I generified
>>> the Python script I use for running search benchmarks (in
>>> contrib/benchmark/sortBench.py).
>>> The script first creates indexes with 1M docs (based on
>>> SortableSingleDocSource, and based on wikipedia, if available). Then
>>> it runs various combinations:
>>> * Index with 20 balanced segments vs index with the "normal" log
>>> segment size
>>> * Queries with different numbers of hits (only for wikipedia index)
>>> * Different top N
>>> * Different sorts (by title, for wikipedia, and by random string,
>>> random int, and country for the random index)
>>> For each test, 7 search rounds are run and the best QPS is kept. The
>>> script runs singlePQ then multiPQ, and records the resulting best QPS
>>> for each and produces table (in Jira format) as output.
>>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


markrmiller at gmail

Nov 2, 2009, 3:13 PM

Post #82 of 110 (795 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

Also, eg is Latin for roughly "for example" - so it's not "the"
concern ;)

- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" <jira [at] apache>
wrote:

>
> [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
> ]
>
> Jake Mannix commented on LUCENE-1997:
> -------------------------------------
>
> The current concern is to do with the memory? I'm more concerned
> with the weird "java ghosts" that are flying around, sometimes
> swaying results by 20-40%... the memory could only be an issue on a
> setup with hundreds of segments and sorting the top 1000 values (do
> we really try to optimize for this performance case?). In the
> normal case (no more than tens of segments, and the top 10 or 100
> hits), we're talking about what, 100-1000 PQ entries?
>
>> Explore performance of multi-PQ vs single-PQ sorting API
>> --------------------------------------------------------
>>
>> Key: LUCENE-1997
>> URL: https://issues.apache.org/jira/browse/LUCENE-1997
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Components: Search
>> Affects Versions: 2.9
>> Reporter: Michael McCandless
>> Assignee: Michael McCandless
>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch
>>
>>
>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-
>> dev,
>> where a simpler (non-segment-based) comparator API is proposed that
>> gathers results into multiple PQs (one per segment) and then merges
>> them in the end.
>> I started from John's multi-PQ code and worked it into
>> contrib/benchmark so that we could run perf tests. Then I generified
>> the Python script I use for running search benchmarks (in
>> contrib/benchmark/sortBench.py).
>> The script first creates indexes with 1M docs (based on
>> SortableSingleDocSource, and based on wikipedia, if available). Then
>> it runs various combinations:
>> * Index with 20 balanced segments vs index with the "normal" log
>> segment size
>> * Queries with different numbers of hits (only for wikipedia index)
>> * Different top N
>> * Different sorts (by title, for wikipedia, and by random string,
>> random int, and country for the random index)
>> For each test, 7 search rounds are run and the best QPS is kept. The
>> script runs singlePQ then multiPQ, and records the resulting best QPS
>> for each and produces table (in Jira format) as output.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markrmiller at gmail

Nov 2, 2009, 3:16 PM

Post #83 of 110 (774 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

That I don't know. Probably fewer than more, but that's just me
guessing based on common sense :)

- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 6:13 PM, Jake Mannix <jake.mannix [at] gmail> wrote:

> Ok, and how many of those users are also running on indices with
> hundreds of segments?
>
> -jake
>
> On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller <markrmiller [at] gmail>
> wrote:
> There are plenty of Lucene users that do go 1000 in. We've been
> calling it "deep paging" at LI. I like that name :)
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
>
> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" <jira [at] apache>
> wrote:
>
>
> [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
> ]
>
> Jake Mannix commented on LUCENE-1997:
> -------------------------------------
>
> The current concern is to do with the memory? I'm more concerned
> with the weird "java ghosts" that are flying around, sometimes
> swaying results by 20-40%... the memory could only be an issue on a
> setup with hundreds of segments and sorting the top 1000 values (do
> we really try to optimize for this performance case?). In the
> normal case (no more than tens of segments, and the top 10 or 100
> hits), we're talking about what, 100-1000 PQ entries?
>
> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


jake.mannix at gmail

Nov 2, 2009, 3:27 PM

Post #84 of 110 (787 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

Right, that's my common sense saying that I'm guessing that worrying about
memory usage of the order of numSegments * numResultsReturned is rarely
going to be an issue.

I think I actually took latin once, so I may have even known what "eg"
literally meant back at some point in prehistory. I just have seen how
these discussions can catch on one particular thing that is brought up, and
then derailed into long banterings back and forth on that topic like it was
the most important in the world for ever (for example, how I'm highlighting
this one parenthetical remark which Mike made), losing what the actual
consensus and disagreement was.

As far as I could see, in general, for most common use cases, the
performance was so comparable as to be changeable by wind patterns, and ram
usage is such that in comparison to how much memory a lucene system which is
asking for thousands of hits to be returned across hundred-segment indices
(for typical LogMergePolicy, this would only happen once you get up toward a
billion documents) is probably already so large that worrying about another
10KB here and there is really going to get washed away in the same way that
the leading order complexity is totally dominating the CPU time.

The real thing to focus on is the API itself, for maximal ease of use for
the next possibly 3 years (if the timeframe of 2.x is any indicator).

-jake

On Mon, Nov 2, 2009 at 3:16 PM, Mark Miller <markrmiller [at] gmail> wrote:

> That I don't know. Probably fewer than more, but that's just me guessing
> based on common sense :)
>
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 2, 2009, at 6:13 PM, Jake Mannix <jake.mannix [at] gmail> wrote:
>
> Ok, and how many of those users are also running on indices with hundreds
> of segments?
>
> -jake
>
> On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller < <markrmiller [at] gmail>
> markrmiller [at] gmail> wrote:
>
>> There are plenty of Lucene users that do go 1000 in. We've been calling it
>> "deep paging" at LI. I like that name :)
>>
>> - Mark
>>
>> <http://www.lucidimagination.com>http://www.lucidimagination.com(mobile)
>>
>>
>> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" < <jira [at] apache>
>> jira [at] apache> wrote:
>>
>>
>>> [
>>> <https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741>
>>> https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
>>> ]
>>>
>>> Jake Mannix commented on LUCENE-1997:
>>> -------------------------------------
>>>
>>> The current concern is to do with the memory? I'm more concerned with
>>> the weird "java ghosts" that are flying around, sometimes swaying results by
>>> 20-40%... the memory could only be an issue on a setup with hundreds of
>>> segments and sorting the top 1000 values (do we really try to optimize for
>>> this performance case?). In the normal case (no more than tens of segments,
>>> and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?
>>>
>>> Explore performance of multi-PQ vs single-PQ sorting API
>>>> --------------------------------------------------------
>>>>
>>>> Key: LUCENE-1997
>>>> URL: <https://issues.apache.org/jira/browse/LUCENE-1997>
>>>> https://issues.apache.org/jira/browse/LUCENE-1997
>>>> Project: Lucene - Java
>>>> Issue Type: Improvement
>>>> Components: Search
>>>> Affects Versions: 2.9
>>>> Reporter: Michael McCandless
>>>> Assignee: Michael McCandless
>>>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>>>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>>>>
>>>>
>>>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
>>>> where a simpler (non-segment-based) comparator API is proposed that
>>>> gathers results into multiple PQs (one per segment) and then merges
>>>> them in the end.
>>>> I started from John's multi-PQ code and worked it into
>>>> contrib/benchmark so that we could run perf tests. Then I generified
>>>> the Python script I use for running search benchmarks (in
>>>> contrib/benchmark/sortBench.py).
>>>> The script first creates indexes with 1M docs (based on
>>>> SortableSingleDocSource, and based on wikipedia, if available). Then
>>>> it runs various combinations:
>>>> * Index with 20 balanced segments vs index with the "normal" log
>>>> segment size
>>>> * Queries with different numbers of hits (only for wikipedia index)
>>>> * Different top N
>>>> * Different sorts (by title, for wikipedia, and by random string,
>>>> random int, and country for the random index)
>>>> For each test, 7 search rounds are run and the best QPS is kept. The
>>>> script runs singlePQ then multiPQ, and records the resulting best QPS
>>>> for each and produces table (in Jira format) as output.
>>>>
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: <java-dev-unsubscribe [at] lucene>
>>> java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: <java-dev-help [at] lucene>
>>> java-dev-help [at] lucene
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: <java-dev-unsubscribe [at] lucene>
>> java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: <java-dev-help [at] lucene>
>> java-dev-help [at] lucene
>>
>>
>


markrmiller at gmail

Nov 2, 2009, 3:34 PM

Post #85 of 110 (781 views)
Permalink
Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

Yeah, I was half joking about the Latin - I'm sure you do know what it
means. But you still chose to reword it as "the" concern. And so I
chose to flaunt my ridiculously miniscule grasp of high school Latin :)

- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 6:27 PM, Jake Mannix <jake.mannix [at] gmail> wrote:

> Right, that's my common sense saying that I'm guessing that worrying
> about memory usage of the order of numSegments * numResultsReturned
> is rarely going to be an issue.
>
> I think I actually took latin once, so I may have even known what
> "eg" literally meant back at some point in prehistory. I just have
> seen how these discussions can catch on one particular thing that is
> brought up, and then derailed into long banterings back and forth on
> that topic like it was the most important in the world for ever (for
> example, how I'm highlighting this one parenthetical remark which
> Mike made), losing what the actual consensus and disagreement was.
>
> As far as I could see, in general, for most common use cases, the
> performance was so comparable as to be changeable by wind patterns,
> and ram usage is such that in comparison to how much memory a lucene
> system which is asking for thousands of hits to be returned across
> hundred-segment indices (for typical LogMergePolicy, this would only
> happen once you get up toward a billion documents) is probably
> already so large that worrying about another 10KB here and there is
> really going to get washed away in the same way that the leading
> order complexity is totally dominating the CPU time.
>
> The real thing to focus on is the API itself, for maximal ease of
> use for the next possibly 3 years (if the timeframe of 2.x is any
> indicator).
>
> -jake
>
> On Mon, Nov 2, 2009 at 3:16 PM, Mark Miller <markrmiller [at] gmail>
> wrote:
> That I don't know. Probably fewer than more, but that's just me
> guessing based on common sense :)
>
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Nov 2, 2009, at 6:13 PM, Jake Mannix <jake.mannix [at] gmail> wrote:
>
>> Ok, and how many of those users are also running on indices with
>> hundreds of segments?
>>
>> -jake
>>
>> On Mon, Nov 2, 2009 at 3:10 PM, Mark Miller <markrmiller [at] gmail>
>> wrote:
>> There are plenty of Lucene users that do go 1000 in. We've been
>> calling it "deep paging" at LI. I like that name :)
>>
>> - Mark
>>
>> http://www.lucidimagination.com (mobile)
>>
>>
>> On Nov 2, 2009, at 6:04 PM, "Jake Mannix (JIRA)" <jira [at] apache>
>> wrote:
>>
>>
>> [ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772741#action_12772741
>> ]
>>
>> Jake Mannix commented on LUCENE-1997:
>> -------------------------------------
>>
>> The current concern is to do with the memory? I'm more concerned
>> with the weird "java ghosts" that are flying around, sometimes
>> swaying results by 20-40%... the memory could only be an issue on
>> a setup with hundreds of segments and sorting the top 1000 values
>> (do we really try to optimize for this performance case?). In the
>> normal case (no more than tens of segments, and the top 10 or 100
>> hits), we're talking about what, 100-1000 PQ entries?
>>
>> Explore performance of multi-PQ vs single-PQ sorting API
>> --------------------------------------------------------
>>
>> Key: LUCENE-1997
>> URL: https://issues.apache.org/jira/browse/LUCENE-1997
>> Project: Lucene - Java
>> Issue Type: Improvement
>> Components: Search
>> Affects Versions: 2.9
>> Reporter: Michael McCandless
>> Assignee: Michael McCandless
>> Attachments: LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,
>> LUCENE-1997.patch
>>
>>
>> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-
>> dev,
>> where a simpler (non-segment-based) comparator API is proposed that
>> gathers results into multiple PQs (one per segment) and then merges
>> them in the end.
>> I started from John's multi-PQ code and worked it into
>> contrib/benchmark so that we could run perf tests. Then I generified
>> the Python script I use for running search benchmarks (in
>> contrib/benchmark/sortBench.py).
>> The script first creates indexes with 1M docs (based on
>> SortableSingleDocSource, and based on wikipedia, if available). Then
>> it runs various combinations:
>> * Index with 20 balanced segments vs index with the "normal" log
>> segment size
>> * Queries with different numbers of hits (only for wikipedia index)
>> * Different top N
>> * Different sorts (by title, for wikipedia, and by random string,
>> random int, and country for the random index)
>> For each test, 7 search rounds are run and the best QPS is kept. The
>> script runs singlePQ then multiPQ, and records the resulting best QPS
>> for each and produces table (in Jira format) as output.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>


jira at apache

Nov 2, 2009, 3:36 PM

Post #86 of 110 (778 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772754#action_12772754 ]

John Wang commented on LUCENE-1997:
-----------------------------------

Hi Michael:

Thanks for the heads up. I will work on it locally then.

I am a bit confused here with memory, since most users don't go beyond page one, I can't see memory is even a concern here comparing to the amount of memory lucene uses overall. Am I missing something?

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 3:54 PM

Post #87 of 110 (779 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772764#action_12772764 ]

John Wang commented on LUCENE-1997:
-----------------------------------

I just looked at the most recent patch. Every entry in the PQ is an extra int, so even at the very very very rare and extreme case, 100th page(assuming 10 entries per page) and 100 segment index, we are looking at 400k.
Is this really a concern? I must be missing something...

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 3:58 PM

Post #88 of 110 (775 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772770#action_12772770 ]

Mark Miller commented on LUCENE-1997:
-------------------------------------

Its not as rare as you think - certainly not 3 verys rare.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:01 PM

Post #89 of 110 (776 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772773#action_12772773 ]

Earwin Burrfoot commented on LUCENE-1997:
-----------------------------------------

Regarding memory - If I'm not misunderstanding things, you also have to materialize all field values you're sorting on for each doc in each PQ?

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:08 PM

Post #90 of 110 (775 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772776#action_12772776 ]

Mark Miller commented on LUCENE-1997:
-------------------------------------

Yes, but you have to do that with the current method as well. Comes with per segment really.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:14 PM

Post #91 of 110 (781 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772779#action_12772779 ]

Earwin Burrfoot commented on LUCENE-1997:
-----------------------------------------

Right now DocComparator cheats and stores fields striped, keeping primitives(or in my case any value that can be represented with primitive) - primitive. If the same approach is taken with multiPQs, BOOM! - there goes your API simplicity.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:36 PM

Post #92 of 110 (772 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772790#action_12772790 ]

John Wang commented on LUCENE-1997:
-----------------------------------

Mark:

100th page at the same time index is at 100 segments? How many very's would you give it?

Earwin:

Field values are in FieldCache. Not in the PQ. It is the PQ's memory consumption is at question here (If I am not misunderstanding) You only materialize after the merge, which even at the N*Very case, it is only a page worth, which is the same as the singlePQ approach.

-John

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:38 PM

Post #93 of 110 (778 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772792#action_12772792 ]

Mark Miller commented on LUCENE-1997:
-------------------------------------

bq. 100th page at the same time index is at 100 segments? How many very's would you give it?

I'm not claiming 100th page with many segments - I have no info on that, and I agree it would be more rare. But it has come to my attention that 100th page is more common than I would have thought.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:50 PM

Post #94 of 110 (778 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772794#action_12772794 ]

John Wang commented on LUCENE-1997:
-----------------------------------

Mark:

The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge :) )

Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.

BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)

-John


> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 4:56 PM

Post #95 of 110 (776 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772798#action_12772798 ]

Mark Miller commented on LUCENE-1997:
-------------------------------------

bq. The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge )

I know, I was purposely avoiding getting into the mem argument and just focusing on how rare the situation is. And whether there is going to be a huge perf drop with queue sizes of 1000, I just don't know. The tests have been changing a lot - which is why I think its a little early to come to final conclusions.

bq. Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.

Yes - I do agree its rare.

bq. BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)

I don't have numbers I can share - but this isn't for situations with users paging through an interface (like a web search page) - its users that are using Lucene for other tasks - and there are plenty of those. Lucene is used a lot for websites with users click through 10 results at a time - but its also used in many, many other apps (and I do mean two manys ! :) )

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 5:04 PM

Post #96 of 110 (775 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772809#action_12772809 ]

Mark Miller commented on LUCENE-1997:
-------------------------------------

Actually - while I cannot share any current info I have, I'll share an example from my last job. I worked on a system that librarians used to maintain a newspaper archive. The feed for the paper would come in daily and the librarians would "enhance" the data - adding keywords, breaking up stories, etc. Then reporters or end users could search this data. Librarians, who I learned are odd in there requirements by nature, insisted on bringing in thousands of results that they could scroll through at a time. This was demanded at paper after paper. So we regularly fed back up to 5000 thousand results at a time with our software (though they'd have preferred no limit - "what are you talking about ! I want them all!" - we made them click more buttons for that :) ). Thats just one small example, but I know for a fact there are many, many more.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 2, 2009, 8:54 PM

Post #97 of 110 (767 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772877#action_12772877 ]

John Wang commented on LUCENE-1997:
-----------------------------------

Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators. Because we would know each Q corresponds to 1 segment, and would not matter if docs come out of order.
Please correct me if I am misunderstanding here.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 3, 2009, 2:19 AM

Post #98 of 110 (753 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773000#action_12773000 ]

Earwin Burrfoot commented on LUCENE-1997:
-----------------------------------------

bq. though they'd have preferred no limit - "what are you talking about ! I want them all!"
Same here. People searching on a job site don't really care for top 10 vacancies/resumes, they want eeeeeverything! that matched their requirements. :)

John: on the other hand, we here already have code to merge results coming from different shards, with stripes, primitives and whistles to appease GC. Might as well reuse it asis to merge between segments.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 3, 2009, 2:45 AM

Post #99 of 110 (748 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773012#action_12773012 ]

Michael McCandless commented on LUCENE-1997:
--------------------------------------------

bq. Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators.

I think it'd still be beneficial to differentiate In vs OutOf order collectors, because even within 1 segment if you know the docIDs arrive in order then the compare bottom is cheaper (need not break ties).

bq. Even with 100 segment which I am guessing you agree that it is rare, it is 400K,

Don't forget that this is multiplied by however many queries are currently in flight.

Yonik also raised an important difference of the single PQ API (above: https://issues.apache.org/jira/browse/LUCENE-1997?focusedCommentId=12771008&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12771008), ie, the fact that it references "slots" instead of "docid" means you can cache something private in your "slots" based on previous compare/copy.

Since each approach has distinct advantages, why not offer both ("simple" and "expert") comparator extensions APIs?

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 3, 2009, 10:51 AM

Post #100 of 110 (700 views)
Permalink
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773114#action_12773114 ]

Jake Mannix commented on LUCENE-1997:
-------------------------------------

bq. Since each approach has distinct advantages, why not offer both ("simple" and "expert") comparator extensions APIs?

+1 from me on this one, as long as the simpler one is around. I'll bet we'll find that we regret keeping the "expert" one by 3.2 or so though, but I'll take any compromise which gets the simpler API in there.

bq. Don't forget that this is multiplied by however many queries are currently in flight.

Sure, so if you're running with 100 queries per second on a single shard (pretty fast!), with 100 segments, and you want to do sorting by value on the top 1000 values (how far down the long tail of extreme cases are we at now? Do librarians hit their search servers with 100 QPS and have indices poorly built with hundreds of segments and can't take downtime to *ever* optimize?), we're now talking about 40MB.

*Forty megabytes*. On a beefy machine which is supposed to be handling 100QPS across an index big enough to need 100 segments. How much heap would such a machine already be allocating? 4GB? 6? More?

We're talking about less than 1% of the heap is being used by the multiPQ approach in comparison to singlePQ.

> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests. Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available). Then
> it runs various combinations:
> * Index with 20 balanced segments vs index with the "normal" log
> segment size
> * Queries with different numbers of hits (only for wikipedia index)
> * Different top N
> * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept. The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 4 5 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.