Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

 

 

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded


raghavendra.kt at linux

May 13, 2012, 11:45 AM

Post #26 of 34 (207 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE

> I'd like to see those numbers, then.
>
> Ingo, please hold on the kvm-specific patches, meanwhile.
>

3 guests 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script with while
true do hackbench)

1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

kernbench on PLE:
Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
core, with 8 online cpus and 4*64GB RAM.

The average is taken over 4 iterations with 3 run each (4*3=12). and
stdev is calculated over mean reported in each run.


A): 8 vcpu guest

BASE BASE+patch
%improvement w.r.t
mean (sd) mean (sd) patched
kernel time
case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605
case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401
case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855


B): 16 vcpu guest
BASE BASE+patch
%improvement w.r.t
mean (sd) mean (sd) patched
kernel time
case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867
case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114
case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218

B): 32 vcpu guest
BASE BASE+patch
%improvementw.r.t
mean (sd) mean (sd) patched
kernel time
case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905

It seems while we do not see any improvement in low contention case,
the benefit becomes evident with overcommit and large guests. I am
continuing analysis with other benchmarks (now with pgbench to check if
it has acceptable improvement/degradation in low contenstion case).

Avi,
Can patch series go ahead for inclusion into tree with following
reasons:

The patch series brings fairness with ticketlock ( hence the
predictability, since during contention, vcpu trying
to acqire lock is sure that it gets its turn in less than total number
of vcpus conntending for lock), which is very much desired irrespective
of its low benefit/degradation (if any) in low contention scenarios.

Ofcourse ticketlocks had undesirable effect of exploding LHP problem,
and the series addresses with improvement in scheduling and sleeping
instead of burning cpu time.

Finally a less famous one, it brings almost PLE equivalent capabilty to
all the non PLE hardware (TBH I always preferred my experiment kernel to
be compiled in my pv guest that saves more than 30 min of time for each
run).

It would be nice to see any results if somebody got benefited/suffered
with patchset.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


nikunj at linux

May 13, 2012, 9:57 PM

Post #27 of 34 (206 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T <raghavendra.kt [at] linux> wrote:
> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>
> I could not come with pv-flush results (also Nikunj had clarified that
> the result was on NOn PLE
>
Did you see any issues on PLE?

Regards,
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jeremy at goop

May 14, 2012, 12:38 AM

Post #28 of 34 (208 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/13/2012 11:45 AM, Raghavendra K T wrote:
> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>
> I could not come with pv-flush results (also Nikunj had clarified that
> the result was on NOn PLE
>
>> I'd like to see those numbers, then.
>>
>> Ingo, please hold on the kvm-specific patches, meanwhile.
>>
>
> 3 guests 8GB RAM, 1 used for kernbench
> (kernbench -f -H -M -o 20) other for cpuhog (shell script with while
> true do hackbench)
>
> 1x: no hogs
> 2x: 8hogs in one guest
> 3x: 8hogs each in two guest
>
> kernbench on PLE:
> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
> core, with 8 online cpus and 4*64GB RAM.
>
> The average is taken over 4 iterations with 3 run each (4*3=12). and
> stdev is calculated over mean reported in each run.
>
>
> A): 8 vcpu guest
>
> BASE BASE+patch %improvement w.r.t
> mean (sd) mean (sd)
> patched kernel time
> case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605
> case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401
> case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855
>
>
> B): 16 vcpu guest
> BASE BASE+patch %improvement w.r.t
> mean (sd) mean (sd)
> patched kernel time
> case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867
> case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114
> case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218
>
> B): 32 vcpu guest
> BASE BASE+patch %improvementw.r.t
> mean (sd) mean (sd)
> patched kernel time
> case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905

What does the "4*1x" notation mean? Do these workloads have overcommit
of the PCPU resources?

When I measured it, even quite small amounts of overcommit lead to large
performance drops with non-pv ticket locks (on the order of 10%
improvements when there were 5 busy VCPUs on a 4 cpu system). I never
tested it on larger machines, but I guess that represents around 25%
overcommit, or 40 busy VCPUs on a 32-PCPU system.

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

May 14, 2012, 1:11 AM

Post #29 of 34 (206 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/14/2012 01:08 PM, Jeremy Fitzhardinge wrote:
> On 05/13/2012 11:45 AM, Raghavendra K T wrote:
>> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>>
>> I could not come with pv-flush results (also Nikunj had clarified that
>> the result was on NOn PLE
>>
>>> I'd like to see those numbers, then.
>>>
>>> Ingo, please hold on the kvm-specific patches, meanwhile.
>>>
>>
>> 3 guests 8GB RAM, 1 used for kernbench
>> (kernbench -f -H -M -o 20) other for cpuhog (shell script with while
>> true do hackbench)
>>
>> 1x: no hogs
>> 2x: 8hogs in one guest
>> 3x: 8hogs each in two guest
>>
>> kernbench on PLE:
>> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
>> core, with 8 online cpus and 4*64GB RAM.
>>
>> The average is taken over 4 iterations with 3 run each (4*3=12). and
>> stdev is calculated over mean reported in each run.
>>
>>
>> A): 8 vcpu guest
>>
>> BASE BASE+patch %improvement w.r.t
>> mean (sd) mean (sd)
>> patched kernel time
>> case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605
>> case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401
>> case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855
>>
>>
>> B): 16 vcpu guest
>> BASE BASE+patch %improvement w.r.t
>> mean (sd) mean (sd)
>> patched kernel time
>> case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867
>> case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114
>> case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218
>>
>> B): 32 vcpu guest
>> BASE BASE+patch %improvementw.r.t
>> mean (sd) mean (sd)
>> patched kernel time
>> case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905
>
> What does the "4*1x" notation mean? Do these workloads have overcommit
> of the PCPU resources?
>
> When I measured it, even quite small amounts of overcommit lead to large
> performance drops with non-pv ticket locks (on the order of 10%
> improvements when there were 5 busy VCPUs on a 4 cpu system). I never
> tested it on larger machines, but I guess that represents around 25%
> overcommit, or 40 busy VCPUs on a 32-PCPU system.

All the above measurements are on PLE machine. It is 32 vcpu single
guest on a 8 pcpu.

(PS:One problem I saw in my kernbench run itself is that
number of threads spawned = 20 instead of 2* number of vcpu. I ll
correct during next measurement.)

"even quite small amounts of overcommit lead to large performance drops
with non-pv ticket locks":

This is very much true on non PLE machine. probably compilation takes
even a day vs just one hour. ( with just 1:3x overcommit I had got 25 x
speedup).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

May 14, 2012, 2:01 AM

Post #30 of 34 (205 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/14/2012 10:27 AM, Nikunj A Dadhania wrote:
> On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T<raghavendra.kt [at] linux> wrote:
>> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>>
>> I could not come with pv-flush results (also Nikunj had clarified that
>> the result was on NOn PLE
>>
> Did you see any issues on PLE?
>

No, I did not see issues in setup, but did not get time to check
that out yet ..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


JBeulich at suse

May 15, 2012, 4:26 AM

Post #31 of 34 (204 views)
Permalink
Re: [Xen-devel] [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

>>> On 07.05.12 at 19:25, Ingo Molnar <mingo [at] kernel> wrote:

(apologies for the late reply, the mail just now made it to my inbox
via xen-devel)

> I'll hold off on the whole thing - frankly, we don't want this
> kind of Xen-only complexity. If KVM can make use of PLE then Xen
> ought to be able to do it as well.

It does - for fully virtualized guests. For para-virtualized ones,
it can't (as the hardware feature is an extension to VMX/SVM).

> If both Xen and KVM makes good use of it then that's a different
> matter.

I saw in a later reply that you're now tending towards trying it
out at least - thanks.

Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

May 15, 2012, 8:19 PM

Post #32 of 34 (206 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/14/2012 12:15 AM, Raghavendra K T wrote:
> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>
> I could not come with pv-flush results (also Nikunj had clarified that
> the result was on NOn PLE
>
>> I'd like to see those numbers, then.
>>
>> Ingo, please hold on the kvm-specific patches, meanwhile.
>>
>
> 3 guests 8GB RAM, 1 used for kernbench
> (kernbench -f -H -M -o 20) other for cpuhog (shell script with while
> true do hackbench)
>
> 1x: no hogs
> 2x: 8hogs in one guest
> 3x: 8hogs each in two guest
>
> kernbench on PLE:
> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
> core, with 8 online cpus and 4*64GB RAM.
>
> The average is taken over 4 iterations with 3 run each (4*3=12). and
> stdev is calculated over mean reported in each run.
>
>
> A): 8 vcpu guest
>
> BASE BASE+patch %improvement w.r.t
> mean (sd) mean (sd) patched kernel time
> case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605
> case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401
> case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855
>
>
> B): 16 vcpu guest
> BASE BASE+patch %improvement w.r.t
> mean (sd) mean (sd) patched kernel time
> case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867
> case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114
> case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218
>
> B): 32 vcpu guest
> BASE BASE+patch %improvementw.r.t
> mean (sd) mean (sd) patched kernel time
> case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905
>
> It seems while we do not see any improvement in low contention case,
> the benefit becomes evident with overcommit and large guests. I am
> continuing analysis with other benchmarks (now with pgbench to check if
> it has acceptable improvement/degradation in low contenstion case).

Here are the results for pgbench and sysbench. Here the results are on a
single guest.

Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
core, with 8
online cpus and 4*64GB RAM.

Guest config: 8GB RAM

pgbench
==========

unit=tps (higher is better)
pgbench based on pgsql 9.2-dev:
http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo)

tool used to collect benachmark:
git://git.postgresql.org/git/pgbench-tools.git
config: MAX_WORKER=16 SCALE=32 run for NRCLIENTS = 1, 8, 64

Average taken over 10 iterations.

8 vcpu guest

N base patch improvement
1 5271 5235 -0.687679
8 37953 38202 0.651798
64 37546 37774 0.60359


16 vcpu guest

N base patch improvement
1 5229 5239 0.190876
8 34908 36048 3.16245
64 51796 52852 1.99803

sysbench
==========
sysbench 0.4.12 cnfigured for postgres driver ran with
sysbench --num-threads=8/16/32 --max-requests=100000 --test=oltp
--oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
annalysed with ministat with
x patch
+ base

8 vcpu guest
---------------
1) num_threads = 8
N Min Max Median Avg Stddev
x 10 20.7805 21.55 20.9667 21.03502 0.22682186
+ 10 21.025 22.3122 21.29535 21.41793 0.39542349
Difference at 98.0% confidence
1.82035% +/- 1.74892%

2) num_threads = 16
N Min Max Median Avg Stddev
x 10 20.8786 21.3967 21.1566 21.14441 0.15490983
+ 10 21.3992 21.9437 21.46235 21.58724 0.2089425
Difference at 98.0% confidence
2.09431% +/- 0.992732%

3) num_threads = 32
N Min Max Median Avg Stddev
x 10 21.1329 21.3726 21.33415 21.2893 0.08324195
+ 10 21.5692 21.8966 21.6441 21.65679 0.093430003
Difference at 98.0% confidence
1.72617% +/- 0.474343%


16 vcpu guest
---------------
1) num_threads = 8
N Min Max Median Avg Stddev
x 10 23.5314 25.6118 24.76145 24.64517 0.74856264
+ 10 22.2675 26.6204 22.9131 23.50554 1.345386
No difference proven at 98.0% confidence

2) num_threads = 16
N Min Max Median Avg Stddev
x 10 12.0095 12.2305 12.15575 12.13926 0.070872722
+ 10 11.413 11.6986 11.4817 11.493 0.080007819
Difference at 98.0% confidence
-5.32372% +/- 0.710561%

3) num_threads = 32
N Min Max Median Avg Stddev
x 10 12.1378 12.3567 12.21675 12.22703 0.0670695
+ 10 11.573 11.7438 11.6306 11.64905 0.062780221
Difference at 98.0% confidence
-4.72707% +/- 0.606349%


32 vcpu guest
---------------
1) num_threads = 8
N Min Max Median Avg Stddev
x 10 30.5602 41.4756 37.45155 36.43752 3.5490215
+ 10 21.1183 49.2599 22.60845 29.61119 11.269393
No difference proven at 98.0% confidence

2) num_threads = 16
N Min Max Median Avg Stddev
x 10 12.2556 12.9023 12.4968 12.55764 0.25330459
+ 10 11.7627 11.9959 11.8419 11.86256 0.088563903
Difference at 98.0% confidence
-5.53512% +/- 1.72448%

3) num_threads = 32
N Min Max Median Avg Stddev
x 10 16.8751 17.0756 16.97335 16.96765 0.063197191
+ 10 21.3763 21.8111 21.6799 21.66438 0.13059888
Difference at 98.0% confidence
27.6805% +/- 0.690056%


To summarise,
with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
very low/undercommitted systems we may see very small improvement or
small acceptable degradation ( which it deserves).

(IMO with more overcommit/contention, we can get more than 15% for the
benchmarks and we do ).

Please let me know if you have any suggestions for try.
(Currently my PLE machine lease is expired, it may take some time to
comeback :()

Ingo, Avi ?


>
> Avi,
> Can patch series go ahead for inclusion into tree with following
> reasons:
>
> The patch series brings fairness with ticketlock ( hence the
> predictability, since during contention, vcpu trying
> to acqire lock is sure that it gets its turn in less than total number
> of vcpus conntending for lock), which is very much desired irrespective
> of its low benefit/degradation (if any) in low contention scenarios.
>
> Ofcourse ticketlocks had undesirable effect of exploding LHP problem,
> and the series addresses with improvement in scheduling and sleeping
> instead of burning cpu time.
>
> Finally a less famous one, it brings almost PLE equivalent capabilty to
> all the non PLE hardware (TBH I always preferred my experiment kernel to
> be compiled in my pv guest that saves more than 30 min of time for each
> run).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

May 30, 2012, 4:26 AM

Post #33 of 34 (195 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/16/2012 08:49 AM, Raghavendra K T wrote:
> On 05/14/2012 12:15 AM, Raghavendra K T wrote:
>> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>>
>> I could not come with pv-flush results (also Nikunj had clarified that
>> the result was on NOn PLE
>>
>>> I'd like to see those numbers, then.
>>>
>>> Ingo, please hold on the kvm-specific patches, meanwhile.
[...]
> To summarise,
> with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
> very low/undercommitted systems we may see very small improvement or
> small acceptable degradation ( which it deserves).
>

For large guests, current value SPIN_THRESHOLD, along with ple_window
needed some of research/experiment.

[Thanks to Jeremy/Nikunj for inputs and help in result analysis ]

I started with debugfs spinlock/histograms, and ran experiments with 32,
64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with
1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench.
[ spinlock/histogram gives logarithmic view of lockwait times ]

machine: PLE machine with 32 cores.

Here is the result summary.
The summary includes 2 part,
(1) %improvement w.r.t 2K spin threshold,
(2) improvement w.r.t sum of histogram numbers in debugfs (that gives
rough indication of contention/cpu time wasted)

For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98%
reduction in sigma(histogram values) compared to 2k case

Result for 32 vcpu guest
==========================
+----------------+-----------+-----------+-----------+-----------+
| Base-2k | 4k | 8k | 16k | 32k |
+----------------+-----------+-----------+-----------+-----------+
| kbench-1vm | 44 | 50 | 46 | 41 |
| SPINHisto-1vm | 98 | 99 | 99 | 99 |
| kbench-2vm | 25 | 45 | 49 | 45 |
| SPINHisto-2vm | 31 | 91 | 99 | 99 |
| kbench-4vm | -13 | -27 | -2 | -4 |
| SPINHisto-4vm | 29 | 66 | 95 | 99 |
+----------------+-----------+-----------+-----------+-----------+
| ebizzy-1vm | 954 | 942 | 913 | 915 |
| SPINHisto-1vm | 96 | 99 | 99 | 99 |
| ebizzy-2vm | 158 | 135 | 123 | 106 |
| SPINHisto-2vm | 90 | 98 | 99 | 99 |
| ebizzy-4vm | -13 | -28 | -33 | -37 |
| SPINHisto-4vm | 83 | 98 | 99 | 99 |
+----------------+-----------+-----------+-----------+-----------+
| hbench-1vm | 48 | 56 | 52 | 64 |
| SPINHisto-1vm | 92 | 95 | 99 | 99 |
| hbench-2vm | 32 | 40 | 39 | 21 |
| SPINHisto-2vm | 74 | 96 | 99 | 99 |
| hbench-4vm | 27 | 15 | 3 | -57 |
| SPINHisto-4vm | 68 | 88 | 94 | 97 |
+----------------+-----------+-----------+-----------+-----------+
| sysbnch-1vm | 0 | 0 | 1 | 0 |
| SPINHisto-1vm | 76 | 98 | 99 | 99 |
| sysbnch-2vm | -1 | 3 | -1 | -4 |
| SPINHisto-2vm | 82 | 94 | 96 | 99 |
| sysbnch-4vm | 0 | -2 | -8 | -14 |
| SPINHisto-4vm | 57 | 79 | 88 | 95 |
+----------------+-----------+-----------+-----------+-----------+

result for 64 vcpu guest
=========================
+----------------+-----------+-----------+-----------+-----------+
| Base-2k | 4k | 8k | 16k | 32k |
+----------------+-----------+-----------+-----------+-----------+
| kbench-1vm | 1 | -11 | -25 | 31 |
| SPINHisto-1vm | 3 | 10 | 47 | 99 |
| kbench-2vm | 15 | -9 | -66 | -15 |
| SPINHisto-2vm | 2 | 11 | 19 | 90 |
+----------------+-----------+-----------+-----------+-----------+
| ebizzy-1vm | 784 | 1097 | 978 | 930 |
| SPINHisto-1vm | 74 | 97 | 98 | 99 |
| ebizzy-2vm | 43 | 48 | 56 | 32 |
| SPINHisto-2vm | 58 | 93 | 97 | 98 |
+----------------+-----------+-----------+-----------+-----------+
| hbench-1vm | 8 | 55 | 56 | 62 |
| SPINHisto-1vm | 18 | 69 | 96 | 99 |
| hbench-2vm | 13 | -14 | -75 | -29 |
| SPINHisto-2vm | 57 | 74 | 80 | 97 |
+----------------+-----------+-----------+-----------+-----------+
| sysbnch-1vm | 9 | 11 | 15 | 10 |
| SPINHisto-1vm | 80 | 93 | 98 | 99 |
| sysbnch-2vm | 3 | 3 | 4 | 2 |
| SPINHisto-2vm | 72 | 89 | 94 | 97 |
+----------------+-----------+-----------+-----------+-----------+

From this, value around 4k-8k threshold seem to be optimal one. [ This
is amost inline with ple_window default ]
(lower the spin threshold, we would cover lesser % of spinlocks, that
would result in more halt_exit/wakeups.

[. www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical
detail on covering spinlock waits ]

After 8k threshold, we see no more contention but that would mean we
have wasted lot of cpu time in busy waits.

Will get a PLE machine again, and 'll continue experimenting with
further tuning of SPIN_THRESHOLD.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Jun 14, 2012, 5:21 AM

Post #34 of 34 (201 views)
Permalink
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks [In reply to]

On 05/30/2012 04:56 PM, Raghavendra K T wrote:
> On 05/16/2012 08:49 AM, Raghavendra K T wrote:
>> On 05/14/2012 12:15 AM, Raghavendra K T wrote:
>>> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>>>
>>> I could not come with pv-flush results (also Nikunj had clarified that
>>> the result was on NOn PLE
>>>
>>>> I'd like to see those numbers, then.
>>>>
>>>> Ingo, please hold on the kvm-specific patches, meanwhile.
> [...]
>> To summarise,
>> with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
>> very low/undercommitted systems we may see very small improvement or
>> small acceptable degradation ( which it deserves).
>>
>
> For large guests, current value SPIN_THRESHOLD, along with ple_window
> needed some of research/experiment.
>
> [Thanks to Jeremy/Nikunj for inputs and help in result analysis ]
>
> I started with debugfs spinlock/histograms, and ran experiments with 32,
> 64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with
> 1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench.
> [ spinlock/histogram gives logarithmic view of lockwait times ]
>
> machine: PLE machine with 32 cores.
>
> Here is the result summary.
> The summary includes 2 part,
> (1) %improvement w.r.t 2K spin threshold,
> (2) improvement w.r.t sum of histogram numbers in debugfs (that gives
> rough indication of contention/cpu time wasted)
>
> For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98%
> reduction in sigma(histogram values) compared to 2k case
>
> Result for 32 vcpu guest
> ==========================
> +----------------+-----------+-----------+-----------+-----------+
> | Base-2k | 4k | 8k | 16k | 32k |
> +----------------+-----------+-----------+-----------+-----------+
> | kbench-1vm | 44 | 50 | 46 | 41 |
> | SPINHisto-1vm | 98 | 99 | 99 | 99 |
> | kbench-2vm | 25 | 45 | 49 | 45 |
> | SPINHisto-2vm | 31 | 91 | 99 | 99 |
> | kbench-4vm | -13 | -27 | -2 | -4 |
> | SPINHisto-4vm | 29 | 66 | 95 | 99 |
> +----------------+-----------+-----------+-----------+-----------+
> | ebizzy-1vm | 954 | 942 | 913 | 915 |
> | SPINHisto-1vm | 96 | 99 | 99 | 99 |
> | ebizzy-2vm | 158 | 135 | 123 | 106 |
> | SPINHisto-2vm | 90 | 98 | 99 | 99 |
> | ebizzy-4vm | -13 | -28 | -33 | -37 |
> | SPINHisto-4vm | 83 | 98 | 99 | 99 |
> +----------------+-----------+-----------+-----------+-----------+
> | hbench-1vm | 48 | 56 | 52 | 64 |
> | SPINHisto-1vm | 92 | 95 | 99 | 99 |
> | hbench-2vm | 32 | 40 | 39 | 21 |
> | SPINHisto-2vm | 74 | 96 | 99 | 99 |
> | hbench-4vm | 27 | 15 | 3 | -57 |
> | SPINHisto-4vm | 68 | 88 | 94 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
> | sysbnch-1vm | 0 | 0 | 1 | 0 |
> | SPINHisto-1vm | 76 | 98 | 99 | 99 |
> | sysbnch-2vm | -1 | 3 | -1 | -4 |
> | SPINHisto-2vm | 82 | 94 | 96 | 99 |
> | sysbnch-4vm | 0 | -2 | -8 | -14 |
> | SPINHisto-4vm | 57 | 79 | 88 | 95 |
> +----------------+-----------+-----------+-----------+-----------+
>
> result for 64 vcpu guest
> =========================
> +----------------+-----------+-----------+-----------+-----------+
> | Base-2k | 4k | 8k | 16k | 32k |
> +----------------+-----------+-----------+-----------+-----------+
> | kbench-1vm | 1 | -11 | -25 | 31 |
> | SPINHisto-1vm | 3 | 10 | 47 | 99 |
> | kbench-2vm | 15 | -9 | -66 | -15 |
> | SPINHisto-2vm | 2 | 11 | 19 | 90 |
> +----------------+-----------+-----------+-----------+-----------+
> | ebizzy-1vm | 784 | 1097 | 978 | 930 |
> | SPINHisto-1vm | 74 | 97 | 98 | 99 |
> | ebizzy-2vm | 43 | 48 | 56 | 32 |
> | SPINHisto-2vm | 58 | 93 | 97 | 98 |
> +----------------+-----------+-----------+-----------+-----------+
> | hbench-1vm | 8 | 55 | 56 | 62 |
> | SPINHisto-1vm | 18 | 69 | 96 | 99 |
> | hbench-2vm | 13 | -14 | -75 | -29 |
> | SPINHisto-2vm | 57 | 74 | 80 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
> | sysbnch-1vm | 9 | 11 | 15 | 10 |
> | SPINHisto-1vm | 80 | 93 | 98 | 99 |
> | sysbnch-2vm | 3 | 3 | 4 | 2 |
> | SPINHisto-2vm | 72 | 89 | 94 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
>
> From this, value around 4k-8k threshold seem to be optimal one. [ This
> is amost inline with ple_window default ]
> (lower the spin threshold, we would cover lesser % of spinlocks, that
> would result in more halt_exit/wakeups.
>
> [. www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical
> detail on covering spinlock waits ]
>
> After 8k threshold, we see no more contention but that would mean we
> have wasted lot of cpu time in busy waits.
>
> Will get a PLE machine again, and 'll continue experimenting with
> further tuning of SPIN_THRESHOLD.

Sorry for delayed response. Was doing too much of analysis and
experiments.

Continued my experiment, with spin threshold. unfortunately could
not settle between which one of 4k/8k threshold is better, since it
depends on load and type of workload.

Here is the result for 32 vcpu guest for sysbench and kernebench for 4
8GB RAM vms on same PLE machine with:

1x: benchmark running on 1 guest
2x: same benchmark running on 2 guest and so on

1x run is taken over 8*3 run averages
2x run was taken with 4*3 runs
3x run was with 6*3
4x run was with 4*3


kernbench
=========
total job=2* number of vcpus
kernbench -f -H -M -o $total_job


+------------+------------+-----------+---------------+---------+
| base | pv_4k | %impr | pv_8k | %impr |
+------------+------------+-----------+---------------+---------+
| 49.98 | 49.147475 | 1.69393 | 50.575567 | -1.17758|
| 106.0051 | 96.668325 | 9.65857 | 91.62165 | 15.6987 |
| 189.82067 | 181.839 | 4.38942 | 188.8595 | 0.508934|
+------------+------------+-----------+---------------+---------+

sysbench
===========
Ran with num_thread=2* number of vcpus

sysbench --num-threads=$num_thread --max-requests=100000 --test=oltp
--oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run

32 vcpu
-------

+------------+------------+-----------+---------------+---------+
| base | pv_4k | %impr | pv_8k | %impr |
+------------+------------+-----------+---------------+---------+
| 16.4109 | 12.109988 | 35.5154 | 12.658113 | 29.6473 |
| 14.232712 | 13.640387 | 4.34244 | 14.16485 | 0.479087|
| 23.49685 | 23.196375 | 1.29535 | 19.024871 | 23.506 |
+------------+------------+-----------+---------------+---------+

and observations are:

1) 8k threshold does better for medium overcommit. But PLE has more
control rather than pv spinlock.

2) 4k does well for no overcommit and high overcommit cases. and also,
for non PLE machine this helps rather than 8k. in medium overcommit
cases, we see less performance benefits due to increase in halt exits

I 'll continue my analysis.
Also I have come-up with directed yield patch where we do directed
yield in vcpu block path, instead of blind schedule. will do some more
experiment with that and post as an RFC.

Let me know if you have any comments/suggestions.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.