Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH RFC V6 0/11] Paravirtualized ticketlocks

 

 

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded


raghavendra.kt at linux

Mar 21, 2012, 3:20 AM

Post #1 of 36 (381 views)
Permalink
[PATCH RFC V6 0/11] Paravirtualized ticketlocks

From: Jeremy Fitzhardinge <jeremy.fitzhardinge [at] citrix>

Changes since last posting: (Raghavendra K T)
[.
- Rebased to linux-3.3-rc6.
- used function+enum in place of macro (better type checking)
- use cmpxchg while resetting zero status for possible race
[suggested by Dave Hansen for KVM patches ]
]

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs). This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct "next"
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning. (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
iterations, then call out to the __ticket_lock_spinning() pvop,
which allows a backend to block the vCPU rather than spinning. This
pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call
__ticket_unlock_kick() to kick the next vCPU in line awake. If the
lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail
ticket. This has the effect of reducing the max number of CPUs by
half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
32768).

This series provides a Xen implementation, but it should be
straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath"
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(&lock->tickets, inc);
inc.tail &= ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out: barrier();
which results in:
push %rbp
mov %rsp,%rbp

mov $0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp %al,%dl
jne 1f # Slowpath if lock in contention

pop %rbp
retq

### SLOWPATH START
1: and $-2,%edx
movzbl %dl,%esi

2: mov $0x800,%eax
jmp 4f

3: pause
sub $0x1,%eax
je 5f

4: movzbl (%rdi),%ecx
cmp %cl,%dl
jne 3b

pop %rbp
retq

5: callq *__ticket_lock_spinning
jmp 2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push %rbp
mov %rsp,%rbp

mov $0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp %al,%dl
jne 1f

pop %rbp
retq

### SLOWPATH START
1: pause
movzbl (%rdi),%eax
cmp %dl,%al
jne 1b

pop %rbp
retq
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
"head" and fetch the slowpath flag from "tail". This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set. The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket
locks, it also uses the jump-label machinery to use the standard
"add"-based unlock in the non-PV case.

if (TICKET_SLOWPATH_FLAG &&
unlikely(static_branch(&paravirt_ticketlocks_enabled))) {
arch_spinlock_t prev;
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
which generates:
push %rbp
mov %rsp,%rbp

nop5 # replaced by 5-byte jmp 2f when PV enabled

# non-PV unlock
addb $0x2,(%rdi)

1: pop %rbp
retq

### PV unlock ###
2: movzwl (%rdi),%esi # Fetch prev

lock addb $0x2,(%rdi) # Do unlock

testb $0x1,0x1(%rdi) # Test flag
je 1b # Finished if not set

### Slow path ###
add $2,%sil # Add "head" in old lock state
mov %esi,%edx
and $0xfe,%dh # clear slowflag for comparison
movzbl %dh,%eax
cmp %dl,%al # If head == tail (uncontended)
je 4f # clear slowpath flag

# Kick next CPU waiting for lock
3: movzbl %sil,%esi
callq *pv_lock_ops.kick

pop %rbp
retq

# Lock no longer contended - clear slowflag
4: mov %esi,%eax
lock cmpxchg %dx,(%rdi) # cmpxchg to clear flag
cmp %si,%ax
jne 3b # If clear failed, then kick

pop %rbp
retq

So when not using PV ticketlocks, the unlock sequence just has a
5-byte nop added to it, and the PV case is reasonable straightforward
aside from requiring a "lock add".

Note that the patch series needs jumplabel split posted in
https://lkml.org/lkml/2012/2/21/167 to avoid cyclic dependency
of headers (to use jump label machinary)

TODO: remove CONFIG_PARAVIRT_SPINLOCK when everybody convinced.

Results:
=======
machine
IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM

OS: enterprise linux
Gcc Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.5 20110214

base kernel = 3.3-rc6 (cloned sunday 4th march)
unit=tps (higher is better)
benchmak=pgbench based on pgsql 9.2-dev:
http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo)

tool used to collect benachmark: git://git.postgresql.org/git/pgbench-tools.git
config is same as tools default except MAX_WORKER=8

Average taken over 10 iterations, analysed with ministat tool.

BASE (CONFIG_PARAVIRT_SPINLOCK = n)
==========================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3718.4108 4182.7842 3855.1089 3914.535 196.91943
2 x 10 7462.1997 7921.4638 7855.1965 7808.1603 135.37891
4 x 10 21682.402 23445.941 22151.922 22224.329 507.32299
8 x 10 43309.638 48103.494 45332.24 45593.135 1496.3735
16 x 10 108624.95 109227.45 108997.96 108987.84 210.15136
32 x 10 112582.1 113170.42 112776.92 112830.09 202.70556
64 x 10 100576.34 104011.92 103299.89 103034.24 928.24581
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3451.9407 3948.3127 3512.2215 3610.6086 201.58491
2 x 10 7311.1769 7383.2552 7341.0847 7342.2349 21.231902
4 x 10 19582.548 26909.72 24778.282 23893.162 2587.6103
8 x 10 52292.765 54561.472 53171.286 53216.256 733.16626
16 x 10 89643.138 90353.598 89970.878 90018.505 213.73589
32 x 10 81010.402 81556.02 81256.217 81247.223 174.31678
64 x 10 83855.565 85048.602 84087.693 84201.86 352.25182
----------------

BASE + jumplabel_split + jeremy patch (CONFIG_PARAVIRT_SPINLOCK = n)
=====================================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3669.2156 4102.5109 3732.9526 3784.4072 129.14134
2 x 10 7423.984 7797.5046 7446.8946 7500.2076 119.85178
4 x 10 21332.859 26327.619 24175.239 24084.731 1841.8335
8 x 10 43149.937 49515.406 45779.204 45838.782 2191.6348
16 x 10 109512.27 110407.82 109977.15 110019.72 283.41371
32 x 10 112653.3 113156.22 113023.24 112973.56 151.54906
64 x 10 102816.08 104514.48 103843.95 103658.17 515.10115
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3501.3548 3985.3114 3609.0236 3705.6665 224.3719
2 x 10 7275.246 9026.7466 7447.4013 7581.6494 512.75417
4 x 10 19506.151 22661.801 20843.142 21154.886 1329.5591
8 x 10 53150.178 55594.073 54132.383 54227.117 728.42913
16 x 10 84281.93 91234.692 90917.411 90249.053 2108.903
32 x 10 80860.018 81500.369 81212.514 81201.361 205.66759
64 x 10 84090.033 85423.79 84505.041 84588.913 436.69012
----------------

BASE + jumplabel_split+ jeremy patch (CONFIG_PARAVIRT_SPINLOCK = y)
=====================================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3749.8427 4149.0224 4120.6696 3982.6575 197.32902
2 x 10 7786.4802 8149.0902 7956.6706 7970.5441 94.42967
4 x 10 22053.383 27424.414 23514.166 23698.775 1492.792
8 x 10 44585.203 48082.115 46123.156 46135.687 1232.9399
16 x 10 108290.15 109655.13 108924 108968.59 476.48336
32 x 10 112359.02 112966.97 112570.06 112611.48 180.51304
64 x 10 103020.85 104042.71 103457.83 103496.84 291.19165
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3462.6179 3898.5392 3871.6231 3738.0069 196.86077
2 x 10 7358.8148 7396.1029 7387.8169 7382.229 13.117357
4 x 10 19734.357 27799.895 21840.41 22964.202 3070.8067
8 x 10 52412.64 55214.305 53481.185 53552.261 878.21383
16 x 10 89862.081 90375.328 90161.886 90139.154 202.49282
32 x 10 80140.853 80898.452 80683.819 80671.361 227.13277
64 x 10 83402.864 84868.355 84311.472 84281.567 428.6501
----------------

Summary of Avg
==============

Client BASE Base+patch base+patch
PARAVIRT=n PARAVIRT=n (%improve) PARAVIRT=y (%improve)
------ scale=1 (32MB shared buf) ----------
1 3914.535 3784.4072 (-3.32422) 3982.6575 (+1.74025)
2 7808.1603 7500.2076 (-3.94399) 7970.5441 (+2.07967)
4 22224.329 24084.731 (+8.37102) 23698.775 (+6.63438)
8 45593.135 45838.782 (+0.538781) 46135.687 (+1.18999)
16 108987.84 110019.72 (+0.946785) 108968.59 (-0.0176625)
32 112830.09 112973.56 (+0.127156) 112611.48 (-0.193752)
64 103034.24 103658.17 (+0.605556) 103496.84 (+0.448977)

------ scale=500 (~16GB shared buf) ----------
1 3610.6086 3705.6665 (+2.63274) 3738.0069 (+3.52844)
2 7342.2349 7581.6494 (+3.26079) 7382.229 (+0.544713)
4 23893.162 21154.886 (-11.4605) 22964.202 (-3.88797)
8 53216.256 54227.117 (+1.89953) 53552.261 (+0.631395)
16 90018.505 90249.053 (+0.256112) 90139.154 (+0.134027)
32 81247.223 81201.361 (-0.0564475) 80671.361 (-0.708777)
64 84201.86 84588.913 (+0.459673) 84281.567 (+0.0946618)

Thoughts? Comments? Suggestions?

Jeremy Fitzhardinge (10):
x86/spinlock: replace pv spinlocks with pv ticketlocks
x86/ticketlock: don't inline _spin_unlock when using paravirt
spinlocks
x86/ticketlock: collapse a layer of functions
xen: defer spinlock setup until boot CPU setup
xen/pvticketlock: Xen implementation for PV ticket locks
xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv
ticketlocks
x86/pvticketlock: use callee-save for lock_spinning
x86/pvticketlock: when paravirtualizing ticket locks, increment by 2
x86/ticketlock: add slowpath logic
xen/pvticketlock: allow interrupts to be enabled while blocking

Stefano Stabellini (1):
xen: enable PV ticketlocks on HVM Xen
---
arch/x86/Kconfig | 3 +
arch/x86/include/asm/paravirt.h | 32 +---
arch/x86/include/asm/paravirt_types.h | 10 +-
arch/x86/include/asm/spinlock.h | 128 ++++++++----
arch/x86/include/asm/spinlock_types.h | 16 +-
arch/x86/kernel/paravirt-spinlocks.c | 18 +--
arch/x86/xen/smp.c | 3 +-
arch/x86/xen/spinlock.c | 383 +++++++++++----------------------
kernel/Kconfig.locks | 2 +-
9 files changed, 245 insertions(+), 350 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Mar 26, 2012, 7:25 AM

Post #2 of 36 (372 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/21/2012 12:20 PM, Raghavendra K T wrote:
> From: Jeremy Fitzhardinge <jeremy.fitzhardinge [at] citrix>
>
> Changes since last posting: (Raghavendra K T)
> [.
> - Rebased to linux-3.3-rc6.
> - used function+enum in place of macro (better type checking)
> - use cmpxchg while resetting zero status for possible race
> [suggested by Dave Hansen for KVM patches ]
> ]
>
> This series replaces the existing paravirtualized spinlock mechanism
> with a paravirtualized ticketlock mechanism.
>
> Ticket locks have an inherent problem in a virtualized case, because
> the vCPUs are scheduled rather than running concurrently (ignoring
> gang scheduled vCPUs). This can result in catastrophic performance
> collapses when the vCPU scheduler doesn't schedule the correct "next"
> vCPU, and ends up scheduling a vCPU which burns its entire timeslice
> spinning. (Note that this is not the same problem as lock-holder
> preemption, which this series also addresses; that's also a problem,
> but not catastrophic).
>
> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
>
> Currently we deal with this by having PV spinlocks, which adds a layer
> of indirection in front of all the spinlock functions, and defining a
> completely new implementation for Xen (and for other pvops users, but
> there are none at present).
>
> PV ticketlocks keeps the existing ticketlock implemenentation
> (fastpath) as-is, but adds a couple of pvops for the slow paths:
>
> - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
> iterations, then call out to the __ticket_lock_spinning() pvop,
> which allows a backend to block the vCPU rather than spinning. This
> pvop can set the lock into "slowpath state".
>
> - When releasing a lock, if it is in "slowpath state", the call
> __ticket_unlock_kick() to kick the next vCPU in line awake. If the
> lock is no longer in contention, it also clears the slowpath flag.
>
> The "slowpath state" is stored in the LSB of the within the lock tail
> ticket. This has the effect of reducing the max number of CPUs by
> half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
> 32768).
>
> This series provides a Xen implementation, but it should be
> straightforward to add a KVM implementation as well.
>

Looks like a good baseline on which to build the KVM implementation. We
might need some handshake to prevent interference on the host side with
the PLE code.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 27, 2012, 12:37 AM

Post #3 of 36 (371 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/26/2012 07:55 PM, Avi Kivity wrote:
> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge [at] citrix>
[...]
>>
>> This series provides a Xen implementation, but it should be
>> straightforward to add a KVM implementation as well.
>>
>
> Looks like a good baseline on which to build the KVM implementation. We
> might need some handshake to prevent interference on the host side with
> the PLE code.
>

Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non
PLE machine.

Ingo, Peter,
Can you please let us know if this series can be considered for next
merge window?
OR do you still have some concerns that needs addressing.

I shall rebase patches to 3.3 and resend. (main difference would be
UNINLINE_SPIN_UNLOCK and jump label changes to use
static_key_true/false() usage instead of static_branch.)

Thanks,
Raghu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


alan.meadows at gmail

Mar 28, 2012, 9:37 AM

Post #4 of 36 (368 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

I am happy to see this issue receiving some attention and second the
wish to see these patches be considered for further review and
inclusion in an upcoming release.

Overcommit is not as common in enterprise and single-tenant
virtualized environments as it is in multi-tenant environments, and
frankly we have been suffering.

We have been running an early copy of these patches in our lab and in
a small production node sample set both on 3.2.0-rc4 and 3.3.0-rc6 for
over two weeks now with great success. With the heavy level of
vCPU:pCPU overcommit required for our situation, the patches are
increasing performance by an _order of magnitude_ on our E5645 and
E5620 systems.

Alan Meadows

On Tue, Mar 27, 2012 at 12:37 AM, Raghavendra K T
<raghavendra.kt [at] linux> wrote:
>
> On 03/26/2012 07:55 PM, Avi Kivity wrote:
>>
>> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>>>
>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge [at] citrix>
>
> [...]
>
>>>
>>> This series provides a Xen implementation, but it should be
>>> straightforward to add a KVM implementation as well.
>>>
>>
>> Looks like a good baseline on which to build the KVM implementation.  We
>> might need some handshake to prevent interference on the host side with
>> the PLE code.
>>
>
> Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non
> PLE machine.
>
> Ingo, Peter,
> Can you please let us know if this series can be considered for next merge
> window?
> OR do you still have some concerns that needs addressing.
>
> I shall rebase patches to 3.3 and resend. (main difference would be
> UNINLINE_SPIN_UNLOCK and jump label changes to use static_key_true/false()
> usage instead of static_branch.)
>
> Thanks,
> Raghu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo [at] vger
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 28, 2012, 11:21 AM

Post #5 of 36 (366 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/28/2012 09:39 PM, Alan Meadows wrote:
> I am happy to see this issue receiving some attention and second the
> wish to see these patches be considered for further review and inclusion
> in an upcoming release.
>
> Overcommit is not as common in enterprise and single-tenant virtualized
> environments as it is in multi-tenant environments, and frankly we have
> been suffering.
>
> We have been running an early copy of these patches in our lab and in a
> small production node sample set both on3.2.0-rc4 and 3.3.0-rc6 for over
> two weeks now with great success. With the heavy level of vCPU:pCPU
> overcommit required for our situation, the patches are increasing
> performance by an _order of magnitude_ on our E5645 and E5620 systems.
>

Thanks Alan for the support. I feel timing of this patch was little bad
though. (merge window)

>
>
> Looks like a good baseline on which to build the KVM
> implementation. We
> might need some handshake to prevent interference on the host
> side with
> the PLE code.
>

I think I still missed some point in Avi's comment. I agree that PLE
may be interfering with above patches (resulting in less performance
advantages). but we have not seen performance degradation with the
patches in earlier benchmarks. [. theoretically since patch has very
slight advantage over PLE that atleast it knows who should run next ].

So TODO in my list on this is:
1. More analysis of performance on PLE mc.
2. Seeing how to implement handshake to increase performance (if PLE +
patch combination have slight negative effect).

Sorry that, I could not do more analysis on PLE (as promised last time)
because of machine availability.

I 'll do some work on this and comeback. But in the meantime, I do not
see it as blocking for next merge window.

>
> Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on
> non PLE machine.
>
> Ingo, Peter,
> Can you please let us know if this series can be considered for next
> merge window?
> OR do you still have some concerns that needs addressing.
>
> I shall rebase patches to 3.3 and resend. (main difference would be
> UNINLINE_SPIN_UNLOCK and jump label changes to use
> static_key_true/false() usage instead of static_branch.)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Mar 29, 2012, 2:58 AM

Post #6 of 36 (366 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>>
>>
>> Looks like a good baseline on which to build the KVM
>> implementation. We
>> might need some handshake to prevent interference on the host
>> side with
>> the PLE code.
>>
>
> I think I still missed some point in Avi's comment. I agree that PLE
> may be interfering with above patches (resulting in less performance
> advantages). but we have not seen performance degradation with the
> patches in earlier benchmarks. [. theoretically since patch has very
> slight advantage over PLE that atleast it knows who should run next ].

The advantage grows with the vcpu counts and overcommit ratio. If you
have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
while your patch knows who to wake up.

>
> So TODO in my list on this is:
> 1. More analysis of performance on PLE mc.
> 2. Seeing how to implement handshake to increase performance (if PLE +
> patch combination have slight negative effect).

I can think of two options:
- from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick
- look at other sources of pause loops (I guess smp_call_function() is
the significant source) and adjust them to use the same mechanism, and
ask the host to disable PLE exiting.

This can be done incrementally later.

>
> Sorry that, I could not do more analysis on PLE (as promised last time)
> because of machine availability.
>
> I 'll do some work on this and comeback. But in the meantime, I do not
> see it as blocking for next merge window.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 29, 2012, 11:03 AM

Post #7 of 36 (367 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/29/2012 03:28 PM, Avi Kivity wrote:
> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>>
>>>
>>> Looks like a good baseline on which to build the KVM
>>> implementation. We
>>> might need some handshake to prevent interference on the host
>>> side with
>>> the PLE code.
>>>
>>
>> I think I still missed some point in Avi's comment. I agree that PLE
>> may be interfering with above patches (resulting in less performance
>> advantages). but we have not seen performance degradation with the
>> patches in earlier benchmarks. [. theoretically since patch has very
>> slight advantage over PLE that atleast it knows who should run next ].
>
> The advantage grows with the vcpu counts and overcommit ratio. If you
> have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
> while your patch knows who to wake up.
>

Yes. I agree.

>>
>> So TODO in my list on this is:
>> 1. More analysis of performance on PLE mc.
>> 2. Seeing how to implement handshake to increase performance (if PLE +
>> patch combination have slight negative effect).
>
> I can think of two options:

I really like below ideas. Thanks for that!.

> - from the PLE handler, don't wake up a vcpu that is sleeping because it
> is waiting for a kick

How about, adding another pass in the beginning of kvm_vcpu_on_spin()
to check if any vcpu is already kicked. This would almost result in
yield_to(kicked_vcpu). IMO this is also worth trying.

will try above ideas soon.

> - look at other sources of pause loops (I guess smp_call_function() is
> the significant source) and adjust them to use the same mechanism, and
> ask the host to disable PLE exiting.
>
> This can be done incrementally later.
>

Yes.. this can wait a bit.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 30, 2012, 3:07 AM

Post #8 of 36 (364 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/29/2012 11:33 PM, Raghavendra K T wrote:
> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:

> I really like below ideas. Thanks for that!.
>
>> - from the PLE handler, don't wake up a vcpu that is sleeping because it
>> is waiting for a kick
>
> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
> to check if any vcpu is already kicked. This would almost result in
> yield_to(kicked_vcpu). IMO this is also worth trying.
>
> will try above ideas soon.
>

I have patch something like below in mind to try:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3b98b1..5127668 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* else and called schedule in __vcpu_run. Hopefully that
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
+ * Priority is given to vcpu that are unhalted.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
+ for (pass = 0; pass < 3 && !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
struct task_struct *task = NULL;
struct pid *pid;
- if (!pass && i < last_boosted_vcpu) {
+ if (!pass && !vcpu->pv_unhalted)
+ continue;
+ else if (pass == 1 && i < last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
- } else if (pass && i > last_boosted_vcpu)
+ } else if (pass == 2 && i > last_boosted_vcpu)
break;
if (vcpu == me)
continue;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hpa at zytor

Mar 30, 2012, 1:26 PM

Post #9 of 36 (363 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

What is the current status of this patchset? I haven't looked at it too
closely because I have been focused on 3.4 up until now...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Mar 30, 2012, 3:07 PM

Post #10 of 36 (364 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Fri, 30 Mar 2012, H. Peter Anvin wrote:

> What is the current status of this patchset? I haven't looked at it too
> closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach
or not.

If I look at it from the non virtualized kernel side then this is ass
backwards. We know already that we are holding a spinlock which might
cause other (v)cpus going into eternal spin. The non virtualized
kernel solves this by disabling preemption and therefor getting out of
the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT
kernels are observing where non raw spinlocks are turned into
"sleeping spinlocks" and therefor can cause throughput issues for non
RT workloads.

Though the virtualized situation is even worse. Any preempted guest
section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but
not solve it. With massive overcommit there is always a way to trigger
worst case scenarious unless you are educating the scheduler to cope
with that.

So if we need to fiddle with the scheduler and frankly that's the only
way to get a real gain (the numbers, which are achieved by this
patches, are not that impressive) then the question arises whether we
should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running
a paravirt guest then it's simple to provide a mechanism which allows
the host (aka hypervisor) to check that in the guest just by looking
at some global state.

So if a guest exits due to an external event it's easy to inspect the
state of that guest and avoid to schedule away when it was interrupted
in a spinlock held section. That guest/host shared state needs to be
modified to indicate the guest to invoke an exit when the last nested
lock has been released.

Of course this needs to be time bound, so a rogue guest cannot
monopolize the cpu forever, but that's the least to worry about
problem simply because a guest which does not get out of a spinlocked
region within a certain amount of time is borked and elegible to
killing anyway.

Thoughts ?

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


andi at firstfloor

Mar 30, 2012, 3:18 PM

Post #11 of 36 (363 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
--
ak [at] linux -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Mar 30, 2012, 4:04 PM

Post #12 of 36 (365 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, 31 Mar 2012, Andi Kleen wrote:

> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
>
> On a large system under high contention sleeping can perform surprisingly
> well. pthread mutexes have a tendency to beat kernel spinlocks there.
> So avoiding sleeping locks completely (that is what pv locks are
> essentially) is not necessarily that good.

Care to back that up with numbers and proper trace evidence instead of
handwaving?

I've stared at RT traces and throughput problems on _LARGE_ machines
long enough to know what I'm talking about and I can provide evidence
in a split second.

> Your proposal is probably only a good idea on low contention
> and relatively small systems.

Sigh, you have really no fcking clue what you are talking about.

On RT we observed scalabilty problems way before hardware was
available to expose them. So what's your point?

Put up or shut up, really!

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


andi at firstfloor

Mar 30, 2012, 5:08 PM

Post #13 of 36 (364 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote:
> On Sat, 31 Mar 2012, Andi Kleen wrote:
>
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> >
> > On a large system under high contention sleeping can perform surprisingly
> > well. pthread mutexes have a tendency to beat kernel spinlocks there.
> > So avoiding sleeping locks completely (that is what pv locks are
> > essentially) is not necessarily that good.
>
> Care to back that up with numbers and proper trace evidence instead of
> handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has some
graphs that show this very clearly, plus some additional data on the mutexes.
This compares to the glibc futex locks, which perform much better than the kernel
mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you
need it.

-Andi

--
ak [at] linux -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 30, 2012, 5:51 PM

Post #14 of 36 (363 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 01:56 AM, H. Peter Anvin wrote:
> What is the current status of this patchset? I haven't looked at it too
> closely because I have been focused on 3.4 up until now...
>

Thanks Peter,

Currently targeting the patchset for next merge window. IMO These
patches are in good shape now. I' ll rebase these patches and send it
ASAP. It needs "jumplabel split patch" (from Andrew Jones) as
dependency, I would fold that patch (got ok from him for that) also
into series, and send them ASAP.

- Raghu





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vatsa at linux

Mar 30, 2012, 9:07 PM

Post #15 of 36 (351 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Thomas Gleixner <tglx [at] linutronix> [2012-03-31 00:07:58]:

> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

I had attempted something like that long back:

http://lkml.org/lkml/2010/6/3/4

The issue is with ticketlocks though. VCPUs could go into a spin w/o
a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
and releases the lock. VCPU1 is next eligible to take the lock. If
that is not scheduled early enough by host, then remaining vcpus would keep
spinning (even though lock is technically not held by anybody) w/o making
forward progress.

In that situation, what we really need is for the guest to hint to host
scheduler to schedule VCPU1 early (via yield_to or something similar).

The current pv-spinlock patches however does not track which vcpu is
spinning at what head of the ticketlock. I suppose we can consider
that optimization in future and see how much benefit it provides (over
plain yield/sleep the way its done now).

Do you see any issues if we take in what we have today and address the
finer-grained optimization as next step?

- vatsa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vatsa at linux

Mar 30, 2012, 9:09 PM

Post #16 of 36 (350 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Srivatsa Vaddagiri <vatsa [at] linux> [2012-03-31 09:37:45]:

> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If

Sorry I meant to say "VCPU2 is next eligible ..."

> that is not scheduled early enough by host, then remaining vcpus would keep
> spinning (even though lock is technically not held by anybody) w/o making
> forward progress.
>
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar).

s/VCPU1/VCPU2 ..

- vatsa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mingo at kernel

Mar 31, 2012, 1:11 AM

Post #17 of 36 (350 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Andi Kleen <andi [at] firstfloor> wrote:

> > Care to back that up with numbers and proper trace evidence
> > instead of handwaving?
>
> E.g. my plumbers presentations on lock and mm scalability from
> last year has some graphs that show this very clearly, plus
> some additional data on the mutexes. This compares to the
> glibc futex locks, which perform much better than the kernel
> mutex locks on larger systems under higher contention

If you mean these draft slides:

http://www.halobates.de/plumbers-fork-locks_v2.pdf

it has very little verifiable information in it. It just
cryptically says lock hold time "microbenchmark", which might or
might not be a valid measurement.

You could have been honest and straightforward in your first
mail:

"I ran workload X on machine Y, and got results Z."

Instead you are *hindering* the discussion:

> Given your tone I will not supply an URL. [...]

If you meant the above URL then it's not the proper numbers
Thomas asked for, just some vague slides. If you meant something
else then put up or shut up.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:18 AM

Post #18 of 36 (343 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/30/2012 01:07 PM, Raghavendra K T wrote:
> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>> I really like below ideas. Thanks for that!.
>>
>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>> because it
>>> is waiting for a kick
>>
>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>> to check if any vcpu is already kicked. This would almost result in
>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>
>> will try above ideas soon.
>>
>
> I have patch something like below in mind to try:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d3b98b1..5127668 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> * else and called schedule in __vcpu_run. Hopefully that
> * VCPU is holding the lock that we need and will release it.
> * We approximate round-robin by starting at the last boosted VCPU.
> + * Priority is given to vcpu that are unhalted.
> */
> - for (pass = 0; pass < 2 && !yielded; pass++) {
> + for (pass = 0; pass < 3 && !yielded; pass++) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> struct task_struct *task = NULL;
> struct pid *pid;
> - if (!pass && i < last_boosted_vcpu) {
> + if (!pass && !vcpu->pv_unhalted)
> + continue;
> + else if (pass == 1 && i < last_boosted_vcpu) {
> i = last_boosted_vcpu;
> continue;
> - } else if (pass && i > last_boosted_vcpu)
> + } else if (pass == 2 && i > last_boosted_vcpu)
> break;
> if (vcpu == me)
> continue;
>

Actually I think this is unneeded. The loops tries to find vcpus that
are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
don't match this condition.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:31 AM

Post #19 of 36 (339 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
> > What is the current status of this patchset? I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
>
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Interesting idea (I think it has been raised before btw, don't recall by
who).

One thing about it is that it can give many false positives. Consider a
fine-grained spinlock that is being accessed by many threads. That is,
the lock is taken and released with high frequency, but there is no
contention, because each vcpu is accessing a different instance. So the
host scheduler will needlessly delay preemption of vcpus that happen to
be holding a lock, even though this gains nothing.

A second issue may happen with a lock that is taken and released with
high frequency, with a high hold percentage. The host scheduler may
always sample the guest in a held state, leading it to conclude that
it's exceeding its timeout when in fact the lock is held for a short
time only.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.

Hopefully not killing! Just because a guest doesn't scale well, or even
if it's deadlocked, doesn't mean it should be killed. Just preempt it.

> Thoughts ?

It's certainly interesting. Maybe a combination is worthwhile - prevent
lockholder preemption for a short period of time AND put waiters to
sleep in case that period is insufficient to release the lock.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 1, 2012, 6:48 AM

Post #20 of 36 (339 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 06:48 PM, Avi Kivity wrote:
> On 03/30/2012 01:07 PM, Raghavendra K T wrote:
>> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>> I really like below ideas. Thanks for that!.
>>>
>>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>>> because it
>>>> is waiting for a kick
>>>
>>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>>> to check if any vcpu is already kicked. This would almost result in
>>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>>
>>> will try above ideas soon.
>>>
>>
>> I have patch something like below in mind to try:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d3b98b1..5127668 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> * else and called schedule in __vcpu_run. Hopefully that
>> * VCPU is holding the lock that we need and will release it.
>> * We approximate round-robin by starting at the last boosted VCPU.
>> + * Priority is given to vcpu that are unhalted.
>> */
>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> struct task_struct *task = NULL;
>> struct pid *pid;
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& !vcpu->pv_unhalted)
>> + continue;
>> + else if (pass == 1&& i< last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> - } else if (pass&& i> last_boosted_vcpu)
>> + } else if (pass == 2&& i> last_boosted_vcpu)
>> break;
>> if (vcpu == me)
>> continue;
>>
>
> Actually I think this is unneeded. The loops tries to find vcpus that
> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
> don't match this condition.
>

I almost agree. But at corner of my thought,

Suppose there are 8 vcpus runnable out of which 4 of them are kicked
but not running, making yield_to those 4 vcpus would result in better
lock progress. no?

I still have little problem getting PLE setup, here (instead rebasing
patches).
Once I get PLE to get that running, and numbers prove no improvement, I
will drop this idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:53 AM

Post #21 of 36 (340 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>> I have patch something like below in mind to try:
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index d3b98b1..5127668 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> * else and called schedule in __vcpu_run. Hopefully that
>>> * VCPU is holding the lock that we need and will release it.
>>> * We approximate round-robin by starting at the last boosted
>>> VCPU.
>>> + * Priority is given to vcpu that are unhalted.
>>> */
>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>>> kvm_for_each_vcpu(i, vcpu, kvm) {
>>> struct task_struct *task = NULL;
>>> struct pid *pid;
>>> - if (!pass&& i< last_boosted_vcpu) {
>>> + if (!pass&& !vcpu->pv_unhalted)
>>> + continue;
>>> + else if (pass == 1&& i< last_boosted_vcpu) {
>>> i = last_boosted_vcpu;
>>> continue;
>>> - } else if (pass&& i> last_boosted_vcpu)
>>> + } else if (pass == 2&& i> last_boosted_vcpu)
>>> break;
>>> if (vcpu == me)
>>> continue;
>>>
>>
>> Actually I think this is unneeded. The loops tries to find vcpus that
>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>> don't match this condition.
>>
>
>
> I almost agree. But at corner of my thought,
>
> Suppose there are 8 vcpus runnable out of which 4 of them are kicked
> but not running, making yield_to those 4 vcpus would result in better
> lock progress. no?

That's what the code does.

> I still have little problem getting PLE setup, here (instead
> rebasing patches).
> Once I get PLE to get that running, and numbers prove no improvement,
> I will drop this idea.
>

I'm interested in how PLE does vs. your patches, both with PLE enabled
and disabled.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 1, 2012, 6:56 AM

Post #22 of 36 (342 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>
> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

Sure. will update with the experimental results.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


juergen.gross at ts

Apr 1, 2012, 9:36 PM

Post #23 of 36 (336 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 12:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
>> What is the current status of this patchset? I haven't looked at it too
>> closely because I have been focused on 3.4 up until now...
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.
>
> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.
>
> Thoughts ?

I used this approach in 2008:

http://lists.xen.org/archives/html/xen-devel/2008-12/msg00740.html

It worked very well, but it was rejected at that time. I wouldn't mind trying
it again if there is some support from your side. :-)


Juergen

--

Juergen Gross Principal Developer Operating Systems
PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail: juergen.gross [at] ts
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Apr 2, 2012, 2:26 AM

Post #24 of 36 (336 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sun, 1 Apr 2012, Avi Kivity wrote:
> On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2012, H. Peter Anvin wrote:
> >
> > > What is the current status of this patchset? I haven't looked at it too
> > > closely because I have been focused on 3.4 up until now...
> >
> > The real question is whether these heuristics are the correct approach
> > or not.
> >
> > If I look at it from the non virtualized kernel side then this is ass
> > backwards. We know already that we are holding a spinlock which might
> > cause other (v)cpus going into eternal spin. The non virtualized
> > kernel solves this by disabling preemption and therefor getting out of
> > the critical section as fast as possible,
> >
> > The virtualization problem reminds me a lot of the problem which RT
> > kernels are observing where non raw spinlocks are turned into
> > "sleeping spinlocks" and therefor can cause throughput issues for non
> > RT workloads.
> >
> > Though the virtualized situation is even worse. Any preempted guest
> > section which holds a spinlock is prone to cause unbound delays.
> >
> > The paravirt ticketlock solution can only mitigate the problem, but
> > not solve it. With massive overcommit there is always a way to trigger
> > worst case scenarious unless you are educating the scheduler to cope
> > with that.
> >
> > So if we need to fiddle with the scheduler and frankly that's the only
> > way to get a real gain (the numbers, which are achieved by this
> > patches, are not that impressive) then the question arises whether we
> > should turn the whole thing around.
> >
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> >
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
>
> Interesting idea (I think it has been raised before btw, don't recall by
> who).

Someoen posted a pointer to that old thread.

> One thing about it is that it can give many false positives. Consider a
> fine-grained spinlock that is being accessed by many threads. That is,
> the lock is taken and released with high frequency, but there is no
> contention, because each vcpu is accessing a different instance. So the
> host scheduler will needlessly delay preemption of vcpus that happen to
> be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there,
but per cpu stuff might be worth annotating to avoid this.

> A second issue may happen with a lock that is taken and released with
> high frequency, with a high hold percentage. The host scheduler may
> always sample the guest in a held state, leading it to conclude that
> it's exceeding its timeout when in fact the lock is held for a short
> time only.

Well, no. You can avoid that.

host VCPU
spin_lock()
modify_global_state()
exit
check_global_state()
mark_global_state()
reschedule vcpu

spin_unlock()
check_global_state()
trigger_exit()

So when an exit occures in the locked section, then the host can mark
the global state to tell the guest to issue a trap on unlock.

> > Of course this needs to be time bound, so a rogue guest cannot
> > monopolize the cpu forever, but that's the least to worry about
> > problem simply because a guest which does not get out of a spinlocked
> > region within a certain amount of time is borked and elegible to
> > killing anyway.
>
> Hopefully not killing! Just because a guest doesn't scale well, or even
> if it's deadlocked, doesn't mean it should be killed. Just preempt it.

:)

> > Thoughts ?
>
> It's certainly interesting. Maybe a combination is worthwhile - prevent
> lockholder preemption for a short period of time AND put waiters to
> sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock
ordering support to avoid that guests are completely starved.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


Ian.Campbell at citrix

Apr 2, 2012, 2:42 AM

Post #25 of 36 (337 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Fri, 2012-03-30 at 23:07 +0100, Thomas Gleixner wrote:
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.

It probably doesn't materially effect your core point (which seems valid
to me) but it's worth pointing out that the numbers presented in this
thread are AFAICT mostly focused on ensuring that that the impact of
this infrastructure is acceptable on native rather than showing the
benefits for virtualized workloads.

Ian.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.