Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH RFC V6 0/11] Paravirtualized ticketlocks

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


raghavendra.kt at linux

Mar 21, 2012, 3:20 AM

Post #1 of 36 (437 views)
Permalink
[PATCH RFC V6 0/11] Paravirtualized ticketlocks

From: Jeremy Fitzhardinge <jeremy.fitzhardinge [at] citrix>

Changes since last posting: (Raghavendra K T)
[.
- Rebased to linux-3.3-rc6.
- used function+enum in place of macro (better type checking)
- use cmpxchg while resetting zero status for possible race
[suggested by Dave Hansen for KVM patches ]
]

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs). This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct "next"
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning. (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
iterations, then call out to the __ticket_lock_spinning() pvop,
which allows a backend to block the vCPU rather than spinning. This
pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call
__ticket_unlock_kick() to kick the next vCPU in line awake. If the
lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail
ticket. This has the effect of reducing the max number of CPUs by
half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
32768).

This series provides a Xen implementation, but it should be
straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath"
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(&lock->tickets, inc);
inc.tail &= ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out: barrier();
which results in:
push %rbp
mov %rsp,%rbp

mov $0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp %al,%dl
jne 1f # Slowpath if lock in contention

pop %rbp
retq

### SLOWPATH START
1: and $-2,%edx
movzbl %dl,%esi

2: mov $0x800,%eax
jmp 4f

3: pause
sub $0x1,%eax
je 5f

4: movzbl (%rdi),%ecx
cmp %cl,%dl
jne 3b

pop %rbp
retq

5: callq *__ticket_lock_spinning
jmp 2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push %rbp
mov %rsp,%rbp

mov $0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp %al,%dl
jne 1f

pop %rbp
retq

### SLOWPATH START
1: pause
movzbl (%rdi),%eax
cmp %dl,%al
jne 1b

pop %rbp
retq
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
"head" and fetch the slowpath flag from "tail". This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set. The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket
locks, it also uses the jump-label machinery to use the standard
"add"-based unlock in the non-PV case.

if (TICKET_SLOWPATH_FLAG &&
unlikely(static_branch(&paravirt_ticketlocks_enabled))) {
arch_spinlock_t prev;
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
which generates:
push %rbp
mov %rsp,%rbp

nop5 # replaced by 5-byte jmp 2f when PV enabled

# non-PV unlock
addb $0x2,(%rdi)

1: pop %rbp
retq

### PV unlock ###
2: movzwl (%rdi),%esi # Fetch prev

lock addb $0x2,(%rdi) # Do unlock

testb $0x1,0x1(%rdi) # Test flag
je 1b # Finished if not set

### Slow path ###
add $2,%sil # Add "head" in old lock state
mov %esi,%edx
and $0xfe,%dh # clear slowflag for comparison
movzbl %dh,%eax
cmp %dl,%al # If head == tail (uncontended)
je 4f # clear slowpath flag

# Kick next CPU waiting for lock
3: movzbl %sil,%esi
callq *pv_lock_ops.kick

pop %rbp
retq

# Lock no longer contended - clear slowflag
4: mov %esi,%eax
lock cmpxchg %dx,(%rdi) # cmpxchg to clear flag
cmp %si,%ax
jne 3b # If clear failed, then kick

pop %rbp
retq

So when not using PV ticketlocks, the unlock sequence just has a
5-byte nop added to it, and the PV case is reasonable straightforward
aside from requiring a "lock add".

Note that the patch series needs jumplabel split posted in
https://lkml.org/lkml/2012/2/21/167 to avoid cyclic dependency
of headers (to use jump label machinary)

TODO: remove CONFIG_PARAVIRT_SPINLOCK when everybody convinced.

Results:
=======
machine
IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM

OS: enterprise linux
Gcc Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.5 20110214

base kernel = 3.3-rc6 (cloned sunday 4th march)
unit=tps (higher is better)
benchmak=pgbench based on pgsql 9.2-dev:
http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo)

tool used to collect benachmark: git://git.postgresql.org/git/pgbench-tools.git
config is same as tools default except MAX_WORKER=8

Average taken over 10 iterations, analysed with ministat tool.

BASE (CONFIG_PARAVIRT_SPINLOCK = n)
==========================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3718.4108 4182.7842 3855.1089 3914.535 196.91943
2 x 10 7462.1997 7921.4638 7855.1965 7808.1603 135.37891
4 x 10 21682.402 23445.941 22151.922 22224.329 507.32299
8 x 10 43309.638 48103.494 45332.24 45593.135 1496.3735
16 x 10 108624.95 109227.45 108997.96 108987.84 210.15136
32 x 10 112582.1 113170.42 112776.92 112830.09 202.70556
64 x 10 100576.34 104011.92 103299.89 103034.24 928.24581
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3451.9407 3948.3127 3512.2215 3610.6086 201.58491
2 x 10 7311.1769 7383.2552 7341.0847 7342.2349 21.231902
4 x 10 19582.548 26909.72 24778.282 23893.162 2587.6103
8 x 10 52292.765 54561.472 53171.286 53216.256 733.16626
16 x 10 89643.138 90353.598 89970.878 90018.505 213.73589
32 x 10 81010.402 81556.02 81256.217 81247.223 174.31678
64 x 10 83855.565 85048.602 84087.693 84201.86 352.25182
----------------

BASE + jumplabel_split + jeremy patch (CONFIG_PARAVIRT_SPINLOCK = n)
=====================================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3669.2156 4102.5109 3732.9526 3784.4072 129.14134
2 x 10 7423.984 7797.5046 7446.8946 7500.2076 119.85178
4 x 10 21332.859 26327.619 24175.239 24084.731 1841.8335
8 x 10 43149.937 49515.406 45779.204 45838.782 2191.6348
16 x 10 109512.27 110407.82 109977.15 110019.72 283.41371
32 x 10 112653.3 113156.22 113023.24 112973.56 151.54906
64 x 10 102816.08 104514.48 103843.95 103658.17 515.10115
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3501.3548 3985.3114 3609.0236 3705.6665 224.3719
2 x 10 7275.246 9026.7466 7447.4013 7581.6494 512.75417
4 x 10 19506.151 22661.801 20843.142 21154.886 1329.5591
8 x 10 53150.178 55594.073 54132.383 54227.117 728.42913
16 x 10 84281.93 91234.692 90917.411 90249.053 2108.903
32 x 10 80860.018 81500.369 81212.514 81201.361 205.66759
64 x 10 84090.033 85423.79 84505.041 84588.913 436.69012
----------------

BASE + jumplabel_split+ jeremy patch (CONFIG_PARAVIRT_SPINLOCK = y)
=====================================================
------ scale=1 (32MB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3749.8427 4149.0224 4120.6696 3982.6575 197.32902
2 x 10 7786.4802 8149.0902 7956.6706 7970.5441 94.42967
4 x 10 22053.383 27424.414 23514.166 23698.775 1492.792
8 x 10 44585.203 48082.115 46123.156 46135.687 1232.9399
16 x 10 108290.15 109655.13 108924 108968.59 476.48336
32 x 10 112359.02 112966.97 112570.06 112611.48 180.51304
64 x 10 103020.85 104042.71 103457.83 103496.84 291.19165
----------------
------ scale=500 (16GB shared buf) ----------
Client N Min Max Median Avg Stddev
1 x 10 3462.6179 3898.5392 3871.6231 3738.0069 196.86077
2 x 10 7358.8148 7396.1029 7387.8169 7382.229 13.117357
4 x 10 19734.357 27799.895 21840.41 22964.202 3070.8067
8 x 10 52412.64 55214.305 53481.185 53552.261 878.21383
16 x 10 89862.081 90375.328 90161.886 90139.154 202.49282
32 x 10 80140.853 80898.452 80683.819 80671.361 227.13277
64 x 10 83402.864 84868.355 84311.472 84281.567 428.6501
----------------

Summary of Avg
==============

Client BASE Base+patch base+patch
PARAVIRT=n PARAVIRT=n (%improve) PARAVIRT=y (%improve)
------ scale=1 (32MB shared buf) ----------
1 3914.535 3784.4072 (-3.32422) 3982.6575 (+1.74025)
2 7808.1603 7500.2076 (-3.94399) 7970.5441 (+2.07967)
4 22224.329 24084.731 (+8.37102) 23698.775 (+6.63438)
8 45593.135 45838.782 (+0.538781) 46135.687 (+1.18999)
16 108987.84 110019.72 (+0.946785) 108968.59 (-0.0176625)
32 112830.09 112973.56 (+0.127156) 112611.48 (-0.193752)
64 103034.24 103658.17 (+0.605556) 103496.84 (+0.448977)

------ scale=500 (~16GB shared buf) ----------
1 3610.6086 3705.6665 (+2.63274) 3738.0069 (+3.52844)
2 7342.2349 7581.6494 (+3.26079) 7382.229 (+0.544713)
4 23893.162 21154.886 (-11.4605) 22964.202 (-3.88797)
8 53216.256 54227.117 (+1.89953) 53552.261 (+0.631395)
16 90018.505 90249.053 (+0.256112) 90139.154 (+0.134027)
32 81247.223 81201.361 (-0.0564475) 80671.361 (-0.708777)
64 84201.86 84588.913 (+0.459673) 84281.567 (+0.0946618)

Thoughts? Comments? Suggestions?

Jeremy Fitzhardinge (10):
x86/spinlock: replace pv spinlocks with pv ticketlocks
x86/ticketlock: don't inline _spin_unlock when using paravirt
spinlocks
x86/ticketlock: collapse a layer of functions
xen: defer spinlock setup until boot CPU setup
xen/pvticketlock: Xen implementation for PV ticket locks
xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv
ticketlocks
x86/pvticketlock: use callee-save for lock_spinning
x86/pvticketlock: when paravirtualizing ticket locks, increment by 2
x86/ticketlock: add slowpath logic
xen/pvticketlock: allow interrupts to be enabled while blocking

Stefano Stabellini (1):
xen: enable PV ticketlocks on HVM Xen
---
arch/x86/Kconfig | 3 +
arch/x86/include/asm/paravirt.h | 32 +---
arch/x86/include/asm/paravirt_types.h | 10 +-
arch/x86/include/asm/spinlock.h | 128 ++++++++----
arch/x86/include/asm/spinlock_types.h | 16 +-
arch/x86/kernel/paravirt-spinlocks.c | 18 +--
arch/x86/xen/smp.c | 3 +-
arch/x86/xen/spinlock.c | 383 +++++++++++----------------------
kernel/Kconfig.locks | 2 +-
9 files changed, 245 insertions(+), 350 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Mar 26, 2012, 7:25 AM

Post #2 of 36 (428 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/21/2012 12:20 PM, Raghavendra K T wrote:
> From: Jeremy Fitzhardinge <jeremy.fitzhardinge [at] citrix>
>
> Changes since last posting: (Raghavendra K T)
> [.
> - Rebased to linux-3.3-rc6.
> - used function+enum in place of macro (better type checking)
> - use cmpxchg while resetting zero status for possible race
> [suggested by Dave Hansen for KVM patches ]
> ]
>
> This series replaces the existing paravirtualized spinlock mechanism
> with a paravirtualized ticketlock mechanism.
>
> Ticket locks have an inherent problem in a virtualized case, because
> the vCPUs are scheduled rather than running concurrently (ignoring
> gang scheduled vCPUs). This can result in catastrophic performance
> collapses when the vCPU scheduler doesn't schedule the correct "next"
> vCPU, and ends up scheduling a vCPU which burns its entire timeslice
> spinning. (Note that this is not the same problem as lock-holder
> preemption, which this series also addresses; that's also a problem,
> but not catastrophic).
>
> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
>
> Currently we deal with this by having PV spinlocks, which adds a layer
> of indirection in front of all the spinlock functions, and defining a
> completely new implementation for Xen (and for other pvops users, but
> there are none at present).
>
> PV ticketlocks keeps the existing ticketlock implemenentation
> (fastpath) as-is, but adds a couple of pvops for the slow paths:
>
> - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
> iterations, then call out to the __ticket_lock_spinning() pvop,
> which allows a backend to block the vCPU rather than spinning. This
> pvop can set the lock into "slowpath state".
>
> - When releasing a lock, if it is in "slowpath state", the call
> __ticket_unlock_kick() to kick the next vCPU in line awake. If the
> lock is no longer in contention, it also clears the slowpath flag.
>
> The "slowpath state" is stored in the LSB of the within the lock tail
> ticket. This has the effect of reducing the max number of CPUs by
> half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
> 32768).
>
> This series provides a Xen implementation, but it should be
> straightforward to add a KVM implementation as well.
>

Looks like a good baseline on which to build the KVM implementation. We
might need some handshake to prevent interference on the host side with
the PLE code.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 27, 2012, 12:37 AM

Post #3 of 36 (427 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/26/2012 07:55 PM, Avi Kivity wrote:
> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge [at] citrix>
[...]
>>
>> This series provides a Xen implementation, but it should be
>> straightforward to add a KVM implementation as well.
>>
>
> Looks like a good baseline on which to build the KVM implementation. We
> might need some handshake to prevent interference on the host side with
> the PLE code.
>

Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non
PLE machine.

Ingo, Peter,
Can you please let us know if this series can be considered for next
merge window?
OR do you still have some concerns that needs addressing.

I shall rebase patches to 3.3 and resend. (main difference would be
UNINLINE_SPIN_UNLOCK and jump label changes to use
static_key_true/false() usage instead of static_branch.)

Thanks,
Raghu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


alan.meadows at gmail

Mar 28, 2012, 9:37 AM

Post #4 of 36 (423 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

I am happy to see this issue receiving some attention and second the
wish to see these patches be considered for further review and
inclusion in an upcoming release.

Overcommit is not as common in enterprise and single-tenant
virtualized environments as it is in multi-tenant environments, and
frankly we have been suffering.

We have been running an early copy of these patches in our lab and in
a small production node sample set both on 3.2.0-rc4 and 3.3.0-rc6 for
over two weeks now with great success. With the heavy level of
vCPU:pCPU overcommit required for our situation, the patches are
increasing performance by an _order of magnitude_ on our E5645 and
E5620 systems.

Alan Meadows

On Tue, Mar 27, 2012 at 12:37 AM, Raghavendra K T
<raghavendra.kt [at] linux> wrote:
>
> On 03/26/2012 07:55 PM, Avi Kivity wrote:
>>
>> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>>>
>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge [at] citrix>
>
> [...]
>
>>>
>>> This series provides a Xen implementation, but it should be
>>> straightforward to add a KVM implementation as well.
>>>
>>
>> Looks like a good baseline on which to build the KVM implementation.  We
>> might need some handshake to prevent interference on the host side with
>> the PLE code.
>>
>
> Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non
> PLE machine.
>
> Ingo, Peter,
> Can you please let us know if this series can be considered for next merge
> window?
> OR do you still have some concerns that needs addressing.
>
> I shall rebase patches to 3.3 and resend. (main difference would be
> UNINLINE_SPIN_UNLOCK and jump label changes to use static_key_true/false()
> usage instead of static_branch.)
>
> Thanks,
> Raghu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo [at] vger
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 28, 2012, 11:21 AM

Post #5 of 36 (422 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/28/2012 09:39 PM, Alan Meadows wrote:
> I am happy to see this issue receiving some attention and second the
> wish to see these patches be considered for further review and inclusion
> in an upcoming release.
>
> Overcommit is not as common in enterprise and single-tenant virtualized
> environments as it is in multi-tenant environments, and frankly we have
> been suffering.
>
> We have been running an early copy of these patches in our lab and in a
> small production node sample set both on3.2.0-rc4 and 3.3.0-rc6 for over
> two weeks now with great success. With the heavy level of vCPU:pCPU
> overcommit required for our situation, the patches are increasing
> performance by an _order of magnitude_ on our E5645 and E5620 systems.
>

Thanks Alan for the support. I feel timing of this patch was little bad
though. (merge window)

>
>
> Looks like a good baseline on which to build the KVM
> implementation. We
> might need some handshake to prevent interference on the host
> side with
> the PLE code.
>

I think I still missed some point in Avi's comment. I agree that PLE
may be interfering with above patches (resulting in less performance
advantages). but we have not seen performance degradation with the
patches in earlier benchmarks. [. theoretically since patch has very
slight advantage over PLE that atleast it knows who should run next ].

So TODO in my list on this is:
1. More analysis of performance on PLE mc.
2. Seeing how to implement handshake to increase performance (if PLE +
patch combination have slight negative effect).

Sorry that, I could not do more analysis on PLE (as promised last time)
because of machine availability.

I 'll do some work on this and comeback. But in the meantime, I do not
see it as blocking for next merge window.

>
> Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on
> non PLE machine.
>
> Ingo, Peter,
> Can you please let us know if this series can be considered for next
> merge window?
> OR do you still have some concerns that needs addressing.
>
> I shall rebase patches to 3.3 and resend. (main difference would be
> UNINLINE_SPIN_UNLOCK and jump label changes to use
> static_key_true/false() usage instead of static_branch.)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Mar 29, 2012, 2:58 AM

Post #6 of 36 (422 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>>
>>
>> Looks like a good baseline on which to build the KVM
>> implementation. We
>> might need some handshake to prevent interference on the host
>> side with
>> the PLE code.
>>
>
> I think I still missed some point in Avi's comment. I agree that PLE
> may be interfering with above patches (resulting in less performance
> advantages). but we have not seen performance degradation with the
> patches in earlier benchmarks. [. theoretically since patch has very
> slight advantage over PLE that atleast it knows who should run next ].

The advantage grows with the vcpu counts and overcommit ratio. If you
have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
while your patch knows who to wake up.

>
> So TODO in my list on this is:
> 1. More analysis of performance on PLE mc.
> 2. Seeing how to implement handshake to increase performance (if PLE +
> patch combination have slight negative effect).

I can think of two options:
- from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick
- look at other sources of pause loops (I guess smp_call_function() is
the significant source) and adjust them to use the same mechanism, and
ask the host to disable PLE exiting.

This can be done incrementally later.

>
> Sorry that, I could not do more analysis on PLE (as promised last time)
> because of machine availability.
>
> I 'll do some work on this and comeback. But in the meantime, I do not
> see it as blocking for next merge window.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 29, 2012, 11:03 AM

Post #7 of 36 (423 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/29/2012 03:28 PM, Avi Kivity wrote:
> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>>
>>>
>>> Looks like a good baseline on which to build the KVM
>>> implementation. We
>>> might need some handshake to prevent interference on the host
>>> side with
>>> the PLE code.
>>>
>>
>> I think I still missed some point in Avi's comment. I agree that PLE
>> may be interfering with above patches (resulting in less performance
>> advantages). but we have not seen performance degradation with the
>> patches in earlier benchmarks. [. theoretically since patch has very
>> slight advantage over PLE that atleast it knows who should run next ].
>
> The advantage grows with the vcpu counts and overcommit ratio. If you
> have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
> while your patch knows who to wake up.
>

Yes. I agree.

>>
>> So TODO in my list on this is:
>> 1. More analysis of performance on PLE mc.
>> 2. Seeing how to implement handshake to increase performance (if PLE +
>> patch combination have slight negative effect).
>
> I can think of two options:

I really like below ideas. Thanks for that!.

> - from the PLE handler, don't wake up a vcpu that is sleeping because it
> is waiting for a kick

How about, adding another pass in the beginning of kvm_vcpu_on_spin()
to check if any vcpu is already kicked. This would almost result in
yield_to(kicked_vcpu). IMO this is also worth trying.

will try above ideas soon.

> - look at other sources of pause loops (I guess smp_call_function() is
> the significant source) and adjust them to use the same mechanism, and
> ask the host to disable PLE exiting.
>
> This can be done incrementally later.
>

Yes.. this can wait a bit.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 30, 2012, 3:07 AM

Post #8 of 36 (420 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/29/2012 11:33 PM, Raghavendra K T wrote:
> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:

> I really like below ideas. Thanks for that!.
>
>> - from the PLE handler, don't wake up a vcpu that is sleeping because it
>> is waiting for a kick
>
> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
> to check if any vcpu is already kicked. This would almost result in
> yield_to(kicked_vcpu). IMO this is also worth trying.
>
> will try above ideas soon.
>

I have patch something like below in mind to try:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3b98b1..5127668 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* else and called schedule in __vcpu_run. Hopefully that
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
+ * Priority is given to vcpu that are unhalted.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
+ for (pass = 0; pass < 3 && !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
struct task_struct *task = NULL;
struct pid *pid;
- if (!pass && i < last_boosted_vcpu) {
+ if (!pass && !vcpu->pv_unhalted)
+ continue;
+ else if (pass == 1 && i < last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
- } else if (pass && i > last_boosted_vcpu)
+ } else if (pass == 2 && i > last_boosted_vcpu)
break;
if (vcpu == me)
continue;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hpa at zytor

Mar 30, 2012, 1:26 PM

Post #9 of 36 (419 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

What is the current status of this patchset? I haven't looked at it too
closely because I have been focused on 3.4 up until now...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Mar 30, 2012, 3:07 PM

Post #10 of 36 (420 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Fri, 30 Mar 2012, H. Peter Anvin wrote:

> What is the current status of this patchset? I haven't looked at it too
> closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach
or not.

If I look at it from the non virtualized kernel side then this is ass
backwards. We know already that we are holding a spinlock which might
cause other (v)cpus going into eternal spin. The non virtualized
kernel solves this by disabling preemption and therefor getting out of
the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT
kernels are observing where non raw spinlocks are turned into
"sleeping spinlocks" and therefor can cause throughput issues for non
RT workloads.

Though the virtualized situation is even worse. Any preempted guest
section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but
not solve it. With massive overcommit there is always a way to trigger
worst case scenarious unless you are educating the scheduler to cope
with that.

So if we need to fiddle with the scheduler and frankly that's the only
way to get a real gain (the numbers, which are achieved by this
patches, are not that impressive) then the question arises whether we
should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running
a paravirt guest then it's simple to provide a mechanism which allows
the host (aka hypervisor) to check that in the guest just by looking
at some global state.

So if a guest exits due to an external event it's easy to inspect the
state of that guest and avoid to schedule away when it was interrupted
in a spinlock held section. That guest/host shared state needs to be
modified to indicate the guest to invoke an exit when the last nested
lock has been released.

Of course this needs to be time bound, so a rogue guest cannot
monopolize the cpu forever, but that's the least to worry about
problem simply because a guest which does not get out of a spinlocked
region within a certain amount of time is borked and elegible to
killing anyway.

Thoughts ?

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


andi at firstfloor

Mar 30, 2012, 3:18 PM

Post #11 of 36 (419 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
--
ak [at] linux -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Mar 30, 2012, 4:04 PM

Post #12 of 36 (420 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, 31 Mar 2012, Andi Kleen wrote:

> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
>
> On a large system under high contention sleeping can perform surprisingly
> well. pthread mutexes have a tendency to beat kernel spinlocks there.
> So avoiding sleeping locks completely (that is what pv locks are
> essentially) is not necessarily that good.

Care to back that up with numbers and proper trace evidence instead of
handwaving?

I've stared at RT traces and throughput problems on _LARGE_ machines
long enough to know what I'm talking about and I can provide evidence
in a split second.

> Your proposal is probably only a good idea on low contention
> and relatively small systems.

Sigh, you have really no fcking clue what you are talking about.

On RT we observed scalabilty problems way before hardware was
available to expose them. So what's your point?

Put up or shut up, really!

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


andi at firstfloor

Mar 30, 2012, 5:08 PM

Post #13 of 36 (420 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote:
> On Sat, 31 Mar 2012, Andi Kleen wrote:
>
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> >
> > On a large system under high contention sleeping can perform surprisingly
> > well. pthread mutexes have a tendency to beat kernel spinlocks there.
> > So avoiding sleeping locks completely (that is what pv locks are
> > essentially) is not necessarily that good.
>
> Care to back that up with numbers and proper trace evidence instead of
> handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has some
graphs that show this very clearly, plus some additional data on the mutexes.
This compares to the glibc futex locks, which perform much better than the kernel
mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you
need it.

-Andi

--
ak [at] linux -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Mar 30, 2012, 5:51 PM

Post #14 of 36 (419 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 01:56 AM, H. Peter Anvin wrote:
> What is the current status of this patchset? I haven't looked at it too
> closely because I have been focused on 3.4 up until now...
>

Thanks Peter,

Currently targeting the patchset for next merge window. IMO These
patches are in good shape now. I' ll rebase these patches and send it
ASAP. It needs "jumplabel split patch" (from Andrew Jones) as
dependency, I would fold that patch (got ok from him for that) also
into series, and send them ASAP.

- Raghu





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vatsa at linux

Mar 30, 2012, 9:07 PM

Post #15 of 36 (407 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Thomas Gleixner <tglx [at] linutronix> [2012-03-31 00:07:58]:

> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

I had attempted something like that long back:

http://lkml.org/lkml/2010/6/3/4

The issue is with ticketlocks though. VCPUs could go into a spin w/o
a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
and releases the lock. VCPU1 is next eligible to take the lock. If
that is not scheduled early enough by host, then remaining vcpus would keep
spinning (even though lock is technically not held by anybody) w/o making
forward progress.

In that situation, what we really need is for the guest to hint to host
scheduler to schedule VCPU1 early (via yield_to or something similar).

The current pv-spinlock patches however does not track which vcpu is
spinning at what head of the ticketlock. I suppose we can consider
that optimization in future and see how much benefit it provides (over
plain yield/sleep the way its done now).

Do you see any issues if we take in what we have today and address the
finer-grained optimization as next step?

- vatsa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vatsa at linux

Mar 30, 2012, 9:09 PM

Post #16 of 36 (407 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Srivatsa Vaddagiri <vatsa [at] linux> [2012-03-31 09:37:45]:

> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If

Sorry I meant to say "VCPU2 is next eligible ..."

> that is not scheduled early enough by host, then remaining vcpus would keep
> spinning (even though lock is technically not held by anybody) w/o making
> forward progress.
>
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar).

s/VCPU1/VCPU2 ..

- vatsa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mingo at kernel

Mar 31, 2012, 1:11 AM

Post #17 of 36 (406 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Andi Kleen <andi [at] firstfloor> wrote:

> > Care to back that up with numbers and proper trace evidence
> > instead of handwaving?
>
> E.g. my plumbers presentations on lock and mm scalability from
> last year has some graphs that show this very clearly, plus
> some additional data on the mutexes. This compares to the
> glibc futex locks, which perform much better than the kernel
> mutex locks on larger systems under higher contention

If you mean these draft slides:

http://www.halobates.de/plumbers-fork-locks_v2.pdf

it has very little verifiable information in it. It just
cryptically says lock hold time "microbenchmark", which might or
might not be a valid measurement.

You could have been honest and straightforward in your first
mail:

"I ran workload X on machine Y, and got results Z."

Instead you are *hindering* the discussion:

> Given your tone I will not supply an URL. [...]

If you meant the above URL then it's not the proper numbers
Thomas asked for, just some vague slides. If you meant something
else then put up or shut up.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:18 AM

Post #18 of 36 (399 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/30/2012 01:07 PM, Raghavendra K T wrote:
> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>> I really like below ideas. Thanks for that!.
>>
>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>> because it
>>> is waiting for a kick
>>
>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>> to check if any vcpu is already kicked. This would almost result in
>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>
>> will try above ideas soon.
>>
>
> I have patch something like below in mind to try:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d3b98b1..5127668 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> * else and called schedule in __vcpu_run. Hopefully that
> * VCPU is holding the lock that we need and will release it.
> * We approximate round-robin by starting at the last boosted VCPU.
> + * Priority is given to vcpu that are unhalted.
> */
> - for (pass = 0; pass < 2 && !yielded; pass++) {
> + for (pass = 0; pass < 3 && !yielded; pass++) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> struct task_struct *task = NULL;
> struct pid *pid;
> - if (!pass && i < last_boosted_vcpu) {
> + if (!pass && !vcpu->pv_unhalted)
> + continue;
> + else if (pass == 1 && i < last_boosted_vcpu) {
> i = last_boosted_vcpu;
> continue;
> - } else if (pass && i > last_boosted_vcpu)
> + } else if (pass == 2 && i > last_boosted_vcpu)
> break;
> if (vcpu == me)
> continue;
>

Actually I think this is unneeded. The loops tries to find vcpus that
are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
don't match this condition.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:31 AM

Post #19 of 36 (395 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
> > What is the current status of this patchset? I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
>
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Interesting idea (I think it has been raised before btw, don't recall by
who).

One thing about it is that it can give many false positives. Consider a
fine-grained spinlock that is being accessed by many threads. That is,
the lock is taken and released with high frequency, but there is no
contention, because each vcpu is accessing a different instance. So the
host scheduler will needlessly delay preemption of vcpus that happen to
be holding a lock, even though this gains nothing.

A second issue may happen with a lock that is taken and released with
high frequency, with a high hold percentage. The host scheduler may
always sample the guest in a held state, leading it to conclude that
it's exceeding its timeout when in fact the lock is held for a short
time only.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.

Hopefully not killing! Just because a guest doesn't scale well, or even
if it's deadlocked, doesn't mean it should be killed. Just preempt it.

> Thoughts ?

It's certainly interesting. Maybe a combination is worthwhile - prevent
lockholder preemption for a short period of time AND put waiters to
sleep in case that period is insufficient to release the lock.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 1, 2012, 6:48 AM

Post #20 of 36 (395 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 06:48 PM, Avi Kivity wrote:
> On 03/30/2012 01:07 PM, Raghavendra K T wrote:
>> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>> I really like below ideas. Thanks for that!.
>>>
>>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>>> because it
>>>> is waiting for a kick
>>>
>>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>>> to check if any vcpu is already kicked. This would almost result in
>>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>>
>>> will try above ideas soon.
>>>
>>
>> I have patch something like below in mind to try:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d3b98b1..5127668 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> * else and called schedule in __vcpu_run. Hopefully that
>> * VCPU is holding the lock that we need and will release it.
>> * We approximate round-robin by starting at the last boosted VCPU.
>> + * Priority is given to vcpu that are unhalted.
>> */
>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> struct task_struct *task = NULL;
>> struct pid *pid;
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& !vcpu->pv_unhalted)
>> + continue;
>> + else if (pass == 1&& i< last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> - } else if (pass&& i> last_boosted_vcpu)
>> + } else if (pass == 2&& i> last_boosted_vcpu)
>> break;
>> if (vcpu == me)
>> continue;
>>
>
> Actually I think this is unneeded. The loops tries to find vcpus that
> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
> don't match this condition.
>

I almost agree. But at corner of my thought,

Suppose there are 8 vcpus runnable out of which 4 of them are kicked
but not running, making yield_to those 4 vcpus would result in better
lock progress. no?

I still have little problem getting PLE setup, here (instead rebasing
patches).
Once I get PLE to get that running, and numbers prove no improvement, I
will drop this idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 1, 2012, 6:53 AM

Post #21 of 36 (396 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>> I have patch something like below in mind to try:
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index d3b98b1..5127668 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> * else and called schedule in __vcpu_run. Hopefully that
>>> * VCPU is holding the lock that we need and will release it.
>>> * We approximate round-robin by starting at the last boosted
>>> VCPU.
>>> + * Priority is given to vcpu that are unhalted.
>>> */
>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>>> kvm_for_each_vcpu(i, vcpu, kvm) {
>>> struct task_struct *task = NULL;
>>> struct pid *pid;
>>> - if (!pass&& i< last_boosted_vcpu) {
>>> + if (!pass&& !vcpu->pv_unhalted)
>>> + continue;
>>> + else if (pass == 1&& i< last_boosted_vcpu) {
>>> i = last_boosted_vcpu;
>>> continue;
>>> - } else if (pass&& i> last_boosted_vcpu)
>>> + } else if (pass == 2&& i> last_boosted_vcpu)
>>> break;
>>> if (vcpu == me)
>>> continue;
>>>
>>
>> Actually I think this is unneeded. The loops tries to find vcpus that
>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>> don't match this condition.
>>
>
>
> I almost agree. But at corner of my thought,
>
> Suppose there are 8 vcpus runnable out of which 4 of them are kicked
> but not running, making yield_to those 4 vcpus would result in better
> lock progress. no?

That's what the code does.

> I still have little problem getting PLE setup, here (instead
> rebasing patches).
> Once I get PLE to get that running, and numbers prove no improvement,
> I will drop this idea.
>

I'm interested in how PLE does vs. your patches, both with PLE enabled
and disabled.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 1, 2012, 6:56 AM

Post #22 of 36 (398 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>
> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

Sure. will update with the experimental results.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


juergen.gross at ts

Apr 1, 2012, 9:36 PM

Post #23 of 36 (392 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 03/31/2012 12:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
>> What is the current status of this patchset? I haven't looked at it too
>> closely because I have been focused on 3.4 up until now...
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.
>
> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.
>
> Thoughts ?

I used this approach in 2008:

http://lists.xen.org/archives/html/xen-devel/2008-12/msg00740.html

It worked very well, but it was rejected at that time. I wouldn't mind trying
it again if there is some support from your side. :-)


Juergen

--

Juergen Gross Principal Developer Operating Systems
PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail: juergen.gross [at] ts
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tglx at linutronix

Apr 2, 2012, 2:26 AM

Post #24 of 36 (392 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sun, 1 Apr 2012, Avi Kivity wrote:
> On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2012, H. Peter Anvin wrote:
> >
> > > What is the current status of this patchset? I haven't looked at it too
> > > closely because I have been focused on 3.4 up until now...
> >
> > The real question is whether these heuristics are the correct approach
> > or not.
> >
> > If I look at it from the non virtualized kernel side then this is ass
> > backwards. We know already that we are holding a spinlock which might
> > cause other (v)cpus going into eternal spin. The non virtualized
> > kernel solves this by disabling preemption and therefor getting out of
> > the critical section as fast as possible,
> >
> > The virtualization problem reminds me a lot of the problem which RT
> > kernels are observing where non raw spinlocks are turned into
> > "sleeping spinlocks" and therefor can cause throughput issues for non
> > RT workloads.
> >
> > Though the virtualized situation is even worse. Any preempted guest
> > section which holds a spinlock is prone to cause unbound delays.
> >
> > The paravirt ticketlock solution can only mitigate the problem, but
> > not solve it. With massive overcommit there is always a way to trigger
> > worst case scenarious unless you are educating the scheduler to cope
> > with that.
> >
> > So if we need to fiddle with the scheduler and frankly that's the only
> > way to get a real gain (the numbers, which are achieved by this
> > patches, are not that impressive) then the question arises whether we
> > should turn the whole thing around.
> >
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> >
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
>
> Interesting idea (I think it has been raised before btw, don't recall by
> who).

Someoen posted a pointer to that old thread.

> One thing about it is that it can give many false positives. Consider a
> fine-grained spinlock that is being accessed by many threads. That is,
> the lock is taken and released with high frequency, but there is no
> contention, because each vcpu is accessing a different instance. So the
> host scheduler will needlessly delay preemption of vcpus that happen to
> be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there,
but per cpu stuff might be worth annotating to avoid this.

> A second issue may happen with a lock that is taken and released with
> high frequency, with a high hold percentage. The host scheduler may
> always sample the guest in a held state, leading it to conclude that
> it's exceeding its timeout when in fact the lock is held for a short
> time only.

Well, no. You can avoid that.

host VCPU
spin_lock()
modify_global_state()
exit
check_global_state()
mark_global_state()
reschedule vcpu

spin_unlock()
check_global_state()
trigger_exit()

So when an exit occures in the locked section, then the host can mark
the global state to tell the guest to issue a trap on unlock.

> > Of course this needs to be time bound, so a rogue guest cannot
> > monopolize the cpu forever, but that's the least to worry about
> > problem simply because a guest which does not get out of a spinlocked
> > region within a certain amount of time is borked and elegible to
> > killing anyway.
>
> Hopefully not killing! Just because a guest doesn't scale well, or even
> if it's deadlocked, doesn't mean it should be killed. Just preempt it.

:)

> > Thoughts ?
>
> It's certainly interesting. Maybe a combination is worthwhile - prevent
> lockholder preemption for a short period of time AND put waiters to
> sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock
ordering support to avoid that guests are completely starved.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


Ian.Campbell at citrix

Apr 2, 2012, 2:42 AM

Post #25 of 36 (392 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Fri, 2012-03-30 at 23:07 +0100, Thomas Gleixner wrote:
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.

It probably doesn't materially effect your core point (which seems valid
to me) but it's worth pointing out that the numbers presented in this
thread are AFAICT mostly focused on ensuring that that the impact of
this infrastructure is acceptable on native rather than showing the
benefits for virtualized workloads.

Ian.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 2, 2012, 2:51 AM

Post #26 of 36 (156 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index d3b98b1..5127668 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>> * else and called schedule in __vcpu_run. Hopefully that
>>>> * VCPU is holding the lock that we need and will release it.
>>>> * We approximate round-robin by starting at the last boosted
>>>> VCPU.
>>>> + * Priority is given to vcpu that are unhalted.
>>>> */
>>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>>>> kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> struct task_struct *task = NULL;
>>>> struct pid *pid;
>>>> - if (!pass&& i< last_boosted_vcpu) {
>>>> + if (!pass&& !vcpu->pv_unhalted)
>>>> + continue;
>>>> + else if (pass == 1&& i< last_boosted_vcpu) {
>>>> i = last_boosted_vcpu;
>>>> continue;
>>>> - } else if (pass&& i> last_boosted_vcpu)
>>>> + } else if (pass == 2&& i> last_boosted_vcpu)
>>>> break;
>>>> if (vcpu == me)
>>>> continue;
>>>>
>>>
>>> Actually I think this is unneeded. The loops tries to find vcpus that
>>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>>> don't match this condition.
>>>

Oh! I think I misinterpreted your statement. hmm I got it. you told to
remove if (vcpu == me) condition.

I got some more interesting idea ( not sure there is some flaw in idea too).
Basically tried similar idea (to PLE exit handler) in vcpu_block.

Instead of blind scheduling we try to do yield to vcpu that is kicked.
IMO it may solve some scalability problem and make LHP problem further
shrink.

I think Thomas would be happy to see the result.

results:
test setup.
===========
Host: i5-2540M CPU @ 2.60GHz laptop with 4cpu w/ hyperthreading. 8GB RAM
guest: 16 vcpu 2GB RAM single guest.

Did kernbench run under guest:
x rc6-with ticketlock (current patchset)+ kvmpatches
(CONFIG_PARAVIRT_SPINLOCK=y)
+ rc6-with ticketlock + kvmpatches + try_yield_patch (below one)
(YIELD_THRESHOLD=256) (CONFIG_PARAVIRT_SPINLOCK=y)
* rc6-withticketlock + kvmpatches + try_yield_patch
(YIELD_THRESHOLD=2048) (CONFIG_PARAVIRT_SPINLOCK=y)

N Min Max Median Avg Stddev
x 3 162.45 165.94 165.433 164.60767 1.8857111
+ 3 114.02 117.243 115.953 115.73867 1.6221548
Difference at 95.0% confidence
-29.6882% +/- 2.42192%
* 3 115.823 120.423 117.103 117.783 2.3741946
Difference at 95.0% confidence
-28.4462% +/- 2.9521%


improvement ~29% w.r.t to current patches.

Note: vanilla rc6 (host and guest) with (CONFIG_PARAVIRT_SPINLOCK=n)
did not finish kernbench run even after *1hr 45* minutes (above
kernbench runs took 9 minute and 6.5 min respectively). I did not try
to test it again.


Yes, I understand that have to do some more test. and immediate TODO's
for patch are.

1) code belongs to arch/x86 directory and fill in static inline for
other archs
2) tweek YIELD_THRESHOLD value.

Ideas/suggestions welcome

Here is the try_yield_to patch.
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5127668..3fa912a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
mark_page_dirty_in_slot(kvm, memslot, gfn);
}

+#define YIELD_THRESHOLD 2048
+static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
DEFINE_WAIT(wait);
+ unsigned int loop_count;
+
+ loop_count = 0;

for (;;) {
prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
@@ -1579,7 +1584,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
if (signal_pending(current))
break;

- schedule();
+ if (loop_count++ % YIELD_THRESHOLD)
+ schedule();
+ else
+ kvm_vcpu_try_yield_to(vcpu);
}

finish_wait(&vcpu->wq, &wait);
@@ -1593,6 +1601,39 @@ void kvm_resched(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_resched);

+static void kvm_vcpu_try_yield(struct kvm_vcpu *me)
+{
+
+ struct kvm *kvm = me->kvm;
+ struct kvm_vcpu *vcpu;
+ int i;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct task_struct *task = NULL;
+ struct pid *pid;
+ if (!vcpu->pv_unhalted)
+ continue;
+ if (waitqueue_active(&vcpu->wq))
+ continue;
+ rcu_read_lock();
+ pid = rcu_dereference(vcpu->pid);
+ if (pid)
+ task = get_pid_task(vcpu->pid, PIDTYPE_PID);
+ rcu_read_unlock();
+ if (!task)
+ continue;
+ if (task->flags & PF_VCPU) {
+ put_task_struct(task);
+ continue;
+ }
+ if (yield_to(task, 1)) {
+ put_task_struct(task);
+ break;
+ }
+ put_task_struct(task);
+ }
+}
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 2, 2012, 5:15 AM

Post #27 of 36 (159 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/02/2012 03:21 PM, Raghavendra K T wrote:
> On 04/01/2012 07:23 PM, Avi Kivity wrote:
> > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
> >>>> I have patch something like below in mind to try:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5127668..3fa912a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> mark_page_dirty_in_slot(kvm, memslot, gfn);
> }
>
> +#define YIELD_THRESHOLD 2048
> +static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> {
[...]
> + if (loop_count++ % YIELD_THRESHOLD)
> + schedule();
> + else
> + kvm_vcpu_try_yield_to(vcpu);
> }
>
> +static void kvm_vcpu_try_yield(struct kvm_vcpu *me)

yes, it is kvm_vcpu_try_yield_to. had changed the name just before
sending. sorry.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 5, 2012, 1:43 AM

Post #28 of 36 (153 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index d3b98b1..5127668 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>> * else and called schedule in __vcpu_run. Hopefully that
>>>> * VCPU is holding the lock that we need and will release it.
>>>> * We approximate round-robin by starting at the last boosted
>>>> VCPU.
>>>> + * Priority is given to vcpu that are unhalted.
>>>> */
>>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>>>> kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> struct task_struct *task = NULL;
>>>> struct pid *pid;
>>>> - if (!pass&& i< last_boosted_vcpu) {
>>>> + if (!pass&& !vcpu->pv_unhalted)
>>>> + continue;
>>>> + else if (pass == 1&& i< last_boosted_vcpu) {
>>>> i = last_boosted_vcpu;
>>>> continue;
>>>> - } else if (pass&& i> last_boosted_vcpu)
>>>> + } else if (pass == 2&& i> last_boosted_vcpu)
>>>> break;
>>>> if (vcpu == me)
>>>> continue;
>>>>
>>>

[...]

> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

Here is the result taken on PLE machine. Results seem to support all
our assumptions.
Following are the observations from results:

1) There is a huge benefit for Non PLE based configuration.
(base_nople vs pv_ple) (around 90%)

2) ticketlock + kvm patches does go well along with PLE (more benefit
is seen not degradation)
(base_ple vs pv_ple)

3) The ticketlock + kvm patches make behaves almost like PLE enabled
machine (base_ple vs pv_nople)

4) ple handler modification patches seem to give advantage (pv_ple vs
pv_ple_optimized). More study needed
probably with higher M/N ratio Avi pointed.

configurations:

base_nople = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n - PLE
base_ple = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n + PLE
pv_ple = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE +
ticketlock + kvm patches
pv_nople = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y - PLE +
ticketlock + kvm patches
pv_ple_optimized = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE +
optimizaton patch + ticketlock
+ kvm patches + posted with ple_handler modification (yield to kicked
vcpu).

Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
core, with 8
online cores and 4*64GB RAM

3 guests running with 2GB RAM, 8vCPU.

Results:
-------
case A)
1x: 1 kernbench 2 idle
2x: 1 kernbench 1 while1 hog 1 idle
3x: 1 kernbench 2 while1 hog

Average time taken in sec for kernbench run (std). [ lower the value
better ]

base_nople base_ple pv_ple
pv_nople pv_ple_optimized

1x 72.8284 (89.8757) 70.475 (85.6979) 63.5033 (72.7041)
65.7634 (77.0504) 64.3284 (73.2688)
2x 823.053 (1113.05) 110.971 (132.829) 105.099 (128.738)
139.058 (165.156) 106.268 (129.611)
3x 3244.37 (4707.61) 150.265 (184.766) 138.341 (172.69)
139.106 (163.549) 133.238 (168.388)


Percentage improvement calculation w.r.t base_nople
-------------------------------------------------

base_ple pv_ple pv_nople pv_ple_optimized

1x 3.23143 12.8042 9.70089 11.6713
2x 86.5172 87.2306 83.1046 87.0886
3x 95.3684 95.736 95.7124 95.8933

-------------------
Percentage improvement calculation w.r.t base_ple
-------------------------------------------------

base_nople pv_ple pv_nople pv_ple_optimized

1x -3.3393 9.89244 6.68549 8.72167
2x -641.683 5.29147 -25.3102 4.23804
3x -2059.1 7.93531 7.42621 11.3313


case B)
all 3 guests running kernbench
Average time taken in sec for kernbench run (std). [ lower the value
better ].
Note that std is calculated over 6*3 run average from all 3 guests
given by kernbench

base_nople base_ple pv_ple
pv_nople pv_ple_opt
2886.92 (18.289131) 204.80333 (7.1784039) 200.22517 (10.134804)
202.091 (12.249673) 201.60683 (7.881737)


Percentage improvement calculation w.r.t base_nople
-------------------------------------------------

base_ple pv_ple pv_nople pv_ple_optimized
92.9058 93.0644 93 93.0166



Percentage improvement calculation w.r.t base_ple
-------------------------------------------------

base_nople pv_ple pv_nople pv_ple_optimized
-1309.606 2.2354 1.324 1.5607

I hope the experimental results should convey same message if somebody
does benchmarking.

Also as Ian pointed in the thread, the earlier results from Attilio
and me was to convince that framework is acceptable on native.

Does this convince to consider for it to go to next merge window?

comments /suggestions please...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 5, 2012, 2:01 AM

Post #29 of 36 (155 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/02/2012 12:51 PM, Raghavendra K T wrote:
> On 04/01/2012 07:23 PM, Avi Kivity wrote:
> > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
> >>>> I have patch something like below in mind to try:
> >>>>
> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>> index d3b98b1..5127668 100644
> >>>> --- a/virt/kvm/kvm_main.c
> >>>> +++ b/virt/kvm/kvm_main.c
> >>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>>> * else and called schedule in __vcpu_run. Hopefully that
> >>>> * VCPU is holding the lock that we need and will release it.
> >>>> * We approximate round-robin by starting at the last boosted
> >>>> VCPU.
> >>>> + * Priority is given to vcpu that are unhalted.
> >>>> */
> >>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
> >>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
> >>>> kvm_for_each_vcpu(i, vcpu, kvm) {
> >>>> struct task_struct *task = NULL;
> >>>> struct pid *pid;
> >>>> - if (!pass&& i< last_boosted_vcpu) {
> >>>> + if (!pass&& !vcpu->pv_unhalted)
> >>>> + continue;
> >>>> + else if (pass == 1&& i< last_boosted_vcpu) {
> >>>> i = last_boosted_vcpu;
> >>>> continue;
> >>>> - } else if (pass&& i> last_boosted_vcpu)
> >>>> + } else if (pass == 2&& i> last_boosted_vcpu)
> >>>> break;
> >>>> if (vcpu == me)
> >>>> continue;
> >>>>
> >>>
> >>> Actually I think this is unneeded. The loops tries to find vcpus
> that
> >>> are runnable but not running (vcpu_active(vcpu->wq)), and halted
> vcpus
> >>> don't match this condition.
> >>>
>
> Oh! I think I misinterpreted your statement. hmm I got it. you told to
> remove if (vcpu == me) condition.

No, the entire patch is unneeded. My original comment was:

> from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick

But the PLE handler never wakes up sleeping vcpus anyway.

There's still a conflict with PLE in that it may trigger during the spin
phase and send a random yield_to() somewhere. Maybe it's sufficient to
tune the PLE timeout to be longer than the spinlock timeout.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


avi at redhat

Apr 5, 2012, 2:15 AM

Post #30 of 36 (155 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/02/2012 12:26 PM, Thomas Gleixner wrote:
> > One thing about it is that it can give many false positives. Consider a
> > fine-grained spinlock that is being accessed by many threads. That is,
> > the lock is taken and released with high frequency, but there is no
> > contention, because each vcpu is accessing a different instance. So the
> > host scheduler will needlessly delay preemption of vcpus that happen to
> > be holding a lock, even though this gains nothing.
>
> You're talking about per cpu locks, right? I can see the point there,
> but per cpu stuff might be worth annotating to avoid this.

Or any lock which is simply uncontended. Say a single process is
running, the rest of the system is idle. It will take and release many
locks; but it can be preempted at any point by the hypervisor with no
performance loss.

The overhead is arming a timer twice and an extra exit per deferred
context switch. Perhaps not much given that you don't see tons of
context switches on virt workloads, at least without threaded interrupts
(or maybe interrupt threads should override this mechanism, by being
realtime threads).

> > A second issue may happen with a lock that is taken and released with
> > high frequency, with a high hold percentage. The host scheduler may
> > always sample the guest in a held state, leading it to conclude that
> > it's exceeding its timeout when in fact the lock is held for a short
> > time only.
>
> Well, no. You can avoid that.
>
> host VCPU
> spin_lock()
> modify_global_state()
> exit
> check_global_state()
> mark_global_state()
> reschedule vcpu
>
> spin_unlock()
> check_global_state()
> trigger_exit()
>
> So when an exit occures in the locked section, then the host can mark
> the global state to tell the guest to issue a trap on unlock.

Right.

How does this nest? Do we trigger the exit on the outermost unlock?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


raghavendra.kt at linux

Apr 5, 2012, 3:40 AM

Post #31 of 36 (149 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/05/2012 02:31 PM, Avi Kivity wrote:
> On 04/02/2012 12:51 PM, Raghavendra K T wrote:
>> On 04/01/2012 07:23 PM, Avi Kivity wrote:
>>> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>>>> I have patch something like below in mind to try:
>>>>>>
>>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>>> index d3b98b1..5127668 100644
>>>>>> --- a/virt/kvm/kvm_main.c
>>>>>> +++ b/virt/kvm/kvm_main.c
>>>>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>>> * else and called schedule in __vcpu_run. Hopefully that
>>>>>> * VCPU is holding the lock that we need and will release it.
>>>>>> * We approximate round-robin by starting at the last boosted
>>>>>> VCPU.
>>>>>> + * Priority is given to vcpu that are unhalted.
>>>>>> */
>>>>>> - for (pass = 0; pass< 2&& !yielded; pass++) {
>>>>>> + for (pass = 0; pass< 3&& !yielded; pass++) {
>>>>>> kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>> struct task_struct *task = NULL;
>>>>>> struct pid *pid;
>>>>>> - if (!pass&& i< last_boosted_vcpu) {
>>>>>> + if (!pass&& !vcpu->pv_unhalted)
>>>>>> + continue;
>>>>>> + else if (pass == 1&& i< last_boosted_vcpu) {
>>>>>> i = last_boosted_vcpu;
>>>>>> continue;
>>>>>> - } else if (pass&& i> last_boosted_vcpu)
>>>>>> + } else if (pass == 2&& i> last_boosted_vcpu)
>>>>>> break;
>>>>>> if (vcpu == me)
>>>>>> continue;
>>>>>>
>>>>>
>>>>> Actually I think this is unneeded. The loops tries to find vcpus
>> that
>>>>> are runnable but not running (vcpu_active(vcpu->wq)), and halted
>> vcpus
>>>>> don't match this condition.
>>>>>
>>
>> Oh! I think I misinterpreted your statement. hmm I got it. you told to
>> remove if (vcpu == me) condition.
>
> No, the entire patch is unneeded. My original comment was:
>
>> from the PLE handler, don't wake up a vcpu that is sleeping because it
> is waiting for a kick
>
> But the PLE handler never wakes up sleeping vcpus anyway.

I agree with you. It is already doing that. But my approach here is
little different.

In 2 classes of vcpus we have (one is subset of another when we try to
do yield_to) viz,

1) runnable and kicked < (subset of) 2) just runnable

what we are trying to do here is targeting 1) first so that we get good
lock progress.

Here was the sequence I was talking.

vcpu1 releases lock->finds that vcpu2 is next candidate ->
kick hypercall -> kvm_pv_kick_cpu_op -> set kicked flag ->
vcpu->kick(vcpu2)

at this point we have vcpu2 waiting for getting scheduled. But
above yield call can wake *anybody*.

I agree this is not serious (rather it is overhead) when there are are
less number of vcpus, But when we have more number of vcpu/vm.. it may
not scale well. My attempt was to fix that.

Let me know if I am completely missing something..

>
> There's still a conflict with PLE in that it may trigger during the spin
> phase and send a random yield_to() somewhere. Maybe it's sufficient to
> tune the PLE timeout to be longer than the spinlock timeout.
>

Ok ... But we also should be cautious that, we may do more halt, though
we are about to get spinlock.
Need more study on this.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


mtosatti at redhat

Apr 10, 2012, 6:29 PM

Post #32 of 36 (141 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, Mar 31, 2012 at 12:07:58AM +0200, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
> > What is the current status of this patchset? I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
>
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Remember that the host is scheduling other processes than vcpus of guests.

The case where a higher priority task (whatever that task is) interrupts
a vcpu which holds a spinlock should be frequent, in a overcommit
scenario. Whenever that is the case, other vcpus _must_ be able to stop
spinning.

Now extrapolate that to guests with large number of vcpus. There is no
replacement for sleep-in-hypervisor-instead-of-spin.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.
>
> Thoughts ?
>
> Thanks,
>
> tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


konrad.wilk at oracle

Apr 16, 2012, 8:44 AM

Post #33 of 36 (140 views)
Permalink
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
> * Thomas Gleixner <tglx [at] linutronix> [2012-03-31 00:07:58]:
>
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> >
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
>
> I had attempted something like that long back:
>
> http://lkml.org/lkml/2010/6/3/4
>
> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If
> that is not scheduled early enough by host, then remaining vcpus would keep
> spinning (even though lock is technically not held by anybody) w/o making
> forward progress.
>
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar).
>
> The current pv-spinlock patches however does not track which vcpu is
> spinning at what head of the ticketlock. I suppose we can consider
> that optimization in future and see how much benefit it provides (over
> plain yield/sleep the way its done now).

Right. I think Jeremy played around with this some time?
>
> Do you see any issues if we take in what we have today and address the
> finer-grained optimization as next step?

I think that is the proper course - these patches show
that on baremetal we don't incur performance regressions and in
virtualization case we benefit greatly. Since these are the basic
building blocks of a kernel - taking it slow and just adding
this set of patches for v3.5 is a good idea - and then building on top
of that for further refinement.

>
> - vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


Ian.Campbell at citrix

Apr 16, 2012, 9:36 AM

Post #34 of 36 (141 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On Mon, 2012-04-16 at 16:44 +0100, Konrad Rzeszutek Wilk wrote:
> On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
> > * Thomas Gleixner <tglx [at] linutronix> [2012-03-31 00:07:58]:
> >
> > > I know that Peter is going to go berserk on me, but if we are running
> > > a paravirt guest then it's simple to provide a mechanism which allows
> > > the host (aka hypervisor) to check that in the guest just by looking
> > > at some global state.
> > >
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> > > modified to indicate the guest to invoke an exit when the last nested
> > > lock has been released.
> >
> > I had attempted something like that long back:
> >
> > http://lkml.org/lkml/2010/6/3/4
> >
> > The issue is with ticketlocks though. VCPUs could go into a spin w/o
> > a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> > that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> > and releases the lock. VCPU1 is next eligible to take the lock. If
> > that is not scheduled early enough by host, then remaining vcpus would keep
> > spinning (even though lock is technically not held by anybody) w/o making
> > forward progress.
> >
> > In that situation, what we really need is for the guest to hint to host
> > scheduler to schedule VCPU1 early (via yield_to or something similar).
> >
> > The current pv-spinlock patches however does not track which vcpu is
> > spinning at what head of the ticketlock. I suppose we can consider
> > that optimization in future and see how much benefit it provides (over
> > plain yield/sleep the way its done now).
>
> Right. I think Jeremy played around with this some time?

5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
tracks which lock each is waiting for in per-cpu "lock_waiting". This is
used in xen_unlock_kick to kick the right CPU. There's a loop over only
the waiting cpus to figure out who to kick.

> >
> > Do you see any issues if we take in what we have today and address the
> > finer-grained optimization as next step?
>
> I think that is the proper course - these patches show
> that on baremetal we don't incur performance regressions and in
> virtualization case we benefit greatly. Since these are the basic
> building blocks of a kernel - taking it slow and just adding
> this set of patches for v3.5 is a good idea - and then building on top
> of that for further refinement.
>
> >
> > - vatsa
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel [at] lists
> http://lists.xen.org/xen-devel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


jeremy at goop

Apr 16, 2012, 9:42 AM

Post #35 of 36 (139 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

On 04/16/2012 09:36 AM, Ian Campbell wrote:
> On Mon, 2012-04-16 at 16:44 +0100, Konrad Rzeszutek Wilk wrote:
>> On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
>>> * Thomas Gleixner <tglx [at] linutronix> [2012-03-31 00:07:58]:
>>>
>>>> I know that Peter is going to go berserk on me, but if we are running
>>>> a paravirt guest then it's simple to provide a mechanism which allows
>>>> the host (aka hypervisor) to check that in the guest just by looking
>>>> at some global state.
>>>>
>>>> So if a guest exits due to an external event it's easy to inspect the
>>>> state of that guest and avoid to schedule away when it was interrupted
>>>> in a spinlock held section. That guest/host shared state needs to be
>>>> modified to indicate the guest to invoke an exit when the last nested
>>>> lock has been released.
>>> I had attempted something like that long back:
>>>
>>> http://lkml.org/lkml/2010/6/3/4
>>>
>>> The issue is with ticketlocks though. VCPUs could go into a spin w/o
>>> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
>>> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
>>> and releases the lock. VCPU1 is next eligible to take the lock. If
>>> that is not scheduled early enough by host, then remaining vcpus would keep
>>> spinning (even though lock is technically not held by anybody) w/o making
>>> forward progress.
>>>
>>> In that situation, what we really need is for the guest to hint to host
>>> scheduler to schedule VCPU1 early (via yield_to or something similar).
>>>
>>> The current pv-spinlock patches however does not track which vcpu is
>>> spinning at what head of the ticketlock. I suppose we can consider
>>> that optimization in future and see how much benefit it provides (over
>>> plain yield/sleep the way its done now).
>> Right. I think Jeremy played around with this some time?
> 5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
> which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
> tracks which lock each is waiting for in per-cpu "lock_waiting". This is
> used in xen_unlock_kick to kick the right CPU. There's a loop over only
> the waiting cpus to figure out who to kick.

Yes, and AFAIK the KVM pv-ticketlock patches do the same thing. If a
(V)CPU is asleep, then sending it a kick is pretty much equivalent to a
yield to (not precisely, but it should get scheduled soon enough, and it
won't be competing with a pile of VCPUs with no useful work to do).

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vatsa at linux

Apr 16, 2012, 7:54 PM

Post #36 of 36 (141 views)
Permalink
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks [In reply to]

* Ian Campbell <Ian.Campbell [at] citrix> [2012-04-16 17:36:35]:

> > > The current pv-spinlock patches however does not track which vcpu is
> > > spinning at what head of the ticketlock. I suppose we can consider
> > > that optimization in future and see how much benefit it provides (over
> > > plain yield/sleep the way its done now).
> >
> > Right. I think Jeremy played around with this some time?
>
> 5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
> which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
> tracks which lock each is waiting for in per-cpu "lock_waiting". This is
> used in xen_unlock_kick to kick the right CPU. There's a loop over only
> the waiting cpus to figure out who to kick.

Yes sorry that's right. We do track who is waiting on what lock at what
position. This can be used to pass directed yield hints to host
scheduler (in a future optimization patch). What we don't track is the
vcpu owning a lock, which would have allowed other spinning vcpus to do
a directed yield to vcpu preempted holding a lock. OTOH that may be
unnecessary if we put in support for deferring preemption of vcpu that
are holding locks.

- vatsa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.