Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Xen: API

VCPUs-at-startup and VCPUs-max with NUMA node affinity

 

 

Xen api RSS feed   Index | Next | Previous | View Threaded


James.Bulpin at eu

May 29, 2012, 5:00 AM

Post #1 of 5 (556 views)
Permalink
VCPUs-at-startup and VCPUs-max with NUMA node affinity

I'm thinking about the interaction of xapi's vCPU management and the future Xen automatic NUMA placement (http://blog.xen.org/index.php/2012/05/16/numa-and-xen-part-ii-scheduling-and-placement/). If a VM has an equal or smaller number of vCPUs than a NUMA node has pCPUs then it makes sense for that VM to have NUMA node affinity. But then what happens if vCPUs are hotplugged to the VM and it now has more vCPUs than the node has pCPUs? I can see several options here:

1. The node is over-provisioned in that the VM's vCPUs contend with each other for the pCPUs - not good

2. The CPU affinity is dropped allowing vCPUs to run on any node - the memory is still on the original node so now we've got a poor placement for vCPUs that happen to end up running on other nodes. This also leads to additional interconnect traffic and possible cache line ping-pong.

3. The vCPUs that cannot fit on the node are given no affinity but those that can retain their node affinity - leads to some vCPUs being better performing than others due to memory (non-)locality. This also leads to some additional interconnect traffic and possible cache line ping-pong.

4. We never let this happen because we only allow node affinity to be set for the maximum vCPU count a VM may have during this boot (VCPUs-max; options 1 to 3 above use VCPUs-at-startup to decide whether to use node affinity).

I'm tempted by #4 because it avoids having to make difficult and workload dependent decisions when changing vCPU counts. My guess is that many users will have VMs with VCPUs-at-startup==VCPUs-max so it becomes a non-issue anyway. My only real concern is that if users regularly run VMs with small VCPUs-at-startup but with VCPUs-max being the number of pCPUs in the box, i.e. allowing them to hotplug up to the full resource of the box.

And a related question: when xapi/xenopsd builds a domain does it have to tell Xen about VCPUs-max or just the number of vCPUs required right now?

Any thoughts?

Thanks,
James


_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


Dave.Scott at eu

May 30, 2012, 2:21 AM

Post #2 of 5 (523 views)
Permalink
Re: VCPUs-at-startup and VCPUs-max with NUMA node affinity [In reply to]

Hi,

James wrote:
> I'm thinking about the interaction of xapi's vCPU management and the
> future Xen automatic NUMA placement
> (http://blog.xen.org/index.php/2012/05/16/numa-and-xen-part-ii-
> scheduling-and-placement/). If a VM has an equal or smaller number of
> vCPUs than a NUMA node has pCPUs then it makes sense for that VM to
> have NUMA node affinity. But then what happens if vCPUs are hotplugged
> to the VM and it now has more vCPUs than the node has pCPUs? I can see
> several options here:
>
> 1. The node is over-provisioned in that the VM's vCPUs contend with
> each other for the pCPUs - not good

I agree, this doesn't sound good to me either.

> 2. The CPU affinity is dropped allowing vCPUs to run on any node -
> the memory is still on the original node so now we've got a poor
> placement for vCPUs that happen to end up running on other nodes. This
> also leads to additional interconnect traffic and possible cache line
> ping-pong.

This also sounds pretty bad -- it would have been better to stripe the memory across all the banks in the first place!

> 3. The vCPUs that cannot fit on the node are given no affinity but
> those that can retain their node affinity - leads to some vCPUs being
> better performing than others due to memory (non-)locality. This also
> leads to some additional interconnect traffic and possible cache line
> ping-pong.
>
> 4. We never let this happen because we only allow node affinity to be
> set for the maximum vCPU count a VM may have during this boot (VCPUs-
> max; options 1 to 3 above use VCPUs-at-startup to decide whether to use
> node affinity).
>
> I'm tempted by #4 because it avoids having to make difficult and
> workload dependent decisions when changing vCPU counts. My guess is
> that many users will have VMs with VCPUs-at-startup==VCPUs-max so it
> becomes a non-issue anyway.

I agree, this looks like the best solution to me. Also since we only support vCPU hotplug for PV guests, all HVM guests implicity have VCPUs-at-startup=VCPUs-max, so that's definitely a fairly common scenario.

> My only real concern is that if users
> regularly run VMs with small VCPUs-at-startup but with VCPUs-max being
> the number of pCPUs in the box, i.e. allowing them to hotplug up to the
> full resource of the box.
>
> And a related question: when xapi/xenopsd builds a domain does it have
> to tell Xen about VCPUs-max or just the number of vCPUs required right
> now?

IIRC the domain builder needs to know the VCPUs-max. VCPUs-at-startup is implemented by a protocol over xenstore where there's a directory:

cpu = ""
0 = ""
availability = "online"
1 = ""
Availability = "online"

Which tells the PV kernel that it should disable/hotunplug (or not) certain vCPUs. I'm not sure but I imagine the guest receives the xenstore watch event, deregisters the vCPU with its scheduler and then issues a hypercall telling xen to stop scheduling the vCPU too. It's certainly has to be a co-operative thing, since if xen just stopped scheduling a vCPU that would probably have some bad effects on the guest :) It's slightly odd that the protocol allows per-vCPU control, when I'm not convinced that you can meaningfully tell them apart.

Cheers,
Dave


_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


James.Bulpin at eu

May 30, 2012, 2:56 AM

Post #3 of 5 (532 views)
Permalink
Re: VCPUs-at-startup and VCPUs-max with NUMA node affinity [In reply to]

Thanks Dave.

Dave wrote:
> IIRC the domain builder needs to know the VCPUs-max. VCPUs-at-startup
> is implemented by a protocol over xenstore where there's a directory:
>
> cpu = ""
> 0 = ""
> availability = "online"
> 1 = ""
> Availability = "online"
>
> Which tells the PV kernel that it should disable/hotunplug (or not)
> certain vCPUs. I'm not sure but I imagine the guest receives the
> xenstore watch event, deregisters the vCPU with its scheduler and then
> issues a hypercall telling xen to stop scheduling the vCPU too. It's
> certainly has to be a co-operative thing, since if xen just stopped
> scheduling a vCPU that would probably have some bad effects on the
> guest :) It's slightly odd that the protocol allows per-vCPU control,
> when I'm not convinced that you can meaningfully tell them apart.

Presumably the in-guest CPU enumeration matches the enumeration in xenstore so it does know which is which. Although xapi currently doesn't allow it I believe independent hot plug/unplug of vCPUs works in Xen and would be useful where topology is exposed to the guest (e.g. if a guest has a unity pinned set of vCPUs == pCPUs and the user wishes to unplug a vCPU on each socket that may mean unplugging vCPU 3 (on socket 0) and vCPU 7 (on socket 1) - I think this comes in xapi NUMA support phase 2 - which reminds me: my original comments were in the context of the current xapi behaviour of not exposing topology to the guest).

Cheers,
James

_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


anil at recoil

May 30, 2012, 4:13 AM

Post #4 of 5 (526 views)
Permalink
Re: VCPUs-at-startup and VCPUs-max with NUMA node affinity [In reply to]

On 29 May 2012, at 13:00, James Bulpin wrote:

> 2. The CPU affinity is dropped allowing vCPUs to run on any node - the memory is still on the original node so now we've got a poor placement for vCPUs that happen to end up running on other nodes. This also leads to additional interconnect traffic and possible cache line ping-pong.

Is there a memory-swap operation available to exchange pages from one NUMA domain for pages from another? I'm thinking of a scenario where CPU hotplugs have led to allocated memory being on the wrong NUMA domain entirely. Is the only way for the guest to resolve this by live migrating back to localhost so that it goes through a suspend/resume cycle?

Right now we see performance like this all the time (on non-NUMA Xen) since memory is usually allocated from a single NUMA domain; e.g. on a 48-core Magny-cours, notice unix domain socket latency grows worse as it spreads away from vCPU 0 (which also happens to be on NUMA domain 0); http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpwlnFNM.html

-anil
_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


James.Bulpin at eu

May 30, 2012, 5:39 AM

Post #5 of 5 (525 views)
Permalink
Re: VCPUs-at-startup and VCPUs-max with NUMA node affinity [In reply to]

Anil wrote:
> Is there a memory-swap operation available to exchange pages from one
> NUMA domain for pages from another? I'm thinking of a scenario where
> CPU hotplugs have led to allocated memory being on the wrong NUMA
> domain entirely. Is the only way for the guest to resolve this by live
> migrating back to localhost so that it goes through a suspend/resume
> cycle?

Whilst a localhost migrate would do the job it needs enough spare memory on the target node to do it. Dario Faggioli over on xen-devel is working on memory migration primarily for rebalancing nodes but would apply here too. Using VCPUs-max to do placement means that vCPU hotplugging would all be within a node anyway so this shouldn't be a problem.

> Right now we see performance like this all the time (on non-NUMA Xen)
> since memory is usually allocated from a single NUMA domain; e.g. on a
> 48-core Magny-cours, notice unix domain socket latency grows worse as
> it spreads away from vCPU 0 (which also happens to be on NUMA domain
> 0); http://www.cl.cam.ac.uk/research/srg/netos/ipc-
> bench/details/tmpwlnFNM.html

By non-NUMA I assume you mean numa=off, as was the default before 4.0 (or thereabouts)? I think since then memory is striped so everybody should suffer equally.

Cheers,
James


_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

Xen api RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.