Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Xen: Devel

[PATCH] Boot PV guests with more than 128GB (v3) for v3.7.

 

 

Xen devel RSS feed   Index | Next | Previous | View Threaded


konrad.wilk at oracle

Aug 16, 2012, 9:03 AM

Post #1 of 2 (59 views)
Permalink
[PATCH] Boot PV guests with more than 128GB (v3) for v3.7.

Since (v2): http://lists.xen.org/archives/html/xen-devel/2012-07/msg01864.html
- fixed a bug if guest booted with non-PMD aligned size (say, 899MB).
- fixed smack warnings
- moved a memset(xen_start_info->mfn_list, 0xff,.. ) from one patch to another.
Since v1: [http://lists.xen.org/archives/html/xen-devel/2012-07/msg01561.html]
- added more comments, and #ifdefs
- squashed The L4 and L4, L3, and L2 recycle patches together
- Added Acked-by's

These patches are quite baked by now. If you already looked at this before
I would just skip over it and just look at the last patch.

The explanation of these patches is exactly what v1 had:

The details of this problem are nicely explained in:

[PATCH 4/6] xen/p2m: Add logic to revector a P2M tree to use __va
[PATCH 5/6] xen/mmu: Copy and revector the P2M tree.
[PATCH 6/6] xen/mmu: Remove from __ka space PMD entries for

and the supporting patches are just nice optimizations. Pasting in
what those patches mentioned:


During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment: kernel : 0xffffffff81000000 -> 0xffffffff81e43000
(pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment: ramdisk : 0xffffffff81e43000 -> 0xffffffff925c7000
(pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment: phys2mach : 0xffffffff925c7000 -> 0xffffffffa25c7000
(pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page : start info : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page : xenstore : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page : console : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 ->
0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 ->
0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 ->
0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 ->
0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment: page tables : 0xffffffffa25ca000 -> 0xffffffffa26e1000
(pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page : boot stack : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

phys __ka
/------------\ /-------------------\
| 0 | empty | 0xffffffff80000000|
| .. | | .. |
| 16MB | <= kernel starts | 0xffffffff81000000|
| .. | | |
| 30MB | <= kernel ends => | 0xffffffff81e43000|
| .. | & ramdisk starts | .. |
| 293MB | <= ramdisk ends=> | 0xffffffff925c7000|
| .. | & P2M starts | .. |
| .. | | .. |
| 549MB | <= P2M ends => | 0xffffffffa25c7000|
| .. | start_info | 0xffffffffa25c7000|
| .. | xenstore | 0xffffffffa25c8000|
| .. | cosole | 0xffffffffa25c9000|
| 549MB | <= page tables => | 0xffffffffa25ca000|
| .. | | |
| 550MB | <= PGT end => | 0xffffffffa26e1000|
| .. | boot stack | |
\------------/ \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106
vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
[<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
[<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
[<ffffffff81130c4d>] map_vm_area+0x2d/0x50
[<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c6186>] ? load_module+0x66/0x19c0
[<ffffffff8105cadc>] module_alloc+0x5c/0x60
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
[<ffffffff810c70c3>] load_module+0xfa3/0x19c0
[<ffffffff812491f6>] ? security_file_permission+0x86/0x90
[<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
[<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

1) All P2M lookups instead of using the __ka address would
use the __va address. This means we can safely erase from
__ka space the PMD pointers that point to the PFNs for
P2M array and be OK.
2). Allocate a new array, copy the existing P2M into it,
revector the P2M tree to use that, and return the old
P2M to the memory allocate. This has the advantage that
it sets the stage for using XEN_ELF_NOTE_INIT_P2M
feature. That feature allows us to set the exact virtual
address space we want for the P2M - and allows us to
boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

-- xen/mmu: Copy and revector the P2M tree:

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

-- xen/mmu: Remove from __ka space PMD entries for

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.


_______________________________________________
Xen-devel mailing list
Xen-devel [at] lists
http://lists.xen.org/xen-devel


konrad.wilk at oracle

Aug 17, 2012, 10:39 AM

Post #2 of 2 (55 views)
Permalink
Re: [PATCH] Boot PV guests with more than 128GB (v3) for v3.7. [In reply to]

On Thu, Aug 16, 2012 at 12:03:18PM -0400, Konrad Rzeszutek Wilk wrote:
> Since (v2): http://lists.xen.org/archives/html/xen-devel/2012-07/msg01864.html
> - fixed a bug if guest booted with non-PMD aligned size (say, 899MB).
> - fixed smack warnings
> - moved a memset(xen_start_info->mfn_list, 0xff,.. ) from one patch to another.

And two more bug-fixes:

From f042050664c97a365e98daf5783f682d734e35f8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk [at] oracle>
Date: Thu, 16 Aug 2012 16:38:55 -0400
Subject: [PATCH 1/2] xen/p2m: When revectoring deal with holes in the P2M
array.

When we free the PFNs and then subsequently populate them back
during bootup:

Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added

We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled

"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.

And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:

P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.

We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:

Populating 1acb8a-1f20e9 pfn range: 283999 pages added

we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.

That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.

[v2: Check for overflow]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk [at] oracle>
---
arch/x86/xen/p2m.c | 14 +++++++++++---
1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index bbfd085..3b5bd7e 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -401,6 +401,7 @@ unsigned long __init xen_revector_p2m_tree(void)
unsigned long va_start;
unsigned long va_end;
unsigned long pfn;
+ unsigned long pfn_free = 0;
unsigned long *mfn_list = NULL;
unsigned long size;

@@ -443,15 +444,22 @@ unsigned long __init xen_revector_p2m_tree(void)
if ((unsigned long)mid_p == INVALID_P2M_ENTRY)
continue;

+ if ((pfn_free + P2M_PER_PAGE) * PAGE_SIZE > size) {
+ WARN(1, "Only allocated for %ld pages, but we want %ld!\n",
+ size, pfn_free + P2M_PER_PAGE);
+ return 0;
+ }
/* The old va. Rebase it on mfn_list */
if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) {
unsigned long *new;

- new = &mfn_list[pfn];
+ new = &mfn_list[pfn_free];

copy_page(new, mid_p);
- p2m_top[topidx][mididx] = &mfn_list[pfn];
- p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]);
+ p2m_top[topidx][mididx] = &mfn_list[pfn_free];
+ p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn_free]);
+
+ pfn_free += P2M_PER_PAGE;

}
/* This should be the leafs allocated for identity from _brk. */
--
1.7.7.6



From 60a9b396456a2990bb3490671ca832d36e8dd6aa Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk [at] oracle>
Date: Fri, 17 Aug 2012 09:35:31 -0400
Subject: [PATCH 2/2] xen/mmu: If the revector fails, don't attempt to
revector anything else.

If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.

This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk [at] oracle>
---
arch/x86/xen/mmu.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 6019c22..0dac3d2 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1238,7 +1238,8 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
memblock_free(__pa(xen_start_info->mfn_list), size);
/* And revector! Bye bye old array */
xen_start_info->mfn_list = new_mfn_list;
- }
+ } else
+ goto skip;
}
/* At this stage, cleanup_highmap has already cleaned __ka space
* from _brk_limit way up to the max_pfn_mapped (which is the end of
@@ -1259,6 +1260,7 @@ static void __init xen_pagetable_setup_done(pgd_t *base)
* anything at this stage. */
xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);
#endif
+skip:
#endif
xen_post_allocator_init();
}
--
1.7.7.6


_______________________________________________
Xen-devel mailing list
Xen-devel [at] lists
http://lists.xen.org/xen-devel

Xen devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.