Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[patch 5/5] mm: refault distance-based file cache sizing

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


hannes at cmpxchg

May 1, 2012, 1:41 AM

Post #1 of 7 (97 views)
Permalink
[patch 5/5] mm: refault distance-based file cache sizing

To protect frequently used page cache (workingset) from bursts of less
frequently used or one-shot cache, page cache pages are managed on two
linked lists. The inactive list is where all cache starts out on
fault and ends on reclaim. Pages that get accessed another time while
on the inactive list get promoted to the active list to protect them
from reclaim.

Right now we have two main problems.

One stems from numa allocation decisions and how the page allocator
and kswapd interact. The both of them can enter into a perfect loop
where kswapd reclaims from the preferred zone of a task, allowing the
task to continuously allocate from that zone. Or, the node distance
can lead to the allocator to do direct zone reclaim to stay in the
preferred zone. This may be good for locality, but the task has only
the inactive space of that one zone to get its memory activated.
Forcing the allocator to spread out to lower zones in the right
situation makes the difference between continuous IO to serve the
workingset, or taking the numa cost but serving fully from memory.

The other issue is that with the two lists alone, we can never detect
when a new set of data with equal access frequency should be cached if
the size of it is bigger than total/allowed memory minus the active
set. Currently we have the perfect compromise given those
constraints: the active list is not allowed to grow bigger than the
inactive list. This means that we can protect cache from reclaim only
up to half of memory, and don't recognize workingset changes that are
bigger than half of memory.

This patch tries to solve both problems by adding and making use of a
new metric, the refault distance.

Whenever a file page leaves the inactive list, be it through reclaim
or activation, a global counter is increased, called the "workingset
time". When a page is evicted from memory, a snapshot of the current
workingset time is remembered, so that when the page is refaulted
later, it can be figured out for how long the page has been out of
memory. This is called the refault distance.

The observation then is this: if a page is refaulted after N ticks of
working set time, the eviction could have been avoided if the active
list had been N pages smaller and this space available to the inactive
list instead.

We don't have recent usage information for pages on the active list,
so we can not explicitely compare the refaulting page to the least
frequently used active page. Instead, for each refault with a
distance smaller than the size of the active list, we deactivate an
active page. This way, both the refaulted page and the freshly
deactivated page get placed next to each other on the head of the
inactive list and both have equal chance to get activated. Whichever
wins is probably the more frequently used page.

To ensure the spreading of pages across available/allowed zones when
necessary, a per-zone floating proportion of evictions in the system
is maintained, which allows translating the global refault distance of
a page into a distance proportional to the zone's own eviction speed.
When a refaulting page is allocated, for each zone considered in the
first zonelist walk of the allocator, the per-zone distance is
compared to the zone's number of active and free pages. If the
distance is bigger, the allocator moves to the next zone, to see if
its less utilized (less evictions -> smaller distance, potentially
stale active pages, or even free pages) and thus, unlike the preferred
zone, has the potential to hold the page in memory. This way,
non-refault allocations and those that would fit into the preferred
zone stay local, but if we see a chance to keep these pages in memory
long-term by spreading them out, we try to use all the space we can
get and sacrifice locality to save disk IO.

Signed-off-by: Johannes Weiner <hannes [at] cmpxchg>
---
include/linux/mmzone.h | 7 ++
include/linux/swap.h | 9 ++-
mm/Makefile | 1 +
mm/memcontrol.c | 3 +
mm/page_alloc.c | 7 ++
mm/swap.c | 2 +
mm/vmscan.c | 80 +++++++++++++---------
mm/vmstat.c | 4 +
mm/workingset.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 249 insertions(+), 38 deletions(-)
create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 650ba2f..a4da472 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,6 +15,7 @@
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
+#include <linux/proportions.h>
#include <generated/bounds.h>
#include <linux/atomic.h>
#include <asm/page.h>
@@ -115,6 +116,10 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
+ WORKINGSET_SKIP,
+ WORKINGSET_ALLOC,
+ WORKINGSET_STALE,
+ WORKINGSET_STALE_FORCE,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_VM_ZONE_STAT_ITEMS };

@@ -161,6 +166,7 @@ static inline int is_unevictable_lru(enum lru_list lru)

struct lruvec {
struct list_head lists[NR_LRU_LISTS];
+ long shrink_active;
};

/* Mask used at gathering information at once (see memcontrol.c) */
@@ -372,6 +378,7 @@ struct zone {
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct lruvec lruvec;
+ struct prop_local_percpu evictions;

struct zone_reclaim_stat reclaim_stat;

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 03d327f..cf304ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,10 +205,11 @@ struct swap_list_t {
#define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)

/* linux/mm/workingset.c */
-static inline unsigned long workingset_refault_distance(struct page *page)
-{
- return 0;
-}
+void *workingset_eviction(struct page *);
+void workingset_activation(struct page *);
+unsigned long workingset_refault_distance(struct page *);
+bool workingset_zone_alloc(struct zone *, unsigned long,
+ unsigned long *, unsigned long *);

/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..bd09137 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,6 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
+ workingset.o \
$(mmu-y)
obj-y += init-mm.o

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 58a08fc..10dc07c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1020,6 +1020,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
if (mem_cgroup_disabled())
return &zone->lruvec;

+ if (!memcg)
+ memcg = root_mem_cgroup;
+
mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
return &mz->lruvec;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..a6544c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1711,9 +1711,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ unsigned long distance, active;

classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
+ distance = active = 0;
/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
@@ -1726,6 +1728,11 @@ zonelist_scan:
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
+ if ((alloc_flags & ALLOC_WMARK_LOW) &&
+ current->refault_distance &&
+ !workingset_zone_alloc(zone, current->refault_distance,
+ &distance, &active))
+ continue;
/*
* When allocating a page cache page for writing, we
* want to get it from a zone that is within its dirty
diff --git a/mm/swap.c b/mm/swap.c
index cc5ce81..3029b40 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -365,6 +365,8 @@ void mark_page_accessed(struct page *page)
PageReferenced(page) && PageLRU(page)) {
activate_page(page);
ClearPageReferenced(page);
+ if (page_is_file_cache(page))
+ workingset_activation(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44d81f5..a01d123 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,7 +536,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+ bool reclaimed)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -582,10 +583,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
swapcache_free(swap, page);
} else {
void (*freepage)(struct page *);
+ void *shadow = NULL;

freepage = mapping->a_ops->freepage;
-
- __delete_from_page_cache(page, NULL);
+
+ if (reclaimed && page_is_file_cache(page))
+ shadow = workingset_eviction(page);
+ __delete_from_page_cache(page, shadow);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);

@@ -608,7 +612,7 @@ cannot_free:
*/
int remove_mapping(struct address_space *mapping, struct page *page)
{
- if (__remove_mapping(mapping, page)) {
+ if (__remove_mapping(mapping, page, false)) {
/*
* Unfreezing the refcount with 1 rather than 2 effectively
* drops the pagecache ref for us without requiring another
@@ -968,7 +972,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (!mapping || !__remove_mapping(mapping, page))
+ if (!mapping || !__remove_mapping(mapping, page, true))
goto keep_locked;

/*
@@ -1824,43 +1828,51 @@ static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
}
#endif

-static int inactive_file_is_low_global(struct zone *zone)
+static int inactive_file_is_low(unsigned long nr_to_scan,
+ struct mem_cgroup_zone *mz,
+ struct scan_control *sc)
{
- unsigned long active, inactive;
-
- active = zone_page_state(zone, NR_ACTIVE_FILE);
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-
- return (active > inactive);
-}
+ unsigned long inactive_ratio;
+ unsigned long inactive;
+ struct lruvec *lruvec;
+ unsigned long active;
+ unsigned long gb;

-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @mz: memory cgroup and zone to check
- *
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
- *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
- *
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
- */
-static int inactive_file_is_low(struct mem_cgroup_zone *mz)
-{
- if (!scanning_global_lru(mz))
+ if (!global_reclaim(sc)) /* XXX: integrate hard limit reclaim */
return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
mz->zone);

- return inactive_file_is_low_global(mz->zone);
+ lruvec = mem_cgroup_zone_lruvec(mz->zone, sc->target_mem_cgroup);
+ if (lruvec->shrink_active > 0) {
+ inc_zone_state(mz->zone, WORKINGSET_STALE);
+ lruvec->shrink_active -= nr_to_scan;
+ return true;
+ }
+ /*
+ * Make sure there is always a reasonable amount of inactive
+ * file pages around to keep the zone reclaimable.
+ */
+ inactive = zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+ active = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE);
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
+ else
+ inactive_ratio = 1;
+ if (inactive * inactive_ratio < active) {
+ inc_zone_state(mz->zone, WORKINGSET_STALE_FORCE);
+ return true;
+ }
+ return false;
}

-static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
+static int inactive_list_is_low(unsigned long nr_to_scan,
+ struct mem_cgroup_zone *mz,
+ struct scan_control *sc,
+ int file)
{
if (file)
- return inactive_file_is_low(mz);
+ return inactive_file_is_low(nr_to_scan, mz, sc);
else
return inactive_anon_is_low(mz);
}
@@ -1872,7 +1884,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
int file = is_file_lru(lru);

if (is_active_lru(lru)) {
- if (inactive_list_is_low(mz, file))
+ if (inactive_list_is_low(nr_to_scan, mz, sc, file))
shrink_active_list(nr_to_scan, mz, sc, priority, file);
return 0;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..28f4b90 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -718,6 +718,10 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
+ "workingset_skip",
+ "workingset_alloc",
+ "workingset_stale",
+ "workingset_stale_force",
"nr_anon_transparent_hugepages",
"nr_dirty_threshold",
"nr_dirty_background_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..2fc9ac6
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,174 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ * Monotonic workingset clock for non-resident pages. Each page
+ * leaving the inactive list (eviction or activation) is one tick.
+ *
+ * The refault distance of a page that is the number of ticks that
+ * occurred between eviction and refault.
+ *
+ * If the inactive list had been bigger by the refault distance in
+ * pages, the refault would not have happened. Or put differently, if
+ * the distance is smaller than the number of active file pages, an
+ * active page needs to be deactivated so that both pages get an equal
+ * chance for activation when there is not enough memory for both.
+ */
+static atomic_t workingset_time;
+
+/*
+ * Per-zone proportional eviction counter to keep track of recent zone
+ * eviction speed and be able to calculate per-zone refault distances.
+ */
+static struct prop_descriptor global_evictions;
+
+/*
+ * Workingset time snapshots are stored in the page cache radix tree
+ * as exceptional entries.
+ */
+#define EV_SHIFT RADIX_TREE_EXCEPTIONAL_SHIFT
+#define EV_MASK (~0UL >> EV_SHIFT)
+
+void *workingset_eviction(struct page *page)
+{
+ unsigned long time;
+
+ prop_inc_percpu(&global_evictions, &page_zone(page)->evictions);
+ time = (unsigned int)atomic_inc_return(&workingset_time);
+
+ return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+void workingset_activation(struct page *page)
+{
+ struct lruvec *lruvec;
+ /*
+ * Refault distance is compared to the number of active pages,
+ * but pages activated after the eviction were hardly the
+ * reason for memory shortness back then. Advancing the clock
+ * on activation compensates for them, so that we compare to
+ * the number of active pages at time of eviction.
+ */
+ atomic_inc(&workingset_time);
+ /*
+ * Furthermore, activations mean that the inactive list is big
+ * enough and that a new workingset is being adapted already.
+ * Deactivation is no longer necessary; even harmful.
+ */
+ lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+ if (lruvec->shrink_active > 0)
+ lruvec->shrink_active--;
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+ unsigned long time_of_eviction;
+ unsigned long now;
+
+ if (!page)
+ return 0;
+
+ BUG_ON(!radix_tree_exceptional_entry(page));
+
+ time_of_eviction = (unsigned long)page >> EV_SHIFT;
+ now = (unsigned int)atomic_read(&workingset_time) & EV_MASK;
+
+ return (now - time_of_eviction) & EV_MASK;
+}
+EXPORT_SYMBOL(workingset_refault_distance);
+
+bool workingset_zone_alloc(struct zone *zone, unsigned long refault_distance,
+ unsigned long *pdistance, unsigned long *pactive)
+{
+ unsigned long zone_active;
+ unsigned long zone_free;
+ unsigned long missing;
+ long denominator;
+ long numerator;
+
+ /*
+ * Don't put refaulting pages into zones that are already
+ * heavily reclaimed and don't have the potential to hold all
+ * the workingset. Instead go for zones where the zone-local
+ * distance is smaller than the potential inactive list space.
+ * This can be either because there has not been much reclaim
+ * recently (small distance), because the zone is not actually
+ * full (free pages), or because there are just genuinely a
+ * lot of active pages that may be used less frequently than
+ * the refaulting page. Either way, use this potential to
+ * hold the refaulting page long-term instead of beating on
+ * already thrashing higher zones.
+ */
+ prop_fraction_percpu(&global_evictions, &zone->evictions,
+ &numerator, &denominator);
+ missing = refault_distance * numerator;
+ do_div(missing, denominator);
+ *pdistance += missing;
+
+ zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+ *pactive += zone_active;
+
+ /*
+ * Lower zones may not even be full, and free pages are
+ * potential inactive space, too. But the dirty reserve is
+ * not available to page cache due to lowmem reserves and the
+ * kswapd watermark. Don't include it.
+ */
+ zone_free = zone_page_state(zone, NR_FREE_PAGES);
+ if (zone_free > zone->dirty_balance_reserve)
+ zone_free -= zone->dirty_balance_reserve;
+ else
+ zone_free = 0;
+
+ if (missing >= zone_active + zone_free) {
+ inc_zone_state(zone, WORKINGSET_SKIP);
+ return false;
+ }
+
+ inc_zone_state(zone, WORKINGSET_ALLOC);
+
+ /*
+ * Okay, placement in this zone makes sense, but don't start
+ * actually deactivating pages until all allowed zones are
+ * under equalized pressure, or risk throwing out active pages
+ * from a barely used zone even when the refaulting data set
+ * is bigger than the available memory. To prevent that, look
+ * at the cumulative distance and active pages of all zones
+ * already visited, which normalizes the distance for the case
+ * when higher zones are thrashing and we just started putting
+ * pages in the lower ones.
+ */
+ if (*pdistance < *pactive) {
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+ lruvec->shrink_active++;
+ }
+ return true;
+}
+
+static int __init workingset_init(void)
+{
+ extern unsigned long global_dirtyable_memory(void);
+ struct zone *zone;
+ int shift;
+
+ shift = ilog2(global_dirtyable_memory() - 1);
+ prop_descriptor_init(&global_evictions, shift);
+ for_each_zone(zone)
+ prop_local_init_percpu(&zone->evictions);
+ return 0;
+}
+
+module_init(workingset_init);
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

May 1, 2012, 7:13 AM

Post #2 of 7 (90 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

Hi Hannes,

On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> To protect frequently used page cache (workingset) from bursts of less
> frequently used or one-shot cache, page cache pages are managed on two
> linked lists. The inactive list is where all cache starts out on
> fault and ends on reclaim. Pages that get accessed another time while
> on the inactive list get promoted to the active list to protect them
> from reclaim.
>
> Right now we have two main problems.
>
> One stems from numa allocation decisions and how the page allocator
> and kswapd interact. The both of them can enter into a perfect loop
> where kswapd reclaims from the preferred zone of a task, allowing the
> task to continuously allocate from that zone. Or, the node distance
> can lead to the allocator to do direct zone reclaim to stay in the
> preferred zone. This may be good for locality, but the task has only

Understood.

> the inactive space of that one zone to get its memory activated.
> Forcing the allocator to spread out to lower zones in the right
> situation makes the difference between continuous IO to serve the
> workingset, or taking the numa cost but serving fully from memory.

It's hard to parse your word due to my dumb brain.
Could you elaborate on it?
It would be a good if you say with example.

>
> The other issue is that with the two lists alone, we can never detect
> when a new set of data with equal access frequency should be cached if
> the size of it is bigger than total/allowed memory minus the active
> set. Currently we have the perfect compromise given those
> constraints: the active list is not allowed to grow bigger than the
> inactive list. This means that we can protect cache from reclaim only

Okay.

> up to half of memory, and don't recognize workingset changes that are
> bigger than half of memory.

Workingset change?
You mean if new workingset is bigger than half of memory and it's like
stream before retouch, we could cache only part of working set because
head pages on working set would be discared by tail pages of working set
in inactive list?

I'm sure I totally coudln't parse your point.
Could you explain in detail? Before reading your approach and diving into code,
I would like to see the problem clearly.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hannes at cmpxchg

May 1, 2012, 8:38 AM

Post #3 of 7 (89 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
> Hi Hannes,
>
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > To protect frequently used page cache (workingset) from bursts of less
> > frequently used or one-shot cache, page cache pages are managed on two
> > linked lists. The inactive list is where all cache starts out on
> > fault and ends on reclaim. Pages that get accessed another time while
> > on the inactive list get promoted to the active list to protect them
> > from reclaim.
> >
> > Right now we have two main problems.
> >
> > One stems from numa allocation decisions and how the page allocator
> > and kswapd interact. The both of them can enter into a perfect loop
> > where kswapd reclaims from the preferred zone of a task, allowing the
> > task to continuously allocate from that zone. Or, the node distance
> > can lead to the allocator to do direct zone reclaim to stay in the
> > preferred zone. This may be good for locality, but the task has only
>
> Understood.
>
> > the inactive space of that one zone to get its memory activated.
> > Forcing the allocator to spread out to lower zones in the right
> > situation makes the difference between continuous IO to serve the
> > workingset, or taking the numa cost but serving fully from memory.
>
> It's hard to parse your word due to my dumb brain.
> Could you elaborate on it?
> It would be a good if you say with example.

Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
file pages in Normal and DMA32 is full of other stuff. Now you access
a new 6G file repeatedly. First it allocates from Normal (preferred),
then tries DMA32 (full), wakes up kswapd and retries all zones. If
kswapd then frees pages at roughly the same pace as the allocator
allocates from Normal, kswapd never goes to sleep and evicts pages
from the 6G file before they can get accessed a second time. Even
though the 6G file could fit in memory (4G Normal + 4G DMA32), the
allocator only uses the 4G Normal zone.

Same applies if you have a load that would fit in the memory of two
nodes but the node distance leads the allocator to do zone_reclaim()
and forcing the pages to stay in one node, again preventing the load
from being fully cached in memory, which is much more expensive than
the foreign node cost.

> > up to half of memory, and don't recognize workingset changes that are
> > bigger than half of memory.
>
> Workingset change?
> You mean if new workingset is bigger than half of memory and it's like
> stream before retouch, we could cache only part of working set because
> head pages on working set would be discared by tail pages of working set
> in inactive list?

Spot-on. I called that 'tail-chasing' in my notes :-) When you are in
a perpetual loop of evicting pages you will need in a couple hundred
page faults. Those couple hundred page faults are the refault
distance and my code is able to detect these loops and increases the
space available to the inactive list to end them, if possible.

This is the whole principle of the series.

If such a loop is recognized in a single zone, the allocator goes for
lower zones to increase the inactive space. If such a loop is
recognized over all allowed zones in the zonelist, the active lists
are shrunk to increase the inactive space.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aarcange at redhat

May 1, 2012, 6:57 PM

Post #4 of 7 (92 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> frequently used active page. Instead, for each refault with a
> distance smaller than the size of the active list, we deactivate an

Shouldn't this be the size of active list + size of inactive list?

If the active list is 500M, inactive 500M and the new working set is
600M, the refault distance will be 600M, it won't be smaller than the
size of the active list, and it won't deactivate the active list as it
should and it won't be detected as working set.

Only the refault distance bigger than inactive+active should not
deactivate the active list if I understand how this works correctly.

> @@ -1726,6 +1728,11 @@ zonelist_scan:
> if ((alloc_flags & ALLOC_CPUSET) &&
> !cpuset_zone_allowed_softwall(zone, gfp_mask))
> continue;
> + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> + current->refault_distance &&
> + !workingset_zone_alloc(zone, current->refault_distance,
> + &distance, &active))
> + continue;
> /*
> * When allocating a page cache page for writing, we
> * want to get it from a zone that is within its dirty

It's a bit hard to see how this may not run oom prematurely if the
distance is always bigger, this is just an implementation question and
maybe I'm missing a fallback somewhere where we actually allocate
memory from whatever place in case no place is ideal.

> + /*
> + * Lower zones may not even be full, and free pages are
> + * potential inactive space, too. But the dirty reserve is
> + * not available to page cache due to lowmem reserves and the
> + * kswapd watermark. Don't include it.
> + */
> + zone_free = zone_page_state(zone, NR_FREE_PAGES);
> + if (zone_free > zone->dirty_balance_reserve)
> + zone_free -= zone->dirty_balance_reserve;
> + else
> + zone_free = 0;

Maybe also remove the high wmark from the sum? It can be some hundred
meg so it's better to take it into account, to have a more accurate
math and locate the best zone that surely fits.

For the same reason it looks like the lowmem reserve should also be
taken into account, on the full sum.

> + if (missing >= zone_active + zone_free) {

This seems a place where to add the zone_inactive too according to my
comment on top.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan at kernel

May 1, 2012, 10:21 PM

Post #5 of 7 (90 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

On 05/02/2012 12:38 AM, Johannes Weiner wrote:

> On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
>> Hi Hannes,
>>
>> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
>>> To protect frequently used page cache (workingset) from bursts of less
>>> frequently used or one-shot cache, page cache pages are managed on two
>>> linked lists. The inactive list is where all cache starts out on
>>> fault and ends on reclaim. Pages that get accessed another time while
>>> on the inactive list get promoted to the active list to protect them
>>> from reclaim.
>>>
>>> Right now we have two main problems.
>>>
>>> One stems from numa allocation decisions and how the page allocator
>>> and kswapd interact. The both of them can enter into a perfect loop
>>> where kswapd reclaims from the preferred zone of a task, allowing the
>>> task to continuously allocate from that zone. Or, the node distance
>>> can lead to the allocator to do direct zone reclaim to stay in the
>>> preferred zone. This may be good for locality, but the task has only
>>
>> Understood.
>>
>>> the inactive space of that one zone to get its memory activated.
>>> Forcing the allocator to spread out to lower zones in the right
>>> situation makes the difference between continuous IO to serve the
>>> workingset, or taking the numa cost but serving fully from memory.
>>
>> It's hard to parse your word due to my dumb brain.
>> Could you elaborate on it?
>> It would be a good if you say with example.
>
> Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
> file pages in Normal and DMA32 is full of other stuff. Now you access
> a new 6G file repeatedly. First it allocates from Normal (preferred),
> then tries DMA32 (full), wakes up kswapd and retries all zones. If
> kswapd then frees pages at roughly the same pace as the allocator
> allocates from Normal, kswapd never goes to sleep and evicts pages
> from the 6G file before they can get accessed a second time. Even
> though the 6G file could fit in memory (4G Normal + 4G DMA32), the
> allocator only uses the 4G Normal zone.
>
> Same applies if you have a load that would fit in the memory of two
> nodes but the node distance leads the allocator to do zone_reclaim()
> and forcing the pages to stay in one node, again preventing the load
> from being fully cached in memory, which is much more expensive than
> the foreign node cost.
>
>>> up to half of memory, and don't recognize workingset changes that are
>>> bigger than half of memory.
>>
>> Workingset change?
>> You mean if new workingset is bigger than half of memory and it's like
>> stream before retouch, we could cache only part of working set because
>> head pages on working set would be discared by tail pages of working set
>> in inactive list?
>
> Spot-on. I called that 'tail-chasing' in my notes :-) When you are in
> a perpetual loop of evicting pages you will need in a couple hundred
> page faults. Those couple hundred page faults are the refault
> distance and my code is able to detect these loops and increases the
> space available to the inactive list to end them, if possible.
>


Thanks! It would be better to add above explanation in cover-letter.


> This is the whole principle of the series.
>
> If such a loop is recognized in a single zone, the allocator goes for
> lower zones to increase the inactive space. If such a loop is
> recognized over all allowed zones in the zonelist, the active lists
> are shrunk to increase the inactive space.

>

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo [at] kvack For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont [at] kvack"> email [at] kvack </a>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hannes at cmpxchg

May 1, 2012, 11:23 PM

Post #6 of 7 (90 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote:
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > frequently used active page. Instead, for each refault with a
> > distance smaller than the size of the active list, we deactivate an
>
> Shouldn't this be the size of active list + size of inactive list?
>
> If the active list is 500M, inactive 500M and the new working set is
> 600M, the refault distance will be 600M, it won't be smaller than the
> size of the active list, and it won't deactivate the active list as it
> should and it won't be detected as working set.
>
> Only the refault distance bigger than inactive+active should not
> deactivate the active list if I understand how this works correctly.

The refault distance is what's missing, not the full reuse frequency.
You ignore the 500M worth of inactive LRU time the page had in memory.
The distance in that scenario would be 100M, the time between eviction
and refault:

+-----------------------------++-----------------------------+
| || |
| inactive || active |
+-----------------------------++-----------------------------+
+~~~~~~~------------------------------+
| |
| new set |
+~~~~~~~------------------------------+
^ ^
| |
| eviction
refault

The ~~~'d part could fit into memory if the active list was 100M
smaller.

> > @@ -1726,6 +1728,11 @@ zonelist_scan:
> > if ((alloc_flags & ALLOC_CPUSET) &&
> > !cpuset_zone_allowed_softwall(zone, gfp_mask))
> > continue;
> > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > + current->refault_distance &&
> > + !workingset_zone_alloc(zone, current->refault_distance,
> > + &distance, &active))
> > + continue;
> > /*
> > * When allocating a page cache page for writing, we
> > * want to get it from a zone that is within its dirty
>
> It's a bit hard to see how this may not run oom prematurely if the
> distance is always bigger, this is just an implementation question and
> maybe I'm missing a fallback somewhere where we actually allocate
> memory from whatever place in case no place is ideal.

Sorry, this should be documented better.

The ALLOC_WMARK_LOW check makes sure this only applies in the
fastpath. It will prepare reclaim with lruvec->shrink_active, then
wake up kswapd and retry the zonelist without this constraint.

> > + /*
> > + * Lower zones may not even be full, and free pages are
> > + * potential inactive space, too. But the dirty reserve is
> > + * not available to page cache due to lowmem reserves and the
> > + * kswapd watermark. Don't include it.
> > + */
> > + zone_free = zone_page_state(zone, NR_FREE_PAGES);
> > + if (zone_free > zone->dirty_balance_reserve)
> > + zone_free -= zone->dirty_balance_reserve;
> > + else
> > + zone_free = 0;
>
> Maybe also remove the high wmark from the sum? It can be some hundred
> meg so it's better to take it into account, to have a more accurate
> math and locate the best zone that surely fits.
>
> For the same reason it looks like the lowmem reserve should also be
> taken into account, on the full sum.

dirty_balance_reserve IS the sum of the high watermark and the biggest
lowmem reserve for a particular zone, see how it's calculated in
mm/page_alloc.c::calculate_totalreserve_pages().

nr_free - dirty_balance_reserve is the number of pages available to
page cache allocations without keeping kswapd alive or having to dip
into lowmem reserves.

Or did I misunderstand you?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aarcange at redhat

May 2, 2012, 8:11 AM

Post #7 of 7 (91 views)
Permalink
Re: [patch 5/5] mm: refault distance-based file cache sizing [In reply to]

On Wed, May 02, 2012 at 08:23:09AM +0200, Johannes Weiner wrote:
> On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote:
> > On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > > frequently used active page. Instead, for each refault with a
> > > distance smaller than the size of the active list, we deactivate an
> >
> > Shouldn't this be the size of active list + size of inactive list?
> >
> > If the active list is 500M, inactive 500M and the new working set is
> > 600M, the refault distance will be 600M, it won't be smaller than the
> > size of the active list, and it won't deactivate the active list as it
> > should and it won't be detected as working set.
> >
> > Only the refault distance bigger than inactive+active should not
> > deactivate the active list if I understand how this works correctly.
>
> The refault distance is what's missing, not the full reuse frequency.
> You ignore the 500M worth of inactive LRU time the page had in memory.
> The distance in that scenario would be 100M, the time between eviction
> and refault:
>
> +-----------------------------++-----------------------------+
> | || |
> | inactive || active |
> +-----------------------------++-----------------------------+
> +~~~~~~~------------------------------+
> | |
> | new set |
> +~~~~~~~------------------------------+
> ^ ^
> | |
> | eviction
> refault
>
> The ~~~'d part could fit into memory if the active list was 100M
> smaller.

Never mind, I see that the refault distance is only going to measure
the amount of the new working set that spilled over the inactive list
so it would only be set to 100M in the example.

> > > @@ -1726,6 +1728,11 @@ zonelist_scan:
> > > if ((alloc_flags & ALLOC_CPUSET) &&
> > > !cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > continue;
> > > + if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > > + current->refault_distance &&
> > > + !workingset_zone_alloc(zone, current->refault_distance,
> > > + &distance, &active))
> > > + continue;
> > > /*
> > > * When allocating a page cache page for writing, we
> > > * want to get it from a zone that is within its dirty
> >
> > It's a bit hard to see how this may not run oom prematurely if the
> > distance is always bigger, this is just an implementation question and
> > maybe I'm missing a fallback somewhere where we actually allocate
> > memory from whatever place in case no place is ideal.
>
> Sorry, this should be documented better.
>
> The ALLOC_WMARK_LOW check makes sure this only applies in the
> fastpath. It will prepare reclaim with lruvec->shrink_active, then
> wake up kswapd and retry the zonelist without this constraint.

My point is this is going to change the semantics of ALLOC_WMARK_LOW
to "return OOM randomly even if there's plenty of free memory" instead
of "use only up to the low wmark". I see you want to wake kswapd and
retry with the min wmark after that, but maybe it would be cleaner to
have a new ALLOC_REFAULT_DISTANCE to avoid altering the meaning of
ALLOC_WMARK_LOW. Then add a "|ALLOC_REFAULT_DISTANCE" to the
parameter. It sounds simpler to keep controlling the wmark level
checked with ALLOC_WMARK_LOW|MIN|HIGH without introducing a new special
meanings to the LOW bitflag.

This is only a cleanup though, I believe it works good at runtime.

> > > + /*
> > > + * Lower zones may not even be full, and free pages are
> > > + * potential inactive space, too. But the dirty reserve is
> > > + * not available to page cache due to lowmem reserves and the
> > > + * kswapd watermark. Don't include it.
> > > + */
> > > + zone_free = zone_page_state(zone, NR_FREE_PAGES);
> > > + if (zone_free > zone->dirty_balance_reserve)
> > > + zone_free -= zone->dirty_balance_reserve;
> > > + else
> > > + zone_free = 0;
> >
> > Maybe also remove the high wmark from the sum? It can be some hundred
> > meg so it's better to take it into account, to have a more accurate
> > math and locate the best zone that surely fits.
> >
> > For the same reason it looks like the lowmem reserve should also be
> > taken into account, on the full sum.
>
> dirty_balance_reserve IS the sum of the high watermark and the biggest
> lowmem reserve for a particular zone, see how it's calculated in
> mm/page_alloc.c::calculate_totalreserve_pages().
>
> nr_free - dirty_balance_reserve is the number of pages available to
> page cache allocations without keeping kswapd alive or having to dip
> into lowmem reserves.
>
> Or did I misunderstand you?

No, that's all right then! I didn't realize dirty_balance_reserve
accounts exactly for what I wrote above (high wmark and lowmem
reserve). I've seen it used by page-writeback and I naively assumed it
had to do with dirty pages levels, while it has absolutely nothing to
do with writeback or any dirty memory level! Despite its quite
misleading _dirty prefix :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.