Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


kamezawa.hiroyu at jp

Oct 27, 2009, 12:45 AM

Post #1 of 17 (351 views)
Permalink
[RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit

On Tue, 27 Oct 2009 15:55:26 +0900
Minchan Kim <minchan.kim [at] gmail> wrote:

> >> Hmm.
> >> I wonder why we consider VM size for OOM kiling.
> >> How about RSS size?
> >>
> >
> > Maybe the current code assumes "Tons of swap have been generated, already" if
> > oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> >
> > Hm, should we count # of swap entries reference from mm ?....
>
> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> I think it would be better to consider both RSS + # of swap entries as
> Kosaki mentioned.
>
Then, maybe this kind of patch is necessary.
This is on 2.6.31...then I may have to rebase this to mmotom.
Added more CCs.

Vedran, I'm glad if you can test this patch.


==
Now, oom-killer's score uses mm->total_vm as its base value.
But, in these days, applications like GUI program tend to use
much shared libraries and total_vm grows too high even when
pages are not fully mapped.

For example, running a program "mmap" which allocates 1 GBbytes of
anonymous memory, oom_score top 10 on system will be..

score PID name
89924 3938 mixer_applet2
90210 3942 tomboy
94753 3936 clock-applet
101994 3919 pulseaudio
113525 4028 gnome-terminal
127340 1 init
128177 3871 nautilus
151003 11515 bash
256944 11653 mmap <-----------------use 1G of anon
425561 3829 gnome-session

No one believes gnome-session is more guilty than "mmap".

Instead of total_vm, we should use anon/file/swap usage of a process, I think.
This patch adds mm->swap_usage and calculate oom_score based on
anon_rss + file_rss + swap_usage.
Considering usual applications, this will be much better information than
total_vm. After this patch, the score on my desktop is

score PID name
4033 3176 gnome-panel
4077 3113 xinit
4526 3190 python
4820 3161 gnome-settings-
4989 3289 gnome-terminal
7105 3271 tomboy
8427 3177 nautilus
17549 3140 gnome-session
128501 3299 bash
256106 3383 mmap

This order is not bad, I think.

Note: This adss new counter...then new cost is added.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp>
---
include/linux/mm_types.h | 1 +
mm/memory.c | 29 +++++++++++++++++++++--------
mm/oom_kill.c | 12 +++++++++---
mm/rmap.c | 1 +
mm/swapfile.c | 1 +
5 files changed, 33 insertions(+), 11 deletions(-)

Index: linux-2.6.31/include/linux/mm_types.h
===================================================================
--- linux-2.6.31.orig/include/linux/mm_types.h
+++ linux-2.6.31/include/linux/mm_types.h
@@ -228,6 +228,7 @@ struct mm_struct {
*/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
+ mm_counter_t _swap_usage;

unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
Index: linux-2.6.31/mm/memory.c
===================================================================
--- linux-2.6.31.orig/mm/memory.c
+++ linux-2.6.31/mm/memory.c
@@ -361,12 +361,15 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}

-static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+static inline
+void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss, int swaps)
{
if (file_rss)
add_mm_counter(mm, file_rss, file_rss);
if (anon_rss)
add_mm_counter(mm, anon_rss, anon_rss);
+ if (swaps)
+ add_mm_counter(mm, swap_usage, swaps);
}

/*
@@ -562,6 +565,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
+ if (!is_migration_entry(entry))
+ rss[2]++;
if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
@@ -611,10 +616,10 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[2];
+ int rss[3];

again:
- rss[1] = rss[0] = 0;
+ rss[2] = rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -645,7 +650,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(src_pte - 1);
- add_mm_rss(dst_mm, rss[0], rss[1]);
+ add_mm_rss(dst_mm, rss[0], rss[1], rss[2]);
pte_unmap_unlock(dst_pte - 1, dst_ptl);
cond_resched();
if (addr != end)
@@ -769,6 +774,7 @@ static unsigned long zap_pte_range(struc
spinlock_t *ptl;
int file_rss = 0;
int anon_rss = 0;
+ int swaps = 0;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -838,13 +844,19 @@ static unsigned long zap_pte_range(struc
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else if
- (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
- print_bad_pte(vma, addr, ptent, NULL);
+ } else {
+ swp_entry_t entry = pte_to_swp_entry(ptent);
+
+ if (!is_migration_entry(entry))
+ swaps++;
+
+ if (unlikely(!free_swap_and_cache(entry)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ }
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

- add_mm_rss(mm, file_rss, anon_rss);
+ add_mm_rss(mm, file_rss, anon_rss, swaps);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);

@@ -2573,6 +2585,7 @@ static int do_swap_page(struct mm_struct
*/

inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, swap_usage);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
Index: linux-2.6.31/mm/rmap.c
===================================================================
--- linux-2.6.31.orig/mm/rmap.c
+++ linux-2.6.31/mm/rmap.c
@@ -834,6 +834,7 @@ static int try_to_unmap_one(struct page
spin_unlock(&mmlist_lock);
}
dec_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, swap_usage);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
Index: linux-2.6.31/mm/swapfile.c
===================================================================
--- linux-2.6.31.orig/mm/swapfile.c
+++ linux-2.6.31/mm/swapfile.c
@@ -830,6 +830,7 @@ static int unuse_pte(struct vm_area_stru
}

inc_mm_counter(vma->vm_mm, anon_rss);
+ dec_mm_counter(vma->vm_mm, swap_usage);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: linux-2.6.31/mm/oom_kill.c
===================================================================
--- linux-2.6.31.orig/mm/oom_kill.c
+++ linux-2.6.31/mm/oom_kill.c
@@ -69,7 +69,8 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss)
+ + get_mm_counter(mm, swap_usage);

/*
* After this unlock we can no longer dereference local variable `mm'
@@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
+ if (child->mm != mm && child->mm) {
+ unsigned long cpoint;
+ /* At considering child, we don't count swap */
+ cpoint = get_mm_counter(child->mm, anon_rss) +
+ get_mm_counter(child->mm, file_rss);
+ points += cpoint/2 + 1;
+ }
task_unlock(child);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan.kim at gmail

Oct 27, 2009, 12:56 AM

Post #2 of 17 (336 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 16:45:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:

> On Tue, 27 Oct 2009 15:55:26 +0900
> Minchan Kim <minchan.kim [at] gmail> wrote:
>
> > >> Hmm.
> > >> I wonder why we consider VM size for OOM kiling.
> > >> How about RSS size?
> > >>
> > >
> > > Maybe the current code assumes "Tons of swap have been generated, already" if
> > > oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> > >
> > > Hm, should we count # of swap entries reference from mm ?....
> >
> > In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> > I think it would be better to consider both RSS + # of swap entries as
> > Kosaki mentioned.
> >
> Then, maybe this kind of patch is necessary.
> This is on 2.6.31...then I may have to rebase this to mmotom.
> Added more CCs.
>
> Vedran, I'm glad if you can test this patch.
>
>
> ==
> Now, oom-killer's score uses mm->total_vm as its base value.
> But, in these days, applications like GUI program tend to use
> much shared libraries and total_vm grows too high even when
> pages are not fully mapped.
>
> For example, running a program "mmap" which allocates 1 GBbytes of
> anonymous memory, oom_score top 10 on system will be..
>
> score PID name
> 89924 3938 mixer_applet2
> 90210 3942 tomboy
> 94753 3936 clock-applet
> 101994 3919 pulseaudio
> 113525 4028 gnome-terminal
> 127340 1 init
> 128177 3871 nautilus
> 151003 11515 bash
> 256944 11653 mmap <-----------------use 1G of anon
> 425561 3829 gnome-session
>
> No one believes gnome-session is more guilty than "mmap".
>
> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.
> Considering usual applications, this will be much better information than
> total_vm. After this patch, the score on my desktop is
>
> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.
>
> Note: This adss new counter...then new cost is added.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp>

Thanks for making the patch.
Let's hear other's opinion. :)

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 12:56 AM

Post #3 of 17 (338 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 16:45:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
/*
> * After this unlock we can no longer dereference local variable `mm'
> @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> */
> list_for_each_entry(child, &p->children, sibling) {
> task_lock(child);
> - if (child->mm != mm && child->mm)
> - points += child->mm->total_vm/2 + 1;
> + if (child->mm != mm && child->mm) {
> + unsigned long cpoint;
> + /* At considering child, we don't count swap */
> + cpoint = get_mm_counter(child->mm, anon_rss) +
> + get_mm_counter(child->mm, file_rss);
> + points += cpoint/2 + 1;
> + }
> task_unlock(child);

BTW, I'd like to get rid of this code.

Can't we use other techniques for detecting fork-bomb ?

This check can't catch following type, anyway.

fork()
-> fork()
-> fork()
-> fork()
....

but I have no good idea.
What is the difference with task-launcher and fork bomb()...

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan.kim at gmail

Oct 27, 2009, 1:14 AM

Post #4 of 17 (327 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 16:56:28 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:

> On Tue, 27 Oct 2009 16:45:26 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> /*
> > * After this unlock we can no longer dereference local variable `mm'
> > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > */
> > list_for_each_entry(child, &p->children, sibling) {
> > task_lock(child);
> > - if (child->mm != mm && child->mm)
> > - points += child->mm->total_vm/2 + 1;
> > + if (child->mm != mm && child->mm) {
> > + unsigned long cpoint;
> > + /* At considering child, we don't count swap */
> > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > + get_mm_counter(child->mm, file_rss);
> > + points += cpoint/2 + 1;
> > + }
> > task_unlock(child);
>
> BTW, I'd like to get rid of this code.
>
> Can't we use other techniques for detecting fork-bomb ?
>
> This check can't catch following type, anyway.
>
> fork()
> -> fork()
> -> fork()
> -> fork()
> ....
>
> but I have no good idea.
> What is the difference with task-launcher and fork bomb()...
>

I think it's good as-is.
Kernel is hard to know it by effiecient method.
It depends on applications. so Doesnt's task-launcher
like gnome-session have to control his oom_score?

Welcome to any ideas if kernel can do it well.

> Thanks,
> -Kame
>


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 1:33 AM

Post #5 of 17 (326 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 17:14:41 +0900
Minchan Kim <minchan.kim [at] gmail> wrote:

> On Tue, 27 Oct 2009 16:56:28 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
>
> > On Tue, 27 Oct 2009 16:45:26 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> > /*
> > > * After this unlock we can no longer dereference local variable `mm'
> > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > */
> > > list_for_each_entry(child, &p->children, sibling) {
> > > task_lock(child);
> > > - if (child->mm != mm && child->mm)
> > > - points += child->mm->total_vm/2 + 1;
> > > + if (child->mm != mm && child->mm) {
> > > + unsigned long cpoint;
> > > + /* At considering child, we don't count swap */
> > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > + get_mm_counter(child->mm, file_rss);
> > > + points += cpoint/2 + 1;
> > > + }
> > > task_unlock(child);
> >
> > BTW, I'd like to get rid of this code.
> >
> > Can't we use other techniques for detecting fork-bomb ?
> >
> > This check can't catch following type, anyway.
> >
> > fork()
> > -> fork()
> > -> fork()
> > -> fork()
> > ....
> >
> > but I have no good idea.
> > What is the difference with task-launcher and fork bomb()...
> >
>
> I think it's good as-is.
> Kernel is hard to know it by effiecient method.
> It depends on applications. so Doesnt's task-launcher
> like gnome-session have to control his oom_score?
>
> Welcome to any ideas if kernel can do it well.
>
Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..

Regards,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


minchan.kim at gmail

Oct 27, 2009, 1:52 AM

Post #6 of 17 (325 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 17:33:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:

> On Tue, 27 Oct 2009 17:14:41 +0900
> Minchan Kim <minchan.kim [at] gmail> wrote:
>
> > On Tue, 27 Oct 2009 16:56:28 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> >
> > > On Tue, 27 Oct 2009 16:45:26 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> > > /*
> > > > * After this unlock we can no longer dereference local variable `mm'
> > > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > > */
> > > > list_for_each_entry(child, &p->children, sibling) {
> > > > task_lock(child);
> > > > - if (child->mm != mm && child->mm)
> > > > - points += child->mm->total_vm/2 + 1;
> > > > + if (child->mm != mm && child->mm) {
> > > > + unsigned long cpoint;
> > > > + /* At considering child, we don't count swap */
> > > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > > + get_mm_counter(child->mm, file_rss);
> > > > + points += cpoint/2 + 1;
> > > > + }
> > > > task_unlock(child);
> > >
> > > BTW, I'd like to get rid of this code.
> > >
> > > Can't we use other techniques for detecting fork-bomb ?
> > >
> > > This check can't catch following type, anyway.
> > >
> > > fork()
> > > -> fork()
> > > -> fork()
> > > -> fork()
> > > ....
> > >
> > > but I have no good idea.
> > > What is the difference with task-launcher and fork bomb()...
> > >
> >
> > I think it's good as-is.
> > Kernel is hard to know it by effiecient method.
> > It depends on applications. so Doesnt's task-launcher
> > like gnome-session have to control his oom_score?
> >
> > Welcome to any ideas if kernel can do it well.
> >
> Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..

Yes. We can do anything to achieve the goal in kernel.
Maybe check the time or fork-depth counting.
What I have a concern is how we can do it nicely if it is a serious
problem in kernel. ;)

I think most of program which have many child are victims of OOM killing.
It make sense to me. There is some cases to not make sense like task-launcher.
So I think if task-launcher which is very rare and special program can change
oom_adj by itself, it's good than thing that add new heuristic in kernel.

It's just my opinon. :)

> Regards,
> -Kame
>


--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 1:56 AM

Post #7 of 17 (328 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 17:52:43 +0900
Minchan Kim <minchan.kim [at] gmail> wrote:

> On Tue, 27 Oct 2009 17:33:08 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
>
> > On Tue, 27 Oct 2009 17:14:41 +0900
> > Minchan Kim <minchan.kim [at] gmail> wrote:
> >
> > > On Tue, 27 Oct 2009 16:56:28 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> > >
> > > > On Tue, 27 Oct 2009 16:45:26 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu [at] jp> wrote:
> > > > /*
> > > > > * After this unlock we can no longer dereference local variable `mm'
> > > > > @@ -92,8 +93,13 @@ unsigned long badness(struct task_struct
> > > > > */
> > > > > list_for_each_entry(child, &p->children, sibling) {
> > > > > task_lock(child);
> > > > > - if (child->mm != mm && child->mm)
> > > > > - points += child->mm->total_vm/2 + 1;
> > > > > + if (child->mm != mm && child->mm) {
> > > > > + unsigned long cpoint;
> > > > > + /* At considering child, we don't count swap */
> > > > > + cpoint = get_mm_counter(child->mm, anon_rss) +
> > > > > + get_mm_counter(child->mm, file_rss);
> > > > > + points += cpoint/2 + 1;
> > > > > + }
> > > > > task_unlock(child);
> > > >
> > > > BTW, I'd like to get rid of this code.
> > > >
> > > > Can't we use other techniques for detecting fork-bomb ?
> > > >
> > > > This check can't catch following type, anyway.
> > > >
> > > > fork()
> > > > -> fork()
> > > > -> fork()
> > > > -> fork()
> > > > ....
> > > >
> > > > but I have no good idea.
> > > > What is the difference with task-launcher and fork bomb()...
> > > >
> > >
> > > I think it's good as-is.
> > > Kernel is hard to know it by effiecient method.
> > > It depends on applications. so Doesnt's task-launcher
> > > like gnome-session have to control his oom_score?
> > >
> > > Welcome to any ideas if kernel can do it well.
> > >
> > Hmmm, check system-wide fork/sec and fork-depth ? Maybe not difficult to calculate..
>
> Yes. We can do anything to achieve the goal in kernel.
> Maybe check the time or fork-depth counting.
> What I have a concern is how we can do it nicely if it is a serious
> problem in kernel. ;)
>
yes...only the user knows whether user is wrong, finally. Especially in case
of memory leak.

> I think most of program which have many child are victims of OOM killing.
> It make sense to me. There is some cases to not make sense like task-launcher.
> So I think if task-launcher which is very rare and special program can change
> oom_adj by itself, it's good than thing that add new heuristic in kernel.
>
> It's just my opinon. :)
>
I know KDE already adjsut oom_adj for their 3.5 release ;)
Okay, concentrate on avoiding total_vm issue for a while.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aarcange at redhat

Oct 27, 2009, 5:38 AM

Post #8 of 17 (308 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, Oct 27, 2009 at 04:56:12PM +0900, Minchan Kim wrote:
> Thanks for making the patch.
> Let's hear other's opinion. :)

total_vm is nearly meaningless, especially on 64bit that reduces the
mmap load on libs, I tried to change it to something "physical" (rss,
didn't add swap too) some time ago too, not sure why I didn't manage
to get it in. Trying again surely sounds good. Accounting swap isn't
necessarily good, we may be killing a task that isn't accessing memory
at all. So yes, we free swap but if the task is the "bloater" it's
unlikely to be all in swap as it did all recent activity that lead to
the oom. So I'm unsure if swap is good to account here, but surely I
ack to replace virtual with rss. I would include the whole rss, as the
file one may also be rendered unswappable if it is accessed in a loop
refreshing the young bit all the time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vedran.furac at gmail

Oct 27, 2009, 10:41 AM

Post #9 of 17 (304 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

KAMEZAWA Hiroyuki wrote:

> On Tue, 27 Oct 2009 15:55:26 +0900
> Minchan Kim <minchan.kim [at] gmail> wrote:
>
>>>> Hmm.
>>>> I wonder why we consider VM size for OOM kiling.
>>>> How about RSS size?
>>>>
>>> Maybe the current code assumes "Tons of swap have been generated, already" if
>>> oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
>>>
>>> Hm, should we count # of swap entries reference from mm ?....
>> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
>> I think it would be better to consider both RSS + # of swap entries as
>> Kosaki mentioned.
>>
> Then, maybe this kind of patch is necessary.
> This is on 2.6.31...then I may have to rebase this to mmotom.
> Added more CCs.
>
> Vedran, I'm glad if you can test this patch.

Thanks for the patch! I'll test it during this week a report after that.

> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.

Isn't file_rss shared between processes? Sorry, I'm newbie. :)

% pmap $(pidof test)
29049: ./test
0000000000400000 4K r-x-- /home/vedranf/dev/tmp/test
0000000000600000 4K rw--- /home/vedranf/dev/tmp/test
00002ba362a80000 116K r-x-- /lib/ld-2.10.1.so
00002ba362a9d000 12K rw--- [ anon ]
00002ba362c9c000 4K r---- /lib/ld-2.10.1.so
00002ba362c9d000 4K rw--- /lib/ld-2.10.1.so
00002ba362c9e000 1320K r-x-- /lib/libc-2.10.1.so
00002ba362de8000 2044K ----- /lib/libc-2.10.1.so
00002ba362fe7000 16K r---- /lib/libc-2.10.1.so
00002ba362feb000 4K rw--- /lib/libc-2.10.1.so
00002ba362fec000 1024028K rw--- [ anon ] // <-- This
00007ffff4618000 84K rw--- [ stack ]
00007ffff47b7000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 1027648K

I would just look at anon if that's OK (or possible).

> Considering usual applications, this will be much better information than
> total_vm.

Agreed.

> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.

Yes, this looks much better now. Bash is only having somewhat strangely
high score.

Regards,

Vedran


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


hugh.dickins at tiscali

Oct 27, 2009, 11:39 AM

Post #10 of 17 (303 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> Now, oom-killer's score uses mm->total_vm as its base value.
> But, in these days, applications like GUI program tend to use
> much shared libraries and total_vm grows too high even when
> pages are not fully mapped.
>
> For example, running a program "mmap" which allocates 1 GBbytes of
> anonymous memory, oom_score top 10 on system will be..
>
> score PID name
> 89924 3938 mixer_applet2
> 90210 3942 tomboy
> 94753 3936 clock-applet
> 101994 3919 pulseaudio
> 113525 4028 gnome-terminal
> 127340 1 init
> 128177 3871 nautilus
> 151003 11515 bash
> 256944 11653 mmap <-----------------use 1G of anon
> 425561 3829 gnome-session
>
> No one believes gnome-session is more guilty than "mmap".
>
> Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> This patch adds mm->swap_usage and calculate oom_score based on
> anon_rss + file_rss + swap_usage.
> Considering usual applications, this will be much better information than
> total_vm. After this patch, the score on my desktop is
>
> score PID name
> 4033 3176 gnome-panel
> 4077 3113 xinit
> 4526 3190 python
> 4820 3161 gnome-settings-
> 4989 3289 gnome-terminal
> 7105 3271 tomboy
> 8427 3177 nautilus
> 17549 3140 gnome-session
> 128501 3299 bash
> 256106 3383 mmap
>
> This order is not bad, I think.
>
> Note: This adss new counter...then new cost is added.

I've often thought we ought to supply such a swap_usage statistic;
and show it in /proc/pid/statsomething, presumably VmSwap in
/proc/pid/status, even an additional field on the end of statm.

A slight new cost, yes: doesn't matter at the swapping end, but
would slightly impact fork and exit - I do hope we can afford it,
because I think it should have been available all along.

I've not checked your patch in detail; but I do agree that basing
OOM (physical memory) decisions on total_vm (virtual memory) has
seemed weird, so it's well worth trying this approach. Whether swap
should be included along with rss isn't quite clear to me: I'm not
saying you're wrong, not at all, just that it's not quite obvious.

I've several observations to make about bad OOM kill decisions,
but it's probably better that I make them in the original
"Memory overcommit" thread, rather than divert this thread.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


aarcange at redhat

Oct 27, 2009, 11:47 AM

Post #11 of 17 (306 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, Oct 27, 2009 at 06:39:07PM +0000, Hugh Dickins wrote:
> OOM (physical memory) decisions on total_vm (virtual memory) has
> seemed weird, so it's well worth trying this approach. Whether swap

It is weird and wrong, I strongly support fixing it once and for
all. The oom killing should be based on physical info, total_vm is
a very rough approximation of the real info we're interested about
(real RAM utilization of the task).

> should be included along with rss isn't quite clear to me: I'm not
> saying you're wrong, not at all, just that it's not quite obvious.

Agreed it's not obvious. Intuitively I think only including RSS and no
swap is best, but clearly I can't be entirely against including swap
too as there may be scenarios where including swap provides for a
better choice.

My argument for not including swap is that we kill tasks to free RAM
(we don't really care to free swap, system needs RAM at oom time).
Freeing swap won't immediately help because no RAM is freed when swap
is released (sure other tasks that sits huge in RAM can be moved to
swap after swap isn't full but if we immediately killed those tasks
that were huge in RAM in the first place we'd be better off).

> I've several observations to make about bad OOM kill decisions,
> but it's probably better that I make them in the original
> "Memory overcommit" thread, rather than divert this thread.

:)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 5:13 PM

Post #12 of 17 (295 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 18:41:22 +0100
Vedran Furač <vedran.furac [at] gmail> wrote:

> KAMEZAWA Hiroyuki wrote:
>
> > On Tue, 27 Oct 2009 15:55:26 +0900
> > Minchan Kim <minchan.kim [at] gmail> wrote:
> >
> >>>> Hmm.
> >>>> I wonder why we consider VM size for OOM kiling.
> >>>> How about RSS size?
> >>>>
> >>> Maybe the current code assumes "Tons of swap have been generated, already" if
> >>> oom-kill is invoked. Then, just using mm->anon_rss will not be correct.
> >>>
> >>> Hm, should we count # of swap entries reference from mm ?....
> >> In Vedran case, he didn't use swap. So, Only considering vm is the problem.
> >> I think it would be better to consider both RSS + # of swap entries as
> >> Kosaki mentioned.
> >>
> > Then, maybe this kind of patch is necessary.
> > This is on 2.6.31...then I may have to rebase this to mmotom.
> > Added more CCs.
> >
> > Vedran, I'm glad if you can test this patch.
>
> Thanks for the patch! I'll test it during this week a report after that.
>
> > Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> > This patch adds mm->swap_usage and calculate oom_score based on
> > anon_rss + file_rss + swap_usage.
>
> Isn't file_rss shared between processes? Sorry, I'm newbie. :)
>
It's shared. But in typical case, file_rss will very small at OOM.


> % pmap $(pidof test)
> 29049: ./test
> 0000000000400000 4K r-x-- /home/vedranf/dev/tmp/test
> 0000000000600000 4K rw--- /home/vedranf/dev/tmp/test
> 00002ba362a80000 116K r-x-- /lib/ld-2.10.1.so
> 00002ba362a9d000 12K rw--- [ anon ]
> 00002ba362c9c000 4K r---- /lib/ld-2.10.1.so
> 00002ba362c9d000 4K rw--- /lib/ld-2.10.1.so
> 00002ba362c9e000 1320K r-x-- /lib/libc-2.10.1.so
> 00002ba362de8000 2044K ----- /lib/libc-2.10.1.so
> 00002ba362fe7000 16K r---- /lib/libc-2.10.1.so
> 00002ba362feb000 4K rw--- /lib/libc-2.10.1.so
> 00002ba362fec000 1024028K rw--- [ anon ] // <-- This
> 00007ffff4618000 84K rw--- [ stack ]
> 00007ffff47b7000 4K r-x-- [ anon ]
> ffffffffff600000 4K r-x-- [ anon ]
> total 1027648K
>
> I would just look at anon if that's OK (or possible).
>
> > Considering usual applications, this will be much better information than
> > total_vm.
>
> Agreed.
>
> > score PID name
> > 4033 3176 gnome-panel
> > 4077 3113 xinit
> > 4526 3190 python
> > 4820 3161 gnome-settings-
> > 4989 3289 gnome-terminal
> > 7105 3271 tomboy
> > 8427 3177 nautilus
> > 17549 3140 gnome-session
> > 128501 3299 bash
> > 256106 3383 mmap
> >
> > This order is not bad, I think.
>
> Yes, this looks much better now. Bash is only having somewhat strangely
> high score.
>
It gets half score of mmap....If mmap goes, bash's score will goes down
dramatically. I'll read other's comments and tweak this patch more.

Thanks,
-Kame



> Regards,
>
> Vedran
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo [at] kvack For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont [at] kvack"> email [at] kvack </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 5:22 PM

Post #13 of 17 (298 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 13:38:10 +0100
Andrea Arcangeli <aarcange [at] redhat> wrote:

> On Tue, Oct 27, 2009 at 04:56:12PM +0900, Minchan Kim wrote:
> > Thanks for making the patch.
> > Let's hear other's opinion. :)
>
> total_vm is nearly meaningless, especially on 64bit that reduces the
> mmap load on libs, I tried to change it to something "physical" (rss,
> didn't add swap too) some time ago too, not sure why I didn't manage
> to get it in. Trying again surely sounds good. Accounting swap isn't
> necessarily good, we may be killing a task that isn't accessing memory
> at all. So yes, we free swap but if the task is the "bloater" it's
> unlikely to be all in swap as it did all recent activity that lead to
> the oom. So I'm unsure if swap is good to account here, but surely I
> ack to replace virtual with rss. I would include the whole rss, as the
> file one may also be rendered unswappable if it is accessed in a loop
> refreshing the young bit all the time.
>
I wonder I'll acccounting swap and export it via /proc/<pid>/??? file.
So, I'll divide this patch into 2 part as swap accounting/oom patch.

Considering amount of swap at oom isn't very bad, I think. But using the
same weight to rss and swap is not good, maybe.

Hmm, maybe
anon_rss + file_rss/2 + swap_usage/4 + kosaki's time accounting change
can give us some better value. I'll consider what number is logical and
technically correct, again.

I'll prepare series of 2-4? patches.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 5:28 PM

Post #14 of 17 (327 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 18:39:07 +0000 (GMT)
Hugh Dickins <hugh.dickins [at] tiscali> wrote:

> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Now, oom-killer's score uses mm->total_vm as its base value.
> > But, in these days, applications like GUI program tend to use
> > much shared libraries and total_vm grows too high even when
> > pages are not fully mapped.
> >
> > For example, running a program "mmap" which allocates 1 GBbytes of
> > anonymous memory, oom_score top 10 on system will be..
> >
> > score PID name
> > 89924 3938 mixer_applet2
> > 90210 3942 tomboy
> > 94753 3936 clock-applet
> > 101994 3919 pulseaudio
> > 113525 4028 gnome-terminal
> > 127340 1 init
> > 128177 3871 nautilus
> > 151003 11515 bash
> > 256944 11653 mmap <-----------------use 1G of anon
> > 425561 3829 gnome-session
> >
> > No one believes gnome-session is more guilty than "mmap".
> >
> > Instead of total_vm, we should use anon/file/swap usage of a process, I think.
> > This patch adds mm->swap_usage and calculate oom_score based on
> > anon_rss + file_rss + swap_usage.
> > Considering usual applications, this will be much better information than
> > total_vm. After this patch, the score on my desktop is
> >
> > score PID name
> > 4033 3176 gnome-panel
> > 4077 3113 xinit
> > 4526 3190 python
> > 4820 3161 gnome-settings-
> > 4989 3289 gnome-terminal
> > 7105 3271 tomboy
> > 8427 3177 nautilus
> > 17549 3140 gnome-session
> > 128501 3299 bash
> > 256106 3383 mmap
> >
> > This order is not bad, I think.
> >
> > Note: This adss new counter...then new cost is added.
>
> I've often thought we ought to supply such a swap_usage statistic;
> and show it in /proc/pid/statsomething, presumably VmSwap in
> /proc/pid/status, even an additional field on the end of statm.
>
Hm, ok. I'll divide this patch into

- replace total_vm with anon_rss + file_rsss (everyone will agree this.)
- add swap usage accounting
- show it via /proc (may need discuss about its style.)
- use the value at oom calculation (need discuss)

> A slight new cost, yes: doesn't matter at the swapping end, but
> would slightly impact fork and exit - I do hope we can afford it,
> because I think it should have been available all along.
>
fork()/exit() uses batched counting. Then, we don't see overhead.


> I've not checked your patch in detail; but I do agree that basing
> OOM (physical memory) decisions on total_vm (virtual memory) has
> seemed weird, so it's well worth trying this approach. Whether swap
> should be included along with rss isn't quite clear to me: I'm not
> saying you're wrong, not at all, just that it's not quite obvious.
>
yes. It just comes from heuristics. It will need discuss/investigation/theory.


> I've several observations to make about bad OOM kill decisions,
> but it's probably better that I make them in the original
> "Memory overcommit" thread, rather than divert this thread.
>

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


kamezawa.hiroyu at jp

Oct 27, 2009, 5:32 PM

Post #15 of 17 (294 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

On Tue, 27 Oct 2009 19:47:43 +0100
Andrea Arcangeli <aarcange [at] redhat> wrote:
> > should be included along with rss isn't quite clear to me: I'm not
> > saying you're wrong, not at all, just that it's not quite obvious.
>
> Agreed it's not obvious. Intuitively I think only including RSS and no
> swap is best, but clearly I can't be entirely against including swap
> too as there may be scenarios where including swap provides for a
> better choice.
>
> My argument for not including swap is that we kill tasks to free RAM
> (we don't really care to free swap, system needs RAM at oom time).
> Freeing swap won't immediately help because no RAM is freed when swap
> is released (sure other tasks that sits huge in RAM can be moved to
> swap after swap isn't full but if we immediately killed those tasks
> that were huge in RAM in the first place we'd be better off).
>
Okay.

As first step, I'll divide this into
- replace total_vm with anon_rss/file_rss patch
- swap accounting
- a patch for consider whether swap amount should be included or not.

Then, necessary part will go early. And backport will be easy.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


vedran.furac at gmail

Oct 27, 2009, 5:45 PM

Post #16 of 17 (294 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

KAMEZAWA Hiroyuki wrote:

> Hmm, maybe
> anon_rss + file_rss/2 + swap_usage/4 + kosaki's time accounting change
> can give us some better value. I'll consider what number is logical and
> technically correct, again.

Although my vote doesn't count, from my experience, this formula sounds
like optimal solution. Thanks, hope it gets accepted!

Regards,

Vedran


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


pavel at ucw

Nov 5, 2009, 11:02 AM

Post #17 of 17 (263 views)
Permalink
Re: [RFC][PATCH] oom_kill: avoid depends on total_vm and use real RSS/swap value for oom_score (Re: Memory overcommit [In reply to]

Hi!

> Agreed it's not obvious. Intuitively I think only including RSS and no
> swap is best, but clearly I can't be entirely against including swap
> too as there may be scenarios where including swap provides for a
> better choice.
>
> My argument for not including swap is that we kill tasks to free RAM
> (we don't really care to free swap, system needs RAM at oom time).

System should be out of _virtual_ memory at that point, so yes,
freeing swap should help, too.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.