Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


efault at gmx

Nov 22, 2009, 4:09 AM

Post #1 of 10 (229 views)
Permalink
[patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant

sched: fix set_task_cpu() and provide an unlocked runqueue variant.

set_task_cpu() falsifies migration stats by unconditionally generating migration
stats whether a task's cpu actually changed or not. As used in copy_process(),
the runqueue is unlocked, so we need to provide an unlocked variant which does
the locking to provide a write barrier.

Signed-off-by: Mike Galbraith <efault [at] gmx>
Cc: Ingo Molnar <mingo [at] elte>
Cc: Peter Zijlstra <a.p.zijlstra [at] chello>
LKML-Reference: <new-submission>

---
include/linux/sched.h | 5 +++++
kernel/fork.c | 2 +-
kernel/sched.c | 38 ++++++++++++++++++++++++--------------
3 files changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2056,7 +2056,6 @@ task_hot(struct task_struct *p, u64 now,
return delta < (s64)sysctl_sched_migration_cost;
}

-
void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
int old_cpu = task_cpu(p);
@@ -2065,10 +2064,10 @@ void set_task_cpu(struct task_struct *p,
*new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
u64 clock_offset;

- clock_offset = old_rq->clock - new_rq->clock;
-
- trace_sched_migrate_task(p, new_cpu);
+ if (unlikely(old_cpu == new_cpu))
+ goto out;

+ clock_offset = old_rq->clock - new_rq->clock;
#ifdef CONFIG_SCHEDSTATS
if (p->se.wait_start)
p->se.wait_start -= clock_offset;
@@ -2076,22 +2075,33 @@ void set_task_cpu(struct task_struct *p,
p->se.sleep_start -= clock_offset;
if (p->se.block_start)
p->se.block_start -= clock_offset;
+ if (task_hot(p, old_rq->clock, NULL))
+ schedstat_inc(p, se.nr_forced2_migrations);
#endif
- if (old_cpu != new_cpu) {
- p->se.nr_migrations++;
-#ifdef CONFIG_SCHEDSTATS
- if (task_hot(p, old_rq->clock, NULL))
- schedstat_inc(p, se.nr_forced2_migrations);
-#endif
- perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS,
- 1, 1, NULL, 0);
- }
p->se.vruntime -= old_cfsrq->min_vruntime -
- new_cfsrq->min_vruntime;
+ new_cfsrq->min_vruntime;
+ p->se.nr_migrations++;
+ trace_sched_migrate_task(p, new_cpu);
+ perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);

+out:
__set_task_cpu(p, new_cpu);
}

+void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
+{
+ unsigned long flags;
+ struct rq *rq, *new_rq = cpu_rq(new_cpu);
+
+ smp_wmb();
+ rq = task_rq_lock(p, &flags);
+ update_rq_clock(rq);
+ if (rq != new_rq)
+ update_rq_clock(new_rq);
+ set_task_cpu(p, new_cpu);
+ task_rq_unlock(rq, &flags);
+}
+
struct migration_req {
struct list_head list;

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -2457,6 +2457,7 @@ static inline unsigned int task_cpu(cons
}

extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
+extern void set_task_cpu_unlocked(struct task_struct *p, unsigned int cpu);

#else

@@ -2469,6 +2470,10 @@ static inline void set_task_cpu(struct t
{
}

+static inline void set_task_cpu_unlocked(struct task_struct *p, unsigned int cpu)
+{
+}
+
#endif /* CONFIG_SMP */

extern void arch_pick_mmap_layout(struct mm_struct *mm);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -1242,7 +1242,7 @@ static struct task_struct *copy_process(
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
!cpu_online(task_cpu(p))))
- set_task_cpu(p, smp_processor_id());
+ set_task_cpu_unlocked(p, smp_processor_id());

/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


peterz at infradead

Nov 25, 2009, 10:27 AM

Post #2 of 10 (203 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Sun, 2009-11-22 at 13:09 +0100, Mike Galbraith wrote:
> sched: fix set_task_cpu() and provide an unlocked runqueue variant.
>
> set_task_cpu() falsifies migration stats by unconditionally generating migration
> stats whether a task's cpu actually changed or not. As used in copy_process(),
> the runqueue is unlocked, so we need to provide an unlocked variant which does
> the locking to provide a write barrier.
>
> Signed-off-by: Mike Galbraith <efault [at] gmx>
> Cc: Ingo Molnar <mingo [at] elte>
> Cc: Peter Zijlstra <a.p.zijlstra [at] chello>
> LKML-Reference: <new-submission>
>
> ---

> +void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
> +{
> + unsigned long flags;
> + struct rq *rq, *new_rq = cpu_rq(new_cpu);
> +
> + smp_wmb();
> + rq = task_rq_lock(p, &flags);
> + update_rq_clock(rq);
> + if (rq != new_rq)
> + update_rq_clock(new_rq);
> + set_task_cpu(p, new_cpu);
> + task_rq_unlock(rq, &flags);
> +}

I've got to ask, what's that barrier for?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


efault at gmx

Nov 25, 2009, 5:01 PM

Post #3 of 10 (204 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-22 at 13:09 +0100, Mike Galbraith wrote:
> > sched: fix set_task_cpu() and provide an unlocked runqueue variant.
> >
> > set_task_cpu() falsifies migration stats by unconditionally generating migration
> > stats whether a task's cpu actually changed or not. As used in copy_process(),
> > the runqueue is unlocked, so we need to provide an unlocked variant which does
> > the locking to provide a write barrier.
> >
> > Signed-off-by: Mike Galbraith <efault [at] gmx>
> > Cc: Ingo Molnar <mingo [at] elte>
> > Cc: Peter Zijlstra <a.p.zijlstra [at] chello>
> > LKML-Reference: <new-submission>
> >
> > ---
>
> > +void set_task_cpu_unlocked(struct task_struct *p, unsigned int new_cpu)
> > +{
> > + unsigned long flags;
> > + struct rq *rq, *new_rq = cpu_rq(new_cpu);
> > +
> > + smp_wmb();
> > + rq = task_rq_lock(p, &flags);
> > + update_rq_clock(rq);
> > + if (rq != new_rq)
> > + update_rq_clock(new_rq);
> > + set_task_cpu(p, new_cpu);
> > + task_rq_unlock(rq, &flags);
> > +}
>
> I've got to ask, what's that barrier for?

It's a leftover from frustrated bug hunting.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


efault at gmx

Nov 25, 2009, 5:31 PM

Post #4 of 10 (207 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:

> > I've got to ask, what's that barrier for?
>
> It's a leftover from frustrated bug hunting.

To be more specific, I put it there to ensure that min_vruntimes are
stable. I figured that if try_to_wake_up() needs a barrier to look at
task->state, I had better do the same for the runqueues.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


peterz at infradead

Nov 26, 2009, 1:35 AM

Post #5 of 10 (197 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 02:31 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> > On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
>
> > > I've got to ask, what's that barrier for?
> >
> > It's a leftover from frustrated bug hunting.
>
> To be more specific, I put it there to ensure that min_vruntimes are
> stable.

min_vruntime should only ever be poked at when holding the respective
rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
funny.

> I figured that if try_to_wake_up() needs a barrier to look at
> task->state, I had better do the same for the runqueues.

Ah, ttwu() has that barrier for another reason. The wmb in ttwu() is to
ensure the wakee sees the state of the waker at the time of waking.

That is, its about ordering things like:


A B


my_cond = true;
wake_process(my_friend);

while (!my_cond)
schedule();


So that you can actually observe my_cond being true once you wakeup
(schedule acts as a mb() when it actually schedules).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


efault at gmx

Nov 26, 2009, 2:16 AM

Post #6 of 10 (193 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 10:35 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 02:31 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-26 at 02:01 +0100, Mike Galbraith wrote:
> > > On Wed, 2009-11-25 at 19:27 +0100, Peter Zijlstra wrote:
> >
> > > > I've got to ask, what's that barrier for?
> > >
> > > It's a leftover from frustrated bug hunting.
> >
> > To be more specific, I put it there to ensure that min_vruntimes are
> > stable.
>
> min_vruntime should only ever be poked at when holding the respective
> rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> funny.

Yeah, but we're looking at an unlocked runqueue. But never mind...

> > I figured that if try_to_wake_up() needs a barrier to look at
> > task->state, I had better do the same for the runqueues.
>
> Ah, ttwu() has that barrier for another reason. The wmb in ttwu() is to
> ensure the wakee sees the state of the waker at the time of waking.
>
> That is, its about ordering things like:
>
>
> A B
>
>
> my_cond = true;
> wake_process(my_friend);
>
> while (!my_cond)
> schedule();
>
>
> So that you can actually observe my_cond being true once you wakeup
> (schedule acts as a mb() when it actually schedules).

(Ah! I think you actually made a wrinkle in grey-ware)

...ATM, I kinda wish I'd not gone off and read barriers.txt. It hasn't
sunk in yet, but certainly has made me _paranoid as hell_ :-)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


peterz at infradead

Nov 26, 2009, 6:09 AM

Post #7 of 10 (196 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > min_vruntime should only ever be poked at when holding the respective
> > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > funny.
>
> Yeah, but we're looking at an unlocked runqueue. But never mind...

The patch is also poking at rq->clock without rq->lock held... not very
nice.

Gah, I hate that we're doing migration things without holding both rq's,
this is making live so very interesting ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


peterz at infradead

Nov 26, 2009, 6:21 AM

Post #8 of 10 (189 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > min_vruntime should only ever be poked at when holding the respective
> > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > funny.
> >
> > Yeah, but we're looking at an unlocked runqueue. But never mind...
>
> The patch is also poking at rq->clock without rq->lock held... not very
> nice.
>
> Gah, I hate that we're doing migration things without holding both rq's,
> this is making live so very interesting ;-)

so the problem is this bit in kernel/fork.c, which is ran with
tasklist_lock held for writing:

/*
* The task hasn't been attached yet, so its cpus_allowed mask will
* not be changed, nor will its assigned CPU.
*
* The cpus_allowed mask of the parent may have changed after it was
* copied first time - so re-copy it here, then check the child's CPU
* to ensure it is on a valid CPU (and if not, just force it back to
* parent's CPU). This avoids alot of nasty races.
*/
p->cpus_allowed = current->cpus_allowed;
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
!cpu_online(task_cpu(p))))
set_task_cpu(p, smp_processor_id());


The problem is that that doesn't close any races at all since
tasklist_lock doesn't fully serialize changes to ->cpus_allowed.

In fact, there is nothing that protects that mask at all.

The second problem is that set_task_cpu() is accessing data from both
the old and the new rq, which basically requires is being ran with both
rq's locked, and the regular migration paths do so.

However things like ttwu() try to be cute and do not, opening the doors
to all kinds of funny.

Clearly we don't really want to do double_rq_lock() in ttwu(), that's
one of the hotter paths around (and looking at it we ought to seriously
look at trimming some of it).

Which sort of leaves us in a predicament all-right...

/me goes puzzle.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


efault at gmx

Nov 26, 2009, 6:58 AM

Post #9 of 10 (190 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > min_vruntime should only ever be poked at when holding the respective
> > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > funny.
> >
> > Yeah, but we're looking at an unlocked runqueue. But never mind...
>
> The patch is also poking at rq->clock without rq->lock held... not very
> nice.

To the users on the remote runqueue, it doesn't matter, they always
update before using, so will always have accurate time.

What bothers me is that I don't see what prevents a SCHED_IDLE task
making a huge min_vruntime update on the remote CPU while you're
queueing a sleeper here, who used to live over there, while you're
reading.

But, they're all instantaneous values anyway, so I suppose memory speed
is fine. It works... and I really doubt that the yet another paranoid
barrier I just put in will make diddly spit difference, but I did just
put an smp_rmb() there anyway to see with my own eyes ;-)

> Gah, I hate that we're doing migration things without holding both rq's,
> this is making live so very interesting ;-)

Yes.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


efault at gmx

Nov 26, 2009, 7:32 AM

Post #10 of 10 (188 views)
Permalink
Re: [patch] sched: fix set_task_cpu() and provide an unlocked runqueue variant [In reply to]

On Thu, 2009-11-26 at 15:21 +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, 2009-11-26 at 11:16 +0100, Mike Galbraith wrote:
> > > > min_vruntime should only ever be poked at when holding the respective
> > > > rq->lock, even with a barrier a 64bit read on a 32bit machine can go all
> > > > funny.
> > >
> > > Yeah, but we're looking at an unlocked runqueue. But never mind...
> >
> > The patch is also poking at rq->clock without rq->lock held... not very
> > nice.
> >
> > Gah, I hate that we're doing migration things without holding both rq's,
> > this is making live so very interesting ;-)
>
> so the problem is this bit in kernel/fork.c, which is ran with
> tasklist_lock held for writing:
>
> /*
> * The task hasn't been attached yet, so its cpus_allowed mask will
> * not be changed, nor will its assigned CPU.
> *
> * The cpus_allowed mask of the parent may have changed after it was
> * copied first time - so re-copy it here, then check the child's CPU
> * to ensure it is on a valid CPU (and if not, just force it back to
> * parent's CPU). This avoids alot of nasty races.
> */
> p->cpus_allowed = current->cpus_allowed;
> p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
> if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
> !cpu_online(task_cpu(p))))
> set_task_cpu(p, smp_processor_id());
>
>
> The problem is that that doesn't close any races at all since
> tasklist_lock doesn't fully serialize changes to ->cpus_allowed.

Well, some stuff can't get at you if you're there, but yes, I was
wondering how fixing it up there was going to guarantee a happy landing
when we get to... wake_up_new_task().

> In fact, there is nothing that protects that mask at all.
>
> The second problem is that set_task_cpu() is accessing data from both
> the old and the new rq, which basically requires is being ran with both
> rq's locked, and the regular migration paths do so.

Yes, and task_cpu() and task_rq() are racy as heck without the lock. It
all goes fuzzy.

sched_class can change out from under you the instant you release the
runqueue lock afaikt, nice level, affinity... etc?

> However things like ttwu() try to be cute and do not, opening the doors
> to all kinds of funny.

Yes, so all the raciness I've been imagining isn't _all_ imaginary.
Yoohoo. Um, I mean damn.

> Clearly we don't really want to do double_rq_lock() in ttwu(), that's
> one of the hotter paths around (and looking at it we ought to seriously
> look at trimming some of it).

No, apparently not. About an hour ago, paranoid little me merely did
lock handoff in ttwu and... wunt (wunt?), and was rewarded with a
deadlocked box a bit after X came up.

WRT lard, yes, it is getting fat. The cache misses of the prefer
sibling thing are hurting very fast threads too. Much reward if you
find a sibling, ~4% pain for TCP_RR with the cache misses and whatnot
you waste looking around for a spot for a pinned ultralight task.

Wish I could find an answer for the sibling thing. Nearly doubles
throughput for some things.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.