Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH] block: strip out locking optimization in put_io_context()

 

 

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded


tj at kernel

Feb 6, 2012, 1:54 PM

Post #1 of 28 (121 views)
Permalink
[PATCH] block: strip out locking optimization in put_io_context()

put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo <tj [at] kernel>
LKML-Reference: <1328514611.21268.66.camel [at] sli10-conro>
---
I couldn't find any statiscally meaningful advantage of the
optimization with tight fork/exit tests w/ forced ioc creation on
fork, which gotta be the most pathological test case for the code
path. So, let's remove the ugly optimization. If I missed sth, we
can resurrect the simpler optimization later. Jens, this is on top of
linus#master without Shaohua's patch.

Thanks.

block/blk-cgroup.c | 2 -
block/blk-core.c | 2 -
block/blk-ioc.c | 90 +++++-----------------------------------------
block/cfq-iosched.c | 2 -
fs/ioprio.c | 2 -
include/linux/blkdev.h | 3 -
include/linux/iocontext.h | 5 +-
kernel/fork.c | 2 -
8 files changed, 18 insertions(+), 90 deletions(-)

Index: work/block/blk-cgroup.c
===================================================================
--- work.orig/block/blk-cgroup.c
+++ work/block/blk-cgroup.c
@@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_cgroup_changed(ioc);
- put_io_context(ioc, NULL);
+ put_io_context(ioc);
}
}
}
Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -642,7 +642,7 @@ static inline void blk_free_request(stru
if (rq->cmd_flags & REQ_ELVPRIV) {
elv_put_request(q, rq);
if (rq->elv.icq)
- put_io_context(rq->elv.icq->ioc, q);
+ put_io_context(rq->elv.icq->ioc);
}

mempool_free(rq, q->rq.rq_pool);
Index: work/block/blk-ioc.c
===================================================================
--- work.orig/block/blk-ioc.c
+++ work/block/blk-ioc.c
@@ -29,21 +29,6 @@ void get_io_context(struct io_context *i
}
EXPORT_SYMBOL(get_io_context);

-/*
- * Releasing ioc may nest into another put_io_context() leading to nested
- * fast path release. As the ioc's can't be the same, this is okay but
- * makes lockdep whine. Keep track of nesting and use it as subclass.
- */
-#ifdef CONFIG_LOCKDEP
-#define ioc_release_depth(q) ((q) ? (q)->ioc_release_depth : 0)
-#define ioc_release_depth_inc(q) (q)->ioc_release_depth++
-#define ioc_release_depth_dec(q) (q)->ioc_release_depth--
-#else
-#define ioc_release_depth(q) 0
-#define ioc_release_depth_inc(q) do { } while (0)
-#define ioc_release_depth_dec(q) do { } while (0)
-#endif
-
static void icq_free_icq_rcu(struct rcu_head *head)
{
struct io_cq *icq = container_of(head, struct io_cq, __rcu_head);
@@ -75,11 +60,8 @@ static void ioc_exit_icq(struct io_cq *i
if (rcu_dereference_raw(ioc->icq_hint) == icq)
rcu_assign_pointer(ioc->icq_hint, NULL);

- if (et->ops.elevator_exit_icq_fn) {
- ioc_release_depth_inc(q);
+ if (et->ops.elevator_exit_icq_fn)
et->ops.elevator_exit_icq_fn(icq);
- ioc_release_depth_dec(q);
- }

/*
* @icq->q might have gone away by the time RCU callback runs
@@ -149,79 +131,29 @@ static void ioc_release_fn(struct work_s
/**
* put_io_context - put a reference of io_context
* @ioc: io_context to put
- * @locked_q: request_queue the caller is holding queue_lock of (hint)
*
* Decrement reference count of @ioc and release it if the count reaches
- * zero. If the caller is holding queue_lock of a queue, it can indicate
- * that with @locked_q. This is an optimization hint and the caller is
- * allowed to pass in %NULL even when it's holding a queue_lock.
+ * zero.
*/
-void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
+void put_io_context(struct io_context *ioc)
{
- struct request_queue *last_q = locked_q;
unsigned long flags;

if (ioc == NULL)
return;

BUG_ON(atomic_long_read(&ioc->refcount) <= 0);
- if (locked_q)
- lockdep_assert_held(locked_q->queue_lock);
-
- if (!atomic_long_dec_and_test(&ioc->refcount))
- return;

/*
- * Destroy @ioc. This is a bit messy because icq's are chained
- * from both ioc and queue, and ioc->lock nests inside queue_lock.
- * The inner ioc->lock should be held to walk our icq_list and then
- * for each icq the outer matching queue_lock should be grabbed.
- * ie. We need to do reverse-order double lock dancing.
- *
- * Another twist is that we are often called with one of the
- * matching queue_locks held as indicated by @locked_q, which
- * prevents performing double-lock dance for other queues.
- *
- * So, we do it in two stages. The fast path uses the queue_lock
- * the caller is holding and, if other queues need to be accessed,
- * uses trylock to avoid introducing locking dependency. This can
- * handle most cases, especially if @ioc was performing IO on only
- * single device.
- *
- * If trylock doesn't cut it, we defer to @ioc->release_work which
- * can do all the double-locking dancing.
+ * Releasing ioc requires reverse order double locking and we may
+ * already be holding a queue_lock. Do it asynchronously from wq.
*/
- spin_lock_irqsave_nested(&ioc->lock, flags,
- ioc_release_depth(locked_q));
-
- while (!hlist_empty(&ioc->icq_list)) {
- struct io_cq *icq = hlist_entry(ioc->icq_list.first,
- struct io_cq, ioc_node);
- struct request_queue *this_q = icq->q;
-
- if (this_q != last_q) {
- if (last_q && last_q != locked_q)
- spin_unlock(last_q->queue_lock);
- last_q = NULL;
-
- if (!spin_trylock(this_q->queue_lock))
- break;
- last_q = this_q;
- continue;
- }
- ioc_exit_icq(icq);
+ if (atomic_long_dec_and_test(&ioc->refcount)) {
+ spin_lock_irqsave(&ioc->lock, flags);
+ if (!hlist_empty(&ioc->icq_list))
+ schedule_work(&ioc->release_work);
+ spin_unlock_irqrestore(&ioc->lock, flags);
}
-
- if (last_q && last_q != locked_q)
- spin_unlock(last_q->queue_lock);
-
- spin_unlock_irqrestore(&ioc->lock, flags);
-
- /* if no icq is left, we're done; otherwise, kick release_work */
- if (hlist_empty(&ioc->icq_list))
- kmem_cache_free(iocontext_cachep, ioc);
- else
- schedule_work(&ioc->release_work);
}
EXPORT_SYMBOL(put_io_context);

@@ -236,7 +168,7 @@ void exit_io_context(struct task_struct
task_unlock(task);

atomic_dec(&ioc->nr_tasks);
- put_io_context(ioc, NULL);
+ put_io_context(ioc);
}

/**
Index: work/block/cfq-iosched.c
===================================================================
--- work.orig/block/cfq-iosched.c
+++ work/block/cfq-iosched.c
@@ -1794,7 +1794,7 @@ __cfq_slice_expired(struct cfq_data *cfq
cfqd->active_queue = NULL;

if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue);
+ put_io_context(cfqd->active_cic->icq.ioc);
cfqd->active_cic = NULL;
}
}
Index: work/fs/ioprio.c
===================================================================
--- work.orig/fs/ioprio.c
+++ work/fs/ioprio.c
@@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct *
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_ioprio_changed(ioc, ioprio);
- put_io_context(ioc, NULL);
+ put_io_context(ioc);
}

return err;
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -399,9 +399,6 @@ struct request_queue {
/* Throttle data */
struct throtl_data *td;
#endif
-#ifdef CONFIG_LOCKDEP
- int ioc_release_depth;
-#endif
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
Index: work/include/linux/iocontext.h
===================================================================
--- work.orig/include/linux/iocontext.h
+++ work/include/linux/iocontext.h
@@ -133,7 +133,7 @@ static inline struct io_context *ioc_tas

struct task_struct;
#ifdef CONFIG_BLOCK
-void put_io_context(struct io_context *ioc, struct request_queue *locked_q);
+void put_io_context(struct io_context *ioc);
void exit_io_context(struct task_struct *task);
struct io_context *get_task_io_context(struct task_struct *task,
gfp_t gfp_flags, int node);
@@ -141,8 +141,7 @@ void ioc_ioprio_changed(struct io_contex
void ioc_cgroup_changed(struct io_context *ioc);
#else
struct io_context;
-static inline void put_io_context(struct io_context *ioc,
- struct request_queue *locked_q) { }
+static inline void put_io_context(struct io_context *ioc) { }
static inline void exit_io_context(struct task_struct *task) { }
#endif

Index: work/kernel/fork.c
===================================================================
--- work.orig/kernel/fork.c
+++ work/kernel/fork.c
@@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f
return -ENOMEM;

new_ioc->ioprio = ioc->ioprio;
- put_io_context(new_ioc, NULL);
+ put_io_context(new_ioc);
}
#endif
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


axboe at kernel

Feb 6, 2012, 10:49 PM

Post #2 of 28 (117 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On 02/06/2012 10:54 PM, Tejun Heo wrote:
> put_io_context() performed a complex trylock dancing to avoid
> deferring ioc release to workqueue. It was also broken on UP because
> trylock was always assumed to succeed which resulted in unbalanced
> preemption count.
>
> While there are ways to fix the UP breakage, even the most
> pathological microbench (forced ioc allocation and tight fork/exit
> loop) fails to show any appreciable performance benefit of the
> optimization. Strip it out. If there turns out to be workloads which
> are affected by this change, simpler optimization from the discussion
> thread can be applied later.
>
> Signed-off-by: Tejun Heo <tj [at] kernel>
> LKML-Reference: <1328514611.21268.66.camel [at] sli10-conro>
> ---
> I couldn't find any statiscally meaningful advantage of the
> optimization with tight fork/exit tests w/ forced ioc creation on
> fork, which gotta be the most pathological test case for the code
> path. So, let's remove the ugly optimization. If I missed sth, we
> can resurrect the simpler optimization later. Jens, this is on top of
> linus#master without Shaohua's patch.

OK, then I'm fine with cleaning it up. Applied, thanks Tejun.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 7, 2012, 8:22 AM

Post #3 of 28 (118 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello, Jens.

On Tue, Feb 07, 2012 at 07:49:19AM +0100, Jens Axboe wrote:
> > I couldn't find any statiscally meaningful advantage of the
> > optimization with tight fork/exit tests w/ forced ioc creation on
> > fork, which gotta be the most pathological test case for the code
> > path. So, let's remove the ugly optimization. If I missed sth, we
> > can resurrect the simpler optimization later. Jens, this is on top of
> > linus#master without Shaohua's patch.
>
> OK, then I'm fine with cleaning it up. Applied, thanks Tejun.

Hmmm... how about merging Shaohua's smaller fix first until we figure
out what's going on with the performance regression he's seeing?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


axboe at kernel

Feb 7, 2012, 8:28 AM

Post #4 of 28 (116 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On 2012-02-07 17:22, Tejun Heo wrote:
> Hello, Jens.
>
> On Tue, Feb 07, 2012 at 07:49:19AM +0100, Jens Axboe wrote:
>>> I couldn't find any statiscally meaningful advantage of the
>>> optimization with tight fork/exit tests w/ forced ioc creation on
>>> fork, which gotta be the most pathological test case for the code
>>> path. So, let's remove the ugly optimization. If I missed sth, we
>>> can resurrect the simpler optimization later. Jens, this is on top of
>>> linus#master without Shaohua's patch.
>>
>> OK, then I'm fine with cleaning it up. Applied, thanks Tejun.
>
> Hmmm... how about merging Shaohua's smaller fix first until we figure
> out what's going on with the performance regression he's seeing?

That was already merged in my tree. I don't see how it makes much
difference in tracking the regression. You said that removing it made no
difference for the find test case, so I'd be more comfortable getting
rid of the nasty optimization.

I'll send it when I have pending tomorrow, so there's still a full day
to change things. We can just shuffle patches before then, not a
problem.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


torvalds at linux-foundation

Feb 7, 2012, 8:33 AM

Post #5 of 28 (117 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Tue, Feb 7, 2012 at 8:28 AM, Jens Axboe <axboe [at] kernel> wrote:
>
> That was already merged in my tree. I don't see how it makes much
> difference in tracking the regression. You said that removing it made no
> difference for the find test case, so I'd be more comfortable getting
> rid of the nasty optimization.

Yeah, please just get rid of the crazy code. Maybe *that* fixes the
regression too, who knows?

For all we know, the "fast case" is what causes extra locking only to
then fail and not even be a fast-path.

I think our default action should always be to simplify and clean up
code, unless you have seriously hard numbers to show that the code
complexity is worth it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 7, 2012, 8:47 AM

Post #6 of 28 (116 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello,

On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote:
> Yeah, please just get rid of the crazy code. Maybe *that* fixes the
> regression too, who knows?
>
> For all we know, the "fast case" is what causes extra locking only to
> then fail and not even be a fast-path.

Yeah, I was about to ask Shaohua to test the version w/o optimization.
With heavily loaded request_queue, trylock failure could be frequent,
which I wasn't testing.

Shaohua, can you please test the version w/o optimization? Also, can
you please give a bit more details on the setup? Are there multiple
swap devices? Is it SSD or rotating disk?

> I think our default action should always be to simplify and clean up
> code, unless you have seriously hard numbers to show that the code
> complexity is worth it.

Sure, it was originally all in put_io_context() and while moving
things to wq, it got progressively complex and I lost sense of
complexity from staring at it too long.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 7, 2012, 9:17 AM

Post #7 of 28 (116 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Tue, Feb 07, 2012 at 08:47:35AM -0800, Tejun Heo wrote:
> Yeah, I was about to ask Shaohua to test the version w/o optimization.
> With heavily loaded request_queue, trylock failure could be frequent,
> which I wasn't testing.
>
> Shaohua, can you please test the version w/o optimization? Also, can
> you please give a bit more details on the setup? Are there multiple
> swap devices? Is it SSD or rotating disk?

Can you please also apply the following patch on top of
block/for-linus and see how it behaves?

Thanks.
---
block/blk-ioc.c | 54 +++++++++++++++++-------------------------------------
1 file changed, 17 insertions(+), 37 deletions(-)

Index: work/block/blk-ioc.c
===================================================================
--- work.orig/block/blk-ioc.c
+++ work/block/blk-ioc.c
@@ -79,52 +79,32 @@ static void ioc_release_fn(struct work_s
{
struct io_context *ioc = container_of(work, struct io_context,
release_work);
- struct request_queue *last_q = NULL;
+ unsigned long flags;

- spin_lock_irq(&ioc->lock);
+ /*
+ * Exiting icq may call into put_io_context() through elevator
+ * which will trigger lockdep warning. The ioc's are guaranteed to
+ * be different, use a different locking subclass here. Using
+ * irqsave variant as there's no spin_lock_irq_nested().
+ */
+ spin_lock_irqsave_nested(&ioc->lock, flags, 1);

while (!hlist_empty(&ioc->icq_list)) {
struct io_cq *icq = hlist_entry(ioc->icq_list.first,
struct io_cq, ioc_node);
- struct request_queue *this_q = icq->q;
+ struct request_queue *q = icq->q;

- if (this_q != last_q) {
- /*
- * Need to switch to @this_q. Once we release
- * @ioc->lock, it can go away along with @cic.
- * Hold on to it.
- */
- __blk_get_queue(this_q);
-
- /*
- * blk_put_queue() might sleep thanks to kobject
- * idiocy. Always release both locks, put and
- * restart.
- */
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
- }
-
- last_q = this_q;
- spin_lock_irq(this_q->queue_lock);
- spin_lock(&ioc->lock);
- continue;
+ if (spin_trylock(q->queue_lock)) {
+ ioc_exit_icq(icq);
+ spin_unlock(q->queue_lock);
+ } else {
+ spin_unlock_irqrestore(&ioc->lock, flags);
+ cpu_relax();
+ spin_lock_irqsave_nested(&ioc->lock, flags, 1);
}
- ioc_exit_icq(icq);
- }
-
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
}

+ spin_unlock_irqrestore(&ioc->lock, flags);
kmem_cache_free(iocontext_cachep, ioc);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shaohua.li at intel

Feb 7, 2012, 4:19 PM

Post #8 of 28 (116 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/8 Tejun Heo <tj [at] kernel>:
> Hello,
>
> On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote:
>> Yeah, please just get rid of the crazy code. Maybe *that* fixes the
>> regression too, who knows?
>>
>> For all we know, the "fast case" is what causes extra locking only to
>> then fail and not even be a fast-path.
>
> Yeah, I was about to ask Shaohua to test the version w/o optimization.
> With heavily loaded request_queue, trylock failure could be frequent,
> which I wasn't testing.
>
> Shaohua, can you please test the version w/o optimization?  Also, can
> you please give a bit more details on the setup?  Are there multiple
> swap devices?  Is it SSD or rotating disk?
the test adds mem=4G in a 2 sockets 16 CPU machine.
just make several copy of kernel source in tmpfs (so there is swap
depending on your
memory size) and run kernel build in the kernel source in the meaning time.
there is only one swap device which is a rotating disk.

I'll test both patches soon.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shaohua.li at intel

Feb 8, 2012, 12:29 AM

Post #9 of 28 (108 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/8 Shaohua Li <shaohua.li [at] intel>:
> 2012/2/8 Tejun Heo <tj [at] kernel>:
>> Hello,
>>
>> On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote:
>>> Yeah, please just get rid of the crazy code. Maybe *that* fixes the
>>> regression too, who knows?
>>>
>>> For all we know, the "fast case" is what causes extra locking only to
>>> then fail and not even be a fast-path.
>>
>> Yeah, I was about to ask Shaohua to test the version w/o optimization.
>> With heavily loaded request_queue, trylock failure could be frequent,
>> which I wasn't testing.
>>
>> Shaohua, can you please test the version w/o optimization?  Also, can
>> you please give a bit more details on the setup?  Are there multiple
>> swap devices?  Is it SSD or rotating disk?
> the test adds mem=4G in a 2 sockets 16 CPU machine.
> just make several copy of kernel source in tmpfs (so there is swap
> depending on your
> memory size) and run kernel build in the kernel source in the meaning time.
> there is only one swap device which is a rotating disk.
>
> I'll test both patches soon.
Tried all the options, the regression still exists. Any new idea?
I'll spend some time on it if I can get anything

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 8, 2012, 8:29 AM

Post #10 of 28 (105 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello, Shaohua.

On Wed, Feb 08, 2012 at 04:29:53PM +0800, Shaohua Li wrote:
> > the test adds mem=4G in a 2 sockets 16 CPU machine.
> > just make several copy of kernel source in tmpfs (so there is swap
> > depending on your
> > memory size) and run kernel build in the kernel source in the meaning time.
> > there is only one swap device which is a rotating disk.
> >
> > I'll test both patches soon.
>
> Tried all the options, the regression still exists. Any new idea?
> I'll spend some time on it if I can get anything

Can you please try the following one? Thanks a lot!

---
block/blk-cgroup.c | 2
block/blk-core.c | 2
block/blk-ioc.c | 99 +++++++++++++---------------------------------
block/cfq-iosched.c | 2
fs/ioprio.c | 2
include/linux/iocontext.h | 8 +--
kernel/fork.c | 2
7 files changed, 37 insertions(+), 80 deletions(-)

Index: work/block/blk-ioc.c
===================================================================
--- work.orig/block/blk-ioc.c
+++ work/block/blk-ioc.c
@@ -8,9 +8,12 @@
#include <linux/blkdev.h>
#include <linux/bootmem.h> /* for max_pfn/max_low_pfn */
#include <linux/slab.h>
+#include <linux/rwlock.h>

#include "blk.h"

+static DEFINE_RWLOCK(ioc_clear_rwlock);
+
/*
* For io context allocations
*/
@@ -45,7 +48,7 @@ static void ioc_exit_icq(struct io_cq *i
struct request_queue *q = icq->q;
struct elevator_type *et = q->elevator->type;

- lockdep_assert_held(&ioc->lock);
+ //lockdep_assert_held(&ioc->lock);
lockdep_assert_held(q->queue_lock);

radix_tree_delete(&ioc->icq_tree, icq->q->id);
@@ -71,63 +74,6 @@ static void ioc_exit_icq(struct io_cq *i
call_rcu(&icq->__rcu_head, icq_free_icq_rcu);
}

-/*
- * Slow path for ioc release in put_io_context(). Performs double-lock
- * dancing to unlink all icq's and then frees ioc.
- */
-static void ioc_release_fn(struct work_struct *work)
-{
- struct io_context *ioc = container_of(work, struct io_context,
- release_work);
- struct request_queue *last_q = NULL;
-
- spin_lock_irq(&ioc->lock);
-
- while (!hlist_empty(&ioc->icq_list)) {
- struct io_cq *icq = hlist_entry(ioc->icq_list.first,
- struct io_cq, ioc_node);
- struct request_queue *this_q = icq->q;
-
- if (this_q != last_q) {
- /*
- * Need to switch to @this_q. Once we release
- * @ioc->lock, it can go away along with @cic.
- * Hold on to it.
- */
- __blk_get_queue(this_q);
-
- /*
- * blk_put_queue() might sleep thanks to kobject
- * idiocy. Always release both locks, put and
- * restart.
- */
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
- }
-
- last_q = this_q;
- spin_lock_irq(this_q->queue_lock);
- spin_lock(&ioc->lock);
- continue;
- }
- ioc_exit_icq(icq);
- }
-
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
- }
-
- kmem_cache_free(iocontext_cachep, ioc);
-}
-
/**
* put_io_context - put a reference of io_context
* @ioc: io_context to put
@@ -135,7 +81,7 @@ static void ioc_release_fn(struct work_s
* Decrement reference count of @ioc and release it if the count reaches
* zero.
*/
-void put_io_context(struct io_context *ioc)
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
{
unsigned long flags;

@@ -144,16 +90,26 @@ void put_io_context(struct io_context *i

BUG_ON(atomic_long_read(&ioc->refcount) <= 0);

- /*
- * Releasing ioc requires reverse order double locking and we may
- * already be holding a queue_lock. Do it asynchronously from wq.
- */
- if (atomic_long_dec_and_test(&ioc->refcount)) {
- spin_lock_irqsave(&ioc->lock, flags);
- if (!hlist_empty(&ioc->icq_list))
- schedule_work(&ioc->release_work);
- spin_unlock_irqrestore(&ioc->lock, flags);
+ if (!atomic_long_dec_and_test(&ioc->refcount))
+ return;
+
+ read_lock_irqsave(&ioc_clear_rwlock, flags);
+
+ while (!hlist_empty(&ioc->icq_list)) {
+ struct io_cq *icq = hlist_entry(ioc->icq_list.first,
+ struct io_cq, ioc_node);
+ struct request_queue *q = icq->q;
+
+ if (q != locked_q)
+ spin_lock_nested(q->queue_lock, 1);
+
+ ioc_exit_icq(icq);
+
+ if (q != locked_q)
+ spin_unlock(q->queue_lock);
}
+
+ read_unlock_irqrestore(&ioc_clear_rwlock, flags);
}
EXPORT_SYMBOL(put_io_context);

@@ -168,7 +124,7 @@ void exit_io_context(struct task_struct
task_unlock(task);

atomic_dec(&ioc->nr_tasks);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}

/**
@@ -181,6 +137,8 @@ void ioc_clear_queue(struct request_queu
{
lockdep_assert_held(q->queue_lock);

+ write_lock(&ioc_clear_rwlock);
+
while (!list_empty(&q->icq_list)) {
struct io_cq *icq = list_entry(q->icq_list.next,
struct io_cq, q_node);
@@ -190,6 +148,8 @@ void ioc_clear_queue(struct request_queu
ioc_exit_icq(icq);
spin_unlock(&ioc->lock);
}
+
+ write_unlock(&ioc_clear_rwlock);
}

void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
@@ -208,7 +168,6 @@ void create_io_context_slowpath(struct t
spin_lock_init(&ioc->lock);
INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(&ioc->icq_list);
- INIT_WORK(&ioc->release_work, ioc_release_fn);

/*
* Try to install. ioc shouldn't be installed if someone else
Index: work/include/linux/iocontext.h
===================================================================
--- work.orig/include/linux/iocontext.h
+++ work/include/linux/iocontext.h
@@ -3,7 +3,6 @@

#include <linux/radix-tree.h>
#include <linux/rcupdate.h>
-#include <linux/workqueue.h>

enum {
ICQ_IOPRIO_CHANGED,
@@ -113,8 +112,6 @@ struct io_context {
struct radix_tree_root icq_tree;
struct io_cq __rcu *icq_hint;
struct hlist_head icq_list;
-
- struct work_struct release_work;
};

static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -133,7 +130,7 @@ static inline struct io_context *ioc_tas

struct task_struct;
#ifdef CONFIG_BLOCK
-void put_io_context(struct io_context *ioc);
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q);
void exit_io_context(struct task_struct *task);
struct io_context *get_task_io_context(struct task_struct *task,
gfp_t gfp_flags, int node);
@@ -141,7 +138,8 @@ void ioc_ioprio_changed(struct io_contex
void ioc_cgroup_changed(struct io_context *ioc);
#else
struct io_context;
-static inline void put_io_context(struct io_context *ioc) { }
+static inline void put_io_context(struct io_context *ioc,
+ struct request_queue *locked_q) { }
static inline void exit_io_context(struct task_struct *task) { }
#endif

Index: work/block/blk-cgroup.c
===================================================================
--- work.orig/block/blk-cgroup.c
+++ work/block/blk-cgroup.c
@@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_cgroup_changed(ioc);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}
}
}
Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -642,7 +642,7 @@ static inline void blk_free_request(stru
if (rq->cmd_flags & REQ_ELVPRIV) {
elv_put_request(q, rq);
if (rq->elv.icq)
- put_io_context(rq->elv.icq->ioc);
+ put_io_context(rq->elv.icq->ioc, q);
}

mempool_free(rq, q->rq.rq_pool);
Index: work/block/cfq-iosched.c
===================================================================
--- work.orig/block/cfq-iosched.c
+++ work/block/cfq-iosched.c
@@ -1787,7 +1787,7 @@ __cfq_slice_expired(struct cfq_data *cfq
cfqd->active_queue = NULL;

if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->icq.ioc);
+ put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue);
cfqd->active_cic = NULL;
}
}
Index: work/fs/ioprio.c
===================================================================
--- work.orig/fs/ioprio.c
+++ work/fs/ioprio.c
@@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct *
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_ioprio_changed(ioc, ioprio);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}

return err;
Index: work/kernel/fork.c
===================================================================
--- work.orig/kernel/fork.c
+++ work/kernel/fork.c
@@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f
return -ENOMEM;

new_ioc->ioprio = ioc->ioprio;
- put_io_context(new_ioc);
+ put_io_context(new_ioc, NULL);
}
#endif
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


torvalds at linux-foundation

Feb 8, 2012, 8:34 AM

Post #11 of 28 (107 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Wed, Feb 8, 2012 at 8:29 AM, Tejun Heo <tj [at] kernel> wrote:
>
> Can you please try the following one?  Thanks a lot!

If you can use it as a rwlock, why can't you do it with RCU?

Usually rwlocks are a bad idea. They tend to be more expensive than
spinlocks, and the extra parallelism is almost never noticeable
(except as "more cacheline bounces") for something that is appropriate
for a non-sleeping lock.

There's a *very* few situations where rwlock is the right thing, but
it really almost always is a horribly bad idea.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 8, 2012, 8:49 AM

Post #12 of 28 (106 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello, Linus.

On Wed, Feb 08, 2012 at 08:34:53AM -0800, Linus Torvalds wrote:
> On Wed, Feb 8, 2012 at 8:29 AM, Tejun Heo <tj [at] kernel> wrote:
> >
> > Can you please try the following one?  Thanks a lot!
>
> If you can use it as a rwlock, why can't you do it with RCU?

The original locking scheme was using RCU which was very fragile and
broken on corner cases. The locking restructuring was aimed to make
things simpler. While the double locking isn't trivial, it's much
easier to grasp and get right than RCU. We might have to revive RCU
if the regression can't be tackled otherwise and it probably is
possible to do it simpler. Let's see.

> Usually rwlocks are a bad idea. They tend to be more expensive than
> spinlocks, and the extra parallelism is almost never noticeable
> (except as "more cacheline bounces") for something that is appropriate
> for a non-sleeping lock.
>
> There's a *very* few situations where rwlock is the right thing, but
> it really almost always is a horribly bad idea.

I'm still a bit lost on where the regression is coming from and
*suspecting* that queue_lock contention is making the reverse locking
behave much worse than expected, so I mostly wanted to take that out
and see what happens. rwlock might increase locking overhead per try
but it avoids unlock/lock dancing. I'll try to reproduce the
regression in a few days and do better analysis.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 8, 2012, 8:56 AM

Post #13 of 28 (107 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Wed, Feb 08, 2012 at 08:49:20AM -0800, Tejun Heo wrote:
> I'm still a bit lost on where the regression is coming from and
> *suspecting* that queue_lock contention is making the reverse locking
> behave much worse than expected, so I mostly wanted to take that out
> and see what happens.

IOW, we can achieve about the same thing by adding another lock in
request_queue. The goal is using an inner lock for ioc clearing so
that queue_lock doesn't have to be grabbed inside ioc lock.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 8, 2012, 9:23 AM

Post #14 of 28 (106 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Wed, Feb 08, 2012 at 08:56:11AM -0800, Tejun Heo wrote:
> On Wed, Feb 08, 2012 at 08:49:20AM -0800, Tejun Heo wrote:
> > I'm still a bit lost on where the regression is coming from and
> > *suspecting* that queue_lock contention is making the reverse locking
> > behave much worse than expected, so I mostly wanted to take that out
> > and see what happens.
>
> IOW, we can achieve about the same thing by adding another lock in
> request_queue. The goal is using an inner lock for ioc clearing so
> that queue_lock doesn't have to be grabbed inside ioc lock.

Urgh....... forget about the above message. My head is still booting.
We can't do this w/ per-queue lock as we don't have a way to traverse
the associated queues w/o holding the lock and we need read locking on
ioc exit path as we may recurse through elevator icq exit (it's not
about concurrency). /me goes to brew coffee.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shaohua.li at intel

Feb 8, 2012, 10:22 PM

Post #15 of 28 (106 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/9 Tejun Heo <tj [at] kernel>:
> Hello, Shaohua.
>
> On Wed, Feb 08, 2012 at 04:29:53PM +0800, Shaohua Li wrote:
>> > the test adds mem=4G in a 2 sockets 16 CPU machine.
>> > just make several copy of kernel source in tmpfs (so there is swap
>> > depending on your
>> > memory size) and run kernel build in the kernel source in the meaning time.
>> > there is only one swap device which is a rotating disk.
>> >
>> > I'll test both patches soon.
>>
>> Tried all the options, the regression still exists. Any new idea?
>> I'll spend some time on it if I can get anything
>
> Can you please try the following one?  Thanks a lot!
doesn't work.
I also double confirmed b2efa05265d62 causes the regression
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 9, 2012, 9:59 AM

Post #16 of 28 (104 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello,

On Thu, Feb 09, 2012 at 02:22:32PM +0800, Shaohua Li wrote:
> >> Tried all the options, the regression still exists. Any new idea?
> >> I'll spend some time on it if I can get anything
> >
> > Can you please try the following one?  Thanks a lot!
> doesn't work.
> I also double confirmed b2efa05265d62 causes the regression

I'll soon send a RCU based version. I'm still having trouble
reproducing the regression tho. I've tried a few different things.

* Heavy thrashing: Disk IO dominates everything and CPUs aren't too
busy. While how swap behaves affects completion time, I don't see
how CPU locking issue comes into play at all under this
circumstance.

* Some swap load: Simulated w/ 1G memory limit and buliding defconfig
kernel in tmpfs. Swap grows to a couple hundred megabytes but build
time is still dominated by CPU. I didn't see any meanginful
difference before and after the commit - both in terms of wallclock
and CPU times.

Maybe these two were too extreme to show the problem and I need to
push memory limit a bit further, but it would be great if you can give
me a bit more details about your testing.

* How much memory does the machine have? How is the tmpfs setup and
filled up? What's the size of the tmpfs and the output of "free -m"
right before test starts? While the test is running, how occupied
are the CPUs? On test completion, what's the output of "free -m"?

* What exactly is the test and what do you measure? What does "12%
regression" mean? Is it wallclock time or CPU time? If it's CPU
time, does systime increase dominate the regression?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


torvalds at linux-foundation

Feb 9, 2012, 10:07 AM

Post #17 of 28 (104 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj [at] kernel> wrote:
>
> * What exactly is the test and what do you measure?  What does "12%
>  regression" mean?  Is it wallclock time or CPU time?  If it's CPU
>  time, does systime increase dominate the regression?

Shaohua, it might be interesting to see a profile of the bad case.

Now, quite often these kinds of things don't show anything at all -
it's just due to cache issues and there's no obvious "we hold spinlock
X for 15 seconds total". But if it's actual lock contention rather
than just "more scheduling of worker threads", it should show up in
the profile quite clearly.

That said, I do think the RCU approach is the right one. The whole
delayed deallocation (and the replacement patch with rwlocks) really
smells like "badly done RCU-like behavior" to me.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 9, 2012, 11:24 AM

Post #18 of 28 (95 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On Thu, Feb 09, 2012 at 10:07:35AM -0800, Linus Torvalds wrote:
> On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj [at] kernel> wrote:
> >
> > * What exactly is the test and what do you measure?  What does "12%
> >  regression" mean?  Is it wallclock time or CPU time?  If it's CPU
> >  time, does systime increase dominate the regression?
>
> Shaohua, it might be interesting to see a profile of the bad case.

Yeap, if CPUs are taking more time to do stuff, it would be helpful to
obtain before and after profiles.

> Now, quite often these kinds of things don't show anything at all -
> it's just due to cache issues and there's no obvious "we hold spinlock
> X for 15 seconds total". But if it's actual lock contention rather
> than just "more scheduling of worker threads", it should show up in
> the profile quite clearly.

Weird thing is that if it were wq, the rwlock patch should have
removed the regression. It removed the reverse locking and the wq
deferring. It does replace ioc locking with global readlock but it's
weird that that can show up as >10% regression (whatever the measure
may be). Even though kernel compiling is pretty fork/exit
intensive....

> That said, I do think the RCU approach is the right one. The whole
> delayed deallocation (and the replacement patch with rwlocks) really
> smells like "badly done RCU-like behavior" to me.

I'll probably post it in several hours and think it's gonna be pretty
well contained. I probably avoided RCU too hard in that path from the
original scary RCU usage.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 9, 2012, 3:48 PM

Post #19 of 28 (94 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello, Shaohua.

Can you please test the following one? It's probably the simplest
version w/o RCU and wq deferring. RCUfying isn't too bad but I'm
still a bit hesitant because RCU coverage needs to be extended to
request_queue via conditional synchronize_rcu() in queue exit path
(can't enforce delayed RCU free on request_queues and unconditional
synchronize_rcu() may cause excessive delay during boot for certain
configurations). It now can be done in the block core layer proper so
it shouldn't be as bad tho. If this too flops, I'll get to that.

Thanks.
---
block/blk-cgroup.c | 2
block/blk-core.c | 2
block/blk-ioc.c | 112 +++++++++++++++++-----------------------------
block/cfq-iosched.c | 2
fs/ioprio.c | 2
include/linux/blkdev.h | 3 +
include/linux/iocontext.h | 8 +--
kernel/fork.c | 2
8 files changed, 54 insertions(+), 79 deletions(-)

Index: work/block/blk-cgroup.c
===================================================================
--- work.orig/block/blk-cgroup.c
+++ work/block/blk-cgroup.c
@@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_cgroup_changed(ioc);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}
}
}
Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -642,7 +642,7 @@ static inline void blk_free_request(stru
if (rq->cmd_flags & REQ_ELVPRIV) {
elv_put_request(q, rq);
if (rq->elv.icq)
- put_io_context(rq->elv.icq->ioc);
+ put_io_context(rq->elv.icq->ioc, q);
}

mempool_free(rq, q->rq.rq_pool);
Index: work/block/blk-ioc.c
===================================================================
--- work.orig/block/blk-ioc.c
+++ work/block/blk-ioc.c
@@ -29,6 +29,21 @@ void get_io_context(struct io_context *i
}
EXPORT_SYMBOL(get_io_context);

+/*
+ * Releasing ioc may nest into another put_io_context() leading to nested
+ * fast path release. As the ioc's can't be the same, this is okay but
+ * makes lockdep whine. Keep track of nesting and use it as subclass.
+ */
+#ifdef CONFIG_LOCKDEP
+#define ioc_release_depth(q) ((q) ? (q)->ioc_release_depth : 0)
+#define ioc_release_depth_inc(q) (q)->ioc_release_depth++
+#define ioc_release_depth_dec(q) (q)->ioc_release_depth--
+#else
+#define ioc_release_depth(q) 0
+#define ioc_release_depth_inc(q) do { } while (0)
+#define ioc_release_depth_dec(q) do { } while (0)
+#endif
+
static void icq_free_icq_rcu(struct rcu_head *head)
{
struct io_cq *icq = container_of(head, struct io_cq, __rcu_head);
@@ -71,63 +86,6 @@ static void ioc_exit_icq(struct io_cq *i
call_rcu(&icq->__rcu_head, icq_free_icq_rcu);
}

-/*
- * Slow path for ioc release in put_io_context(). Performs double-lock
- * dancing to unlink all icq's and then frees ioc.
- */
-static void ioc_release_fn(struct work_struct *work)
-{
- struct io_context *ioc = container_of(work, struct io_context,
- release_work);
- struct request_queue *last_q = NULL;
-
- spin_lock_irq(&ioc->lock);
-
- while (!hlist_empty(&ioc->icq_list)) {
- struct io_cq *icq = hlist_entry(ioc->icq_list.first,
- struct io_cq, ioc_node);
- struct request_queue *this_q = icq->q;
-
- if (this_q != last_q) {
- /*
- * Need to switch to @this_q. Once we release
- * @ioc->lock, it can go away along with @cic.
- * Hold on to it.
- */
- __blk_get_queue(this_q);
-
- /*
- * blk_put_queue() might sleep thanks to kobject
- * idiocy. Always release both locks, put and
- * restart.
- */
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
- }
-
- last_q = this_q;
- spin_lock_irq(this_q->queue_lock);
- spin_lock(&ioc->lock);
- continue;
- }
- ioc_exit_icq(icq);
- }
-
- if (last_q) {
- spin_unlock(last_q->queue_lock);
- spin_unlock_irq(&ioc->lock);
- blk_put_queue(last_q);
- } else {
- spin_unlock_irq(&ioc->lock);
- }
-
- kmem_cache_free(iocontext_cachep, ioc);
-}
-
/**
* put_io_context - put a reference of io_context
* @ioc: io_context to put
@@ -135,7 +93,7 @@ static void ioc_release_fn(struct work_s
* Decrement reference count of @ioc and release it if the count reaches
* zero.
*/
-void put_io_context(struct io_context *ioc)
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q)
{
unsigned long flags;

@@ -144,16 +102,33 @@ void put_io_context(struct io_context *i

BUG_ON(atomic_long_read(&ioc->refcount) <= 0);

- /*
- * Releasing ioc requires reverse order double locking and we may
- * already be holding a queue_lock. Do it asynchronously from wq.
- */
- if (atomic_long_dec_and_test(&ioc->refcount)) {
- spin_lock_irqsave(&ioc->lock, flags);
- if (!hlist_empty(&ioc->icq_list))
- schedule_work(&ioc->release_work);
- spin_unlock_irqrestore(&ioc->lock, flags);
+ if (!atomic_long_dec_and_test(&ioc->refcount))
+ return;
+
+ spin_lock_irqsave_nested(&ioc->lock, flags,
+ ioc_release_depth(locked_q));
+
+ while (!hlist_empty(&ioc->icq_list)) {
+ struct io_cq *icq = hlist_entry(ioc->icq_list.first,
+ struct io_cq, ioc_node);
+ struct request_queue *q = icq->q;
+
+ if (q == locked_q || spin_trylock(q->queue_lock)) {
+ ioc_release_depth_inc(q);
+ ioc_exit_icq(icq);
+ ioc_release_depth_dec(q);
+
+ if (q != locked_q)
+ spin_unlock(q->queue_lock);
+ } else {
+ spin_unlock_irqrestore(&ioc->lock, flags);
+ cpu_relax();
+ spin_lock_irqsave_nested(&ioc->lock, flags,
+ ioc_release_depth(locked_q));
+ }
}
+
+ spin_unlock_irqrestore(&ioc->lock, flags);
}
EXPORT_SYMBOL(put_io_context);

@@ -168,7 +143,7 @@ void exit_io_context(struct task_struct
task_unlock(task);

atomic_dec(&ioc->nr_tasks);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}

/**
@@ -208,7 +183,6 @@ void create_io_context_slowpath(struct t
spin_lock_init(&ioc->lock);
INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH);
INIT_HLIST_HEAD(&ioc->icq_list);
- INIT_WORK(&ioc->release_work, ioc_release_fn);

/*
* Try to install. ioc shouldn't be installed if someone else
Index: work/block/cfq-iosched.c
===================================================================
--- work.orig/block/cfq-iosched.c
+++ work/block/cfq-iosched.c
@@ -1787,7 +1787,7 @@ __cfq_slice_expired(struct cfq_data *cfq
cfqd->active_queue = NULL;

if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->icq.ioc);
+ put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue);
cfqd->active_cic = NULL;
}
}
Index: work/fs/ioprio.c
===================================================================
--- work.orig/fs/ioprio.c
+++ work/fs/ioprio.c
@@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct *
ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
if (ioc) {
ioc_ioprio_changed(ioc, ioprio);
- put_io_context(ioc);
+ put_io_context(ioc, NULL);
}

return err;
Index: work/include/linux/iocontext.h
===================================================================
--- work.orig/include/linux/iocontext.h
+++ work/include/linux/iocontext.h
@@ -3,7 +3,6 @@

#include <linux/radix-tree.h>
#include <linux/rcupdate.h>
-#include <linux/workqueue.h>

enum {
ICQ_IOPRIO_CHANGED,
@@ -113,8 +112,6 @@ struct io_context {
struct radix_tree_root icq_tree;
struct io_cq __rcu *icq_hint;
struct hlist_head icq_list;
-
- struct work_struct release_work;
};

static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -133,7 +130,7 @@ static inline struct io_context *ioc_tas

struct task_struct;
#ifdef CONFIG_BLOCK
-void put_io_context(struct io_context *ioc);
+void put_io_context(struct io_context *ioc, struct request_queue *locked_q);
void exit_io_context(struct task_struct *task);
struct io_context *get_task_io_context(struct task_struct *task,
gfp_t gfp_flags, int node);
@@ -141,7 +138,8 @@ void ioc_ioprio_changed(struct io_contex
void ioc_cgroup_changed(struct io_context *ioc);
#else
struct io_context;
-static inline void put_io_context(struct io_context *ioc) { }
+static inline void put_io_context(struct io_context *ioc,
+ struct request_queue *locked_q) { }
static inline void exit_io_context(struct task_struct *task) { }
#endif

Index: work/kernel/fork.c
===================================================================
--- work.orig/kernel/fork.c
+++ work/kernel/fork.c
@@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f
return -ENOMEM;

new_ioc->ioprio = ioc->ioprio;
- put_io_context(new_ioc);
+ put_io_context(new_ioc, NULL);
}
#endif
return 0;
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -399,6 +399,9 @@ struct request_queue {
/* Throttle data */
struct throtl_data *td;
#endif
+#ifdef CONFIG_LOCKDEP
+ int ioc_release_depth;
+#endif
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shaohua.li at intel

Feb 9, 2012, 7:09 PM

Post #20 of 28 (95 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/10 Linus Torvalds <torvalds [at] linux-foundation>:
> On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj [at] kernel> wrote:
>>
>> * What exactly is the test and what do you measure?  What does "12%
>>  regression" mean?  Is it wallclock time or CPU time?  If it's CPU
>>  time, does systime increase dominate the regression?
>
> Shaohua, it might be interesting to see a profile of the bad case.
>
> Now, quite often these kinds of things don't show anything at all -
> it's just due to cache issues and there's no obvious "we hold spinlock
> X for 15 seconds total". But if it's actual lock contention rather
> than just "more scheduling of worker threads", it should show up in
> the profile quite clearly.
>
> That said, I do think the RCU approach is the right one. The whole
> delayed deallocation (and the replacement patch with rwlocks) really
> smells like "badly done RCU-like behavior" to me.
Appears not a lock contention issue. The system is quite idle, about
20% busy. And top shows no cpu is very busy. Before test runs, the
system has about 2G free memory (from vmstat)

system and user time isn't changed. only real time becomes longer.
This suggests IO is slower or there is more IO. But vmstat and iostat
data doesn't show significant difference between the good and bad
cases. There might be some access pattern changed which makes
swap no efficient or working set is wrongly swaped out.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shli at kernel

Feb 9, 2012, 9:14 PM

Post #21 of 28 (93 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/10 Tejun Heo <tj [at] kernel>:
> Hello, Shaohua.
>
> Can you please test the following one?  It's probably the simplest
> version w/o RCU and wq deferring.  RCUfying isn't too bad but I'm
> still a bit hesitant because RCU coverage needs to be extended to
> request_queue via conditional synchronize_rcu() in queue exit path
> (can't enforce delayed RCU free on request_queues and unconditional
> synchronize_rcu() may cause excessive delay during boot for certain
> configurations).  It now can be done in the block core layer proper so
> it shouldn't be as bad tho.  If this too flops, I'll get to that.
doesn't work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shli at kernel

Feb 10, 2012, 12:48 AM

Post #22 of 28 (92 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/10 Shaohua Li <shli [at] kernel>:
> 2012/2/10 Tejun Heo <tj [at] kernel>:
>> Hello, Shaohua.
>>
>> Can you please test the following one?  It's probably the simplest
>> version w/o RCU and wq deferring.  RCUfying isn't too bad but I'm
>> still a bit hesitant because RCU coverage needs to be extended to
>> request_queue via conditional synchronize_rcu() in queue exit path
>> (can't enforce delayed RCU free on request_queues and unconditional
>> synchronize_rcu() may cause excessive delay during boot for certain
>> configurations).  It now can be done in the block core layer proper so
>> it shouldn't be as bad tho.  If this too flops, I'll get to that.
> doesn't work.
I added trace in the schedule_work code path of put_io_context, which
runs very rare. So it's not lock contention for sure.
Sounds the only difference between the good/bad cases is the good
case runs with rcu_lock_read/rcu_read_unlock. I also checked slab
info, the cfq related slab doesn't use too many memory, unlikely
because rcu latency uses too many memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


tj at kernel

Feb 10, 2012, 6:17 PM

Post #23 of 28 (88 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

Hello,

On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote:
> >> Can you please test the following one?  It's probably the simplest
> >> version w/o RCU and wq deferring.  RCUfying isn't too bad but I'm
> >> still a bit hesitant because RCU coverage needs to be extended to
> >> request_queue via conditional synchronize_rcu() in queue exit path
> >> (can't enforce delayed RCU free on request_queues and unconditional
> >> synchronize_rcu() may cause excessive delay during boot for certain
> >> configurations).  It now can be done in the block core layer proper so
> >> it shouldn't be as bad tho.  If this too flops, I'll get to that.
> > doesn't work.
> I added trace in the schedule_work code path of put_io_context, which
> runs very rare. So it's not lock contention for sure.
> Sounds the only difference between the good/bad cases is the good
> case runs with rcu_lock_read/rcu_read_unlock. I also checked slab
> info, the cfq related slab doesn't use too many memory, unlikely
> because rcu latency uses too many memory.

Yeah, that makes much more sense. It just isn't hot enough path for
this sort of micro locking changes to matter. I think the problem is
that, after the change, the cfqq aren't being expired immediately on
task exit. ie. While moving the cic destruction to release path, I
accidentally removed exit notification to cfq. I'll come up with a
fix.

Thank you!

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


axboe at kernel

Feb 11, 2012, 3:35 AM

Post #24 of 28 (83 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

On 2012-02-11 03:17, Tejun Heo wrote:
> Hello,
>
> On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote:
>>>> Can you please test the following one? It's probably the simplest
>>>> version w/o RCU and wq deferring. RCUfying isn't too bad but I'm
>>>> still a bit hesitant because RCU coverage needs to be extended to
>>>> request_queue via conditional synchronize_rcu() in queue exit path
>>>> (can't enforce delayed RCU free on request_queues and unconditional
>>>> synchronize_rcu() may cause excessive delay during boot for certain
>>>> configurations). It now can be done in the block core layer proper so
>>>> it shouldn't be as bad tho. If this too flops, I'll get to that.
>>> doesn't work.
>> I added trace in the schedule_work code path of put_io_context, which
>> runs very rare. So it's not lock contention for sure.
>> Sounds the only difference between the good/bad cases is the good
>> case runs with rcu_lock_read/rcu_read_unlock. I also checked slab
>> info, the cfq related slab doesn't use too many memory, unlikely
>> because rcu latency uses too many memory.
>
> Yeah, that makes much more sense. It just isn't hot enough path for
> this sort of micro locking changes to matter. I think the problem is
> that, after the change, the cfqq aren't being expired immediately on
> task exit. ie. While moving the cic destruction to release path, I
> accidentally removed exit notification to cfq. I'll come up with a
> fix.

Was just thinking about that last night, the missing slice expire on
task exit makes a LOT more sense than changed locking.

I'm pushing off what I have to Linus today, since I'll be gone skiing
next week. I will check email regularly and be able to apply patches and
so forth, just a heads up on availability.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


shli at kernel

Feb 12, 2012, 5:34 PM

Post #25 of 28 (82 views)
Permalink
Re: [PATCH] block: strip out locking optimization in put_io_context() [In reply to]

2012/2/11 Tejun Heo <tj [at] kernel>:
> Hello,
>
> On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote:
>> >> Can you please test the following one?  It's probably the simplest
>> >> version w/o RCU and wq deferring.  RCUfying isn't too bad but I'm
>> >> still a bit hesitant because RCU coverage needs to be extended to
>> >> request_queue via conditional synchronize_rcu() in queue exit path
>> >> (can't enforce delayed RCU free on request_queues and unconditional
>> >> synchronize_rcu() may cause excessive delay during boot for certain
>> >> configurations).  It now can be done in the block core layer proper so
>> >> it shouldn't be as bad tho.  If this too flops, I'll get to that.
>> > doesn't work.
>> I added trace in the schedule_work code path of put_io_context, which
>> runs very rare. So it's not lock contention for sure.
>> Sounds the only difference between the good/bad cases is the good
>> case runs with rcu_lock_read/rcu_read_unlock. I also checked slab
>> info, the cfq related slab doesn't use too many memory, unlikely
>> because rcu latency uses too many memory.
>
> Yeah, that makes much more sense.  It just isn't hot enough path for
> this sort of micro locking changes to matter.  I think the problem is
> that, after the change, the cfqq aren't being expired immediately on
> task exit.  ie. While moving the cic destruction to release path, I
> accidentally removed exit notification to cfq.  I'll come up with a
> fix.
ah, I felt a little strange looking at exit_io_context, but didn't realize
the exit notification is removed (confused by put_io_context). This
makes a lot of sense. Good catch!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First page Previous page 1 2 Next page Last page  View All Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.