Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[PATCH 0/4] Was: deferring __fput()

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


oleg at redhat

Jun 27, 2012, 11:37 AM

Post #1 of 23 (188 views)
Permalink
[PATCH 0/4] Was: deferring __fput()

On 06/25, Oleg Nesterov wrote:
>
> And if it always takes ->pi_lock we do not need the new PF_ or something
> else, exit_task_work() can set task->task_works = NO_MORE under ->pi_lock
> (task_work_run() can check PF_EXITING), and task_work_add() ensures that
> task_works != NO_MORE.
>
> What do you think?

It is not clear to me if you agree or not. So I am simply sending the
patches I have.

Feel free to ignore or re-do.

Seriously, why should we add 2 pointers into task_struct? Sure, this
is minor, but still... But perhaps task_work.c should not play tricks
with the circular list, task_work_run() can reverse the list as you
initially suggested.

Also, I am not sure about "define rcu_head callback_head", this series
doesn't do this. But again, up to you.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jun 27, 2012, 9:38 PM

Post #2 of 23 (181 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Wed, Jun 27, 2012 at 08:37:21PM +0200, Oleg Nesterov wrote:
> On 06/25, Oleg Nesterov wrote:
> >
> > And if it always takes ->pi_lock we do not need the new PF_ or something
> > else, exit_task_work() can set task->task_works = NO_MORE under ->pi_lock
> > (task_work_run() can check PF_EXITING), and task_work_add() ensures that
> > task_works != NO_MORE.
> >
> > What do you think?
>
> It is not clear to me if you agree or not. So I am simply sending the
> patches I have.
>
> Feel free to ignore or re-do.
>
> Seriously, why should we add 2 pointers into task_struct? Sure, this
> is minor, but still... But perhaps task_work.c should not play tricks
> with the circular list, task_work_run() can reverse the list as you
> initially suggested.
>
> Also, I am not sure about "define rcu_head callback_head", this series
> doesn't do this. But again, up to you.

Umm... FWIW, my variant circa yesterday is in vfs.git#untested; it seems to survive
on uml/amd64 at least. I'll look through your patches and see what can be nicked.
The list removal logics in mine looks really ugly ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


oleg at redhat

Jun 28, 2012, 9:22 AM

Post #3 of 23 (181 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On 06/28, Al Viro wrote:
>
> The list removal logics in mine looks really ugly ;-/

Basically the code is almost the same as mine.

But task_work_add() can be simplified a bit, no need to check
last != NULL twice.

last = task->task_works ?: twork;
task->last_twork = twork;
twork->next = last->next; /* XXX */
last->next = twork;

If ->task_works == NULL, then XXX is meaningless but safe.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


oleg at redhat

Jun 28, 2012, 9:45 AM

Post #4 of 23 (183 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

Forgot to mention...

And I still think that task_work_add() should not succeed unconditionally,
it synchronize with exit_task_work(). Otherwise keyctl_session_to_parent()
is broken.

On 06/28, Oleg Nesterov wrote:
>
> On 06/28, Al Viro wrote:
> >
> > The list removal logics in mine looks really ugly ;-/
>
> Basically the code is almost the same as mine.
>
> But task_work_add() can be simplified a bit, no need to check
> last != NULL twice.
>
> last = task->task_works ?: twork;
> task->last_twork = twork;
> twork->next = last->next; /* XXX */
> last->next = twork;
>
> If ->task_works == NULL, then XXX is meaningless but safe.
>
> Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jun 28, 2012, 10:30 PM

Post #5 of 23 (179 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Thu, 2012-06-28 at 05:38 +0100, Al Viro wrote:
> On Wed, Jun 27, 2012 at 08:37:21PM +0200, Oleg Nesterov wrote:
> > On 06/25, Oleg Nesterov wrote:
> > >
> > > And if it always takes ->pi_lock we do not need the new PF_ or something
> > > else, exit_task_work() can set task->task_works = NO_MORE under ->pi_lock
> > > (task_work_run() can check PF_EXITING), and task_work_add() ensures that
> > > task_works != NO_MORE.
> > >
> > > What do you think?
> >
> > It is not clear to me if you agree or not. So I am simply sending the
> > patches I have.
> >
> > Feel free to ignore or re-do.
> >
> > Seriously, why should we add 2 pointers into task_struct? Sure, this
> > is minor, but still... But perhaps task_work.c should not play tricks
> > with the circular list, task_work_run() can reverse the list as you
> > initially suggested.
> >
> > Also, I am not sure about "define rcu_head callback_head", this series
> > doesn't do this. But again, up to you.
>
> Umm... FWIW, my variant circa yesterday is in vfs.git#untested; it seems to survive
> on uml/amd64 at least. I'll look through your patches and see what can be nicked.
> The list removal logics in mine looks really ugly ;-/

Still failing to boot. Fails to boot starting with commit "b24dfa6
switch fput to task_work_add".

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jun 29, 2012, 1:33 AM

Post #6 of 23 (182 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, Jun 29, 2012 at 01:30:38AM -0400, Mimi Zohar wrote:
> On Thu, 2012-06-28 at 05:38 +0100, Al Viro wrote:
> > On Wed, Jun 27, 2012 at 08:37:21PM +0200, Oleg Nesterov wrote:
> > > On 06/25, Oleg Nesterov wrote:
> > > >
> > > > And if it always takes ->pi_lock we do not need the new PF_ or something
> > > > else, exit_task_work() can set task->task_works = NO_MORE under ->pi_lock
> > > > (task_work_run() can check PF_EXITING), and task_work_add() ensures that
> > > > task_works != NO_MORE.
> > > >
> > > > What do you think?
> > >
> > > It is not clear to me if you agree or not. So I am simply sending the
> > > patches I have.
> > >
> > > Feel free to ignore or re-do.
> > >
> > > Seriously, why should we add 2 pointers into task_struct? Sure, this
> > > is minor, but still... But perhaps task_work.c should not play tricks
> > > with the circular list, task_work_run() can reverse the list as you
> > > initially suggested.
> > >
> > > Also, I am not sure about "define rcu_head callback_head", this series
> > > doesn't do this. But again, up to you.
> >
> > Umm... FWIW, my variant circa yesterday is in vfs.git#untested; it seems to survive
> > on uml/amd64 at least. I'll look through your patches and see what can be nicked.
> > The list removal logics in mine looks really ugly ;-/
>
> Still failing to boot. Fails to boot starting with commit "b24dfa6
> switch fput to task_work_add".

Details, please... .config, root fs type, etc. Failed execve() of init is, unfortunately,
not too informative...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jun 29, 2012, 6:02 AM

Post #7 of 23 (179 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, 2012-06-29 at 09:33 +0100, Al Viro wrote:
> On Fri, Jun 29, 2012 at 01:30:38AM -0400, Mimi Zohar wrote:
> > On Thu, 2012-06-28 at 05:38 +0100, Al Viro wrote:
> > > On Wed, Jun 27, 2012 at 08:37:21PM +0200, Oleg Nesterov wrote:
> > > > On 06/25, Oleg Nesterov wrote:
> > > > >
> > > > > And if it always takes ->pi_lock we do not need the new PF_ or something
> > > > > else, exit_task_work() can set task->task_works = NO_MORE under ->pi_lock
> > > > > (task_work_run() can check PF_EXITING), and task_work_add() ensures that
> > > > > task_works != NO_MORE.
> > > > >
> > > > > What do you think?
> > > >
> > > > It is not clear to me if you agree or not. So I am simply sending the
> > > > patches I have.
> > > >
> > > > Feel free to ignore or re-do.
> > > >
> > > > Seriously, why should we add 2 pointers into task_struct? Sure, this
> > > > is minor, but still... But perhaps task_work.c should not play tricks
> > > > with the circular list, task_work_run() can reverse the list as you
> > > > initially suggested.
> > > >
> > > > Also, I am not sure about "define rcu_head callback_head", this series
> > > > doesn't do this. But again, up to you.
> > >
> > > Umm... FWIW, my variant circa yesterday is in vfs.git#untested; it seems to survive
> > > on uml/amd64 at least. I'll look through your patches and see what can be nicked.
> > > The list removal logics in mine looks really ugly ;-/
> >
> > Still failing to boot. Fails to boot starting with commit "b24dfa6
> > switch fput to task_work_add".
>
> Details, please... .config, root fs type, etc. Failed execve() of init is, unfortunately,
> not too informative...

In addition to failing with my tailored .config (eg. IMA, TPM builtin),
it fails with config-3.4.2-1.fc16.x86_64, with new default options,
running on a Lenovo W520.

/dev/sda7 on / type ext4 (rw,relatime,seclabel,i_version,data=ordered)

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jun 29, 2012, 10:41 AM

Post #8 of 23 (181 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, Jun 29, 2012 at 09:02:05AM -0400, Mimi Zohar wrote:

> In addition to failing with my tailored .config (eg. IMA, TPM builtin),
> it fails with config-3.4.2-1.fc16.x86_64, with new default options,
> running on a Lenovo W520.
>
> /dev/sda7 on / type ext4 (rw,relatime,seclabel,i_version,data=ordered)

Lovely... Smells like something making dracut unhappy. Does it say
anything interesting when booting with rdinitdebug in command line?
What happens with rdshell in there?

I don't have easily accessible Fedora boxen at the moment and it does
look like something fishy going on while initramfs stuff runs...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jun 29, 2012, 2:38 PM

Post #9 of 23 (182 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, 2012-06-29 at 18:41 +0100, Al Viro wrote:
> On Fri, Jun 29, 2012 at 09:02:05AM -0400, Mimi Zohar wrote:
>
> > In addition to failing with my tailored .config (eg. IMA, TPM builtin),
> > it fails with config-3.4.2-1.fc16.x86_64, with new default options,
> > running on a Lenovo W520.
> >
> > /dev/sda7 on / type ext4 (rw,relatime,seclabel,i_version,data=ordered)
>
> Lovely... Smells like something making dracut unhappy. Does it say
> anything interesting when booting with rdinitdebug in command line?
> What happens with rdshell in there?
>
> I don't have easily accessible Fedora boxen at the moment and it does
> look like something fishy going on while initramfs stuff runs...

No difference with rdinitdebug, rdshell, but the message
"ata5: SATA link down (SStatus 0 SControl 300)" prior to "Freeing unused
kernel memory... ", "Failed to execute /init" and the trace back,
probably helps. Don't know how I missed that.

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jun 29, 2012, 4:56 PM

Post #10 of 23 (181 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, 2012-06-29 at 17:38 -0400, Mimi Zohar wrote:
> On Fri, 2012-06-29 at 18:41 +0100, Al Viro wrote:
> > On Fri, Jun 29, 2012 at 09:02:05AM -0400, Mimi Zohar wrote:
> >
> > > In addition to failing with my tailored .config (eg. IMA, TPM builtin),
> > > it fails with config-3.4.2-1.fc16.x86_64, with new default options,
> > > running on a Lenovo W520.
> > >
> > > /dev/sda7 on / type ext4 (rw,relatime,seclabel,i_version,data=ordered)
> >
> > Lovely... Smells like something making dracut unhappy. Does it say
> > anything interesting when booting with rdinitdebug in command line?
> > What happens with rdshell in there?
> >
> > I don't have easily accessible Fedora boxen at the moment and it does
> > look like something fishy going on while initramfs stuff runs...
>
> No difference with rdinitdebug, rdshell, but the message
> "ata5: SATA link down (SStatus 0 SControl 300)" prior to "Freeing unused
> kernel memory... ", "Failed to execute /init" and the trace back,
> probably helps. Don't know how I missed that.

Looking at /var/log/messages, seems like the ata4 and ata5 "SATA link
down (SStatus 0 SControl 300)" messages are normal.

ata5: SATA link down (SStatus 0 SControl 300)
Freeing unused kernel memory: 1016k freed
Write protecting the kernel read-only data: 12288k
Freeing unused kernel memory: 1964k freed
Freeing unused kernel memory: 1468k freed
Failed to execute /init
Kernel panic - not syncing. No init Found. Try passing init= option ...
Pid: 1, comm: swapper/0 not tainted 3.5.0-rc1+
Call Trace:
panic
init_post
kernel_init
?do_early_param
kernel_thread_helper
start_kernel

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jun 29, 2012, 10:02 PM

Post #11 of 23 (179 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Fri, Jun 29, 2012 at 07:56:37PM -0400, Mimi Zohar wrote:
> Looking at /var/log/messages, seems like the ata4 and ata5 "SATA link
> down (SStatus 0 SControl 300)" messages are normal.
>
> ata5: SATA link down (SStatus 0 SControl 300)
> Freeing unused kernel memory: 1016k freed
> Write protecting the kernel read-only data: 12288k
> Freeing unused kernel memory: 1964k freed
> Freeing unused kernel memory: 1468k freed
> Failed to execute /init
> Kernel panic - not syncing. No init Found. Try passing init= option ...
> Pid: 1, comm: swapper/0 not tainted 3.5.0-rc1+
> Call Trace:
> panic
> init_post
> kernel_init
> ?do_early_param
> kernel_thread_helper
> start_kernel

Just to make sure - you are not getting IMA violations among all that? AFAICS,
the damn thing should behave no worse in that respect than your own patch
a while ago, and you haven't mentioned them in this thread, but...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jun 29, 2012, 11:24 PM

Post #12 of 23 (181 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Thu, Jun 28, 2012 at 06:45:39PM +0200, Oleg Nesterov wrote:
> Forgot to mention...
>
> And I still think that task_work_add() should not succeed unconditionally,
> it synchronize with exit_task_work(). Otherwise keyctl_session_to_parent()
> is broken.

Hmm... Look: if nothing else, we have
/* the parent mustn't be init and mustn't be a kernel thread */
if (parent->pid <= 1 || !parent->mm)
goto unlock;
in the caller. OTOH, on the exit side we have exit_mm() done first. And
that will have ->mm set to NULL. So we are closing a very narrow race to start
with. So why not do the following and be done with that?

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 0291b3f..f1b59ae 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1486,6 +1486,7 @@ long keyctl_session_to_parent(void)
oldwork = NULL;
parent = me->real_parent;

+ task_lock(parent);
/* the parent mustn't be init and mustn't be a kernel thread */
if (parent->pid <= 1 || !parent->mm)
goto unlock;
@@ -1529,6 +1530,7 @@ long keyctl_session_to_parent(void)
if (!ret)
newwork = NULL;
unlock:
+ task_unlock(parent);
write_unlock_irq(&tasklist_lock);
rcu_read_unlock();
if (oldwork)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


oleg at redhat

Jun 30, 2012, 10:41 AM

Post #13 of 23 (178 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On 06/30, Al Viro wrote:
>
> On Thu, Jun 28, 2012 at 06:45:39PM +0200, Oleg Nesterov wrote:
> > Forgot to mention...
> >
> > And I still think that task_work_add() should not succeed unconditionally,
> > it synchronize with exit_task_work(). Otherwise keyctl_session_to_parent()
> > is broken.
>
> Hmm... Look: if nothing else, we have
> /* the parent mustn't be init and mustn't be a kernel thread */
> if (parent->pid <= 1 || !parent->mm)
> goto unlock;
> in the caller. OTOH, on the exit side we have exit_mm() done first. And
> that will have ->mm set to NULL. So we are closing a very narrow race to start
> with. So why not do the following and be done with that?

Of course we can fix keyctl_session_to_parent(). But why? I mean, why
do you dislike the idea to put this synchronization into add/run ?

IMO, this makes task_work much less useful/convenient. Every caller
has to fight with this race somehow. And the lock it can take depends
on the context, say, you can't use task_lock() in irq.

IOW, what is wrong with

[PATCH 2/4] task_work: don't rely on PF_EXITING
http://marc.info/?l=linux-kernel&m=134082265321691

and
[PATCH 3/4] task_work: deal with task_work callbacks adding more work
http://marc.info/?t=134082275400004

?

perhaps you do not like the fact that the exiting task takes pi_lock
unconditionally?



and in fact I think that probably it makes sense to change fput,

- task_work_add(current, ...);
+ BUG_ON(task_work_add(current, ...) != 0);

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jul 1, 2012, 12:50 PM

Post #14 of 23 (169 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Sat, 2012-06-30 at 06:02 +0100, Al Viro wrote:
> On Fri, Jun 29, 2012 at 07:56:37PM -0400, Mimi Zohar wrote:
> > Looking at /var/log/messages, seems like the ata4 and ata5 "SATA link
> > down (SStatus 0 SControl 300)" messages are normal.
> >
> > ata5: SATA link down (SStatus 0 SControl 300)
> > Freeing unused kernel memory: 1016k freed
> > Write protecting the kernel read-only data: 12288k
> > Freeing unused kernel memory: 1964k freed
> > Freeing unused kernel memory: 1468k freed
> > Failed to execute /init
> > Kernel panic - not syncing. No init Found. Try passing init= option ...
> > Pid: 1, comm: swapper/0 not tainted 3.5.0-rc1+
> > Call Trace:
> > panic
> > init_post
> > kernel_init
> > ?do_early_param
> > kernel_thread_helper
> > start_kernel
>
> Just to make sure - you are not getting IMA violations among all that?

I'm not running with the IMA-appraisal patches, nor does the
Fedora .config enable IMA. So I'm not getting violations.

> AFAICS,
> the damn thing should behave no worse in that respect than your own patch
> a while ago, and you haven't mentioned them in this thread, but...

I haven't mentioned the "ima: defer calling __fput()" patch, since I've
compiled git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
#untested with a .config based on config-3.4.2-1.fc16.x86_64 and am
having this problem. No need to add more confusion. The "ima: defer
calling __fput()" will be dropped from the patchset, as soon as the
general method works.

I've isolated the problem to the PF_KTHREAD section of fput().

void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
file_sb_list_del(file);
if (unlikely(in_interrupt() || task->flags & PF_KTHREAD)) {
unsigned long flags;
spin_lock_irqsave(&delayed_fput_lock, flags);
list_add(&file->f_u.fu_list, &delayed_fput_list);
schedule_work(&delayed_fput_work);
spin_unlock_irqrestore(&delayed_fput_lock, flags);
return;
}
init_task_work(&file->f_u.fu_rcuhead, ____fput);
task_work_add(task, &file->f_u.fu_rcuhead, true);
}
}

Replacing it with a call to __fput(), the system boots.

thanks,

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jul 1, 2012, 1:57 PM

Post #15 of 23 (167 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Sun, Jul 01, 2012 at 03:50:02PM -0400, Mimi Zohar wrote:
>
> I haven't mentioned the "ima: defer calling __fput()" patch, since I've
> compiled git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
> #untested with a .config based on config-3.4.2-1.fc16.x86_64 and am
> having this problem. No need to add more confusion. The "ima: defer
> calling __fput()" will be dropped from the patchset, as soon as the
> general method works.
>
> I've isolated the problem to the PF_KTHREAD section of fput().
>
> void fput(struct file *file)
> {
> if (atomic_long_dec_and_test(&file->f_count)) {
> struct task_struct *task = current;
> file_sb_list_del(file);
> if (unlikely(in_interrupt() || task->flags & PF_KTHREAD)) {
> unsigned long flags;
> spin_lock_irqsave(&delayed_fput_lock, flags);
> list_add(&file->f_u.fu_list, &delayed_fput_list);
> schedule_work(&delayed_fput_work);
> spin_unlock_irqrestore(&delayed_fput_lock, flags);
> return;
> }
> init_task_work(&file->f_u.fu_rcuhead, ____fput);
> task_work_add(task, &file->f_u.fu_rcuhead, true);
> }
> }
>
> Replacing it with a call to __fput(), the system boots.

"it" being just the part under that if (unlikely(...)))? Very interesting... If so, we
have some kernel thread ending up with delayed __fput() which somehow makes dracut (assuimg
you are using fedora initramfs to go with fedora config) unhappy. With your own patch,
doing async __fput() in a lot of cases when this one doesn't delay past the return to
userland managing to survive the boot... I wonder which files end up triggering that fun
and which kernel thread is responsible... Could you slap a printk() in there, showing
file->f_dentry->d_inode->i_mode (octal) and at least file->f_dentry->d_name.name?
Along with the current->comm[], all under that inner if (). And see which ones end up
going that way by the time execve() of /sbin/init fails.

It would be nice to see which sys_mount() calls are made and which (if any) fail, BTW.
I wonder if it even gets to mounting the right root...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jul 1, 2012, 6:46 PM

Post #16 of 23 (164 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Sun, 2012-07-01 at 21:57 +0100, Al Viro wrote:
> On Sun, Jul 01, 2012 at 03:50:02PM -0400, Mimi Zohar wrote:
> > Replacing it with a call to __fput(), the system boots.
>
> "it" being just the part under that if (unlikely(...)))? Very interesting... If so, we
> have some kernel thread ending up with delayed __fput() which somehow makes dracut (assuimg
> you are using fedora initramfs to go with fedora config) unhappy. With your own patch,
> doing async __fput() in a lot of cases when this one doesn't delay past the return to
> userland managing to survive the boot... I wonder which files end up triggering that fun
> and which kernel thread is responsible... Could you slap a printk() in there, showing
> file->f_dentry->d_inode->i_mode (octal) and at least file->f_dentry->d_name.name?
> Along with the current->comm[], all under that inner if (). And see which ones end up
> going that way by the time execve() of /sbin/init fails.

pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755

> It would be nice to see which sys_mount() calls are made and which (if any) fail, BTW.
> I wonder if it even gets to mounting the right root...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jul 1, 2012, 8:43 PM

Post #17 of 23 (160 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Sun, Jul 01, 2012 at 09:46:31PM -0400, Mimi Zohar wrote:
> On Sun, 2012-07-01 at 21:57 +0100, Al Viro wrote:
> > On Sun, Jul 01, 2012 at 03:50:02PM -0400, Mimi Zohar wrote:
> > > Replacing it with a call to __fput(), the system boots.
> >
> > "it" being just the part under that if (unlikely(...)))? Very interesting... If so, we
> > have some kernel thread ending up with delayed __fput() which somehow makes dracut (assuimg
> > you are using fedora initramfs to go with fedora config) unhappy. With your own patch,
> > doing async __fput() in a lot of cases when this one doesn't delay past the return to
> > userland managing to survive the boot... I wonder which files end up triggering that fun
> > and which kernel thread is responsible... Could you slap a printk() in there, showing
> > file->f_dentry->d_inode->i_mode (octal) and at least file->f_dentry->d_name.name?
> > Along with the current->comm[], all under that inner if (). And see which ones end up
> > going that way by the time execve() of /sbin/init fails.
>
> pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
> pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755

OK... Here's what I suspect is going on:
* populating initramfs writes binaries there. We open files (for write) from
the kernel thread (there's nothing other than kernel threads at that point), write to
them, then close(). Final fput() gets delayed.
* Then we proceed to execve(). Which means mapping the binary with MAP_DENYWRITE.
Which fails, since there's a struct file still opened for write on that sucker.

Your patch did not delay those fput() - they were done without ->mmap_sem held. So
it survived. Booting without initramfs always survives; booting with initramfs may
or may not survive, depending on the timings - if that scheduled work manages to
run by the time we do those execve(), we win. Note that async_synchronize_full()
done in init_post() might easily affect that, depending on config.

As a quick test, could you try slapping a delay somewhere around the beginning
of init_post() and see if it rescues the system?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jul 1, 2012, 10:11 PM

Post #18 of 23 (158 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, Jul 02, 2012 at 04:43:10AM +0100, Al Viro wrote:
> On Sun, Jul 01, 2012 at 09:46:31PM -0400, Mimi Zohar wrote:
> > On Sun, 2012-07-01 at 21:57 +0100, Al Viro wrote:
> > > On Sun, Jul 01, 2012 at 03:50:02PM -0400, Mimi Zohar wrote:
> > > > Replacing it with a call to __fput(), the system boots.
> > >
> > > "it" being just the part under that if (unlikely(...)))? Very interesting... If so, we
> > > have some kernel thread ending up with delayed __fput() which somehow makes dracut (assuimg
> > > you are using fedora initramfs to go with fedora config) unhappy. With your own patch,
> > > doing async __fput() in a lot of cases when this one doesn't delay past the return to
> > > userland managing to survive the boot... I wonder which files end up triggering that fun
> > > and which kernel thread is responsible... Could you slap a printk() in there, showing
> > > file->f_dentry->d_inode->i_mode (octal) and at least file->f_dentry->d_name.name?
> > > Along with the current->comm[], all under that inner if (). And see which ones end up
> > > going that way by the time execve() of /sbin/init fails.
> >
> > pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
> > pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755
>
> OK... Here's what I suspect is going on:
> * populating initramfs writes binaries there. We open files (for write) from
> the kernel thread (there's nothing other than kernel threads at that point), write to
> them, then close(). Final fput() gets delayed.
> * Then we proceed to execve(). Which means mapping the binary with MAP_DENYWRITE.
> Which fails, since there's a struct file still opened for write on that sucker.
>
> Your patch did not delay those fput() - they were done without ->mmap_sem held. So
> it survived. Booting without initramfs always survives; booting with initramfs may
> or may not survive, depending on the timings - if that scheduled work manages to
> run by the time we do those execve(), we win. Note that async_synchronize_full()
> done in init_post() might easily affect that, depending on config.
>
> As a quick test, could you try slapping a delay somewhere around the beginning
> of init_post() and see if it rescues the system?

Ho-hum... How about this (modulo missing documentation of the whole sad mess):

diff --git a/fs/file_table.c b/fs/file_table.c
index 470da0b..00fd849 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -284,6 +284,11 @@ static void ____fput(struct callback_head *work)
__fput(container_of(work, struct file, f_u.fu_rcuhead));
}

+void flush_delayed_fput(void)
+{
+ delayed_fput(NULL);
+}
+
static DECLARE_WORK(delayed_fput_work, delayed_fput);

void fput(struct file *file)
diff --git a/include/linux/file.h b/include/linux/file.h
index 58bf158..d9a4f5a 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -39,4 +39,6 @@ extern void put_unused_fd(unsigned int fd);

extern void fd_install(unsigned int fd, struct file *file);

+extern void flush_delayed_fput(void);
+
#endif /* __LINUX_FILE_H */
diff --git a/init/main.c b/init/main.c
index b5cc0a7..3f151f6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
#include <linux/shmem_fs.h>
#include <linux/slab.h>
#include <linux/perf_event.h>
+#include <linux/file.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -804,8 +805,8 @@ static noinline int init_post(void)
system_state = SYSTEM_RUNNING;
numa_default_policy();

-
current->signal->flags |= SIGNAL_UNKILLABLE;
+ flush_delayed_fput();

if (ramdisk_execute_command) {
run_init_process(ramdisk_execute_command);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jul 2, 2012, 4:49 AM

Post #19 of 23 (160 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, 2012-07-02 at 06:11 +0100, Al Viro wrote:
> On Mon, Jul 02, 2012 at 04:43:10AM +0100, Al Viro wrote:
> > On Sun, Jul 01, 2012 at 09:46:31PM -0400, Mimi Zohar wrote:
> > > On Sun, 2012-07-01 at 21:57 +0100, Al Viro wrote:
> > > > On Sun, Jul 01, 2012 at 03:50:02PM -0400, Mimi Zohar wrote:
> > > > > Replacing it with a call to __fput(), the system boots.
> > > >
> > > > "it" being just the part under that if (unlikely(...)))? Very interesting... If so, we
> > > > have some kernel thread ending up with delayed __fput() which somehow makes dracut (assuimg
> > > > you are using fedora initramfs to go with fedora config) unhappy. With your own patch,
> > > > doing async __fput() in a lot of cases when this one doesn't delay past the return to
> > > > userland managing to survive the boot... I wonder which files end up triggering that fun
> > > > and which kernel thread is responsible... Could you slap a printk() in there, showing
> > > > file->f_dentry->d_inode->i_mode (octal) and at least file->f_dentry->d_name.name?
> > > > Along with the current->comm[], all under that inner if (). And see which ones end up
> > > > going that way by the time execve() of /sbin/init fails.
> > >
> > > pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
> > > pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755
> >
> > OK... Here's what I suspect is going on:
> > * populating initramfs writes binaries there. We open files (for write) from
> > the kernel thread (there's nothing other than kernel threads at that point), write to
> > them, then close(). Final fput() gets delayed.
> > * Then we proceed to execve(). Which means mapping the binary with MAP_DENYWRITE.
> > Which fails, since there's a struct file still opened for write on that sucker.
> >
> > Your patch did not delay those fput() - they were done without ->mmap_sem held. So
> > it survived. Booting without initramfs always survives; booting with initramfs may
> > or may not survive, depending on the timings - if that scheduled work manages to
> > run by the time we do those execve(), we win. Note that async_synchronize_full()
> > done in init_post() might easily affect that, depending on config.
> >
> > As a quick test, could you try slapping a delay somewhere around the beginning
> > of init_post() and see if it rescues the system?
>
> Ho-hum... How about this (modulo missing documentation of the whole sad mess):

Sorry, neither adding the delay or this patch helped.

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 470da0b..00fd849 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -284,6 +284,11 @@ static void ____fput(struct callback_head *work)
> __fput(container_of(work, struct file, f_u.fu_rcuhead));
> }
>
> +void flush_delayed_fput(void)
> +{
> + delayed_fput(NULL);
> +}
> +
> static DECLARE_WORK(delayed_fput_work, delayed_fput);
>
> void fput(struct file *file)
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 58bf158..d9a4f5a 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -39,4 +39,6 @@ extern void put_unused_fd(unsigned int fd);
>
> extern void fd_install(unsigned int fd, struct file *file);
>
> +extern void flush_delayed_fput(void);
> +
> #endif /* __LINUX_FILE_H */
> diff --git a/init/main.c b/init/main.c
> index b5cc0a7..3f151f6 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -68,6 +68,7 @@
> #include <linux/shmem_fs.h>
> #include <linux/slab.h>
> #include <linux/perf_event.h>
> +#include <linux/file.h>
>
> #include <asm/io.h>
> #include <asm/bugs.h>
> @@ -804,8 +805,8 @@ static noinline int init_post(void)
> system_state = SYSTEM_RUNNING;
> numa_default_policy();
>
> -
> current->signal->flags |= SIGNAL_UNKILLABLE;
> + flush_delayed_fput();
>
> if (ramdisk_execute_command) {
> run_init_process(ramdisk_execute_command);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jul 2, 2012, 5:02 AM

Post #20 of 23 (160 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, Jul 02, 2012 at 07:49:50AM -0400, Mimi Zohar wrote:

> > > > pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
> > > > pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755
> > >
> > > OK... Here's what I suspect is going on:
> > > * populating initramfs writes binaries there. We open files (for write) from
> > > the kernel thread (there's nothing other than kernel threads at that point), write to
> > > them, then close(). Final fput() gets delayed.
> > > * Then we proceed to execve(). Which means mapping the binary with MAP_DENYWRITE.
> > > Which fails, since there's a struct file still opened for write on that sucker.
> > >
> > > Your patch did not delay those fput() - they were done without ->mmap_sem held. So
> > > it survived. Booting without initramfs always survives; booting with initramfs may
> > > or may not survive, depending on the timings - if that scheduled work manages to
> > > run by the time we do those execve(), we win. Note that async_synchronize_full()
> > > done in init_post() might easily affect that, depending on config.
> > >
> > > As a quick test, could you try slapping a delay somewhere around the beginning
> > > of init_post() and see if it rescues the system?
> >
> > Ho-hum... How about this (modulo missing documentation of the whole sad mess):
>
> Sorry, neither adding the delay or this patch helped.

Really odd. Could you print the error returned by kernel_execve() in run_init_process()?
At least that way we'll get some indication of what's going on there. Another thing:
could you slap matching printks into the nested if() in fput() and the loop in
delayed_fput(), just to see if we do get __fput() done on all the right struct file?
Just "fput: %p", file and "delayed_fput: %p", file would probably be enough.

I'm assuming that I hadn't misparsed what you wrote and that __fput() in nested
if() in fput() was enough to get the thing working. Could you confirm that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jul 2, 2012, 6:01 AM

Post #21 of 23 (161 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, 2012-07-02 at 13:02 +0100, Al Viro wrote:
> On Mon, Jul 02, 2012 at 07:49:50AM -0400, Mimi Zohar wrote:
>
> > > > > pid=1 uid=0 d_name=init comm=swapper/0 dev="rootfs" mode=100775
> > > > > pid=1 uid=0 d_name=bash comm=swapper/0 dev="rootfs" mode=100755
> > > >
> > > > OK... Here's what I suspect is going on:
> > > > * populating initramfs writes binaries there. We open files (for write) from
> > > > the kernel thread (there's nothing other than kernel threads at that point), write to
> > > > them, then close(). Final fput() gets delayed.
> > > > * Then we proceed to execve(). Which means mapping the binary with MAP_DENYWRITE.
> > > > Which fails, since there's a struct file still opened for write on that sucker.
> > > >
> > > > Your patch did not delay those fput() - they were done without ->mmap_sem held. So
> > > > it survived. Booting without initramfs always survives; booting with initramfs may
> > > > or may not survive, depending on the timings - if that scheduled work manages to
> > > > run by the time we do those execve(), we win. Note that async_synchronize_full()
> > > > done in init_post() might easily affect that, depending on config.
> > > >
> > > > As a quick test, could you try slapping a delay somewhere around the beginning
> > > > of init_post() and see if it rescues the system?
> > >
> > > Ho-hum... How about this (modulo missing documentation of the whole sad mess):
> >
> > Sorry, neither adding the delay or this patch helped.
>
> Really odd. Could you print the error returned by kernel_execve() in run_init_process()?
> At least that way we'll get some indication of what's going on there. Another thing:
> could you slap matching printks into the nested if() in fput() and the loop in
> delayed_fput(), just to see if we do get __fput() done on all the right struct file?
> Just "fput: %p", file and "delayed_fput: %p", file would probably be enough.

ok, it calls delayed_fput(), but never makes it into the delayed_fput()
while statement. I added an additional printk to make sure
delayed_fput() was being called.

delayed_fput:
fput: xxxx
run_init_process: -26
failed to execute /init

run_init_process: -2
run_init_process: -2
run_init_process: -2

delayed_fput:
fput: xxxx
run_init_process: -26

> I'm assuming that I hadn't misparsed what you wrote and that __fput() in nested
> if() in fput() was enough to get the thing working. Could you confirm that?

Yes, everything is exactly as you described.

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


viro at ZenIV

Jul 2, 2012, 6:33 AM

Post #22 of 23 (161 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, Jul 02, 2012 at 09:01:31AM -0400, Mimi Zohar wrote:
> ok, it calls delayed_fput(), but never makes it into the delayed_fput()
> while statement.

That's because I'm a blind idiot and managed to miss the idiocy in
delayed_fput() - that
list_splice(&head, &delayed_fput_list);
should've been
list_splice_init(&delayed_fput_list, &head);

Even more embarrassingly, I've actually screwed up the build scripts
and hadn't discovered until now that I'd been testing without initramfs;
as soon as I'd found that, the bug had promptly reproduced itself.

Anyway, I've pushed the fix into vfs.git#master. Should propagate
to git.kernel.org in a few (commit 2ae9632 is the branch head)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


zohar at linux

Jul 2, 2012, 7:50 AM

Post #23 of 23 (162 views)
Permalink
Re: [PATCH 0/4] Was: deferring __fput() [In reply to]

On Mon, 2012-07-02 at 14:33 +0100, Al Viro wrote:

> Anyway, I've pushed the fix into vfs.git#master. Should propagate
> to git.kernel.org in a few (commit 2ae9632 is the branch head)

Thanks! I'm now able to boot with the delayed __fput().

Mimi



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.