Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux: Kernel

[GIT PULL v2] hw-breakpoints: Rewrite on top of perf events

 

 

Linux kernel RSS feed   Index | Next | Previous | View Threaded


fweisbec at gmail

Oct 24, 2009, 7:16 AM

Post #1 of 8 (292 views)
Permalink
[GIT PULL v2] hw-breakpoints: Rewrite on top of perf events

Hi all,

This is the v2 of the hw-breakpoints API rewrite on top of perf events.
You can find the previous version here:
http://lwn.net/Articles/351922/

Changes in v2:

- Follow the perf "event " rename
- The ptrace regression have been fixed (ptrace breakpoint perf events
weren't released when a task ended)
- Drop the struct hw_breakpoint and store generic fields in
perf_event_attr.
- Separate core and arch specific headers, drop
asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
- Use new generic len/type for breakpoint
- Handle off case: when breakpoints api is not supported by an arch
- Use proper in-kernel perf api provided by Arjan.

There are still a lot of things that need to be cleaned, simplified,
improved (ptrace side, the bp api, etc....) I guess these things can
be done incrementally if you agree.

I've also tried to get an arch-independent api. Generic fields for
breakpoints are stored in perf_event_attr structure (type, len, addr).
This needs to be discussed and improved before it becomes a perf
userspace ABI. We need to find a generic enough structure to host
the breakpoints parameters, something that can better fit to most arch
(handling breakpoint ranges in powerpc, etc...).

Thanks.

---

The following patchset are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perfevents/hw-breakpoint

Arjan van de Ven (1):
perf/core: Provide a kernel-internal interface to get to performance counters

Frederic Weisbecker (3):
perf/core: Add a callback to perf events
hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events
hw-breakpoints: Arbitrate access to pmu following registers constraints

Li Zefan (1):
ksym_tracer: Remove KSYM_SELFTEST_ENTRY

Paul Mundt (1):
x86/hw-breakpoints: Actually flush thread breakpoints in flush_thread().

arch/Kconfig | 3 +
arch/x86/include/asm/debugreg.h | 7 -
arch/x86/include/asm/hw_breakpoint.h | 58 +++--
arch/x86/include/asm/processor.h | 12 +-
arch/x86/kernel/hw_breakpoint.c | 376 ++++++++++++++--------
arch/x86/kernel/process.c | 9 +-
arch/x86/kernel/process_32.c | 26 +--
arch/x86/kernel/process_64.c | 26 +--
arch/x86/kernel/ptrace.c | 182 +++++++----
arch/x86/kernel/smpboot.c | 3 -
arch/x86/power/cpu.c | 6 -
include/asm-generic/hw_breakpoint.h | 139 --------
include/linux/hw_breakpoint.h | 131 ++++++++
include/linux/perf_event.h | 37 ++-
kernel/exit.c | 5 +
kernel/hw_breakpoint.c | 595 +++++++++++++++++++++-------------
kernel/perf_event.c | 137 ++++++++-
kernel/trace/trace.h | 1 -
kernel/trace/trace_entries.h | 6 +-
kernel/trace/trace_ksym.c | 126 ++++----
kernel/trace/trace_selftest.c | 2 +-
21 files changed, 1154 insertions(+), 733 deletions(-)
delete mode 100644 include/asm-generic/hw_breakpoint.h
create mode 100644 include/linux/hw_breakpoint.h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


fweisbec at gmail

Oct 24, 2009, 7:19 AM

Post #2 of 8 (289 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Sat, Oct 24, 2009 at 04:16:52PM +0200, Frederic Weisbecker wrote:
> Hi all,
>
> This is the v2 of the hw-breakpoints API rewrite on top of perf events.
> You can find the previous version here:
> http://lwn.net/Articles/351922/
>
> Changes in v2:
>
> - Follow the perf "event " rename
> - The ptrace regression have been fixed (ptrace breakpoint perf events
> weren't released when a task ended)
> - Drop the struct hw_breakpoint and store generic fields in
> perf_event_attr.
> - Separate core and arch specific headers, drop
> asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
> - Use new generic len/type for breakpoint
> - Handle off case: when breakpoints api is not supported by an arch
> - Use proper in-kernel perf api provided by Arjan.
>
> There are still a lot of things that need to be cleaned, simplified,
> improved (ptrace side, the bp api, etc....) I guess these things can
> be done incrementally if you agree.
>
> I've also tried to get an arch-independent api. Generic fields for
> breakpoints are stored in perf_event_attr structure (type, len, addr).
> This needs to be discussed and improved before it becomes a perf
> userspace ABI. We need to find a generic enough structure to host
> the breakpoints parameters, something that can better fit to most arch
> (handling breakpoint ranges in powerpc, etc...).
>
> Thanks.
>
> ---
>
> The following patchset are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
> perfevents/hw-breakpoint


BTW, this is a branch based on tip:tracing/hw_breakpoint with tip:perf/core
merged inside.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


prasad at linux

Oct 26, 2009, 2:31 PM

Post #3 of 8 (266 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Sat, Oct 24, 2009 at 04:16:52PM +0200, Frederic Weisbecker wrote:
> Hi all,
>
> This is the v2 of the hw-breakpoints API rewrite on top of perf events.
> You can find the previous version here:
> http://lwn.net/Articles/351922/
>
> Changes in v2:
>
> - Follow the perf "event " rename
> - The ptrace regression have been fixed (ptrace breakpoint perf events
> weren't released when a task ended)
> - Drop the struct hw_breakpoint and store generic fields in
> perf_event_attr.
> - Separate core and arch specific headers, drop
> asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
> - Use new generic len/type for breakpoint
> - Handle off case: when breakpoints api is not supported by an arch
> - Use proper in-kernel perf api provided by Arjan.
>
> There are still a lot of things that need to be cleaned, simplified,
> improved (ptrace side, the bp api, etc....) I guess these things can
> be done incrementally if you agree.
>
> I've also tried to get an arch-independent api. Generic fields for
> breakpoints are stored in perf_event_attr structure (type, len, addr).
> This needs to be discussed and improved before it becomes a perf
> userspace ABI. We need to find a generic enough structure to host
> the breakpoints parameters, something that can better fit to most arch
> (handling breakpoint ranges in powerpc, etc...).
>

Outside the specific comments about the implementation here, I think
the patchset begets a larger question about hw-breakpoint layer's
integration with perf-events.

Upon being a witness to the proposed changes and after some exploration
of perf_events' functionality, I'm afraid that hw-breakpoint integration
with perf doesn't benefit the former as much as originally wished to be
(http://lkml.org/lkml/2009/8/26/149).

Some of the prevalent concerns (which have been raised in different
threads earlier) are:

- While kernel-space breakpoints need to reside on every processor
(irrespective of the process in user-space), perf-events' notion of a
counter is always linked to a process context (although there could be
workarounds by making it 'pinned', etc).

- HW Breakpoints register allocation mechanism is 'greedy', which in my
opinion is more suitable for allocating a finite and contended
resource such as debug register while that of perf-events can give
rise to roll-backs (with side-effects such as stray exceptions and
race conditions).

- Given that the notion of a per-process context for counters is
well-ingrained into the design of perf-events (even system-wide
counters are sometimes implemented through individual syscalls over
nr_cpus as in builtin-stat.c), it requires huge re-design and
user-space changes.

Trying to scoop out the hw-breakpoint layer off its book-keeping/register
allocation features only to replace with that of perf-events leads to a
poor retrofit. On the other hand, an implementation to enable perf to use
hw-breakpoint layer (and its APIs) to profile memory accesses over
kernel-space variables (in the context of a process) is very elegant,
modular and fits cleanly within the frame-work of the perf-events as a
new perf-type (refer http://lkml.org/lkml/2009/10/26/467). A working
patchset (under development and containing bugs) is posted for RFC here:
http://lkml.org/lkml/2009/10/26/461

It is my opinion that enhancing perf-layer to profile memory accesses
through hw-breakpoint layer should be preferred over merging them.

Thanks,
K.Prasad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


fweisbec at gmail

Oct 29, 2009, 12:07 PM

Post #4 of 8 (253 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

2009/10/26 K.Prasad <prasad [at] linux>:
> Outside the specific comments about the implementation here, I think
> the patchset begets a larger question about hw-breakpoint layer's
> integration with perf-events.
>
> Upon being a witness to the proposed changes and after some exploration
> of perf_events' functionality, I'm afraid that hw-breakpoint integration
> with perf doesn't benefit the former as much as originally wished to be
> (http://lkml.org/lkml/2009/8/26/149).
>
> Some of the prevalent concerns (which have been raised in different
> threads earlier) are:
>
> - While kernel-space breakpoints need to reside on every processor
>  (irrespective of the process in user-space), perf-events' notion of a
>  counter is always linked to a process context (although there could be
>  workarounds by making it 'pinned', etc).


No. A counter (let's talk about an event profiling instance now) is not
always attached to a single process.
It is attached to a context. Such contexts are defined by perf as gathering
a group of tasks or it can be a whole cpu.

The breakpoint API only supports two kind of contexts: one task, or every
cpus (or per cpu after your last patchset).

That said, perf events can be enhanced to support the context of a wide counter.


>
> - HW Breakpoints register allocation mechanism is 'greedy', which in my
>  opinion is more suitable for allocating a finite and contended
>  resource such as debug register while that of perf-events can give
>  rise to roll-backs (with side-effects such as stray exceptions and
>  race conditions).


I don't get your point. The only possible rollback is when we allocate
a wide breakpoint (then one per cpu).
If you worry about such races, we can register these breakpoints as
being disabled
and enable them once we know the allocation succeeded for every cpu.


>
> - Given that the notion of a per-process context for counters is
>  well-ingrained into the design of perf-events (even system-wide
>  counters are sometimes implemented through individual syscalls over
>  nr_cpus as in builtin-stat.c), it requires huge re-design and
>  user-space changes.


It doesn't require a huge redesign to support wide perf events.


> Trying to scoop out the hw-breakpoint layer off its book-keeping/register
> allocation features only to replace with that of perf-events leads to a
> poor retrofit. On the other hand, an implementation to enable perf to use
> hw-breakpoint layer (and its APIs) to profile memory accesses over
> kernel-space variables (in the context of a process) is very elegant,
> modular and fits cleanly within the frame-work of the perf-events as a
> new perf-type (refer http://lkml.org/lkml/2009/10/26/467). A working
> patchset (under development and containing bugs) is posted for RFC here:
> http://lkml.org/lkml/2009/10/26/461


The non-perf based api is fine for ptrace, kgdb and ftrace uses.
But it is too limited for perf use.

- It has an ad-hoc context binding (register scheduling) abstraction.
Perf is able to manage
that already: binding to defined group of processes, cpu, etc...

- It doesn't allow non-pinned events, when a breakpoint is disabled
(due to context schedule out), it is
only virtually disabled, it's slot is not freed.

Basically, the breakpoints are performance monitoring and debug
events. Something
that perf can already handle.

The current breakpoint API does all that in an ad-hoc way
(debug register scheduling when cpu get up/down, when we context
switch, etc...).
It is also not powerful enough to support non-pinned events.

The only downside I can see in perf events: it does not support wide
system contexts.
I don't think it requires a huge redesign. But instead of continuing
this ad-hoc context-handling
to cover this hole in perf, why not enhance perf so that it can cover that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


prasad at linux

Nov 1, 2009, 10:25 PM

Post #5 of 8 (234 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Thu, Oct 29, 2009 at 08:07:15PM +0100, Frederic Weisbecker wrote:
> 2009/10/26 K.Prasad <prasad [at] linux>:
> > Outside the specific comments about the implementation here, I think
> > the patchset begets a larger question about hw-breakpoint layer's
> > integration with perf-events.
> >
> > Upon being a witness to the proposed changes and after some exploration
> > of perf_events' functionality, I'm afraid that hw-breakpoint integration
> > with perf doesn't benefit the former as much as originally wished to be
> > (http://lkml.org/lkml/2009/8/26/149).
> >
> > Some of the prevalent concerns (which have been raised in different
> > threads earlier) are:
> >
> > - While kernel-space breakpoints need to reside on every processor
> >  (irrespective of the process in user-space), perf-events' notion of a
> >  counter is always linked to a process context (although there could be
> >  workarounds by making it 'pinned', etc).
>
>
> No. A counter (let's talk about an event profiling instance now) is not
> always attached to a single process.
> It is attached to a context. Such contexts are defined by perf as gathering
> a group of tasks or it can be a whole cpu.
>

Okay.

> The breakpoint API only supports two kind of contexts: one task, or every
> cpus (or per cpu after your last patchset).
>

Yes, and please see the replies to your concerns below.

> That said, perf events can be enhanced to support the context of a wide counter.
>
>
> >
> > - HW Breakpoints register allocation mechanism is 'greedy', which in my
> >  opinion is more suitable for allocating a finite and contended
> >  resource such as debug register while that of perf-events can give
> >  rise to roll-backs (with side-effects such as stray exceptions and
> >  race conditions).
>
>
> I don't get your point. The only possible rollback is when we allocate
> a wide breakpoint (then one per cpu).
> If you worry about such races, we can register these breakpoints as
> being disabled
> and enable them once we know the allocation succeeded for every cpu.
>
>

Not just stray exceptions, as explained before here:
http://lkml.org/lkml/2009/10/1/76
- Races between the requests (also leading to temporary failure of
all CPU requests) presenting an unclear picture about free debug
registers (making it difficult to predict the need for a retry).

> >
> > - Given that the notion of a per-process context for counters is
> >  well-ingrained into the design of perf-events (even system-wide
> >  counters are sometimes implemented through individual syscalls over
> >  nr_cpus as in builtin-stat.c), it requires huge re-design and
> >  user-space changes.
>
>
> It doesn't require a huge redesign to support wide perf events.
>
>

I contest that :-)...and the sheer amount of code movement, re-design
(including core data structures) in the patchset here:
http://lkml.org/lkml/2009/10/24/53.
And all this with a loss of a well-layered, modular code and a
loss of true system-wide support for bkpt counters!

> > Trying to scoop out the hw-breakpoint layer off its book-keeping/register
> > allocation features only to replace with that of perf-events leads to a
> > poor retrofit. On the other hand, an implementation to enable perf to use
> > hw-breakpoint layer (and its APIs) to profile memory accesses over
> > kernel-space variables (in the context of a process) is very elegant,
> > modular and fits cleanly within the frame-work of the perf-events as a
> > new perf-type (refer http://lkml.org/lkml/2009/10/26/467). A working
> > patchset (under development and containing bugs) is posted for RFC here:
> > http://lkml.org/lkml/2009/10/26/461
>
>
> The non-perf based api is fine for ptrace, kgdb and ftrace uses.
> But it is too limited for perf use.
>
> - It has an ad-hoc context binding (register scheduling) abstraction.
> Perf is able to manage
> that already: binding to defined group of processes, cpu, etc...
>

I don't see what's ad-hoc in the scheduling behaviour of the hw-bkpt
layer. Hw-breakpoint layer does the following with respect to register
scheduling:

- User-space breakpoints are always tied to a thread
(thread_info/task_struct) and are hence
active when the corresponding thread is scheduled.

- Kernel-space addresses (requests from in-kernel sources) should be
always active and aren't affected by process context-switches/schedule
operations. Some of the sophisticated mechanisms for scheduling
kernel vs user-space breakpoints (such as trapping syscalls to restore
register context) were pre-empted by the community (as seen here:
http://lkml.org/lkml/2009/3/11/145).

Any further abstraction required by the end-user (as in the case of
perf) can be well-implemented through the powerful breakpoint
interfaces. For instance - perf-events with its unique requirement
wherein a kernel-space breakpoint need to be active only when a given
process is active. Hardware breakpoint layer handles them quite well
as seen here: http://lkml.org/lkml/2009/10/29/300.

> - It doesn't allow non-pinned events, when a breakpoint is disabled
> (due to context schedule out), it is
> only virtually disabled, it's slot is not freed.
>

The <enable><disable>_hw_breakpoint() are designed such. If a user want
the slot to be freed (which is ill-advised for a requirement here) it
can invoke (un)register_kernel_hw_breakpoint() instead (would have very
little overhead for the 1-CPU case without IPIs).

> Basically, the breakpoints are performance monitoring and debug
> events. Something
> that perf can already handle.
>
> The current breakpoint API does all that in an ad-hoc way
> (debug register scheduling when cpu get up/down, when we context
> switch, etc...).
> It is also not powerful enough to support non-pinned events.
>
> The only downside I can see in perf events: it does not support wide
> system contexts.
> I don't think it requires a huge redesign. But instead of continuing
> this ad-hoc context-handling
> to cover this hole in perf, why not enhance perf so that it can cover that?

The advantages of having perf-events to use hw-breakpoint layer is
explained here and in many of my previous emails. It entails no loss of
functionality for either perf-events of hw-breakpoints, while allowing
users to harness the power of both.

Thanks,
K.Prasad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


fweisbec at gmail

Nov 2, 2009, 6:07 AM

Post #6 of 8 (232 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Mon, Nov 02, 2009 at 11:55:50AM +0530, K.Prasad wrote:
> > I don't get your point. The only possible rollback is when we allocate
> > a wide breakpoint (then one per cpu).
> > If you worry about such races, we can register these breakpoints as
> > being disabled
> > and enable them once we know the allocation succeeded for every cpu.
> >
> >
>
> Not just stray exceptions, as explained before here:
> http://lkml.org/lkml/2009/10/1/76
> - Races between the requests (also leading to temporary failure of
> all CPU requests) presenting an unclear picture about free debug
> registers (making it difficult to predict the need for a retry).



Ok. But say we have to set a wide breakpoint.
We can create a disabled set of per cpu breakpoint and then enable
them once we are sure every cpus can host it (we have already reserved a slot
for each of these while registering).

Then this race is not there anymore.


> > >
> > > - Given that the notion of a per-process context for counters is
> > >  well-ingrained into the design of perf-events (even system-wide
> > >  counters are sometimes implemented through individual syscalls over
> > >  nr_cpus as in builtin-stat.c), it requires huge re-design and
> > >  user-space changes.
> >
> >
> > It doesn't require a huge redesign to support wide perf events.
> >
> >
>
> I contest that :-)...and the sheer amount of code movement, re-design
> (including core data structures) in the patchset here:
> http://lkml.org/lkml/2009/10/24/53.



This is about rebasing the hw-breakpoints on top of another profiling
infrastructure.

So the fact we had to do a lot of changes looks fair.


> And all this with a loss of a well-layered, modular code and a
> loss of true system-wide support for bkpt counters!


I don't get your point about the loss of a well-layered and modular
code.

We are reusing an existing profiling infrastructure that looks pretty well
layered and modular to me:

The scheduling of per task registers is centralized in the core and not in the
arch like it was done with the hardware breakpoint api. The remaining arch bits
are only a concern of writing these registers.
It is not sane to hook on arch switch_to(), cpu hotplug helpers, kexec, etc...
to do an ad-hoc scheduling of perf task register whereas we already have
a centralized profiling area that can take these decisions.

So, yes indeed it doesn't support the wide contexts yet.

Let's compare that to the tracing area. What if we hadn't the tracepoints
and every tracers put their own tracing callbacks in the area they want to trace.
We would have a proliferation of ad-hoc tracing functions calls.
But we have the tracepoints: a centralized feature that only requires just one
callback somewhere in the kernel where we want to hook up and in which
every tracers can subscribe.
That's more modular and well-layered.

The problem with the lack of a wide context support with perf events is pretty
the same.
The hw-breakpoint api can implement its ad-hoc one. But it means every
other profiling/debug features will lack it and need to implement their
own.

So why not improving a centralized profiling subsystem instead of implementing
an ad-hoc one for every profiling classes that need it?



> > The non-perf based api is fine for ptrace, kgdb and ftrace uses.
> > But it is too limited for perf use.
> >
> > - It has an ad-hoc context binding (register scheduling) abstraction.
> > Perf is able to manage
> > that already: binding to defined group of processes, cpu, etc...
> >
>
> I don't see what's ad-hoc in the scheduling behaviour of the hw-bkpt
> layer. Hw-breakpoint layer does the following with respect to register
> scheduling:
>
> - User-space breakpoints are always tied to a thread
> (thread_info/task_struct) and are hence
> active when the corresponding thread is scheduled.



This is what is ad-hoc. You need to hook on switch_to, cpu hotplug
and kexec to update the breakpoints registers. And this is something
that would need to be done in every archs. That looks insane considering
the fact we have a core layer that can handle these decisions already.



> - Kernel-space addresses (requests from in-kernel sources) should be
> always active and aren't affected by process context-switches/schedule
> operations. Some of the sophisticated mechanisms for scheduling
> kernel vs user-space breakpoints (such as trapping syscalls to restore
> register context) were pre-empted by the community (as seen here:
> http://lkml.org/lkml/2009/3/11/145).



Sure. And things have evolved since then. We have a centralized
profiling/event susbsystem now.



> Any further abstraction required by the end-user (as in the case of
> perf) can be well-implemented through the powerful breakpoint
> interfaces. For instance - perf-events with its unique requirement
> wherein a kernel-space breakpoint need to be active only when a given
> process is active. Hardware breakpoint layer handles them quite well
> as seen here: http://lkml.org/lkml/2009/10/29/300.




It logically disables/enables the breakpoints but not physically.
Which means a disabled breakpoint still keeps its slot, making
it unavailable for another event, it i required for non-pinned
events.




> > - It doesn't allow non-pinned events, when a breakpoint is disabled
> > (due to context schedule out), it is
> > only virtually disabled, it's slot is not freed.
> >
>
> The <enable><disable>_hw_breakpoint() are designed such. If a user want
> the slot to be freed (which is ill-advised for a requirement here) it
> can invoke (un)register_kernel_hw_breakpoint() instead (would have very
> little overhead for the 1-CPU case without IPIs).


This adds unnecessary overhead. All we want is to update arch registers
when we schedule in/out an event.

We need to be able to free a slot for non-pinned counter because
an undefined number of events must be able to be time shared
in a single slot (in the worst case).

Calling unregister_kernel_breakpoint() each time we schedule out
a non-pinned counter adds unnecessary overhead. And calling
register_kernel_breakpoint() while enabling one is yet
another unnecessary overhead.

You'd need to check the free slots constraints every time for them.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


prasad at linux

Nov 4, 2009, 6:14 AM

Post #7 of 8 (227 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Mon, Nov 02, 2009 at 03:07:14PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 02, 2009 at 11:55:50AM +0530, K.Prasad wrote:
> > > I don't get your point. The only possible rollback is when we allocate
> > > a wide breakpoint (then one per cpu).
> > > If you worry about such races, we can register these breakpoints as
> > > being disabled
> > > and enable them once we know the allocation succeeded for every cpu.
> > >
> > >
> >
> > Not just stray exceptions, as explained before here:
> > http://lkml.org/lkml/2009/10/1/76
> > - Races between the requests (also leading to temporary failure of
> > all CPU requests) presenting an unclear picture about free debug
> > registers (making it difficult to predict the need for a retry).
>
>
>
> Ok. But say we have to set a wide breakpoint.
> We can create a disabled set of per cpu breakpoint and then enable
> them once we are sure every cpus can host it (we have already reserved a slot
> for each of these while registering).
>
> Then this race is not there anymore.
>

Let me explain further with an illustration..assume two contenders for
breakpoints say A and B on a machine with 4 CPUs (0-3). Due to several
prior per-CPU requests CPU0 has one free debug register while CPU3 has
none. 'A' asks for a system-wide breakpoint while 'B' requests for a
per-cpu breakpoint on CPU0. Now think of a race between them! If 'A'
begins first, it starts with CPU0 and proceeds by consuming debug
registers on CPUs in ascending order, meanwhile 'B' would return
thinking that its request cannot be serviced. However 'A' would
eventually fail (since there are no free debug registers on CPU3) and
relinquish the debug register on CPU0 also (during rollback), but 'B'
has returned thinking that there are no free debug registers available.
(registering breakpoint requests in disabled state only helps prevent stray
exceptions).

With some book-keeping such races are prevented. Registration requests
would fail with -ENOSPC only if there are genuine users of debug
registers and not because some potential user (who might eventually
rollback) is holding onto the register temporarily.

Such a limitation may not be of immediate worry, but shouldn't be
a reason to throw away a feature that pre-empts such concerns. Are there
any hw-registers managed by perf-events that are more than one (like
debug registers which are 4 in x86) with such peculiar needs?

> > > >
> > > > - Given that the notion of a per-process context for counters is
> > > >  well-ingrained into the design of perf-events (even system-wide
> > > >  counters are sometimes implemented through individual syscalls over
> > > >  nr_cpus as in builtin-stat.c), it requires huge re-design and
> > > >  user-space changes.
> > >
> > >
> > > It doesn't require a huge redesign to support wide perf events.
> > >
> > >
> >
> > I contest that :-)...and the sheer amount of code movement, re-design
> > (including core data structures) in the patchset here:
> > http://lkml.org/lkml/2009/10/24/53.
>
>
>
> This is about rebasing the hw-breakpoints on top of another profiling
> infrastructure.
>
> So the fact we had to do a lot of changes looks fair.
>
>
> > And all this with a loss of a well-layered, modular code and a
> > loss of true system-wide support for bkpt counters!
>
>
> I don't get your point about the loss of a well-layered and modular
> code.
>
> We are reusing an existing profiling infrastructure that looks pretty well
> layered and modular to me:
>
> The scheduling of per task registers is centralized in the core and not in the
> arch like it was done with the hardware breakpoint api. The remaining arch bits
> are only a concern of writing these registers.
> It is not sane to hook on arch switch_to(), cpu hotplug helpers, kexec, etc...
> to do an ad-hoc scheduling of perf task register whereas we already have
> a centralized profiling area that can take these decisions.
>
> So, yes indeed it doesn't support the wide contexts yet.
>

User-space scheduling of breakpoint register is the closest that
comes to what perf-events already does...and that's just about a
single-hook in __switch_to(). Have no delusions about huge
duplication! and no concerns about arch-specific code being littered all
over - writing onto debug registers is a processor-specific activity and
hence the arch-specific invocation.

System-wide breakpoints, cpu hotplug helpers, kexec hooks as you
mentioned have not been implemented for perf-events....and in a way it
is of little help there other than for hw-breakpoints (are there any
hw-registers managed by perf that have residual effect i.e. continue to
generate exceptions like hw-breakpoints?)

> Let's compare that to the tracing area. What if we hadn't the tracepoints
> and every tracers put their own tracing callbacks in the area they want to trace.
> We would have a proliferation of ad-hoc tracing functions calls.
> But we have the tracepoints: a centralized feature that only requires just one
> callback somewhere in the kernel where we want to hook up and in which
> every tracers can subscribe.
> That's more modular and well-layered.
>

Comparing this to tracepoints isn't apt here. My limited knowledge
doesn't quickly provide me with an alternate analogy (how about kprobes
+ perf-events integration?...no, even that isn't close).

> The problem with the lack of a wide context support with perf events is pretty
> the same.
> The hw-breakpoint api can implement its ad-hoc one. But it means every
> other profiling/debug features will lack it and need to implement their
> own.
>

Can you cite the need for such features in general perf-events
architecture with examples other than hw-breakpoints? In my opinion, if there
had been a need, perf-events would have included them already (perf top
is the only need that comes close to wanting system-wide support but
even there, it is happy by making one syscall per-cpu).

Integrating the features required by hw-breakpoints with perf-events
(with apologies for the out-of-context examples) like mixing oil with
water, proverbial chalk-and-cheese....they stay together but are
immiscible.

Citing some examples from your patchset, look at the addition of
'callback' function pointer or the addition of length, type fields in
perf_event_attr. Do you find generic use-cases for them in perf-events
(outside hw-breakpoints)? Merging structures to create a generic one,
but only to be used for a specific use-case (hw-breakpoint) doesn't
sound like a good idea, and speculating on future use-cases (not current
ones) have never been welcomed.

> So why not improving a centralized profiling subsystem instead of implementing
> an ad-hoc one for every profiling classes that need it?
>
>
>
> > > The non-perf based api is fine for ptrace, kgdb and ftrace uses.
> > > But it is too limited for perf use.
> > >
> > > - It has an ad-hoc context binding (register scheduling) abstraction.
> > > Perf is able to manage
> > > that already: binding to defined group of processes, cpu, etc...
> > >
> >
> > I don't see what's ad-hoc in the scheduling behaviour of the hw-bkpt
> > layer. Hw-breakpoint layer does the following with respect to register
> > scheduling:
> >
> > - User-space breakpoints are always tied to a thread
> > (thread_info/task_struct) and are hence
> > active when the corresponding thread is scheduled.
>
>
>
> This is what is ad-hoc. You need to hook on switch_to, cpu hotplug
> and kexec to update the breakpoints registers. And this is something
> that would need to be done in every archs. That looks insane considering
> the fact we have a core layer that can handle these decisions already.
>
>

As explained above.

>
> > - Kernel-space addresses (requests from in-kernel sources) should be
> > always active and aren't affected by process context-switches/schedule
> > operations. Some of the sophisticated mechanisms for scheduling
> > kernel vs user-space breakpoints (such as trapping syscalls to restore
> > register context) were pre-empted by the community (as seen here:
> > http://lkml.org/lkml/2009/3/11/145).
>
>
> Sure. And things have evolved since then. We have a centralized
> profiling/event susbsystem now.
>
>
> > Any further abstraction required by the end-user (as in the case of
> > perf) can be well-implemented through the powerful breakpoint
> > interfaces. For instance - perf-events with its unique requirement
> > wherein a kernel-space breakpoint need to be active only when a given
> > process is active. Hardware breakpoint layer handles them quite well
> > as seen here: http://lkml.org/lkml/2009/10/29/300.
>
>
> It logically disables/enables the breakpoints but not physically.
> Which means a disabled breakpoint still keeps its slot, making
> it unavailable for another event, it i required for non-pinned
> events.
>
> > > - It doesn't allow non-pinned events, when a breakpoint is disabled
> > > (due to context schedule out), it is
> > > only virtually disabled, it's slot is not freed.
> > >
> >
> > The <enable><disable>_hw_breakpoint() are designed such. If a user want
> > the slot to be freed (which is ill-advised for a requirement here) it
> > can invoke (un)register_kernel_hw_breakpoint() instead (would have very
> > little overhead for the 1-CPU case without IPIs).
>
>
> This adds unnecessary overhead. All we want is to update arch registers
> when we schedule in/out an event.
>
> We need to be able to free a slot for non-pinned counter because
> an undefined number of events must be able to be time shared
> in a single slot (in the worst case).
>
> Calling unregister_kernel_breakpoint() each time we schedule out
> a non-pinned counter adds unnecessary overhead. And calling
> register_kernel_breakpoint() while enabling one is yet
> another unnecessary overhead.
>
> You'd need to check the free slots constraints every time for them.
>

As I told you register/unregister combination is to be used, and I can
get you some numbers about its overhead if that is of concern to you.
It is the IPI constitues a large overhead (not needed for a 1-cpu request),
and not the book-keeping work. A more elegant way would be to use the
modify_kernel_hw_breakpoint() (interface submitted in one of my previous
patches) to simply change the breakpoint address/type/len.

I don't see anything that is required by the perf-event that the
hw-breakpoint layer doesn't provide.

I shall re-state these views in a response to Ingo's mail (that talks about
concerns of duplicate code)...meanwhile I think I should begin
reviewing your patchset (for perf-integration) lest the community insist
that approach.

Thanks,
K.Prasad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


fweisbec at gmail

Nov 5, 2009, 3:02 AM

Post #8 of 8 (226 views)
Permalink
Re: [GIT PULL v2] hw-breakpoints: Rewrite on top of perf events [In reply to]

On Wed, Nov 04, 2009 at 07:44:25PM +0530, K.Prasad wrote:
> Let me explain further with an illustration..assume two contenders for
> breakpoints say A and B on a machine with 4 CPUs (0-3). Due to several
> prior per-CPU requests CPU0 has one free debug register while CPU3 has
> none. 'A' asks for a system-wide breakpoint while 'B' requests for a
> per-cpu breakpoint on CPU0. Now think of a race between them! If 'A'
> begins first, it starts with CPU0 and proceeds by consuming debug
> registers on CPUs in ascending order, meanwhile 'B' would return
> thinking that its request cannot be serviced. However 'A' would
> eventually fail (since there are no free debug registers on CPU3) and
> relinquish the debug register on CPU0 also (during rollback), but 'B'
> has returned thinking that there are no free debug registers available.
> (registering breakpoint requests in disabled state only helps prevent stray
> exceptions).
>
> With some book-keeping such races are prevented. Registration requests
> would fail with -ENOSPC only if there are genuine users of debug
> registers and not because some potential user (who might eventually
> rollback) is holding onto the register temporarily.


True. But such race would only happen in case of concurrent launching
of perf and the ksym_tracer or kgdb in parallel.

I can't figure out such situation to happen easily.
But if that's really a worry, we can lock the register_breakpoint_*
path (or implement wide perf events).


> Such a limitation may not be of immediate worry, but shouldn't be
> a reason to throw away a feature that pre-empts such concerns.



When you migrate a feature on top of another subsystem, it happens
you can lose something. If the migration induces nice new features
then sometimes it's worth the migration and add some code to break
the loss.

But if the loss is about such a tiny tight race that is unlikely
to happen in the real world (and doesn't imply crash or something
bad like that) then it's not necessary worth the effort.


> Are there
> any hw-registers managed by perf-events that are more than one (like
> debug registers which are 4 in x86) with such peculiar needs?


Perhaps. I don't know. If so, we may want to migrate the plural pmu
constraints from hw_breakpoint.c to perf_event.c.

That said this is likely to happen because we don't need registers for that.
Any profiling/tracing/debugging unit based on multiple sources and bound
to tunable contexts may fit into that scheme.
Hardware registers are just a subset of what can be considered as a
a source of "performance monitoring unit".

And perf event has been extended enough to expand the possible
sources to "Event monitoring unit", so the possibilities are broad.



> User-space scheduling of breakpoint register is the closest that
> comes to what perf-events already does...and that's just about a
> single-hook in __switch_to(). Have no delusions about huge
> duplication! and no concerns about arch-specific code being littered all
> over - writing onto debug registers is a processor-specific activity and
> hence the arch-specific invocation.



That's because it's so close to perf event context scheduling that
we want to unify it.

This is not only a single hook in __switch_to():

- This is a context bound ressource scheduling decision made from arch
in spite of the existing optimized mechanisms implemented in a core
profiling subsystem.

- This is a single hook in __switch_to() in x86. Now multiply that
by $(ls arch/ | wc -l) = 26 (-1 because of arch/Kconfig)
Also, expand that to cpu hotplug hooks, kexec hooks, etc...
And also include the free_thread hooks.

- If we have a ptrace breakpoint, the scheduling decision is made
by the hw_breakpoint API. Otherwise it's made by perf.
Why should we maintain two versions of the scheduling decision?

This is a matter of maintainabilty.


> System-wide breakpoints, cpu hotplug helpers, kexec hooks as you
> mentioned have not been implemented for perf-events....and in a way it
> is of little help there other than for hw-breakpoints (are there any
> hw-registers managed by perf that have residual effect i.e. continue to
> generate exceptions like hw-breakpoints?)



System wide events are not supported by perf because of a
design decision for scalability ends.

Each event is bound to a private buffer (inherited to task
childs in the case of task bound counters).
If we have an event that can trigger from every cpu, we can
have multiple concurrent writes in the same buffer and the
profiling would suffer from such contention if we have a
lot of events from several cpus.

Having cpu bound events drop this contention as the events don't
fight against other cpus.

(It the case of task bound events, the contention is there, but
in the window of a task group only).

We manage a wide profiling using a collection of per cpu events,
it scales way much better.

We could also implement the wide context if that becomes wanted but
it should be used by knowing that it won't scale for high frequency
events in SMP.

Concerning cpu hotplug helpers, it is implemented by perf events
(perf_event_{exit/init}_cpu() notifiers).

Kexec is a corner case, but we might want to add a kexec callback
to the pmu structure if needed.

Concerning residual effects of lazy breakpoint registers switching
I don't how the migration to perf brings any problem.


> Comparing this to tracepoints isn't apt here. My limited knowledge
> doesn't quickly provide me with an alternate analogy (how about kprobes
> + perf-events integration?...no, even that isn't close).



Why isn't the tracepoint analogy apt?

It's basically the same.

Say you have func1(), a function in which several subsystems want
to hook:


func1()
{
// do something cool
subsys1_hook();
subsys2_hook();
subsys3_hook();
etc...
}

Instead of that, we can use tracepoint, so that we can zap all
these hooks and only provide one that will dispatch to the subsys:

func1()
{
trace_func1() -> will dispatch to subsys1_hook(), subsys2_hook() etc..
}

The current situation with hw breakpoints and perf is somehow the same.
Hw breakpoint is hooking at arch __switch_to(), cpu_hotplug thing, etc...

But perf too. And perf can act as a dispatcher there. It already
hooks on the scheduler events and cpu hotplug and can take
centralized decisions from these hooks, such as toggling pmus,
and hardware breakpoints can fit there.


> Can you cite the need for such features in general perf-events
> architecture with examples other than hw-breakpoints? In my opinion, if there
> had been a need, perf-events would have included them already (perf top
> is the only need that comes close to wanting system-wide support but
> even there, it is happy by making one syscall per-cpu).



It is happy doing so because it scales.



> Integrating the features required by hw-breakpoints with perf-events
> (with apologies for the out-of-context examples) like mixing oil with
> water, proverbial chalk-and-cheese....they stay together but are
> immiscible.
>
> Citing some examples from your patchset, look at the addition of
> 'callback' function pointer or the addition of length, type fields in
> perf_event_attr. Do you find generic use-cases for them in perf-events
> (outside hw-breakpoints)? Merging structures to create a generic one,
> but only to be used for a specific use-case (hw-breakpoint) doesn't
> sound like a good idea, and speculating on future use-cases (not current
> ones) have never been welcomed.



We need to give the parameters for this specific event. The current
struct perf_event_attr is not sufficient for that.
It needs to grow if needed to host new needs for new events. That's
why it has a field for its size, that's why it has reserved fields,
I don't see where is the problem with that. Nothing in the breakpoint
API can help about that either. We still need a suitable userspace
gate to define a breakpoint.

And while at it, we reuse it for in-kernel uses.

If you look at the struct hw_perf_event, you'll find a protean
structure, considering the union inside, so that it can fit the needs
for two different type of events.


> As I told you register/unregister combination is to be used, and I can
> get you some numbers about its overhead if that is of concern to you.
> It is the IPI constitues a large overhead (not needed for a 1-cpu request),
> and not the book-keeping work. A more elegant way would be to use the
> modify_kernel_hw_breakpoint() (interface submitted in one of my previous
> patches) to simply change the breakpoint address/type/len.



And btw the IPI that binds an event to a cpu is another example of
something already managed by perf.

My real concern is about the fact that the hw-breakpoint api
would act as an unnecessary midlayer there.
Perf is already able to talk directly to the pmu to enable/disable it,
which in practice is just a write to the debug registers.

Why should we encumber with such midlayer?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo [at] vger
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linux kernel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.