mirror of https://github.com/torvalds/linux.git
188 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
06afb0f361 |
tracing updates for v6.13:
- Addition of faultable tracepoints
There's a tracepoint attached to both a system call entry and exit. This
location is known to allow page faults. The tracepoints are called under
an rcu_read_lock() which does not allow faults that can sleep. This limits
the ability of tracepoint handlers to page fault in user space system call
parameters. Now these tracepoints have been made "faultable", allowing the
callbacks to fault in user space parameters and record them.
Note, only the infrastructure has been implemented. The consumers (perf,
ftrace, BPF) now need to have their code modified to allow faults.
- Fix up of BPF code for the tracepoint faultable logic
- Update tracepoints to use the new static branch API
- Remove trace_*_rcuidle() variants and the SRCU protection they used
- Remove unused TRACE_EVENT_FL_FILTERED logic
- Replace strncpy() with strscpy() and memcpy()
- Use replace per_cpu_ptr(smp_processor_id()) with this_cpu_ptr()
- Fix perf events to not duplicate samples when tracing is enabled
- Replace atomic64_add_return(1, counter) with atomic64_inc_return(counter)
- Make stack trace buffer 4K instead of PAGE_SIZE
- Remove TRACE_FLAG_IRQS_NOSUPPORT flag as it was never used
- Get the true return address for function tracer when function graph tracer
is also running.
When function_graph trace is running along with function tracer,
the parent function of the function tracer sometimes is
"return_to_handler", which is the function graph trampoline to record
the exit of the function. Use existing logic that calls into the
fgraph infrastructure to find the real return address.
- Remove (un)regfunc pointers out of tracepoint structure
- Added last minute bug fix for setting pending modules in stack function
filter.
echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter
Would cause a kernel NULL dereference.
- Minor clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZz6dehQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlQsAP9aB0XGUV3UykvjZuKK84VDZ26a2hZH
X2JDYsNA4luuPAEAz/BG2rnslfMZ04WTMAl8h1eh10lxcuHG0wQMHVBXIwI=
=lzb5
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Addition of faultable tracepoints
There's a tracepoint attached to both a system call entry and exit.
This location is known to allow page faults. The tracepoints are
called under an rcu_read_lock() which does not allow faults that can
sleep. This limits the ability of tracepoint handlers to page fault
in user space system call parameters. Now these tracepoints have been
made "faultable", allowing the callbacks to fault in user space
parameters and record them.
Note, only the infrastructure has been implemented. The consumers
(perf, ftrace, BPF) now need to have their code modified to allow
faults.
- Fix up of BPF code for the tracepoint faultable logic
- Update tracepoints to use the new static branch API
- Remove trace_*_rcuidle() variants and the SRCU protection they used
- Remove unused TRACE_EVENT_FL_FILTERED logic
- Replace strncpy() with strscpy() and memcpy()
- Use replace per_cpu_ptr(smp_processor_id()) with this_cpu_ptr()
- Fix perf events to not duplicate samples when tracing is enabled
- Replace atomic64_add_return(1, counter) with
atomic64_inc_return(counter)
- Make stack trace buffer 4K instead of PAGE_SIZE
- Remove TRACE_FLAG_IRQS_NOSUPPORT flag as it was never used
- Get the true return address for function tracer when function graph
tracer is also running.
When function_graph trace is running along with function tracer, the
parent function of the function tracer sometimes is
"return_to_handler", which is the function graph trampoline to record
the exit of the function. Use existing logic that calls into the
fgraph infrastructure to find the real return address.
- Remove (un)regfunc pointers out of tracepoint structure
- Added last minute bug fix for setting pending modules in stack
function filter.
echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter
Would cause a kernel NULL dereference.
- Minor clean ups
* tag 'trace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (31 commits)
ftrace: Fix regression with module command in stack_trace_filter
tracing: Fix function name for trampoline
ftrace: Get the true parent ip for function tracer
tracing: Remove redundant check on field->field in histograms
bpf: ensure RCU Tasks Trace GP for sleepable raw tracepoint BPF links
bpf: decouple BPF link/attach hook and BPF program sleepable semantics
bpf: put bpf_link's program when link is safe to be deallocated
tracing: Replace strncpy() with strscpy() when copying comm
tracing: Add might_fault() check in __DECLARE_TRACE_SYSCALL
tracing: Fix syscall tracepoint use-after-free
tracing: Introduce tracepoint_is_faultable()
tracing: Introduce tracepoint extended structure
tracing: Remove TRACE_FLAG_IRQS_NOSUPPORT
tracing: Replace multiple deprecated strncpy with memcpy
tracing: Make percpu stack trace buffer invariant to PAGE_SIZE
tracing: Use atomic64_inc_return() in trace_clock_counter()
trace/trace_event_perf: remove duplicate samples on the first tracepoint event
tracing/bpf: Add might_fault check to syscall probes
tracing/perf: Add might_fault check to syscall probes
tracing/ftrace: Add might_fault check to syscall probes
...
|
|
|
|
c73eb02a47 |
fgragh: No need to invoke the function call_filter_check_discard()
The function call_filter_check_discard() has been removed in the commit |
|
|
|
0a6c61bc9c |
fgraph: Simplify return address printing in function graph tracer
Simplify return address printing in the function graph tracer by removing fgraph_extras. Since this feature is only used by the function graph tracer and the feature flags can directly accessible from the function graph tracer, fgraph_extras can be removed from the fgraph callback. Cc: Donglin Peng <dolinux.peng@gmail.com> Link: https://lore.kernel.org/172857234900.270774.15378354017601069781.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
49e4154f4b |
tracing: Remove TRACE_EVENT_FL_FILTERED logic
After commit
|
|
|
|
21e92806d3 |
function_graph: Support recording and printing the function return address
When using function_graph tracer to analyze the flow of kernel function
execution, it is often necessary to quickly locate the exact line of code
where the call occurs. While this may be easy at times, it can be more
time-consuming when some functions are inlined or the flow is too long.
This feature aims to simplify the process by recording the return address
of traced funcions and printing it when outputing trace logs.
To enhance human readability, the prefix 'ret=' is used for the kernel return
value, while '<-' serves as the prefix for the return address in trace logs to
make it look more like the function tracer.
A new trace option named 'funcgraph-retaddr' has been introduced, and the
existing option 'sym-addr' can be used to control the format of the return
address.
See below logs with both funcgraph-retval and funcgraph-retaddr enabled.
0) | load_elf_binary() { /* <-bprm_execve+0x249/0x600 */
0) | load_elf_phdrs() { /* <-load_elf_binary+0x84/0x1730 */
0) | __kmalloc_noprof() { /* <-load_elf_phdrs+0x4a/0xb0 */
0) 3.657 us | __cond_resched(); /* <-__kmalloc_noprof+0x28c/0x390 ret=0x0 */
0) + 24.335 us | } /* __kmalloc_noprof ret=0xffff8882007f3000 */
0) | kernel_read() { /* <-load_elf_phdrs+0x6c/0xb0 */
0) | rw_verify_area() { /* <-kernel_read+0x2b/0x50 */
0) | security_file_permission() { /* <-kernel_read+0x2b/0x50 */
0) | selinux_file_permission() { /* <-security_file_permission+0x26/0x40 */
0) | __inode_security_revalidate() { /* <-selinux_file_permission+0x6d/0x140 */
0) 2.034 us | __cond_resched(); /* <-__inode_security_revalidate+0x5f/0x80 ret=0x0 */
0) 6.602 us | } /* __inode_security_revalidate ret=0x0 */
0) 2.214 us | avc_policy_seqno(); /* <-selinux_file_permission+0x107/0x140 ret=0x0 */
0) + 16.670 us | } /* selinux_file_permission ret=0x0 */
0) + 20.809 us | } /* security_file_permission ret=0x0 */
0) + 25.217 us | } /* rw_verify_area ret=0x0 */
0) | __kernel_read() { /* <-load_elf_phdrs+0x6c/0xb0 */
0) | ext4_file_read_iter() { /* <-__kernel_read+0x160/0x2e0 */
Then, we can use the faddr2line to locate the source code, for example:
$ ./scripts/faddr2line ./vmlinux load_elf_phdrs+0x6c/0xb0
load_elf_phdrs+0x6c/0xb0:
elf_read at fs/binfmt_elf.c:471
(inlined by) load_elf_phdrs at fs/binfmt_elf.c:531
Link: https://lore.kernel.org/20240915032912.1118397-1-dolinux.peng@gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409150605.HgUmU8ea-lkp@intel.com/
Signed-off-by: Donglin Peng <dolinux.peng@gmail.com>
[ Rebased to handle text_delta offsets ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
f1f36e22be |
ftrace: Have calltime be saved in the fgraph storage
The calltime field in the shadow stack frame is only used by the function graph tracer and profiler. But now that there's other users of the function graph infrastructure, this adds overhead and wastes space on the shadow stack. Move the calltime to the fgraph data storage, where the function graph and profiler entry functions will save it in its own graph storage and retrieve it in its exit functions. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/20240914214827.096968730@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
3c9880f3ab |
ftrace: Use a running sleeptime instead of saving on shadow stack
The fgraph "sleep-time" option tells the function graph tracer and the profiler whether to include the time a function "sleeps" (is scheduled off the CPU) in its duration for the function. By default it is true, which means the duration of a function is calculated by the timestamp of when the function was entered to the timestamp of when it exits. If the "sleep-time" option is disabled, it needs to remove the time that the task was not running on the CPU during the function. Currently it is done in a sched_switch tracepoint probe where it moves the "calltime" (time of entry of the function) forward by the sleep time calculated. It updates all the calltime in the shadow stack. This is time consuming for those users of the function graph tracer that does not care about the sleep time. Instead, add a "ftrace_sleeptime" to the task_struct that gets the sleep time added each time the task wakes up. Then have the function entry save the current "ftrace_sleeptime" and on function exit, move the calltime forward by the difference of the current "ftrace_sleeptime" from the saved sleeptime. This removes one dependency of "calltime" needed to be on the shadow stack. It also simplifies the code that removes the sleep time of functions. TODO: Only enable the sched_switch tracepoint when this is needed. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/20240914214826.938908568@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
4c57d0be52 |
tracing/fgraph: Have fgraph handle previous boot function addresses
Update the function graph code to modify the function addresses for a
previous boot buffer so that it matches the current kallsyms (note this
does not handle module addresses, yet).
After a reboot, instead of seeing:
# trace-cmd show -B boot_mapped | tail -n30
swapper/0-1 [000] d..2. 56.286470: 0) 0.481 us | 0xffffffff925da5c4();
swapper/0-1 [000] d.... 56.286471: 0) 4.065 us | }
swapper/0-1 [000] d.... 56.286471: 0) 4.920 us | }
swapper/0-1 [000] d..1. 56.286472: 0) | 0xffffffff92536254() {
swapper/0-1 [000] d..1. 56.286472: 0) + 28.974 us | 0xffffffff92534e30();
swapper/0-1 [000] d.... 56.286516: 0) + 43.881 us | }
swapper/0-1 [000] d..1. 56.286517: 0) | 0xffffffff925136c4() {
swapper/0-1 [000] d..1. 56.286518: 0) | 0xffffffff92514a14() {
swapper/0-1 [000] d..1. 56.286518: 0) 6.003 us | 0xffffffff92514200();
swapper/0-1 [000] d.... 56.286529: 0) + 11.510 us | }
swapper/0-1 [000] d.... 56.286529: 0) + 12.895 us | }
swapper/0-1 [000] d.... 56.286530: 0) ! 382.884 us | }
swapper/0-1 [000] d..1. 56.286530: 0) | 0xffffffff92536444() {
swapper/0-1 [000] d..1. 56.286531: 0) | 0xffffffff92536254() {
swapper/0-1 [000] d..1. 56.286531: 0) + 26.335 us | 0xffffffff92534e30();
swapper/0-1 [000] d.... 56.286560: 0) + 29.511 us | }
swapper/0-1 [000] d.... 56.286561: 0) + 30.452 us | }
swapper/0-1 [000] d..1. 56.286562: 0) | 0xffffffff9253c014() {
swapper/0-1 [000] d..1. 56.286562: 0) | 0xffffffff9253bed4() {
swapper/0-1 [000] d..1. 56.286563: 0) + 13.465 us | 0xffffffff92536684();
swapper/0-1 [000] d.... 56.286577: 0) + 14.651 us | }
swapper/0-1 [000] d.... 56.286577: 0) + 15.821 us | }
swapper/0-1 [000] d..1. 56.286578: 0) 0.667 us | 0xffffffff92547074();
swapper/0-1 [000] d..1. 56.286579: 0) 0.453 us | 0xffffffff924f35c4();
swapper/0-1 [000] d.... 56.286580: 0) # 3906.348 us | }
swapper/0-1 [000] d..1. 56.286581: 0) | 0xffffffff92531a14() {
swapper/0-1 [000] d..1. 56.286581: 0) 0.518 us | 0xffffffff92505cb4();
swapper/0-1 [000] d..1. 56.286595: 0) | 0xffffffff92db83c4() {
swapper/0-1 [000] d..1. 56.286596: 0) | 0xffffffff92dec2e4() {
swapper/0-1 [000] d..1. 56.286597: 0) | 0xffffffff92db5304() {
It now shows:
# trace-cmd show -B boot_mapped | tail -n30
swapper/0-1 [000] d..2. 363.079099: 0) 0.483 us | preempt_count_sub();
swapper/0-1 [000] d.... 363.079100: 0) 4.112 us | }
swapper/0-1 [000] d.... 363.079101: 0) 4.979 us | }
swapper/0-1 [000] d..1. 363.079101: 0) | disable_local_APIC() {
swapper/0-1 [000] d..1. 363.079102: 0) + 29.153 us | clear_local_APIC.part.0();
swapper/0-1 [000] d.... 363.079148: 0) + 46.517 us | }
swapper/0-1 [000] d..1. 363.079149: 0) | mcheck_cpu_clear() {
swapper/0-1 [000] d..1. 363.079149: 0) | mce_intel_feature_clear() {
swapper/0-1 [000] d..1. 363.079150: 0) 5.871 us | lmce_supported();
swapper/0-1 [000] d.... 363.079161: 0) + 11.340 us | }
swapper/0-1 [000] d.... 363.079161: 0) + 12.638 us | }
swapper/0-1 [000] d.... 363.079162: 0) ! 383.518 us | }
swapper/0-1 [000] d..1. 363.079162: 0) | lapic_shutdown() {
swapper/0-1 [000] d..1. 363.079163: 0) | disable_local_APIC() {
swapper/0-1 [000] d..1. 363.079163: 0) + 26.144 us | clear_local_APIC.part.0();
swapper/0-1 [000] d.... 363.079192: 0) + 29.424 us | }
swapper/0-1 [000] d.... 363.079192: 0) + 30.376 us | }
swapper/0-1 [000] d..1. 363.079193: 0) | restore_boot_irq_mode() {
swapper/0-1 [000] d..1. 363.079194: 0) | native_restore_boot_irq_mode() {
swapper/0-1 [000] d..1. 363.079194: 0) + 13.863 us | disconnect_bsp_APIC();
swapper/0-1 [000] d.... 363.079209: 0) + 14.933 us | }
swapper/0-1 [000] d.... 363.079209: 0) + 16.009 us | }
swapper/0-1 [000] d..1. 363.079210: 0) 0.694 us | hpet_disable();
swapper/0-1 [000] d..1. 363.079211: 0) 0.511 us | iommu_shutdown_noop();
swapper/0-1 [000] d.... 363.079212: 0) # 3980.260 us | }
swapper/0-1 [000] d..1. 363.079212: 0) | native_machine_emergency_restart() {
swapper/0-1 [000] d..1. 363.079213: 0) 0.495 us | tboot_shutdown();
swapper/0-1 [000] d..1. 363.079230: 0) | acpi_reboot() {
swapper/0-1 [000] d..1. 363.079231: 0) | acpi_reset() {
swapper/0-1 [000] d..1. 363.079232: 0) | acpi_os_write_port() {
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20240813171257.478901820@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
b84214890a |
function_graph: Move graph notrace bit to shadow stack global var
The use of the task->trace_recursion for the logic used for the function graph no-trace was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509107907.162236.6564679266777519065.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.796709456@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
12117f3307 |
function_graph: Move set_graph_function tests to shadow stack global var
The use of the task->trace_recursion for the logic used for the set_graph_function was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509105520.162236.10339831553995971290.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190823.472955399@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
c132be2c4f |
function_graph: Have the instances use their own ftrace_ops for filtering
Allow for instances to have their own ftrace_ops part of the fgraph_ops that makes the funtion_graph tracer filter on the set_ftrace_filter file of the instance and not the top instance. This uses the new ftrace_startup_subops(), by using graph_ops as the "manager ops" that defines the callback function and adds the functions defined by the filters of the ops for each trace instance. The callback defined by the manager ops will call the registered fgraph ops that were added to the fgraph_array. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.832946261@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
26dda5631d |
ftrace: Allow function_graph tracer to be enabled in instances
Now that function graph tracing can handle more than one user, allow it to be enabled in the ftrace instances. Note, the filtering of the functions is still joined by the top level set_ftrace_filter and friends, as well as the graph and nograph files. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509099743.162236.1699959255446248163.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.190630762@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
37238abe3c |
ftrace/function_graph: Pass fgraph_ops to function graph callbacks
Pass the fgraph_ops structure to the function graph callbacks. This will allow callbacks to add a descriptor to a fgraph_ops private field that wil be added in the future and use it for the callbacks. This will be useful when more than one callback can be registered to the function graph tracer. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509098588.162236.4787930115997357578.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/20240603190822.035147698@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Guo Ren <guoren@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
a1be9ccc57 |
function_graph: Support recording and printing the return value of function
Analyzing system call failures with the function_graph tracer can be a
time-consuming process, particularly when locating the kernel function
that first returns an error in the trace logs. This change aims to
simplify the process by recording the function return value to the
'retval' member of 'ftrace_graph_ret' and printing it when outputting
the trace log.
We have introduced new trace options: funcgraph-retval and
funcgraph-retval-hex. The former controls whether to display the return
value, while the latter controls the display format.
Please note that even if a function's return type is void, a return
value will still be printed. You can simply ignore it.
This patch only establishes the fundamental infrastructure. Subsequent
patches will make this feature available on some commonly used processor
architectures.
Here is an example:
I attempted to attach the demo process to a cpu cgroup, but it failed:
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
-bash: echo: write error: Invalid argument
The strace logs indicate that the write system call returned -EINVAL(-22):
...
write(1, "273\n", 4) = -1 EINVAL (Invalid argument)
...
To capture trace logs during a write system call, use the following
commands:
cd /sys/kernel/debug/tracing/
echo 0 > tracing_on
echo > trace
echo *sys_write > set_graph_function
echo *spin* > set_graph_notrace
echo *rcu* >> set_graph_notrace
echo *alloc* >> set_graph_notrace
echo preempt* >> set_graph_notrace
echo kfree* >> set_graph_notrace
echo $$ > set_ftrace_pid
echo function_graph > current_tracer
echo 1 > options/funcgraph-retval
echo 0 > options/funcgraph-retval-hex
echo 1 > tracing_on
echo `pidof demo` > /sys/fs/cgroup/cpu/test/tasks
echo 0 > tracing_on
cat trace > ~/trace.log
To locate the root cause, search for error code -22 directly in the file
trace.log and identify the first function that returned -22. Once you
have identified this function, examine its code to determine the root
cause.
For example, in the trace log below, cpu_cgroup_can_attach
returned -22 first, so we can focus our analysis on this function to
identify the root cause.
...
1) | cgroup_migrate() {
1) 0.651 us | cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
1) | cgroup_migrate_execute() {
1) | cpu_cgroup_can_attach() {
1) | cgroup_taskset_first() {
1) 0.732 us | cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
1) 1.232 us | } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
1) 0.380 us | sched_rt_can_attach(); /* = 0x0 */
1) 2.335 us | } /* cpu_cgroup_can_attach = -22 */
1) 4.369 us | } /* cgroup_migrate_execute = -22 */
1) 7.143 us | } /* cgroup_migrate = -22 */
...
Link: https://lkml.kernel.org/r/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@sangfor.com.cn
Tested-by: Florian Kauer <florian.kauer@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@sangfor.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
affc659246 |
tracing: in_irq() cleanup
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Link: https://lkml.kernel.org/r/20210930000342.6016-1-changbin.du@gmail.com Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
21ccc9cd72 |
tracing: Disable "other" permission bits in the tracefs files
When building the files in the tracefs file system, do not by default set any permissions for OTH (other). This will make it easier for admins who want to define a group for accessing tracefs and not having to first disable all the permission bits for "other" in the file system. As tracing can leak sensitive information, it should never by default allowing all users access. An admin can still set the permission bits for others to have access, which may be useful for creating a honeypot and seeing who takes advantage of it and roots the machine. Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
f2cc020d78 |
tracing: Fix various typos in comments
Fix ~59 single-word typos in the tracing code comments, and fix the grammar in a handful of places. Link: https://lore.kernel.org/r/20210322224546.GA1981273@gmail.com Link: https://lkml.kernel.org/r/20210323174935.GA4176821@gmail.com Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
36590c50b2 |
tracing: Merge irqflags + preempt counter.
The state of the interrupts (irqflags) and the preemption counter are
both passed down to tracing_generic_entry_update(). Only one bit of
irqflags is actually required: The on/off state. The complete 32bit
of the preemption counter isn't needed. Just whether of the upper bits
(softirq, hardirq and NMI) are set and the preemption depth is needed.
The irqflags and the preemption counter could be evaluated early and the
information stored in an integer `trace_ctx'.
tracing_generic_entry_update() would use the upper bits as the
TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth
(considering that one must be substracted from the counter in one
special cases).
The actual preemption value is not used except for the tracing record.
The `irqflags' variable is mostly used only for the tracing record. An
exception here is for instance wakeup_tracer_call() or
probe_wakeup_sched_switch() which explicilty disable interrupts and use
that `irqflags' to save (and restore) the IRQ state and to record the
state.
Struct trace_event_buffer has also the `pc' and flags' members which can
be replaced with `trace_ctx' since their actual value is not used
outside of trace recording.
This will reduce tracing_generic_entry_update() to simply assign values
to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved
to _tracing_gen_ctx_flags() which replaces preempt_count() and
local_save_flags() invocations.
As an example, ftrace_syscall_enter() may invoke:
- trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update()
- event_trigger_unlock_commit()
-> ftrace_trace_stack() -> … -> tracing_generic_entry_update()
-> ftrace_trace_userstack() -> … -> tracing_generic_entry_update()
In this case the TRACE_FLAG_* bits were evaluated three times. By using
the `trace_ctx' they are evaluated once and assigned three times.
A build with all tracers enabled on x86-64 with and without the patch:
text data bss dec hex filename
21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old
21970293 17084168 7639260 46693721 2c87d59 vmlinux.new
text shrank by 379 bytes, data remained constant.
Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
|
|
60602cb549 |
fgraph: Make overruns 4 bytes in graph stack structure
Inspecting the data structures of the function graph tracer, I found that the overrun value is unsigned long, which is 8 bytes on a 64 bit machine, and not only that, the depth is an int (4 bytes). The overrun can be simply an unsigned int (4 bytes) and pack the ftrace_graph_ret structure better. The depth is moved up next to the func, as it is used more often with func, and improves cache locality. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
22c36b1826 |
tracing: make tracing_init_dentry() returns an integer instead of a d_entry pointer
Current tracing_init_dentry() return a d_entry pointer, while is not necessary. This function returns NULL on success or error on failure, which means there is no valid d_entry pointer return. Let's return 0 on success and negative value for error. Link: https://lkml.kernel.org/r/20200712011036.70948-5-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
bc1a72afdc |
ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
When the ring buffer was first created, the iterator followed the normal producer/consumer operations where it had both a peek() operation, that just returned the event at the current location, and a read(), that would return the event at the current location and also increment the iterator such that the next peek() or read() will return the next event. The only use of the ring_buffer_read() is currently to move the iterator to the next location and nothing now actually reads the event it returns. Rename this function to its actual use case to ring_buffer_iter_advance(), which also adds the "iter" part to the name, which is more meaningful. As the timestamp returned by ring_buffer_read() was never used, there's no reason that this new version should bother having returning it. It will also become a void function. Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
1329249437 |
tracing: Make struct ring_buffer less ambiguous
As there's two struct ring_buffers in the kernel, it causes some confusion. The other one being the perf ring buffer. It was agreed upon that as neither of the ring buffers are generic enough to be used globally, they should be renamed as: perf's ring_buffer -> perf_buffer ftrace's ring_buffer -> trace_buffer This implements the changes to the ring buffer that ftrace uses. Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
1c5eb4481e |
tracing: Rename trace_buffer to array_buffer
As we are working to remove the generic "ring_buffer" name that is used by both tracing and perf, the ring_buffer name for tracing will be renamed to trace_buffer, and perf's ring buffer will be renamed to perf_buffer. As there already exists a trace_buffer that is used by the trace_arrays, it needs to be first renamed to array_buffer. Link: https://lore.kernel.org/r/20191213153553.GE20583@krava Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
6c77221df9 |
fgraph: Remove redundant ftrace_graph_notrace_addr() test
We already have tested it before. The second one should be removed.
With this change, the performance should have little improvement.
Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
afbab501c6 |
tracing: Put a margin between flags and duration for wakeup tracers
Don't mix context flags with function duration info.
Instead of this:
# tracer: wakeup_rt
#
# wakeup_rt latency trace v1.1.5 on 5.0.0-rc1-test+
# --------------------------------------------------------------------
# latency: 177 us, #545/545, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
# -----------------
# | task: migration/0-11 (uid:0 nice:0 policy:1 rt_prio:99)
# -----------------
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
0 us | 0) <idle>-0 | dNh5 | /* 0:120:R + [000] 11: 0:R migration/0 */
2 us | 0) <idle>-0 | dNh5 0.000 us | (null)();
4 us | 0) <idle>-0 | dNh4 | _raw_spin_unlock() {
4 us | 0) <idle>-0 | dNh4 0.304 us | preempt_count_sub();
5 us | 0) <idle>-0 | dNh3 1.063 us | }
5 us | 0) <idle>-0 | dNh3 0.266 us | ttwu_stat();
6 us | 0) <idle>-0 | dNh3 | _raw_spin_unlock_irqrestore() {
6 us | 0) <idle>-0 | dNh3 0.273 us | preempt_count_sub();
6 us | 0) <idle>-0 | dNh2 0.818 us | }
Show this:
# tracer: wakeup
#
# wakeup latency trace v1.1.5 on 4.20.0+
# --------------------------------------------------------------------
# latency: 593 us, #674/674, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: kworker/0:1H-339 (uid:0 nice:-20 policy:0 rt_prio:0)
# -----------------
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
0 us | 0) <idle>-0 | dNs. | | /* 0:120:R + [000] 339💯R kworker/0:1H */
3 us | 0) <idle>-0 | dNs. | 0.000 us | (null)();
67 us | 0) <idle>-0 | dNs. | 0.721 us | ttwu_stat();
69 us | 0) <idle>-0 | dNs. | 0.607 us | _raw_spin_unlock_irqrestore();
71 us | 0) <idle>-0 | .Ns. | 0.598 us | _raw_spin_lock_irq();
72 us | 0) <idle>-0 | .Ns. | 0.584 us | _raw_spin_lock_irq();
73 us | 0) <idle>-0 | dNs. | + 11.118 us | __next_timer_interrupt();
75 us | 0) <idle>-0 | dNs. | | call_timer_fn() {
76 us | 0) <idle>-0 | dNs. | | delayed_work_timer_fn() {
76 us | 0) <idle>-0 | dNs. | | __queue_work() {
...
Link: http://lkml.kernel.org/r/20190101154614.8887-4-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
|
|
9acd8de69d |
function_graph: Support displaying relative timestamp
When function_graph is used for latency tracers, relative timestamp
is more straightforward than absolute timestamp as function trace
does. This change adds relative timestamp support to function_graph
and applies to latency tracers (wakeup and irqsoff).
Instead of:
# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 5.0.0-rc1-test
# --------------------------------------------------------------------
# latency: 521 us, #1125/1125, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
# -----------------
# | task: swapper/2-0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: __schedule
# => ended at: _raw_spin_unlock_irq
#
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
124.974306 | 2) systemd-693 | d..1 0.000 us | __schedule();
124.974307 | 2) systemd-693 | d..1 | rcu_note_context_switch() {
124.974308 | 2) systemd-693 | d..1 0.487 us | rcu_preempt_deferred_qs();
124.974309 | 2) systemd-693 | d..1 0.451 us | rcu_qs();
124.974310 | 2) systemd-693 | d..1 2.301 us | }
[..]
124.974826 | 2) <idle>-0 | d..2 | finish_task_switch() {
124.974826 | 2) <idle>-0 | d..2 | _raw_spin_unlock_irq() {
124.974827 | 2) <idle>-0 | d..2 0.000 us | _raw_spin_unlock_irq();
124.974828 | 2) <idle>-0 | d..2 0.000 us | tracer_hardirqs_on();
<idle>-0 2d..2 552us : <stack trace>
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> start_secondary
=> secondary_startup_64
Show:
# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 5.0.0-rc1-test+
# --------------------------------------------------------------------
# latency: 511 us, #1053/1053, CPU#7 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8)
# -----------------
# | task: swapper/7-0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: __schedule
# => ended at: _raw_spin_unlock_irq
#
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
0 us | 7) sshd-1704 | d..1 0.000 us | __schedule();
1 us | 7) sshd-1704 | d..1 | rcu_note_context_switch() {
1 us | 7) sshd-1704 | d..1 0.611 us | rcu_preempt_deferred_qs();
2 us | 7) sshd-1704 | d..1 0.484 us | rcu_qs();
3 us | 7) sshd-1704 | d..1 2.599 us | }
[..]
509 us | 7) <idle>-0 | d..2 | finish_task_switch() {
510 us | 7) <idle>-0 | d..2 | _raw_spin_unlock_irq() {
510 us | 7) <idle>-0 | d..2 0.000 us | _raw_spin_unlock_irq();
512 us | 7) <idle>-0 | d..2 0.000 us | tracer_hardirqs_on();
<idle>-0 7d..2 543us : <stack trace>
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> start_secondary
=> secondary_startup_64
Link: http://lkml.kernel.org/r/20190101154614.8887-2-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
|
|
76b42b63ed |
function_graph: Move ftrace_graph_ret_addr() to fgraph.c
Move the function function_graph_ret_addr() to fgraph.c, as the management of the curr_ret_stack is going to change, and all the accesses to ret_stack needs to be done in fgraph.c. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
688f7089d8 |
fgraph: Add new fgraph_ops structure to enable function graph hooks
Currently the registering of function graph is to pass in a entry and return function. We need to have a way to associate those functions together where the entry can determine to run the return hook. Having a structure that contains both functions will facilitate the process of converting the code to be able to do such. This is similar to the way function hooks are enabled (it passes in ftrace_ops). Instead of passing in the functions to use, a single structure is passed in to the registering function. The unregister function is now passed in the fgraph_ops handle. When we allow more than one callback to the function graph hooks, this will let the system know which one to remove. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
c8dd0f4587 |
function_graph: Do not expose the graph_time option when profiler is not configured
When the function profiler is not configured, the "graph_time" option is meaningless, as the function profiler is the only thing that makes use of it. Do not expose it if the profiler is not configured. Link: http://lkml.kernel.org/r/20181123061133.GA195223@google.com Reported-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
761efe8a94 |
function_graph: Remove the use of FTRACE_NOTRACE_DEPTH
The curr_ret_stack is no longer set to a negative value when a function is not to be traced by the function graph tracer. Remove the usage of FTRACE_NOTRACE_DEPTH, as it is no longer needed. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
9cd2992f2d |
fgraph: Have set_graph_notrace only affect function_graph tracer
In order to make the function graph infrastructure more generic, there can not be code specific for the function_graph tracer in the generic code. This includes the set_graph_notrace logic, that stops all graph calls when a function in the set_graph_notrace is hit. By using the trace_recursion mask, we can use a bit in the current task_struct to implement the notrace code, and move the logic out of fgraph.c and into trace_functions_graph.c and keeps it affecting only the tracer and not all call graph callbacks. Acked-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
d864a3ca88 |
fgraph: Create a fgraph.c file to store function graph infrastructure
As the function graph infrastructure can be used by thing other than tracing, moving the code to its own file out of the trace_functions_graph.c code makes more sense. The fgraph.c file will only contain the infrastructure required to hook into functions and their return code. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
|
|
|
c43ac4a530 |
tracing: Do not line wrap short line in function_graph_enter()
Commit 588ca1786f2dd ("function_graph: Use new curr_ret_depth to manage
depth instead of curr_ret_stack") removed a parameter from the call
ftrace_push_return_trace() that made it so that the entire call was under 80
characters, but it did not remove the line break. There's no reason to break
that line up, so make it a single line.
Link: http://lkml.kernel.org/r/20181122100322.GN2131@hirez.programming.kicks-ass.net
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
|
|
5cf99a0f31 |
tracing/fgraph: Fix set_graph_function from showing interrupts
The tracefs file set_graph_function is used to only function graph functions
that are listed in that file (or all functions if the file is empty). The
way this is implemented is that the function graph tracer looks at every
function, and if the current depth is zero and the function matches
something in the file then it will trace that function. When other functions
are called, the depth will be greater than zero (because the original
function will be at depth zero), and all functions will be traced where the
depth is greater than zero.
The issue is that when a function is first entered, and the handler that
checks this logic is called, the depth is set to zero. If an interrupt comes
in and a function in the interrupt handler is traced, its depth will be
greater than zero and it will automatically be traced, even if the original
function was not. But because the logic only looks at depth it may trace
interrupts when it should not be.
The recent design change of the function graph tracer to fix other bugs
caused the depth to be zero while the function graph callback handler is
being called for a longer time, widening the race of this happening. This
bug was actually there for a longer time, but because the race window was so
small it seldom happened. The Fixes tag below is for the commit that widen
the race window, because that commit belongs to a series that will also help
fix the original bug.
Cc: stable@kernel.org
Fixes:
|
|
|
|
7c6ea35ef5 |
function_graph: Reverse the order of pushing the ret_stack and the callback
The function graph profiler uses the ret_stack to store the "subtime" and
reuse it by nested functions and also on the return. But the current logic
has the profiler callback called before the ret_stack is updated, and it is
just modifying the ret_stack that will later be allocated (it's just lucky
that the "subtime" is not touched when it is allocated).
This could also cause a crash if we are at the end of the ret_stack when
this happens.
By reversing the order of the allocating the ret_stack and then calling the
callbacks attached to a function being traced, the ret_stack entry is no
longer used before it is allocated.
Cc: stable@kernel.org
Fixes:
|
|
|
|
552701dd0f |
function_graph: Move return callback before update of curr_ret_stack
In the past, curr_ret_stack had two functions. One was to denote the depth
of the call graph, the other is to keep track of where on the ret_stack the
data is used. Although they may be slightly related, there are two cases
where they need to be used differently.
The one case is that it keeps the ret_stack data from being corrupted by an
interrupt coming in and overwriting the data still in use. The other is just
to know where the depth of the stack currently is.
The function profiler uses the ret_stack to save a "subtime" variable that
is part of the data on the ret_stack. If curr_ret_stack is modified too
early, then this variable can be corrupted.
The "max_depth" option, when set to 1, will record the first functions going
into the kernel. To see all top functions (when dealing with timings), the
depth variable needs to be lowered before calling the return hook. But by
lowering the curr_ret_stack, it makes the data on the ret_stack still being
used by the return hook susceptible to being overwritten.
Now that there's two variables to handle both cases (curr_ret_depth), we can
move them to the locations where they can handle both cases.
Cc: stable@kernel.org
Fixes:
|
|
|
|
39eb456dac |
function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
Currently, the depth of the ret_stack is determined by curr_ret_stack index. The issue is that there's a race between setting of the curr_ret_stack and calling of the callback attached to the return of the function. Commit |
|
|
|
d125f3f866 |
function_graph: Make ftrace_push_return_trace() static
As all architectures now call function_graph_enter() to do the entry work,
no architecture should ever call ftrace_push_return_trace(). Make it static.
This is needed to prepare for a fix of a design bug on how the curr_ret_stack
is used.
Cc: stable@kernel.org
Fixes:
|
|
|
|
8114865ff8 |
function_graph: Create function_graph_enter() to consolidate architecture code
Currently all the architectures do basically the same thing in preparing the
function graph tracer on entry to a function. This code can be pulled into a
generic location and then this will allow the function graph tracer to be
fixed, as well as extended.
Create a new function graph helper function_graph_enter() that will call the
hook function (ftrace_graph_entry) and the shadow stack operation
(ftrace_push_return_trace), and remove the need of the architecture code to
manage the shadow stack.
This is needed to prepare for a fix of a design bug on how the curr_ret_stack
is used.
Cc: stable@kernel.org
Fixes:
|
|
|
|
1fe4293f4b |
tracing: Fix missing return symbol in function_graph output
The function_graph tracer does not show the interrupt return marker for the
leaf entry. On leaf entries, we see an unbalanced interrupt marker (the
interrupt was entered, but nevern left).
Before:
1) | SyS_write() {
1) | __fdget_pos() {
1) 0.061 us | __fget_light();
1) 0.289 us | }
1) | vfs_write() {
1) 0.049 us | rw_verify_area();
1) + 15.424 us | __vfs_write();
1) ==========> |
1) 6.003 us | smp_apic_timer_interrupt();
1) 0.055 us | __fsnotify_parent();
1) 0.073 us | fsnotify();
1) + 23.665 us | }
1) + 24.501 us | }
After:
0) | SyS_write() {
0) | __fdget_pos() {
0) 0.052 us | __fget_light();
0) 0.328 us | }
0) | vfs_write() {
0) 0.057 us | rw_verify_area();
0) | __vfs_write() {
0) ==========> |
0) 8.548 us | smp_apic_timer_interrupt();
0) <========== |
0) + 36.507 us | } /* __vfs_write */
0) 0.049 us | __fsnotify_parent();
0) 0.066 us | fsnotify();
0) + 50.064 us | }
0) + 50.952 us | }
Link: http://lkml.kernel.org/r/1517413729-20411-1-git-send-email-changbin.du@intel.com
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
9b130ad5bb |
treewide: make "nr_cpu_ids" unsigned
First, number of CPUs can't be negative number. Second, different signnnedness leads to suboptimal code in the following cases: 1) kmalloc(nr_cpu_ids * sizeof(X)); "int" has to be sign extended to size_t. 2) while (loff_t *pos < nr_cpu_ids) MOVSXD is 1 byte longed than the same MOV. Other cases exist as well. Basically compiler is told that nr_cpu_ids can't be negative which can't be deduced if it is "int". Code savings on allyesconfig kernel: -3KB add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370) function old new delta coretemp_cpu_online 450 512 +62 rcu_init_one 1234 1272 +38 pci_device_probe 374 399 +25 ... pgdat_reclaimable_pages 628 556 -72 select_fallback_rq 446 369 -77 task_numa_find_cpu 1923 1807 -116 Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1a41442864 |
tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
Currently both the wakeup and irqsoff traces do not handle set_graph_notrace
well. The ftrace infrastructure will ignore the return paths of all
functions leaving them hanging without an end:
# echo '*spin*' > set_graph_notrace
# cat trace
[...]
_raw_spin_lock() {
preempt_count_add() {
do_raw_spin_lock() {
update_rq_clock();
Where the '*spin*' functions should have looked like this:
_raw_spin_lock() {
preempt_count_add();
do_raw_spin_lock();
}
update_rq_clock();
Instead, have the wakeup and irqsoff tracers ignore the functions that are
set by the set_graph_notrace like the function_graph tracer does. Move
the logic in the function_graph tracer into a header to allow wakeup and
irqsoff tracers to use it as well.
Cc: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
|
|
794de08a16 |
fgraph: Handle a case where a tracer ignores set_graph_notrace
Both the wakeup and irqsoff tracers can use the function graph tracer when
the display-graph option is set. The problem is that they ignore the notrace
file, and record the entry of functions that would be ignored by the
function_graph tracer. This causes the trace->depth to be recorded into the
ring buffer. The set_graph_notrace uses a trick by adding a large negative
number to the trace->depth when a graph function is to be ignored.
On trace output, the graph function uses the depth to record a stack of
functions. But since the depth is negative, it accesses the array with a
negative number and causes an out of bounds access that can cause a kernel
oops or corrupt data.
Have the print functions handle cases where a tracer still records functions
even when they are in set_graph_notrace.
Also add warnings if the depth is below zero before accessing the array.
Note, the function graph logic will still prevent the return of these
functions from being recorded, which means that they will be left hanging
without a return. For example:
# echo '*spin*' > set_graph_notrace
# echo 1 > options/display-graph
# echo wakeup > current_tracer
# cat trace
[...]
_raw_spin_lock() {
preempt_count_add() {
do_raw_spin_lock() {
update_rq_clock();
Where it should look like:
_raw_spin_lock() {
preempt_count_add();
do_raw_spin_lock();
}
update_rq_clock();
Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung.kim@lge.com>
Fixes:
|
|
|
|
52ffabe384 |
tracing: Make __buffer_unlock_commit() always_inline
The function __buffer_unlock_commit() is called in a few places outside of trace.c. But for the most part, it should really be inlined, as it is in the hot path of the trace_events. For the callers outside of trace.c, create a new function trace_buffer_unlock_commit_nostack(), as the reason it was used was to avoid the stack tracing that trace_buffer_unlock_commit() could do. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> |
|
|
|
95107b30be |
This release cycle is rather small. Just a few fixes to tracing.
The big change is the addition of the hwlat tracer. It not only detects SMIs, but also other latency that's caused by the hardware. I have detected some latency from large boxes having bus contention. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJX9a77AAoJEKKk/i67LK/8UPEH/jcqMmOMhQYVQsNaJViA5uJM SV96gaLCc9cxXY04Hf7vx8RkVIyIqTCCQZ+RVZt4RSeqpsB2IzZ1u0CNKs2Z0MTv MdvQJoazRoDgVuPzKAsdAlDd0ykqHEFA5ayF3XDK4P2J97La+B4rQIqEiJX/aDrz i0NQQFg2ZF46mXJXn4oXe6nmr6WnbiEduawVjd7JvgILJO2hojDicOTQlNG41Nys 68fOV8mLk0OL7sFRjySLGcbdbKhP2YbNhxILXl8geLgS9+CFZXkE8oTRjjy9IMNA XrqbFLMWaRVv+Nig7bHIWKE8ZErC5WCYUw4LD2GTLMDx5AkAVLGFFp6TOiO4SG8= =ke23 -----END PGP SIGNATURE----- Merge tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "This release cycle is rather small. Just a few fixes to tracing. The big change is the addition of the hwlat tracer. It not only detects SMIs, but also other latency that's caused by the hardware. I have detected some latency from large boxes having bus contention" * tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Call traceoff trigger after event is recorded ftrace/scripts: Add helper script to bisect function tracing problem functions tracing: Have max_latency be defined for HWLAT_TRACER as well tracing: Add NMI tracing in hwlat detector tracing: Have hwlat trace migrate across tracing_cpumask CPUs tracing: Add documentation for hwlat_detector tracer tracing: Added hardware latency tracer ftrace: Access ret_stack->subtime only in the function profiler function_graph: Handle TRACE_BPUTS in print_graph_comment tracing/uprobe: Drop isdigit() check in create_trace_uprobe |
|
|
|
8861dd303c |
ftrace: Access ret_stack->subtime only in the function profiler
The subtime is used only for function profiler with function graph tracer enabled. Move the definition of subtime under CONFIG_FUNCTION_PROFILER to reduce the memory usage. Also move the initialization of subtime into the graph entry callback. Link: http://lkml.kernel.org/r/20160831025529.24018-1-namhyung@kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> |
|
|
|
613dccdf68 |
function_graph: Handle TRACE_BPUTS in print_graph_comment
It missed to handle TRACE_BPUTS so messages recorded by trace_bputs() will be shown with symbol info unnecessarily. You can see it with the trace_printk sample code: # cd /sys/kernel/tracing/ # echo sys_sync > set_graph_function # echo 1 > options/sym-offset # echo function_graph > current_tracer Note that the sys_sync filter was there to prevent recording other functions and the sym-offset option was needed since the first message was called from a module init function so kallsyms doesn't have the symbol and omitted in the output. # cd ~/build/kernel # insmod samples/trace_printk/trace-printk.ko # cd - # head trace Before: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | /* 0xffffffffa0002000: This is a static string that will use trace_bputs */ 1) | /* This is a dynamic string that will use trace_puts */ 1) | /* trace_printk_irq_work+0x5/0x7b [trace_printk]: (irq) This is a static string that will use trace_bputs */ 1) | /* (irq) This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bprintk() */ 1) | /* (irq) This is a dynamic string that will use trace_printk */ After: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | /* This is a static string that will use trace_bputs */ 1) | /* This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bputs */ 1) | /* (irq) This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bprintk() */ 1) | /* (irq) This is a dynamic string that will use trace_printk */ Link: http://lkml.kernel.org/r/20160901024354.13720-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> |
|
|
|
223918e32a |
ftrace: Add ftrace_graph_ret_addr() stack unwinding helpers
When function graph tracing is enabled for a function, ftrace modifies the stack by replacing the original return address with the address of a hook function (return_to_handler). Stack unwinders need a way to get the original return address. Add an arch-independent helper function for that named ftrace_graph_ret_addr(). This adds two variations of the function: one depends on HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, and the other relies on an index state variable. The former is recommended because, in some cases, the latter can cause problems when the unwinder skips stack frames. It can get out of sync with the ret_stack index and wrong addresses can be reported for the stack trace. Once all arches have been ported to use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, we can get rid of the distinction. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nilay Vaish <nilayvaish@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/36bd90f762fc5e5af3929e3797a68a64906421cf.1471607358.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
9a7c348ba6 |
ftrace: Add return address pointer to ftrace_ret_stack
Storing this value will help prevent unwinders from getting out of sync with the function graph tracer ret_stack. Now instead of needing a stateful iterator, they can compare the return address pointer to find the right ret_stack entry. Note that an array of 50 ftrace_ret_stack structs is allocated for every task. So when an arch implements this, it will add either 200 or 400 bytes of memory usage per task (depending on whether it's a 32-bit or 64-bit platform). Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nilay Vaish <nilayvaish@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a95cfcc39e8f26b89a430c56926af0bb217bc0a1.1471607358.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |