mirror of https://github.com/torvalds/linux.git
259 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
69c5079b49 |
tracing updates for v6.19:
- Merge branch shared with kprobes on extending trace options
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that limit
has been hit, and more options are being added, increase the option mask
to a 64 bit number, doubling the number of options available.
As this is required for the kprobe topic branches as well as the tracing
topic branch, a separate branch was created and merged into both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file read to
allow reading user space to be done fast and without locking or
allocations. Make this available so that the system call trace events can
use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a faultable
context, take advantage of this and read the user space buffers for
various system calls. For example, show the path name of the openat system
call instead of just showing the pointer to that path name in user space.
Also show the contents of the buffer of the write system call. Several
system call trace events are updated to make tracing into a light weight
strace tool for all applications in the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of the buffer
Limit the amount of data that can be read from user space. The default
size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and ignores
the print_fmt. This is because the print_fmt may change from kernel to
kernel. As the system call output is fixed by the system call ABI itself,
there's no reason to limit that. This makes reading the system call events
in the persistent ring buffer much nicer and easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is hit
currently lists all functions by its name and offset. But this becomes
ambiguous when there are several functions with the same name. Add a
tracing option that changes the output to be that of _text+offset
instead. Now a user space tool can use this information to map the
_text+offset to the unique function it is counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it properly
in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code and
also using switch statements instead of a series of if statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be enabled
in multiple trace instances, their options are still global. The API is
per instance, thus changing one affects other instances. This isn't even
consistent, as the option take affect differently depending on when an
tracer started in an instance. Make the options for instances only affect
the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at every
sched switch. Taking the lock at sched switch can be removed by instead
using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different types
of triggers (always on trigger, and count limited trigger). But by adding
a single field to one structure, the other structure could be absorbed
into the first structure making he code easier to understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then removes
them, it can take several seconds to complete. This is because each
removal calls the slow tracepoint_synchronize_unregister() that can take
hundreds of milliseconds to complete. Instead, create a helper thread that
will do the clean up. When a trigger is removed, it will create the
kthread if it isn't already created, and then add the trigger to a llist.
The kthread will take the items off the llist, call
tracepoint_synchronize_unregister(), and then remove the items it took
off. It will then check if there's more items to free before sleeping.
This makes user space removing all these triggers to finish in less than a
second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced by the
function tracer the entire tracing directory disables function tracing.
But not all of tracing causes issues if it is traced. Namely, the event
tracing code. Add a config that enables some of the tracing code to be
traced to help in debugging it. Note, when this is enabled, it does add
noise to general function tracing, especially if events are enabled as
well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis in the
field. One issue is that if there's a crash, the data in the persistent
ring buffer must be read before tracing can begin using it. This slows
down the boot process. Once tracing starts in the persistent ring buffer,
the old data must be freed and the addresses no longer match and old
events can't be in the buffer with new events.
Create a way to create a backup buffer that copies the persistent ring
buffer at boot up. Then after a crash, the always on tracer can begin
immediately as well as the normal boot process while the crash analysis
tooling uses the backup buffer. After the backup buffer is finished being
read, it can be removed.
- Enable function graph args and return address options at the same time
Currently the when reading of arguments in the function graph tracer is
enabled, the option to record the parent function in the entry event can
not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one of
its members and it will return the offset of that member. This allows the
ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out with as
well.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
=HilC
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Extend tracing option mask to 64 bits
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that
limit has been hit, and more options are being added, increase the
option mask to a 64 bit number, doubling the number of options
available.
As this is required for the kprobe topic branches as well as the
tracing topic branch, a separate branch was created and merged into
both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file
read to allow reading user space to be done fast and without locking
or allocations. Make this available so that the system call trace
events can use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a
faultable context, take advantage of this and read the user space
buffers for various system calls. For example, show the path name of
the openat system call instead of just showing the pointer to that
path name in user space. Also show the contents of the buffer of the
write system call. Several system call trace events are updated to
make tracing into a light weight strace tool for all applications in
the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of
the buffer
Limit the amount of data that can be read from user space. The
default size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and
ignores the print_fmt. This is because the print_fmt may change from
kernel to kernel. As the system call output is fixed by the system
call ABI itself, there's no reason to limit that. This makes reading
the system call events in the persistent ring buffer much nicer and
easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is
hit currently lists all functions by its name and offset. But this
becomes ambiguous when there are several functions with the same
name.
Add a tracing option that changes the output to be that of
'_text+offset' instead. Now a user space tool can use this
information to map the '_text+offset' to the unique function it is
counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it
properly in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code
and also using switch statements instead of a series of if
statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be
enabled in multiple trace instances, their options are still global.
The API is per instance, thus changing one affects other instances.
This isn't even consistent, as the option take affect differently
depending on when an tracer started in an instance. Make the options
for instances only affect the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at
every sched switch. Taking the lock at sched switch can be removed by
instead using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different
types of triggers (always on trigger, and count limited trigger). But
by adding a single field to one structure, the other structure could
be absorbed into the first structure making he code easier to
understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then
removes them, it can take several seconds to complete. This is
because each removal calls tracepoint_synchronize_unregister() that
can take hundreds of milliseconds to complete.
Instead, create a helper thread that will do the clean up. When a
trigger is removed, it will create the kthread if it isn't already
created, and then add the trigger to a llist. The kthread will take
the items off the llist, call tracepoint_synchronize_unregister(),
and then remove the items it took off. It will then check if there's
more items to free before sleeping.
This makes user space removing all these triggers to finish in less
than a second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced
by the function tracer the entire tracing directory disables function
tracing. But not all of tracing causes issues if it is traced.
Namely, the event tracing code. Add a config that enables some of the
tracing code to be traced to help in debugging it. Note, when this is
enabled, it does add noise to general function tracing, especially if
events are enabled as well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis
in the field. One issue is that if there's a crash, the data in the
persistent ring buffer must be read before tracing can begin using
it. This slows down the boot process. Once tracing starts in the
persistent ring buffer, the old data must be freed and the addresses
no longer match and old events can't be in the buffer with new
events.
Create a way to create a backup buffer that copies the persistent
ring buffer at boot up. Then after a crash, the always on tracer can
begin immediately as well as the normal boot process while the crash
analysis tooling uses the backup buffer. After the backup buffer is
finished being read, it can be removed.
- Enable function graph args and return address options at the same
time
Currently the when reading of arguments in the function graph tracer
is enabled, the option to record the parent function in the entry
event can not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one
of its members and it will return the offset of that member. This
allows the ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out
with as well
* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
overflow: Introduce struct_offset() to get offset of member
function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
tracing: Add boot-time backup of persistent ring buffer
ftrace: Allow tracing of some of the tracing code
tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
tracing: Add bulk garbage collection of freeing event_trigger_data
tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
tracing: Merge struct event_trigger_ops into struct event_command
tracing: Remove get_trigger_ops() and add count_func() from trigger ops
tracing: Show the tracer options in boot-time created instance
ftrace: Avoid redundant initialization in register_ftrace_direct
tracing: Remove unused variable in tracing_trace_options_show()
fgraph: Make fgraph_no_sleep_time signed
tracing: Convert function graph set_flags() to use a switch() statement
tracing: Have function graph tracer option sleep-time be per instance
tracing: Move graph-time out of function graph options
tracing: Have function graph tracer option funcgraph-irqs be per instance
trace/pid_list: optimize pid_list->lock contention
tracing: Have function graph tracer define options per instance
tracing: Have function tracer define options per instance
...
|
|
|
|
b052d70f7c |
tracing: Merge struct event_trigger_ops into struct event_command
Now that there's pretty much a one to one mapping between the struct event_trigger_ops and struct event_command, there's no reason to have two different structures. Merge the function pointers of event_trigger_ops into event_command. There's one exception in trace_events_hist.c for the event_hist_trigger_named_ops. This has special logic for the init and free function pointers for "named histograms". In this case, allocate the cmd_ops of the event_trigger_data and set it to the proper init and free functions, which are used to initialize and free the event_trigger_data respectively. Have the free function and the init function (on failure) free the cmd_ops of the data element. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251125200932.446322765@kernel.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
bdafb4d4cb |
tracing: Remove get_trigger_ops() and add count_func() from trigger ops
The struct event_command has a callback function called get_trigger_ops(). This callback returns the "trigger_ops" to use for the trigger. These ops define the trigger function, how to init the trigger, how to print the trigger and how to free it. The only reason there's a callback function to get these ops is because some triggers have two types of operations. One is an "always on" operation, and the other is a "count down" operation. If a user passes in a parameter to say how many times the trigger should execute. For example: echo stacktrace:5 > events/kmem/kmem_cache_alloc/trigger It will trigger the stacktrace for the first 5 times the kmem_cache_alloc event is hit. Instead of having two different trigger_ops since the only difference between them is the tigger itself (the print, init and free functions are all the same), just use a single ops that the event_command points to and add a function field to the trigger_ops to have a count_func. When a trigger is added to an event, if there's a count attached to it and the trigger ops has the count_func field, the data allocated to represent this trigger will have a new flag set called COUNT. Then when the trigger executes, it will check if the COUNT data flag is set, and if so, it will call the ops count_func(). If that returns false, it returns without executing the trigger. This removes the need for duplicate event_trigger_ops structures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251125200932.274566147@kernel.org Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
80f0d631dc |
tracing: Fix memory leaks in create_field_var()
The function create_field_var() allocates memory for 'val' through
create_hist_field() inside parse_atom(), and for 'var' through
create_var(), which in turn allocates var->type and var->var.name
internally. Simply calling kfree() to release these structures will
result in memory leaks.
Use destroy_hist_field() to properly free 'val', and explicitly release
the memory of var->type and var->var.name before freeing 'var' itself.
Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn
Fixes:
|
|
|
|
f2947c4b7d |
tracing: Rename event_trigger_alloc() to trigger_data_alloc()
The function event_trigger_alloc() creates an event_trigger_data
descriptor and states that it needs to be freed via event_trigger_free().
This is incorrect, it needs to be freed by trigger_data_free() as
event_trigger_free() adds ref counting.
Rename event_trigger_alloc() to trigger_data_alloc() and state that it
needs to be freed via trigger_data_free(). This naming convention
was introducing bugs.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.776436410@goodmis.org
Fixes:
|
|
|
|
54c53dfdb6 |
tracing: Add common_comm to histograms
If one wants to trace the name of the task that wakes up a process and pass that to the synthetic events, there's nothing currently that lets the synthetic events do that. Add a "common_comm" to the histogram logic that allows histograms save the current->comm as a variable that can be passed through and added to a synthetic event: # cd /sys/kernel/tracing # echo 's:wake_lat char[] waker; char[] wakee; u64 delta;' >> dynamic_events # echo 'hist:keys=pid:comm=common_comm:ts=common_timestamp.usecs if !(common_flags & 0x18)' > events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:wake_comm=$comm:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,$wake_comm,next_comm,$delta)' > events/sched/sched_switch/trigger The above will create a synthetic trace event that will save both the name of the waker and the wakee but only if the wakeup did not happen in a hard or soft interrupt context. The "common_comm" is used to save the task->comm at the time of the initial event and is passed via the "comm" variable to the second event, and that is saved as the "waker" field in the "wake_lat" synthetic event. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250407154912.3c6c6246@gandalf.local.home Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
7ab0fc61ce |
tracing: Move histogram trigger variables from stack to per CPU structure
The histogram trigger has three somewhat large arrays on the kernel stack:
unsigned long entries[HIST_STACKTRACE_DEPTH];
u64 var_ref_vals[TRACING_MAP_VARS_MAX];
char compound_key[HIST_KEY_SIZE_MAX];
Checking the function event_hist_trigger() stack frame size, it currently
uses 816 bytes for its stack frame due to these variables!
Instead, allocate a per CPU structure that holds these arrays for each
context level (normal, softirq, irq and NMI). That is, each CPU will have
4 of these structures. This will be allocated when the first histogram
trigger is enabled and freed when the last is disabled. When the
histogram callback triggers, it will request this structure. The request
will disable preemption, get the per CPU structure at the index of the
per CPU variable, and increment that variable.
The callback will use the arrays in this structure to perform its work and
then release the structure. That in turn will simply decrement the per CPU
index and enable preemption.
Moving the variables from the kernel stack to the per CPU structure brings
the stack frame of event_hist_trigger() down to just 112 bytes.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250407123851.74ea8d58@gandalf.local.home
Fixes:
|
|
|
|
872a0d90c1 |
tracing: Always use memcpy() in histogram add_to_key()
The add_to_key() function tests if the key is a string or some data. If
it's a string it does some further calculations of the string size (still
truncating it to the max size it can be), and calls strncpy().
If the key isn't as string it calls memcpy(). The interesting point is
that both use the exact same parameters:
strncpy(compound_key + key_field->offset, (char *)key, size);
} else
memcpy(compound_key + key_field->offset, key, size);
As strncpy() is being used simply as a memcpy() for a string, and since
strncpy() is deprecated, just call memcpy() for both memory and string
keys.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/20250403210637.1c477d4a@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
744fab2d9f |
tracing updates for v6.15:
- Add option traceoff_after_boot In order to debug kernel boot, it sometimes is helpful to enable tracing via the kernel command line. Unfortunately, by the time the login prompt appears, the trace is overwritten by the init process and other user space start up applications. Adding a "traceoff_after_boot" will disable tracing when the kernel passes control to init which will allow developers to be able to see the traces that occurred during boot. - Clean up the mmflags macros that display the GFP flags in trace events The macros to print the GFP flags for trace events had a bit of duplication. The code was restructured to remove duplication and in the process it also adds some flags that were missed before. - Removed some dead code and scripts/draw_functrace.py draw_functrace.py hasn't worked in years and as nobody complained about it, remove it. - Constify struct event_trigger_ops The event_trigger_ops is just a structure that has function pointers that are assigned when the variables are created. These variables should all be constants. - Other minor clean ups and fixes -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+V9IhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qr4RAP9JhE3n69pGuOVaJTN/LGLr2Axl59n4 KqZSZS1nUM76/gD6AxYpR7nxyxgJ7VjNkLptS9tSjJVdPDxGAl0v3eO04w4= =SU30 -----END PGP SIGNATURE----- Merge tag 'trace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Add option traceoff_after_boot In order to debug kernel boot, it sometimes is helpful to enable tracing via the kernel command line. Unfortunately, by the time the login prompt appears, the trace is overwritten by the init process and other user space start up applications. Adding a "traceoff_after_boot" will disable tracing when the kernel passes control to init which will allow developers to be able to see the traces that occurred during boot. - Clean up the mmflags macros that display the GFP flags in trace events The macros to print the GFP flags for trace events had a bit of duplication. The code was restructured to remove duplication and in the process it also adds some flags that were missed before. - Removed some dead code and scripts/draw_functrace.py draw_functrace.py hasn't worked in years and as nobody complained about it, remove it. - Constify struct event_trigger_ops The event_trigger_ops is just a structure that has function pointers that are assigned when the variables are created. These variables should all be constants. - Other minor clean ups and fixes * tag 'trace-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Replace strncpy with memcpy for fixed-length substring copy tracing: Fix synth event printk format for str fields tracing: Do not use PERF enums when perf is not defined tracing: Ensure module defining synth event cannot be unloaded while tracing tracing: fix return value in __ftrace_event_enable_disable for TRACE_REG_UNREGISTER tracing/osnoise: Fix possible recursive locking for cpus_read_lock() tracing: Align synth event print fmt tracing: gfp: vsprintf: Do not print "none" when using %pGg printf format tracepoint: Print the function symbol when tracepoint_debug is set tracing: Constify struct event_trigger_ops scripts/tracing: Remove scripts/tracing/draw_functrace.py tracing: Update MAINTAINERS file to include tracepoint.c tracing/user_events: Slightly simplify user_seq_show() tracing/user_events: Don't use %pK through printk tracing: gfp: Remove duplication of recording GFP flags tracing: Remove orphaned event_trace_printk ring-buffer: Fix typo in comment about header page pointer tracing: Add traceoff_after_boot option |
|
|
|
502d2e71a8 |
tracing: Constify struct event_trigger_ops
'event_trigger_ops mwifiex_if_ops' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 31368 9024 6200 46592 b600 kernel/trace/trace_events_trigger.o After: ===== text data bss dec hex filename 31752 8608 6200 46560 b5e0 kernel/trace/trace_events_trigger.o Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/66e8f990e649678e4be37d4d1a19158ca0dea2f4.1741521295.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
0b4ffbe488 |
tracing: Correct the refcount if the hist/hist_debug file fails to open
The function event_{hist,hist_debug}_open() maintains the refcount of
'file->tr' and 'file' through tracing_open_file_tr(). However, it does
not roll back these counts on subsequent failure paths, resulting in a
refcount leak.
A very obvious case is that if the hist/hist_debug file belongs to a
specific instance, the refcount leak will prevent the deletion of that
instance, as it relies on the condition 'tr->ref == 1' within
__remove_instance().
Fix this by calling tracing_release_file_tr() on all failure paths in
event_{hist,hist_debug}_open() to correct the refcount.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com
Fixes:
|
|
|
|
6f86bdeab6 |
tracing: Fix bad hist from corrupting named_triggers list
The following commands causes a crash:
~# cd /sys/kernel/tracing/events/rcu/rcu_callback
~# echo 'hist:name=bad:keys=common_pid:onmax(bogus).save(common_pid)' > trigger
bash: echo: write error: Invalid argument
~# echo 'hist:name=bad:keys=common_pid' > trigger
Because the following occurs:
event_trigger_write() {
trigger_process_regex() {
event_hist_trigger_parse() {
data = event_trigger_alloc(..);
event_trigger_register(.., data) {
cmd_ops->reg(.., data, ..) [hist_register_trigger()] {
data->ops->init() [event_hist_trigger_init()] {
save_named_trigger(name, data) {
list_add(&data->named_list, &named_triggers);
}
}
}
}
ret = create_actions(); (return -EINVAL)
if (ret)
goto out_unreg;
[..]
ret = hist_trigger_enable(data, ...) {
list_add_tail_rcu(&data->list, &file->triggers); <<<---- SKIPPED!!! (this is important!)
[..]
out_unreg:
event_hist_unregister(.., data) {
cmd_ops->unreg(.., data, ..) [hist_unregister_trigger()] {
list_for_each_entry(iter, &file->triggers, list) {
if (!hist_trigger_match(data, iter, named_data, false)) <- never matches
continue;
[..]
test = iter;
}
if (test && test->ops->free) <<<-- test is NULL
test->ops->free(test) [event_hist_trigger_free()] {
[..]
if (data->name)
del_named_trigger(data) {
list_del(&data->named_list); <<<<-- NEVER gets removed!
}
}
}
}
[..]
kfree(data); <<<-- frees item but it is still on list
The next time a hist with name is registered, it causes an u-a-f bug and
the kernel can crash.
Move the code around such that if event_trigger_register() succeeds, the
next thing called is hist_trigger_enable() which adds it to the list.
A bunch of actions is called if get_named_trigger_data() returns false.
But that doesn't need to be called after event_trigger_register(), so it
can be moved up, allowing event_trigger_register() to be called just
before hist_trigger_enable() keeping them together and allowing the
file->triggers to be properly populated.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250227163944.1c37f85f@gandalf.local.home
Fixes:
|
|
|
|
66fc6f521a |
tracing/hist: Support POLLPRI event for poll on histogram
Since POLLIN will not be flushed until the hist file is read, the user needs to repeatedly read() and poll() on the hist file for monitoring the event continuously. But the read() is somewhat redundant when the user is only monitoring for event updates. Add POLLPRI poll event on the hist file so the event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait for the next event without having to issue a read(). Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
1bd13edbbe |
tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN. Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events. This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file. NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram. Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
2b36a97aee |
tracing: Switch trace_events_hist.c code over to use guard()
There are a couple functions in trace_events_hist.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.694601480@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
6371b4bc17 |
tracing: Remove redundant check on field->field in histograms
The check on field->field being true is handled as the first check on the cascaded if statement, so the later checks on field->field are redundant because this clause has already been handled. Since this later check is redundant, just remove it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241107120530.18728-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
242b32d807 |
tracing: Replace strncpy() with strscpy() when copying comm
Replace the depreciated[1] strncpy() calls with strscpy() when copying comm. Link: https://github.com/KSPP/linux/issues/90 [1] Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241031120139.1343025-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
4a8840af5f |
tracepoints: Use new static branch API
The old static key API is deprecated. Switch to the new one. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/7a08dae3c5eddb14b13864923c1b58ac1f4af83c.1728414936.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
b156040869 |
tracing: Have format file honor EVENT_FILE_FL_FREED
When eventfs was introduced, special care had to be done to coordinate the
freeing of the file meta data with the files that are exposed to user
space. The file meta data would have a ref count that is set when the file
is created and would be decremented and freed after the last user that
opened the file closed it. When the file meta data was to be freed, it
would set a flag (EVENT_FILE_FL_FREED) to denote that the file is freed,
and any new references made (like new opens or reads) would fail as it is
marked freed. This allowed other meta data to be freed after this flag was
set (under the event_mutex).
All the files that were dynamically created in the events directory had a
pointer to the file meta data and would call event_release() when the last
reference to the user space file was closed. This would be the time that it
is safe to free the file meta data.
A shortcut was made for the "format" file. It's i_private would point to
the "call" entry directly and not point to the file's meta data. This is
because all format files are the same for the same "call", so it was
thought there was no reason to differentiate them. The other files
maintain state (like the "enable", "trigger", etc). But this meant if the
file were to disappear, the "format" file would be unaware of it.
This caused a race that could be trigger via the user_events test (that
would create dynamic events and free them), and running a loop that would
read the user_events format files:
In one console run:
# cd tools/testing/selftests/user_events
# while true; do ./ftrace_test; done
And in another console run:
# cd /sys/kernel/tracing/
# while true; do cat events/user_events/__test_event/format; done 2>/dev/null
With KASAN memory checking, it would trigger a use-after-free bug report
(which was a real bug). This was because the format file was not checking
the file's meta data flag "EVENT_FILE_FL_FREED", so it would access the
event that the file meta data pointed to after the event was freed.
After inspection, there are other locations that were found to not check
the EVENT_FILE_FL_FREED flag when accessing the trace_event_file. Add a
new helper function: event_file_file() that will make sure that the
event_mutex is held, and will return NULL if the trace_event_file has the
EVENT_FILE_FL_FREED flag set. Have the first reference of the struct file
pointer use event_file_file() and check for NULL. Later uses can still use
the event_file_data() helper function if the event_mutex is still held and
was not released since the event_file_file() call.
Link: https://lore.kernel.org/all/20240719204701.1605950-1-minipli@grsecurity.net/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Ilkka Naulapää <digirigawa@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Cc: Vasavi Sirnapalli <vasavi.sirnapalli@broadcom.com>
Link: https://lore.kernel.org/20240730110657.3b69d3c1@gandalf.local.home
Fixes:
|
|
|
|
4f1991a92c |
tracing histograms: Simplify parse_actions() function
The parse_actions() function uses 'len = str_has_prefix()' to test which action is in the string being parsed. But then it goes and repeats the logic for each different action. This logic can be simplified and duplicate code can be removed as 'len' contains the length of the found prefix which should be used for all actions. Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andy Shevchenko <andy@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
1cc111b9cd |
tracing: Fix uaf issue when open the hist or hist_debug file
KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.
BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
Read of size 8 at addr ffff242541e336b8 by task head/190
CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x98/0xf8
show_stack+0x1c/0x30
dump_stack_lvl+0x44/0x58
print_report+0xf0/0x5a0
kasan_report+0x80/0xc0
__asan_report_load8_noabort+0x1c/0x28
hist_show+0x11e0/0x1278
seq_read_iter+0x344/0xd78
seq_read+0x128/0x1c0
vfs_read+0x198/0x6c8
ksys_read+0xf4/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Allocated by task 188:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_alloc_info+0x20/0x30
__kasan_slab_alloc+0x6c/0x80
kmem_cache_alloc+0x15c/0x4a8
trace_create_new_event+0x84/0x348
__trace_add_new_event+0x18/0x88
event_trace_add_tracer+0xc4/0x1a0
trace_array_create_dir+0x6c/0x100
trace_array_create+0x2e8/0x568
instance_mkdir+0x48/0x80
tracefs_syscall_mkdir+0x90/0xe8
vfs_mkdir+0x3c4/0x610
do_mkdirat+0x144/0x200
__arm64_sys_mkdirat+0x8c/0xc0
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Freed by task 191:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_free_info+0x34/0x58
__kasan_slab_free+0xe4/0x158
kmem_cache_free+0x19c/0x508
event_file_put+0xa0/0x120
remove_event_file_dir+0x180/0x320
event_trace_del_tracer+0xb0/0x180
__remove_instance+0x224/0x508
instance_rmdir+0x44/0x78
tracefs_syscall_rmdir+0xbc/0x140
vfs_rmdir+0x1cc/0x4c8
do_rmdir+0x220/0x2b8
__arm64_sys_unlinkat+0xc0/0x100
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
545db7e21e |
tracing/histograms: Simplify last_cmd_set()
Turn a kzalloc()+strcpy()+strncat() into an equivalent and less verbose kasprintf(). Link: https://lore.kernel.org/linux-trace-kernel/30b6fb04dadc10a03cc1ad08f5d8a93ef623a167.1697899346.git.christophe.jaillet@wanadoo.fr Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Mukesh ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
4b8b390516 |
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
Commit |
|
|
|
6018b585e8 |
tracing/histograms: Add histograms to hist_vars if they have referenced variables
Hist triggers can have referenced variables without having direct
variables fields. This can be the case if referenced variables are added
for trigger actions. In this case the newly added references will not
have field variables. Not taking such referenced variables into
consideration can result in a bug where it would be possible to remove
hist trigger with variables being refenced. This will result in a bug
that is easily reproducable like so
$ cd /sys/kernel/tracing
$ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events
$ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
$ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger
$ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger
[ 100.263533] ==================================================================
[ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180
[ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439
[ 100.266320]
[ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4
[ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 100.268561] Call Trace:
[ 100.268902] <TASK>
[ 100.269189] dump_stack_lvl+0x4c/0x70
[ 100.269680] print_report+0xc5/0x600
[ 100.270165] ? resolve_var_refs+0xc7/0x180
[ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0
[ 100.271389] ? resolve_var_refs+0xc7/0x180
[ 100.271913] kasan_report+0xbd/0x100
[ 100.272380] ? resolve_var_refs+0xc7/0x180
[ 100.272920] __asan_load8+0x71/0xa0
[ 100.273377] resolve_var_refs+0xc7/0x180
[ 100.273888] event_hist_trigger+0x749/0x860
[ 100.274505] ? kasan_save_stack+0x2a/0x50
[ 100.275024] ? kasan_set_track+0x29/0x40
[ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10
[ 100.276138] ? ksys_write+0xd1/0x170
[ 100.276607] ? do_syscall_64+0x3c/0x90
[ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.277771] ? destroy_hist_data+0x446/0x470
[ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860
[ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10
[ 100.279627] ? __kasan_check_write+0x18/0x20
[ 100.280177] ? mutex_unlock+0x85/0xd0
[ 100.280660] ? __pfx_mutex_unlock+0x10/0x10
[ 100.281200] ? kfree+0x7b/0x120
[ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0
[ 100.282197] ? event_trigger_write+0xac/0x100
[ 100.282764] ? __kasan_slab_free+0x16/0x20
[ 100.283293] ? __kmem_cache_free+0x153/0x2f0
[ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250
[ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10
[ 100.285221] ? event_trigger_write+0xbc/0x100
[ 100.285781] ? __kasan_check_read+0x15/0x20
[ 100.286321] ? __bitmap_weight+0x66/0xa0
[ 100.286833] ? _find_next_bit+0x46/0xe0
[ 100.287334] ? task_mm_cid_work+0x37f/0x450
[ 100.287872] event_triggers_call+0x84/0x150
[ 100.288408] trace_event_buffer_commit+0x339/0x430
[ 100.289073] ? ring_buffer_event_data+0x3f/0x60
[ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0
[ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0
[ 100.298653] syscall_enter_from_user_mode+0x32/0x40
[ 100.301808] do_syscall_64+0x1a/0x90
[ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 100.307775] RIP: 0033:0x7f686c75c1cb
[ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48
[ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021
[ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb
[ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a
[ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a
[ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007
[ 100.338381] </TASK>
We hit the bug because when second hist trigger has was created
has_hist_vars() returned false because hist trigger did not have
variables. As a result of that save_hist_vars() was not called to add
the trigger to trace_array->hist_vars. Later on when we attempted to
remove the first histogram find_any_var_ref() failed to detect it is
being used because it did not find the second trigger in hist_vars list.
With this change we wait until trigger actions are created so we can take
into consideration if hist trigger has variable references. Also, now we
check the return value of save_hist_vars() and fail trigger creation if
save_hist_vars() fails.
Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
4b512860bd |
tracing: Rename stacktrace field to common_stacktrace
The histogram and synthetic events can use a pseudo event called "stacktrace" that will create a stacktrace at the time of the event and use it just like it was a normal field. We have other pseudo events such as "common_cpu" and "common_timestamp". To stay consistent with that, convert "stacktrace" to "common_stacktrace". As this was used in older kernels, to keep backward compatibility, this will act just like "common_cpu" did with "cpu". That is, "cpu" will be the same as "common_cpu" unless the event has a "cpu" field. In which case, the event's field is used. The same is true with "stacktrace". Also update the documentation to reflect this change. Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
e30fbc618e |
tracing/histograms: Allow variables to have some modifiers
Modifiers are used to change the behavior of keys. For instance, they
can grouped into buckets, converted to syscall names (from the syscall
identifier), show task->comm of the current pid, be an array of longs
that represent a stacktrace, and more.
It was found that nothing stopped a value from taking a modifier. As
values are simple counters. If this happened, it would call code that
was not expecting a modifier and crash the kernel. This was fixed by
having the ___create_val_field() function test if a modifier was present
and fail if one was. This fixed the crash.
Now there's a problem with variables. Variables are used to pass fields
from one event to another. Variables are allowed to have some modifiers,
as the processing may need to happen at the time of the event (like
stacktraces and comm names of the current pid). The issue is that it too
uses __create_val_field(). Now that fails on modifiers, variables can no
longer use them (this is a regression).
As not all modifiers are for variables, have them use a separate check.
Link: https://lore.kernel.org/linux-trace-kernel/20230523221108.064a5d82@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes:
|
|
|
|
9f116f76fa |
tracing: Check field value in hist_field_name()
The function hist_field_name() cannot handle being passed a NULL field
parameter. It should never be NULL, but due to a previous bug, NULL was
passed to the function and the kernel crashed due to a NULL dereference.
Mark Rutland reported this to me on IRC.
The bug was fixed, but to prevent future bugs from crashing the kernel,
check the field and add a WARN_ON() if it is NULL.
Link: https://lkml.kernel.org/r/20230302020810.762384440@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes:
|
|
|
|
e0213434fe |
tracing: Do not let histogram values have some modifiers
Histogram values can not be strings, stacktraces, graphs, symbols,
syscalls, or grouped in buckets or log. Give an error if a value is set to
do so.
Note, the histogram code was not prepared to handle these modifiers for
histograms and caused a bug.
Mark Rutland reported:
# echo 'p:copy_to_user __arch_copy_to_user n=$arg2' >> /sys/kernel/tracing/kprobe_events
# echo 'hist:keys=n:vals=hitcount.buckets=8:sort=hitcount' > /sys/kernel/tracing/events/kprobes/copy_to_user/trigger
# cat /sys/kernel/tracing/events/kprobes/copy_to_user/hist
[ 143.694628] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 143.695190] Mem abort info:
[ 143.695362] ESR = 0x0000000096000004
[ 143.695604] EC = 0x25: DABT (current EL), IL = 32 bits
[ 143.695889] SET = 0, FnV = 0
[ 143.696077] EA = 0, S1PTW = 0
[ 143.696302] FSC = 0x04: level 0 translation fault
[ 143.702381] Data abort info:
[ 143.702614] ISV = 0, ISS = 0x00000004
[ 143.702832] CM = 0, WnR = 0
[ 143.703087] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000448f9000
[ 143.703407] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 143.704137] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 143.704714] Modules linked in:
[ 143.705273] CPU: 0 PID: 133 Comm: cat Not tainted 6.2.0-00003-g6fc512c10a7c #3
[ 143.706138] Hardware name: linux,dummy-virt (DT)
[ 143.706723] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 143.707120] pc : hist_field_name.part.0+0x14/0x140
[ 143.707504] lr : hist_field_name.part.0+0x104/0x140
[ 143.707774] sp : ffff800008333a30
[ 143.707952] x29: ffff800008333a30 x28: 0000000000000001 x27: 0000000000400cc0
[ 143.708429] x26: ffffd7a653b20260 x25: 0000000000000000 x24: ffff10d303ee5800
[ 143.708776] x23: ffffd7a6539b27b0 x22: ffff10d303fb8c00 x21: 0000000000000001
[ 143.709127] x20: ffff10d303ec2000 x19: 0000000000000000 x18: 0000000000000000
[ 143.709478] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 143.709824] x14: 0000000000000000 x13: 203a6f666e692072 x12: 6567676972742023
[ 143.710179] x11: 0a230a6d6172676f x10: 000000000000002c x9 : ffffd7a6521e018c
[ 143.710584] x8 : 000000000000002c x7 : 7f7f7f7f7f7f7f7f x6 : 000000000000002c
[ 143.710915] x5 : ffff10d303b0103e x4 : ffffd7a653b20261 x3 : 000000000000003d
[ 143.711239] x2 : 0000000000020001 x1 : 0000000000000001 x0 : 0000000000000000
[ 143.711746] Call trace:
[ 143.712115] hist_field_name.part.0+0x14/0x140
[ 143.712642] hist_field_name.part.0+0x104/0x140
[ 143.712925] hist_field_print+0x28/0x140
[ 143.713125] event_hist_trigger_print+0x174/0x4d0
[ 143.713348] hist_show+0xf8/0x980
[ 143.713521] seq_read_iter+0x1bc/0x4b0
[ 143.713711] seq_read+0x8c/0xc4
[ 143.713876] vfs_read+0xc8/0x2a4
[ 143.714043] ksys_read+0x70/0xfc
[ 143.714218] __arm64_sys_read+0x24/0x30
[ 143.714400] invoke_syscall+0x50/0x120
[ 143.714587] el0_svc_common.constprop.0+0x4c/0x100
[ 143.714807] do_el0_svc+0x44/0xd0
[ 143.714970] el0_svc+0x2c/0x84
[ 143.715134] el0t_64_sync_handler+0xbc/0x140
[ 143.715334] el0t_64_sync+0x190/0x194
[ 143.715742] Code: a9bd7bfd 910003fd a90153f3 aa0003f3 (f9400000)
[ 143.716510] ---[ end trace 0000000000000000 ]---
Segmentation fault
Link: https://lkml.kernel.org/r/20230302020810.559462599@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
|
|
|
f5914b301a |
tracing/histogram: Fix stacktrace key
The current code will always use the current stacktrace as a key even if a stacktrace contained in a specific event field was specified. For example, we expect to use the 'unsigned long[] stack' field in the below event in the histogram: # echo 's:block_lat pid_t pid; u64 delta; unsigned long[] stack;' > /sys/kernel/debug/tracing/dynamic_events # echo 'hist:keys=delta.buckets=100,stack.stacktrace:sort=delta' > /sys/kernel/debug/tracing/events/synthetic/block_lat/trigger But in fact, when we type out the trigger, we see that it's using the plain old global 'stacktrace' as the key, which is just the stacktrace when the event was hit and not the stacktrace contained in the event, which is what we want: # cat /sys/kernel/debug/tracing/events/synthetic/block_lat/trigger hist:keys=delta.buckets=100,stacktrace:vals=hitcount:sort=delta.buckets=100:size=2048 [active] And in fact, there's no code to actually retrieve it from the event, so we need to add HIST_FIELD_FN_STACK and hist_field_stack() to get it and hook it into the trigger code. For now, since the stack is just using dynamic strings, this could just use the dynamic string function, but it seems cleaner to have a dedicated function an be able to tweak independently as necessary. Link: https://lkml.kernel.org/r/11aa614c82976adbfa4ea763dbe885b5fb01d59c.1676063532.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> [ Fixed 32bit build warning reported by kernel test robot <lkp@intel.com> ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
2bacfd9f7e |
tracing/histogram: Fix a few problems with stacktrace variable printing
Currently, there are a few problems when printing hist triggers and
trace output when using stacktrace variables. This fixes the problems
seen below:
# echo 'hist:keys=delta.buckets=100,stack.stacktrace:sort=delta' > /sys/kernel/debug/tracing/events/synthetic/block_lat/trigger
# cat /sys/kernel/debug/tracing/events/synthetic/block_lat/trigger
hist:keys=delta.buckets=100,stacktrace:vals=hitcount:sort=delta.buckets=100:size=2048 [active]
# echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 2' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:ts=common_timestamp.usecs,st=stacktrace.stacktrace:sort=hitcount:size=2048:clock=global if prev_state == 2 [active]
and also in the trace output (should be stack.stacktrace):
{ delta: ~ 100-199, stacktrace __schedule+0xa19/0x1520
Link: https://lkml.kernel.org/r/60bebd4e546728e012a7a2bcbf58716d48ba6edb.1676063532.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
8261ef2eb3 |
tracing: Add BUILD_BUG() to make sure stacktrace fits in strings
The max string length for a histogram variable is 256 bytes. The max depth of a stacktrace is 16. With 8byte words, that's 16 * 8 = 128. Which can easily fit in the string variable. The histogram stacktrace is being stored in the string value (with the given max length), with the assumption it will fit. To make sure that this is always the case (in the case that the stack trace depth increases), add a BUILD_BUG_ON() to test this. Link: https://lore.kernel.org/linux-trace-kernel/20230214002418.0103b9e765d3e5c374d2aa7d@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
fc1a9dc101 |
tracing/histogram: Don't use strlen to find length of stacktrace variables
Because stacktraces are saved in dynamic strings, trace_event_raw_event_synth() uses strlen to determine the length of the stack. Stacktraces may contain 0-bytes, though, in the saved addresses, so the length found and passed to reserve() will be too small. Fix this by using the first unsigned long in the stack variables to store the actual number of elements in the stack and have trace_event_raw_event_synth() use that to determine the length of the stack. Link: https://lkml.kernel.org/r/1ed6906cd9d6477ef2bd8e63c61de20a9ffe64d7.1676063532.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
cc5fc8bfc9 |
tracing/histogram: Add stacktrace type
Now that stacktraces can be part of synthetic events, allow a key to be
typed as a stacktrace.
# cd /sys/kernel/tracing
# echo 's:block_lat u64 delta; unsigned long stack[];' >> dynamic_events
# echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 2' >> events/sched/sched_switch/trigger
# echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,st2=$st:onmatch(sched.sched_switch).trace(block_lat,$delta,$st2)' >> events/sched/sched_switch/trigger
# echo 'hist:keys=delta.buckets=100,stack.stacktrace:sort=delta' > events/synthetic/block_lat/trigger
# cat events/synthetic/block_lat/hist
# event histogram
#
# trigger info: hist:keys=delta.buckets=100,stacktrace:vals=hitcount:sort=delta.buckets=100:size=2048 [active]
#
{ delta: ~ 0-99, stacktrace:
event_hist_trigger+0x464/0x480
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x250
trace_event_raw_event_sched_switch+0xfc/0x150
__traceiter_sched_switch+0x41/0x60
__schedule+0x448/0x7b0
schedule_idle+0x26/0x40
cpu_startup_entry+0x19/0x20
start_secondary+0xed/0xf0
secondary_startup_64_no_verify+0xe0/0xeb
} hitcount: 6
{ delta: ~ 0-99, stacktrace:
event_hist_trigger+0x464/0x480
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x250
trace_event_raw_event_sched_switch+0xfc/0x150
__traceiter_sched_switch+0x41/0x60
__schedule+0x448/0x7b0
schedule_idle+0x26/0x40
cpu_startup_entry+0x19/0x20
__pfx_kernel_init+0x0/0x10
arch_call_rest_init+0xa/0x24
start_kernel+0x964/0x98d
secondary_startup_64_no_verify+0xe0/0xeb
} hitcount: 3
{ delta: ~ 0-99, stacktrace:
event_hist_trigger+0x464/0x480
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x250
trace_event_raw_event_sched_switch+0xfc/0x150
__traceiter_sched_switch+0x41/0x60
__schedule+0x448/0x7b0
schedule+0x5a/0xb0
worker_thread+0xaf/0x380
kthread+0xe9/0x110
ret_from_fork+0x2c/0x50
} hitcount: 1
{ delta: ~ 100-199, stacktrace:
event_hist_trigger+0x464/0x480
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x250
trace_event_raw_event_sched_switch+0xfc/0x150
__traceiter_sched_switch+0x41/0x60
__schedule+0x448/0x7b0
schedule_idle+0x26/0x40
cpu_startup_entry+0x19/0x20
start_secondary+0xed/0xf0
secondary_startup_64_no_verify+0xe0/0xeb
} hitcount: 15
[..]
{ delta: ~ 8500-8599, stacktrace:
event_hist_trigger+0x464/0x480
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x250
trace_event_raw_event_sched_switch+0xfc/0x150
__traceiter_sched_switch+0x41/0x60
__schedule+0x448/0x7b0
schedule_idle+0x26/0x40
cpu_startup_entry+0x19/0x20
start_secondary+0xed/0xf0
secondary_startup_64_no_verify+0xe0/0xeb
} hitcount: 1
Totals:
Hits: 89
Entries: 11
Dropped: 0
Link: https://lkml.kernel.org/r/20230117152236.167046397@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
00cf3d672a |
tracing: Allow synthetic events to pass around stacktraces
Allow a stacktrace from one event to be displayed by the end event of a
synthetic event. This is very useful when looking for the longest latency
of a sleep or something blocked on I/O.
# cd /sys/kernel/tracing/
# echo 's:block_lat pid_t pid; u64 delta; unsigned long[] stack;' > dynamic_events
# echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 1||prev_state == 2' > events/sched/sched_switch/trigger
# echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,s=$st:onmax($delta).trace(block_lat,prev_pid,$delta,$s)' >> events/sched/sched_switch/trigger
The above creates a "block_lat" synthetic event that take the stacktrace of
when a task schedules out in either the interruptible or uninterruptible
states, and on a new per process max $delta (the time it was scheduled
out), will print the process id and the stacktrace.
# echo 1 > events/synthetic/block_lat/enable
# cat trace
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
kworker/u16:0-767 [006] d..4. 560.645045: block_lat: pid=767 delta=66 stack=STACK:
=> __schedule
=> schedule
=> pipe_read
=> vfs_read
=> ksys_read
=> do_syscall_64
=> 0x966000aa
<idle>-0 [003] d..4. 561.132117: block_lat: pid=0 delta=413787 stack=STACK:
=> __schedule
=> schedule
=> schedule_hrtimeout_range_clock
=> do_sys_poll
=> __x64_sys_poll
=> do_syscall_64
=> 0x966000aa
<...>-153 [006] d..4. 562.068407: block_lat: pid=153 delta=54 stack=STACK:
=> __schedule
=> schedule
=> io_schedule
=> rq_qos_wait
=> wbt_wait
=> __rq_qos_throttle
=> blk_mq_submit_bio
=> submit_bio_noacct_nocheck
=> ext4_bio_write_page
=> mpage_submit_page
=> mpage_process_page_bufs
=> mpage_prepare_extent_to_map
=> ext4_do_writepages
=> ext4_writepages
=> do_writepages
=> __writeback_single_inode
Link: https://lkml.kernel.org/r/20230117152236.010941267@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
288709c9f3 |
tracing: Allow stacktraces to be saved as histogram variables
Allow to save stacktraces into a histogram variable. This will be used by synthetic events to allow a stacktrace from one event to be passed and displayed by another event. The special keyword "stacktrace" is to be used to trigger a stack trace for the event that the histogram trigger is attached to. echo 'hist:keys=pid:st=stacktrace" > events/sched/sched_waking/trigger Currently nothing can get access to the "$st" variable above that contains the stack trace, but that will soon change. Link: https://lkml.kernel.org/r/20230117152235.856323729@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
8b152e9150 |
trace_events_hist: add check for return value of 'create_hist_field'
Function 'create_hist_field' is called recursively at
trace_events_hist.c:1954 and can return NULL-value that's why we have
to check it to avoid null pointer dereference.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
608c6ed333 |
tracing/hist: Fix issue of losting command info in error_log
When input some constructed invalid 'trigger' command, command info
in 'error_log' are lost [1].
The root cause is that there is a path that event_hist_trigger_parse()
is recursely called once and 'last_cmd' which save origin command is
cleared, then later calling of hist_err() will no longer record origin
command info:
event_hist_trigger_parse() {
last_cmd_set() // <1> 'last_cmd' save origin command here at first
create_actions() {
onmatch_create() {
action_create() {
trace_action_create() {
trace_action_create_field_var() {
create_field_var_hist() {
event_hist_trigger_parse() { // <2> recursely called once
hist_err_clear() // <3> 'last_cmd' is cleared here
}
hist_err() // <4> No longer find origin command!!!
Since 'glob' is empty string while running into the recurse call, we
can trickly check it and bypass the call of hist_err_clear() to solve it.
[1]
# cd /sys/kernel/tracing
# echo "my_synth_event int v1; int v2; int v3;" >> synthetic_events
# echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
# echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\
pid,pid1)" >> events/sched/sched_switch/trigger
# cat error_log
[ 8.405018] hist:sched:sched_switch: error: Couldn't find synthetic event
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't find field
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't parse field variable
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't parse field variable
Command:
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't create histogram for field
Command:
^
Link: https://lore.kernel.org/linux-trace-kernel/20221207135326.3483216-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Fixes:
|
|
|
|
82470f7d90 |
tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx'
When generate a synthetic event with many params and then create a trace
action for it [1], kernel panic happened [2].
It is because that in trace_action_create() 'data->n_params' is up to
SYNTH_FIELDS_MAX (current value is 64), and array 'data->var_ref_idx'
keeps indices into array 'hist_data->var_refs' for each synthetic event
param, but the length of 'data->var_ref_idx' is TRACING_MAP_VARS_MAX
(current value is 16), so out-of-bound write happened when 'data->n_params'
more than 16. In this case, 'data->match_data.event' is overwritten and
eventually cause the panic.
To solve the issue, adjust the length of 'data->var_ref_idx' to be
SYNTH_FIELDS_MAX and add sanity checks to avoid out-of-bound write.
[1]
# cd /sys/kernel/tracing/
# echo "my_synth_event int v1; int v2; int v3; int v4; int v5; int v6;\
int v7; int v8; int v9; int v10; int v11; int v12; int v13; int v14;\
int v15; int v16; int v17; int v18; int v19; int v20; int v21; int v22;\
int v23; int v24; int v25; int v26; int v27; int v28; int v29; int v30;\
int v31; int v32; int v33; int v34; int v35; int v36; int v37; int v38;\
int v39; int v40; int v41; int v42; int v43; int v44; int v45; int v46;\
int v47; int v48; int v49; int v50; int v51; int v52; int v53; int v54;\
int v55; int v56; int v57; int v58; int v59; int v60; int v61; int v62;\
int v63" >> synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="bash"' >> \
events/sched/sched_waking/trigger
# echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid)" >> events/sched/sched_switch/trigger
[2]
BUG: unable to handle page fault for address: ffff91c900000000
PGD 61001067 P4D 61001067 PUD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 322 Comm: bash Tainted: G W 6.1.0-rc8+ #229
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:strcmp+0xc/0x30
Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee
c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14
07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3
RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000
RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000
RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000
R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580
R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538
FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0
Call Trace:
<TASK>
__find_event_file+0x55/0x90
action_create+0x76c/0x1060
event_hist_trigger_parse+0x146d/0x2060
? event_trigger_write+0x31/0xd0
trigger_process_regex+0xbb/0x110
event_trigger_write+0x6b/0xd0
vfs_write+0xc8/0x3e0
? alloc_fd+0xc0/0x160
? preempt_count_add+0x4d/0xa0
? preempt_count_add+0x70/0xa0
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f1d1d0cf077
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e
fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74
RSP: 002b:00007ffcebb0e568 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000143 RCX: 00007f1d1d0cf077
RDX: 0000000000000143 RSI: 00005639265aa7e0 RDI: 0000000000000001
RBP: 00005639265aa7e0 R08: 000000000000000a R09: 0000000000000142
R10: 000056392639c017 R11: 0000000000000246 R12: 0000000000000143
R13: 00007f1d1d1ae6a0 R14: 00007f1d1d1aa4a0 R15: 00007f1d1d1a98a0
</TASK>
Modules linked in:
CR2: ffff91c900000000
---[ end trace 0000000000000000 ]---
RIP: 0010:strcmp+0xc/0x30
Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee
c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14
07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3
RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000
RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000
RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000
R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580
R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538
FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0
Link: https://lore.kernel.org/linux-trace-kernel/20221207035143.2278781-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
2cc6a52888 |
tracing/hist: Fix wrong return value in parse_action_params()
When number of synth fields is more than SYNTH_FIELDS_MAX,
parse_action_params() should return -EINVAL.
Link: https://lore.kernel.org/linux-trace-kernel/20221207034635.2253990-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
ccf47f5cc4 |
tracing: Add nohitcount option for suppressing display of raw hitcount
Add 'nohitcount' ('NOHC' for short) option for suppressing display of
the raw hitcount column in the histogram.
Note that you must specify at least one value except raw 'hitcount'
when you specify this nohitcount option.
# cd /sys/kernel/debug/tracing/
# echo hist:keys=pid:vals=runtime.percent,runtime.graph:sort=pid:NOHC > \
events/sched/sched_stat_runtime/trigger
# sleep 10
# cat events/sched/sched_stat_runtime/hist
# event histogram
#
# trigger info: hist:keys=pid:vals=runtime.percent,runtime.graph:sort=pid:size=2048:nohitcount [active]
#
{ pid: 8 } runtime (%): 3.02 runtime: #
{ pid: 14 } runtime (%): 2.25 runtime:
{ pid: 16 } runtime (%): 2.25 runtime:
{ pid: 26 } runtime (%): 0.17 runtime:
{ pid: 61 } runtime (%): 11.52 runtime: ####
{ pid: 67 } runtime (%): 1.56 runtime:
{ pid: 68 } runtime (%): 0.84 runtime:
{ pid: 76 } runtime (%): 0.92 runtime:
{ pid: 117 } runtime (%): 2.50 runtime: #
{ pid: 146 } runtime (%): 49.88 runtime: ####################
{ pid: 157 } runtime (%): 16.63 runtime: ######
{ pid: 158 } runtime (%): 8.38 runtime: ###
Link: https://lore.kernel.org/linux-trace-kernel/166610814787.56030.4980636083486339906.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
|
|
|
|
a2c54256de |
tracing: Add .graph suffix option to histogram value
Add the .graph suffix which shows the bar graph of the histogram value.
For example, the below example shows that the bar graph
of the histogram of the runtime for each tasks.
------
# cd /sys/kernel/debug/tracing/
# echo hist:keys=pid:vals=runtime.graph:sort=pid > \
events/sched/sched_stat_runtime/trigger
# sleep 10
# cat events/sched/sched_stat_runtime/hist
# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount,runtime.graph:sort=pid:size=2048 [active]
#
{ pid: 14 } hitcount: 2 runtime:
{ pid: 16 } hitcount: 8 runtime:
{ pid: 26 } hitcount: 1 runtime:
{ pid: 57 } hitcount: 3 runtime:
{ pid: 61 } hitcount: 20 runtime: ###
{ pid: 66 } hitcount: 2 runtime:
{ pid: 70 } hitcount: 3 runtime:
{ pid: 72 } hitcount: 2 runtime:
{ pid: 145 } hitcount: 14 runtime: ####################
{ pid: 152 } hitcount: 5 runtime: #######
{ pid: 153 } hitcount: 2 runtime: ####
Totals:
Hits: 62
Entries: 11
Dropped: 0
-------
Link: https://lore.kernel.org/linux-trace-kernel/166610813953.56030.10944148382315789485.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
|
|
|
|
abaa5258ce |
tracing: Add .percent suffix option to histogram values
Add .percent suffix option to show the histogram values in percentage.
This feature is useful when we need yo undersntand the overall trend
for the histograms of large values.
E.g. this shows the runtime percentage for each tasks.
------
# cd /sys/kernel/debug/tracing/
# echo hist:keys=pid:vals=hitcount,runtime.percent:sort=pid > \
events/sched/sched_stat_runtime/trigger
# sleep 10
# cat events/sched/sched_stat_runtime/hist
# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount,runtime.percent:sort=pid:size=2048 [active]
#
{ pid: 8 } hitcount: 7 runtime (%): 4.14
{ pid: 14 } hitcount: 5 runtime (%): 3.69
{ pid: 16 } hitcount: 11 runtime (%): 3.41
{ pid: 61 } hitcount: 41 runtime (%): 19.75
{ pid: 65 } hitcount: 4 runtime (%): 1.48
{ pid: 70 } hitcount: 6 runtime (%): 3.60
{ pid: 72 } hitcount: 2 runtime (%): 1.10
{ pid: 144 } hitcount: 10 runtime (%): 32.01
{ pid: 151 } hitcount: 8 runtime (%): 22.66
{ pid: 152 } hitcount: 2 runtime (%): 8.10
Totals:
Hits: 96
Entries: 10
Dropped: 0
-----
Link: https://lore.kernel.org/linux-trace-kernel/166610813077.56030.4238090506973562347.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
|
|
|
|
5f2e094ed2 |
tracing: Allow multiple hitcount values in histograms
The hitcount is treated specially in the histograms - since it's always expected to be there regardless of whether the user specified anything or not, it's always added as the first histogram value. Currently the code doesn't allow it to be added more than once as a value, which is inconsistent with all the other possible values. It would seem to be a pointless thing to want to do, but other features being added such as percent and graph modifiers don't work properly with the current hitcount restrictions. Fix this by allowing multiple hitcounts to be added. Link: https://lore.kernel.org/linux-trace-kernel/166610812248.56030.16754785928712505251.stgit@devnote2 Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> |
|
|
|
ef38c79a52 |
tracing: Fix race where histograms can be called before the event
commit |
|
|
|
0a068f4a71 |
tracing/hist: add in missing * in comment blocks
There are a couple of missing * in comment blocks. Fix these. Cleans up two clang warnings: kernel/trace/trace_events_hist.c:986: warning: bad line: kernel/trace/trace_events_hist.c:3229: warning: bad line: Link: https://lkml.kernel.org/r/20221020133019.1547587-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
86087383ec |
tracing/hist: Call hist functions directly via a switch statement
Due to retpolines, indirect calls are much more expensive than direct
calls. The histograms have a select set of functions it uses for the
histograms, instead of using function pointers to call them, create a
hist_fn_call() function that uses a switch statement to call the histogram
functions directly. This gives a 13% speedup to the histogram logic.
Using the histogram benchmark:
Before:
# event histogram
#
# trigger info: hist:keys=delta:vals=hitcount:sort=delta:size=2048 if delta > 0 [active]
#
{ delta: 129 } hitcount: 2213
{ delta: 130 } hitcount: 285965
{ delta: 131 } hitcount: 1146545
{ delta: 132 } hitcount:
|
|
|
|
fb991f1942 |
tracing/histograms: Simplify create_hist_fields()
When I look into implements of create_hist_fields(), I think there can be
following two simplifications:
1. If something wrong happened in parse_var_defs(), free_var_defs() would
have been called in it, so no need goto free again after calling it;
2. After calling create_key_fields(), regardless of the value of 'ret', it
then always runs into 'out: ', so the judge of 'ret' is redundant.
Link: https://lkml.kernel.org/r/20220630013152.164871-1-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
7edc3945bd |
tracing/histograms: Fix memory leak problem
This reverts commit |
|
|
|
2d601b9864 |
tracing: Change "char *" string form to "char []"
The "char []" string form declares a single variable. It is better than "char *" which creates two variables in the final assembly. Link: https://lkml.kernel.org/r/20220512143230.28796-1-liqiong@nfschina.com Signed-off-by: liqiong <liqiong@nfschina.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
|
|
|
99696a2592 |
tracing: Fix potential double free in create_var_ref()
In create_var_ref(), init_var_ref() is called to initialize the fields
of variable ref_field, which is allocated in the previous function call
to create_hist_field(). Function init_var_ref() allocates the
corresponding fields such as ref_field->system, but frees these fields
when the function encounters an error. The caller later calls
destroy_hist_field() to conduct error handling, which frees the fields
and the variable itself. This results in double free of the fields which
are already freed in the previous function.
Fix this by storing NULL to the corresponding fields when they are freed
in init_var_ref().
Link: https://lkml.kernel.org/r/20220425063739.3859998-1-keitasuzuki.park@sslab.ics.keio.ac.jp
Fixes:
|