linux/kernel
Christian Brauner 8ec7c826d9 pidfs: persist information
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.

The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.

This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.

If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid is
closed the pidfs dentry is released and removed from pid->stashed.

So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.

Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or it
might find a dead dentry in there and fail to reuse it and so stashes a
new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry
but only one will succeed, the other ones will put their dentry.

The current scheme aims to ensure that a pidfs dentry for a struct pid
can only be created if the task is still alive or if a pidfs dentry
already existed before the task was reaped and so exit information has
been was stashed in the pidfs inode.

That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is called
we will return a pidfd for a reaped task without exit information being
available.

The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might be
successful simply because we're before __unhash_process() but after
pidfs_exit().

Introduce a new scheme where the lifetime of information associated with
a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.

The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.

If all pidfs for the pidfs dentry are closed the dentry and inode can be
cleaned up but the struct pidfs_attr will stick until the struct pid
itself is freed. This will ensure minimal memory usage while persisting
relevant information.

The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which no
exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries when
registering a struct pid with pidfs. There's no need to get or put a
reference. Instead, the lifetime of exit and coredump information
associated with a struct pid is bound to the lifetime of struct pid
itself.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
..
bpf - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
cgroup cgroup: A fix for v6.16-rc1 2025-06-03 14:12:36 -07:00
configs Kbuild updates for v6.16 2025-06-07 10:05:35 -07:00
debug
dma dma-mapping updates for Linux 6.16: 2025-05-27 20:09:06 -07:00
entry entry: Inline syscall_exit_to_user_mode() 2025-04-29 08:27:10 +02:00
events - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
futex - The 2 patch series "zram: support algorithm-specific parameters" from 2025-06-02 16:00:26 -07:00
gcov kbuild: require gcc-8 and binutils-2.30 2025-04-30 21:53:35 +02:00
irq Updates for the MSI subsystem (core code and PCI): 2025-05-27 08:15:26 -07:00
kcsan treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() 2025-05-08 19:49:33 +02:00
livepatch sched,livepatch: Untangle cond_resched() and live-patching 2025-05-14 13:16:24 +02:00
locking Generic: 2025-06-02 12:24:58 -07:00
module Kbuild updates for v6.16 2025-06-07 10:05:35 -07:00
power - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
printk
rcu treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
sched treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
time treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
trace tracing fixes: 2025-06-08 08:19:01 -07:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
Kconfig.locks
Kconfig.preempt
Makefile - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
acct.c
async.c
audit.c
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
cfi.c
compat.c
configs.c
context_tracking.c
cpu.c perf: Remove too early and redundant CPU hotplug handling 2025-05-08 21:50:19 +02:00
cpu_pm.c
crash_core.c
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c crash: fix spelling mistake "crahskernel" -> "crashkernel" 2025-05-11 17:54:10 -07:00
cred.c
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
exit.h
extable.c
fail_function.c
fork.c - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
freezer.c
gen_kheaders.sh
groups.c
hung_task.c hung_task: show the blocker task if the task is hung on semaphore 2025-05-11 17:54:08 -07:00
iomem.c
irq_work.c
jump_label.c
kallsyms.c
kallsyms_internal.h
kallsyms_selftest.c
kallsyms_selftest.h
kcmp.c
kcov.c
kexec.c
kexec_core.c kexec: define functions to map and unmap segments 2025-04-29 15:54:53 -04:00
kexec_elf.c
kexec_file.c - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
kexec_handover.c kexec: add KHO support to kexec file loads 2025-05-12 23:50:40 -07:00
kexec_internal.h kexec: add KHO support to kexec file loads 2025-05-12 23:50:40 -07:00
kheaders.c
kprobes.c
ksyms_common.c
ksysfs.c
kthread.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
latencytop.c
module_signature.c
notifier.c
nsproxy.c kernel/nsproxy: remove unnecessary guards 2025-05-09 13:13:54 +02:00
padata.c padata: do not leak refcount in reorder_work 2025-05-19 13:44:16 +08:00
panic.c - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
params.c module: ensure that kobject_put() is safe for module type kobjects 2025-05-07 20:24:59 +02:00
pid.c pidfs: persist information 2025-06-19 14:28:24 +02:00
pid_namespace.c
pid_sysctl.h
profile.c
ptrace.c ptrace: introduce PTRACE_SET_SYSCALL_INFO request 2025-05-11 17:48:15 -07:00
range.c
reboot.c
regset.c
relay.c relay: remove unused relay_late_setup_files 2025-05-11 17:54:09 -07:00
resource.c
resource_kunit.c
rseq.c
scftorture.c
scs.c
seccomp.c
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c
static_call.c
static_call_inline.c
stop_machine.c
sys.c futex: Add basic infrastructure for local task local hash 2025-05-03 12:02:07 +02:00
sys_ni.c
sysctl-test.c
sysctl.c
task_work.c
taskstats.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c
user_namespace.c
usermode_driver.c
utsname.c
utsname_sysctl.c
vhost_task.c
vmcore_info.c crash: export PAGE_UNACCEPTED_MAPCOUNT_VALUE to vmcoreinfo 2025-05-11 17:54:04 -07:00
watch_queue.c
watchdog.c kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count 2025-05-21 10:48:22 -07:00
watchdog_buddy.c
watchdog_perf.c
workqueue.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
workqueue_internal.h