linux/kernel
John Stultz c2ae8b0df2 sched/core: Fix psi_dequeue() for Proxy Execution
Currently, if the sleep flag is set, psi_dequeue() doesn't
change any of the psi_flags.

This is because psi_task_switch() will clear TSK_ONCPU as well
as other potential flags (TSK_RUNNING), and the assumption is
that a voluntary sleep always consists of a task being dequeued
followed shortly there after with a psi_sched_switch() call.

Proxy Execution changes this expectation, as mutex-blocked tasks
that would normally sleep stay on the runqueue. But in the case
where the mutex-owning task goes to sleep, or the owner is on a
remote cpu, we will then deactivate the blocked task shortly
after.

In that situation, the mutex-blocked task will have had its
TSK_ONCPU cleared when it was switched off the cpu, but it will
stay TSK_RUNNING. Then if we later dequeue it (as currently done
if we hit a case find_proxy_task() can't yet handle, such as the
case of the owner being on another rq or a sleeping owner)
psi_dequeue() won't change any state (leaving it TSK_RUNNING),
as it incorrectly expects a psi_task_switch() call to
immediately follow.

Later on when the task get woken/re-enqueued, and psi_flags are
set for TSK_RUNNING, we hit an error as the task is already
TSK_RUNNING:

  psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4

To resolve this, extend the logic in psi_dequeue() so that
if the sleep flag is set, we also check if psi_flags have
TSK_ONCPU set (meaning the psi_task_switch is imminent) before
we do the shortcut return.

If TSK_ONCPU is not set, that means we've already switched away,
and this psi_dequeue call needs to clear the flags.

Fixes: be41bde4c3 ("sched: Add an initial sketch of the find_proxy_task() function")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Haiyue Wang <haiyuewa@163.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://patch.msgid.link/20251205012721.756394-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
2025-12-06 10:13:16 +01:00
..
bpf Tree wide cleanup of the remaining users of in_irq() which got replaced 2025-12-02 10:18:49 -08:00
cgroup Update to the time/timers core: 2025-12-02 09:58:33 -08:00
configs
debug
dma dma-mapping fixes for Linux 6.18 2025-11-27 17:29:15 -08:00
entry rseq: Switch to TIF_RSEQ if supported 2025-11-04 08:35:37 +01:00
events Performance events changes for v6.19: 2025-12-01 20:42:01 -08:00
futex Scoped user mode access and related changes: 2025-12-02 08:01:39 -08:00
gcov gcov: add support for GCC 15 2025-11-09 21:19:44 -08:00
irq Updates for [PCI] MSI related code: 2025-12-02 09:35:59 -08:00
kcsan Kernel Concurrency Sanitizer (KCSAN) updates for v6.18 2025-10-02 08:31:44 -07:00
livepatch livepatch: Add CONFIG_KLP_BUILD 2025-10-14 14:50:18 -07:00
locking locking/mutex: Redo __mutex_init() to reduce generated code size 2025-12-01 06:51:57 +01:00
module
power vfs-6.18-rc7.fixes 2025-11-17 09:11:27 -08:00
printk printk changes for 6.18 2025-10-04 11:13:11 -07:00
rcu sched: Provide and use set_need_resched_current() 2025-11-20 22:26:09 +01:00
sched sched/core: Fix psi_dequeue() for Proxy Execution 2025-12-06 10:13:16 +01:00
time Tree wide cleanup of the remaining users of in_irq() which got replaced 2025-12-02 10:18:49 -08:00
trace kernel-6.19-rc1.cred 2025-12-01 13:45:41 -08:00
unwind unwind_user/x86: Teach FP unwind about start of function 2025-10-29 10:29:58 +01:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec kho: warn and fail on metadata or preserved memory in scratch area 2025-11-09 21:19:41 -08:00
Kconfig.locks
Kconfig.preempt
Makefile kho: warn and fail on metadata or preserved memory in scratch area 2025-11-09 21:19:41 -08:00
acct.c act: use credential guards in acct_write_process() 2025-11-04 12:36:49 +01:00
async.c
audit.c
audit.h
audit_fsnotify.c
audit_tree.c mount-related stuff for this cycle 2025-10-03 10:19:44 -07:00
audit_watch.c
auditfilter.c audit/stable-6.18 PR 20250926 2025-09-30 08:22:16 -07:00
auditsc.c
backtracetest.c
bounds.c
capability.c
cfi.c
compat.c
configs.c
context_tracking.c
cpu.c cpumask: Cache num_possible_cpus() 2025-11-25 19:45:40 +01:00
cpu_pm.c
crash_core.c crash: fix crashkernel resource shrink 2025-11-15 10:52:01 -08:00
crash_core_test.c
crash_dump_dm_crypt.c
crash_reserve.c
cred.c kernel-6.19-rc1.cred 2025-12-01 13:45:41 -08:00
delayacct.c
dma.c
elfcorehdr.c
exec_domain.c
exit.c A large overhaul of the restartable sequences and CID management: 2025-12-02 08:48:53 -08:00
exit.h
extable.c
fail_function.c
fork.c A large overhaul of the restartable sequences and CID management: 2025-12-02 08:48:53 -08:00
freezer.c
gen_kheaders.sh
groups.c
hung_task.c
iomem.c
irq_work.c
jump_label.c
kallsyms.c
kallsyms_internal.h
kallsyms_selftest.c kallsyms: use kmalloc_array() instead of kmalloc() 2025-09-28 11:36:14 -07:00
kallsyms_selftest.h
kcmp.c
kcov.c
kexec.c
kexec_core.c
kexec_elf.c
kexec_file.c
kexec_handover.c kho: warn and exit when unpreserved page wasn't preserved 2025-11-09 21:19:47 -08:00
kexec_handover_debug.c kho: warn and fail on metadata or preserved memory in scratch area 2025-11-09 21:19:41 -08:00
kexec_handover_internal.h kho: warn and fail on metadata or preserved memory in scratch area 2025-11-09 21:19:41 -08:00
kexec_internal.h
kheaders.c
kprobes.c
kstack_erase.c
ksyms_common.c
ksysfs.c
kthread.c sched: Rename do_set_cpus_allowed() 2025-10-16 11:13:53 +02:00
latencytop.c
module_signature.c
notifier.c
nscommon.c ns: rename is_initial_namespace() 2025-11-11 10:01:31 +01:00
nsproxy.c nsproxy: fix free_nsproxy() and simplify create_new_namespaces() 2025-11-14 13:10:38 +01:00
nstree.c nstree: fix kernel-doc comments for internal functions 2025-11-14 13:10:38 +01:00
padata.c
panic.c Merge branch 'objtool/core' 2025-11-21 11:21:20 +01:00
params.c
pid.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
pid_namespace.c pid: rely on common reference count behavior 2025-11-11 10:01:32 +01:00
pid_sysctl.h
profile.c
ptrace.c rseq: Introduce struct rseq_data 2025-11-04 08:30:50 +01:00
range.c
reboot.c
regset.c
relay.c
resource.c
resource_kunit.c
rseq.c rseq: Switch to fast path processing on exit to user 2025-11-04 08:34:39 +01:00
scftorture.c
scs.c
seccomp.c Performance events updates for v6.18: 2025-09-30 11:11:21 -07:00
signal.c signal: Move MMCID exit out of sighand lock 2025-11-25 19:45:40 +01:00
smp.c
smpboot.c
smpboot.h
softirq.c
stacktrace.c
static_call.c
static_call_inline.c
stop_machine.c
sys.c Patch series in this pull request: 2025-10-02 18:44:54 -07:00
sys_ni.c
sysctl-test.c
sysctl.c
task_work.c task_work: Fix NMI race condition 2025-10-29 10:29:54 +01:00
taskstats.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
user_namespace.c
utsname.c namespace-6.18-rc1 2025-09-29 11:20:29 -07:00
utsname_sysctl.c
vhost_task.c
vmcore_info.c
watch_queue.c watch_queue: Use local kmap in post_one_notification() 2025-11-19 12:17:28 +01:00
watchdog.c
watchdog_buddy.c
watchdog_perf.c
workqueue.c
workqueue_internal.h