mirror of https://github.com/torvalds/linux.git
1809 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
b079d93796 |
sched: Rename do_set_cpus_allowed()
Hopefully saner naming. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> |
|
|
|
abfc01077d |
sched: Fix do_set_cpus_allowed() locking
All callers of do_set_cpus_allowed() only take p->pi_lock, which is not sufficient to actually change the cpumask. Again, this is mostly ok in these cases, but it results in unnecessarily complicated reasoning. Furthermore, there is no reason what so ever to not just take all the required locks, so do just that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> |
|
|
|
942b8db965 |
sched: Fix migrate_disable_switch() locking
For some reason migrate_disable_switch() was more complicated than it needs to be, resulting in mind bending locking of dubious quality. Recognise that migrate_disable_switch() must be called before a context switch, but any place before that switch is equally good. Since the current place results in troubled locking, simply move the thing before taking rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> |
|
|
|
6455ad5346 |
sched: Move sched_class::prio_changed() into the change pattern
Move sched_class::prio_changed() into the change pattern. And while there, extend it with sched_class::get_prio() in order to fix the deadline sitation. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> |
|
|
|
1ae5f5dfe5 |
sched: Cleanup sched_delayed handling for class switches
Use the new sched_class::switching_from() method to dequeue delayed tasks before switching to another class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> |
|
|
|
637b068282 |
sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.
This changes the order of callbacks slightly:
OLD NEW
|
| switching_from()
dequeue_task(); | dequeue_task()
put_prev_task(); | put_prev_task()
| switched_from()
|
... change task ... | ... change task ...
|
switching_to(); | switching_to()
enqueue_task(); | enqueue_task()
set_next_task(); | set_next_task()
prev_class->switched_from() |
switched_to() | switched_to()
|
Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.
Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
|
|
|
|
e9139f765a |
sched: Employ sched_change guards
As proposed a long while ago -- and half done by scx -- wrap the scheduler's 'change' pattern in a guard helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> |
|
|
|
c5448d46b3 |
Updates for the time(rs) core subsystem:
- Address the inconsistent shutdown sequence of per CPU clockevents on
CPU hotplug, which onoly removed it from the core but failed to invoke
the actual device driver shutdown callback. This keeps the timer
active, which prevents power savings and causes pointless noise in
virtualization.
- Encapsulate the open coded access to the hrtimer clock base, which is a
private implementation detail, so that the implementation can be
changed without breaking a lot of usage sites.
- Enhance the debug output of the clocksource watchdog to provide better
information for analysis.
- The usual set of cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmjaT+ETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoT8HD/47VRqHbFKazGcwZ4+OLKfFTq0PKzIU
kIhimNMSKVqiPlmWhud/4idGX5JFqMP3dS9ZaDlp7xCtu1ScCqzH72kbCrab0l4g
wjRgneGWsHhv06PY50Ty9FusFSetkjA8XSYavFV9HZNgCRIvho958akdckpazMcF
i4V5WTIVkwhGynhcGwqBXcmHASUa7x6gEjZYBbX6wspPl2Wk5z+vn/zAVIHAozwS
9GPw8HlUeKqM72U12mWGkt4KLjy2gzoupvx9vD4giXRvqkt09rfHtRqvEdkew+/G
ZANhmTJqfA1AiihYP30fHgY5lQbNQG+9O10UKhlUyrlBKPKHKu6dIuJQRSBy0j59
Bqef4UubPBlMP4xRgTbt11SFrjuqDo68bTIDGmQNQvb1BXSXhGA08PDPbIKkFdJh
8cXwUHzjLkDk0tL6kVXRWlsZcCmjz87TVS7PufGE3EIZkh0J2Ob+g7nFS3PsORKW
65ROSRoYmh4CRh/l9CTNbPWFvz3jahIEnxlrSc70o0A4DUeeWTwT/M+MoNOtSJuV
D4qSD35JsFaEbKFfSJwF/0CybG3Nx3UPI8nvEEYQtS4D+BCJHy7INg1P/NtWw+vb
W7puAFd+WbjSN14uQfYkYWC53vQ22isW0nU8W6G5NbMRD5K4WIld4UhDksJIGCLe
2LZ0VHxv4gaXkg==
=A9Tu
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Address the inconsistent shutdown sequence of per CPU clockevents on
CPU hotplug, which only removed it from the core but failed to invoke
the actual device driver shutdown callback. This kept the timer
active, which prevented power savings and caused pointless noise in
virtualization.
- Encapsulate the open coded access to the hrtimer clock base, which is
a private implementation detail, so that the implementation can be
changed without breaking a lot of usage sites.
- Enhance the debug output of the clocksource watchdog to provide
better information for analysis.
- The usual set of cleanups and enhancements all over the place
* tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Fix spelling mistakes in comments
clocksource: Print durations for sync check unconditionally
LoongArch: Remove clockevents shutdown call on offlining
tick: Do not set device to detached state in tick_shutdown()
hrtimer: Reorder branches in hrtimer_clockid_to_base()
hrtimer: Remove hrtimer_clock_base:: Get_time
hrtimer: Use hrtimer_cb_get_time() helper
media: pwm-ir-tx: Avoid direct access to hrtimer clockbase
ALSA: hrtimer: Avoid direct access to hrtimer clockbase
lib: test_objpool: Avoid direct access to hrtimer clockbase
sched/core: Avoid direct access to hrtimer clockbase
timers/itimer: Avoid direct access to hrtimer clockbase
posix-timers: Avoid direct access to hrtimer clockbase
jiffies: Remove obsolete SHIFTED_HZ comment
|
|
|
|
6c7340a7a8 |
Scheduler updates for v6.18:
Core scheduler changes:
- Make migrate_{en,dis}able() inline, to improve performance
(Menglong Dong)
- Move STDL_INIT() functions out-of-line (Peter Zijlstra)
- Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
Fair scheduling:
- Defer throttling when tasks exit to user-space, to reduce the
chance & impact of throttle-preemption with held locks and
other resources. (Aaron Lu, Valentin Schneider)
- Get rid of sched_domains_curr_level hack for tl->cpumask(),
as the warning was getting triggered on certain topologies.
(Peter Zijlstra)
Misc cleanups & fixes:
- Header cleanups (Menglong Dong)
- Fix race in push_dl_task() (Harshit Agarwal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
drSUoI1P570=
=8ZTZ
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core scheduler changes:
- Make migrate_{en,dis}able() inline, to improve performance
(Menglong Dong)
- Move STDL_INIT() functions out-of-line (Peter Zijlstra)
- Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
Fair scheduling:
- Defer throttling to when tasks exit to user-space, to reduce the
chance & impact of throttle-preemption with held locks and other
resources (Aaron Lu, Valentin Schneider)
- Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
warning was getting triggered on certain topologies (Peter
Zijlstra)
Misc cleanups & fixes:
- Header cleanups (Menglong Dong)
- Fix race in push_dl_task() (Harshit Agarwal)"
* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix some typos in include/linux/preempt.h
sched: Make migrate_{en,dis}able() inline
rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
sched/fair: Do not balance task to a throttled cfs_rq
sched/fair: Do not special case tasks in throttled hierarchy
sched/fair: update_cfs_group() for throttled cfs_rqs
sched/fair: Propagate load for throttled cfs_rq
sched/fair: Get rid of throttled_lb_pair()
sched/fair: Task based throttle time accounting
sched/fair: Switch to task based throttle model
sched/fair: Implement throttle task work and related helpers
sched/fair: Add related data structure for task based throttle
sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
sched: Move STDL_INIT() functions out-of-line
sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
sched/deadline: Fix race in push_dl_task()
|
|
|
|
a23cd25bae |
sched_ext: Changes for v6.18
- Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaNbpHQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTzWAP9cOG1O2DE0nCJJlDfGoJd1suz9quK+OrVdXhO+ 6nnXoAEAonDP0jXtwoW9GVQFyj8WZ0mfV4QJk0OHM1CoOV5XkQY= =Ev85 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. * tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits) Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" sched_ext: Misc updates around scx_sched instance pointer sched_ext: Drop scx_kf_exit() and scx_kf_error() sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() sched_ext: Drop kf_cpu_valid() sched_ext: Add the @sch parameter to ext_idle helpers sched_ext: Add the @sch parameter to __bstr_format() sched_ext: Separate out scx_kick_cpu() and add @sch to it tools/sched_ext: scx_qmap: Make debug output quieter by default sched_ext: Make qmap dump operation non-destructive sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() sched_ext: Use bitfields for boolean warning flags sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() sched_ext: Improve SCX_KF_DISPATCH comment sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() sched_ext: Verify RCU protection in scx_bpf_cpu_curr() sched_ext: Add migration-disabled counter to error state dump sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() sched_ext: deprecation warn for scx_bpf_cpu_rq() ... |
|
|
|
722df25ddf |
kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY= =tUXG -----END PGP SIGNATURE----- Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_*() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] * tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64) |
|
|
|
378b770819 |
sched: Make migrate_{en,dis}able() inline
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):
54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
bpf_prog_2dcccf652aac1793_bench_trigger_fentry
10.43% [kernel] [k] migrate_enable
10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
8.06% [kernel] [k] __bpf_prog_exit_recur
4.11% libc.so.6 [.] syscall
2.15% [kernel] [k] entry_SYSCALL_64
1.48% [kernel] [k] memchr_inv
1.32% [kernel] [k] fput
1.16% [kernel] [k] _copy_to_user
0.73% [kernel] [k] bpf_prog_test_run_raw_tp
So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.
Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".
The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.
Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().
The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
typeof((ptr) + 0)
So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.
The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.
Before this patch, the performance of BPF FENTRY is:
fentry : 113.030 ± 0.149M/s
fentry : 112.501 ± 0.187M/s
fentry : 112.828 ± 0.267M/s
fentry : 115.287 ± 0.241M/s
After this patch, the performance of BPF FENTRY increases to:
fentry : 143.644 ± 0.670M/s
fentry : 149.764 ± 0.362M/s
fentry : 149.642 ± 0.156M/s
fentry : 145.263 ± 0.221M/s
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
|
|
|
|
ebfd5226ec |
sched_ext: Merge branch 'for-6.17-fixes' into for-6.18
Pull sched_ext/for-6.17-fixes to receive: |
|
|
|
a1eab4d813 |
sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
While collecting SCX related fields in struct task_group into struct scx_task_group, |
|
|
|
b68b7f3e9b |
sched/core: Avoid direct access to hrtimer clockbase
The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de |
|
|
|
a5bd6ba30b |
sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations
SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
|
|
|
2cd571245b |
sched/fair: Add related data structure for task based throttle
Add related data structures for this new throttle functionality. Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com |
|
|
|
edd3cb05c0 |
copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit
|
|
|
|
6a68cec16b |
sched_ext: Changes for v6.17
- Add support for cgroup "cpu.max" interface. - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/. - Drop UP paths in accordance with sched core changes. - Documentation and other misc changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqnxg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGUh5AQC6YM7ggRPYRmy28m5B0nubpKtCHqPOAHSd/QbY MCiThgD+JuE9ewg3wYO/jvJx3NyIRB1McMnAaG59hf6R0Plh5Qo= =TeLF -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node() |
|
|
|
4ff261e725 |
Runtime verification changes for 6.17
- Added Linear temporal logic monitors for RT application
Real-time applications may have design flaws causing them to have
unexpected latency. For example, the applications may raise page faults, or
may be blocked trying to take a mutex without priority inheritance.
However, while attempting to implement DA monitors for these real-time
rules, deterministic automaton is found to be inappropriate as the
specification language. The automaton is complicated, hard to understand,
and error-prone.
For these cases, linear temporal logic is found to be more suitable. The
LTL is more concise and intuitive.
- Make printk_deferred() public
The new monitors needed access to printk_deferred(). Make them visible for
the entire kernel.
- Add a vpanic() to allow for va_list to be passed to panic.
- Add rtapp container monitor.
A collection of monitors that check for common problems with real-time
applications that cause unexpected latency.
- Add page fault tracepoints to risc-v
These tracepoints are necessary to for the RV monitor to run on risc-v.
- Fix the behaviour of the rv tool with -s and idle tasks.
- Allow the rv tool to gracefully terminate with SIGTERM
- Adjusts dot2c not to create lines over 100 columns
- Properly order nested monitors in the RV Kconfig file
- Return the registration error in all DA monitor instead of 0
- Update and add new sched collection monitors
Replace tss and sncid monitors with more complete sts:
Not only prove that switches occur in scheduling context and scheduling
needs interrupt disabled but also that each call to the scheduler
disables interrupts to (optionally) switch.
New monitor: nrp
Preemption requires need resched which is cleared by any switch
(includes a non optimal workaround for /nested/ preemptions)
New monitor: sssw
suspension requires setting the task to sleepable and, after the
switch occurs, the task requires a wakeup to come back to runnable
New monitor: opid
waking and need-resched operations occur with interrupts and
preemption disabled or in IRQ without explicitly disabling preemption
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIk8cBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qi3DAQCFu6DM7uPSh94oggWlH2LukOYVGk2b
CvGrqMFuefae7QD/aK9nCMfzaBehixMOMQHLHELEh527Hd+RwQCrlnLALQU=
=r5HZ
-----END PGP SIGNATURE-----
Merge tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verification updates from Steven Rostedt:
- Added Linear temporal logic monitors for RT application
Real-time applications may have design flaws causing them to have
unexpected latency. For example, the applications may raise page
faults, or may be blocked trying to take a mutex without priority
inheritance.
However, while attempting to implement DA monitors for these
real-time rules, deterministic automaton is found to be inappropriate
as the specification language. The automaton is complicated, hard to
understand, and error-prone.
For these cases, linear temporal logic is found to be more suitable.
The LTL is more concise and intuitive.
- Make printk_deferred() public
The new monitors needed access to printk_deferred(). Make them
visible for the entire kernel.
- Add a vpanic() to allow for va_list to be passed to panic.
- Add rtapp container monitor.
A collection of monitors that check for common problems with
real-time applications that cause unexpected latency.
- Add page fault tracepoints to risc-v
These tracepoints are necessary to for the RV monitor to run on
risc-v.
- Fix the behaviour of the rv tool with -s and idle tasks.
- Allow the rv tool to gracefully terminate with SIGTERM
- Adjusts dot2c not to create lines over 100 columns
- Properly order nested monitors in the RV Kconfig file
- Return the registration error in all DA monitor instead of 0
- Update and add new sched collection monitors
Replace tss and sncid monitors with more complete sts:
Not only prove that switches occur in scheduling context and scheduling
needs interrupt disabled but also that each call to the scheduler
disables interrupts to (optionally) switch.
New monitor: nrp
Preemption requires need resched which is cleared by any switch
(includes a non optimal workaround for /nested/ preemptions)
New monitor: sssw
suspension requires setting the task to sleepable and, after the
switch occurs, the task requires a wakeup to come back to runnable
New monitor: opid
waking and need-resched operations occur with interrupts and
preemption disabled or in IRQ without explicitly disabling
preemption"
* tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (48 commits)
rv: Add opid per-cpu monitor
rv: Add nrp and sssw per-task monitors
rv: Replace tss and sncid monitors with more complete sts
sched: Adapt sched tracepoints for RV task model
rv: Retry when da monitor detects race conditions
rv: Adjust monitor dependencies
rv: Use strings in da monitors tracepoints
rv: Remove trailing whitespace from tracepoint string
rv: Add da_handle_start_run_event_ to per-task monitors
rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
rv: Fix wrong type cast in monitors_show()
rv: Remove struct rv_monitor::reacting
rv: Remove rv_reactor's reference counter
rv: Merge struct rv_reactor_def into struct rv_reactor
rv: Merge struct rv_monitor_def into struct rv_monitor
rv: Remove unused field in struct rv_monitor_def
rv: Return init error when registering monitors
verification/rvgen: Organise Kconfig entries for nested monitors
tools/dot2c: Fix generated files going over 100 column limit
tools/rv: Stop gracefully also on SIGTERM
...
|
|
|
|
bf76f23aa1 |
Scheduler updates for v6.17:
Core scheduler changes:
- Better tracking of maximum lag of tasks in presence of different
slices duration, for better handling of lag in the fair
scheduler. (Vincent Guittot)
- Clean up and standardize #if/#else/#endif markers throughout
the entire scheduler code base (Ingo Molnar)
- Make SMP unconditional: build the SMP scheduler's
data structures and logic on UP kernel too, even though
they are not used, to simplify the scheduler and remove
around 200 #ifdef/[#else]/#endif blocks from the
scheduler. (Ingo Molnar)
- Reorganize cgroup bandwidth control interface handling
for better interfacing with sched_ext (Tejun Heo)
Balancing:
- Bump sd->max_newidle_lb_cost when newidle balance fails (Chris Mason)
- Remove sched_domain_topology_level::flags to simplify the code (Prateek Nayak)
- Simplify and clean up build_sched_topology() (Li Chen)
- Optimize build_sched_topology() on large machines (Li Chen)
Real-time scheduling:
- Add initial version of proxy execution: a mechanism for mutex-owning
tasks to inherit the scheduling context of higher priority waiters.
Currently limited to a single runqueue and conditional on CONFIG_EXPERT,
and other limitations. (John Stultz, Peter Zijlstra, Valentin Schneider)
- Deadline scheduler (Juri Lelli):
- Fix dl_servers initialization order (Juri Lelli)
- Fix DL scheduler's root domain reinitialization logic (Juri Lelli)
- Fix accounting bugs after global limits change (Juri Lelli)
- Fix scalability regression by implementing less agressive dl_server handling
(Peter Zijlstra)
PSI:
- Improve scalability by optimizing psi_group_change() cpu_clock() usage
(Peter Zijlstra)
Rust changes:
- Make Task, CondVar and PollCondVar methods inline to avoid unnecessary
function calls (Kunwu Chan, Panagiotis Foliadis)
- Add might_sleep() support for Rust code: Rust's "#[track_caller]"
mechanism is used so that Rust's might_sleep() doesn't need to be
defined as a macro (Fujita Tomonori)
- Introduce file_from_location() (Boqun Feng)
Debugging & instrumentation:
- Make clangd usable with scheduler source code files again (Peter Zijlstra)
- tools: Add root_domains_dump.py which dumps root domains info (Juri Lelli)
- tools: Add dl_bw_dump.py for printing bandwidth accounting info (Juri Lelli)
Misc cleanups & fixes:
- Remove play_idle() (Feng Lee)
- Fix check_preemption_disabled() (Sebastian Andrzej Siewior)
- Do not call __put_task_struct() on RT if pi_blocked_on is set
(Luis Claudio R. Goncalves)
- Correct the comment in place_entity() (wang wei)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmiHHNIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g7DhAAg9aMW33PuC24A4hCS1XQay6j3rgmR5qC
AOqDofj/CY4Q374HQtOl4m5CYZB/G5csRv6TZliWQKhAy9vr6VWddoyOMJYOAlAx
XRurl1Z3MriOMD6DPgNvtHd5PrR5Un8ygALgT+32d0PRz27KNXORW5TyvEf2Bv4r
BX4/GazlOlK0PdGUdZl0q/3dtkU4Wr5IifQzT8KbarOSBbNwZwVcg+83hLW5gJMx
LgMGLaAATmiN7VuvJWNDATDfEOmOvQOu8veoS8TuP1AOVeJPfPT2JVh9Jen5V1/5
3w1RUOkUI2mQX+cujWDW3koniSxjsA1OegXfHnFkF5BXp4q5e54k6D5sSh1xPFDX
iDhkU5jsbKkkJS2ulD6Vi4bIAct3apMl4IrbJn/OYOLcUVI8WuunHs4UPPEuESAS
TuQExKSdj4Ntrzo3pWEy8kX3/Z9VGa+WDzwsPUuBSvllB5Ir/jjKgvkxPA6zGsiY
rbkmZT8qyI01IZ/GXqfI2AQYCGvgp+SOvFPi755ZlELTQS6sUkGZH2/2M5XnKA9t
Z1wB2iwttoS1VQInx0HgiiAGrXrFkr7IzSIN2T+CfWIqilnL7+nTxzwlJtC206P4
DB97bF6azDtJ6yh1LetRZ1ZMX/Gr56Cy0Z6USNoOu+a12PLqlPk9+fPBBpkuGcdy
BRk8KgysEuk=
=8T0v
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core scheduler changes:
- Better tracking of maximum lag of tasks in presence of different
slices duration, for better handling of lag in the fair scheduler
(Vincent Guittot)
- Clean up and standardize #if/#else/#endif markers throughout the
entire scheduler code base (Ingo Molnar)
- Make SMP unconditional: build the SMP scheduler's data structures
and logic on UP kernel too, even though they are not used, to
simplify the scheduler and remove around 200 #ifdef/[#else]/#endif
blocks from the scheduler (Ingo Molnar)
- Reorganize cgroup bandwidth control interface handling for better
interfacing with sched_ext (Tejun Heo)
Balancing:
- Bump sd->max_newidle_lb_cost when newidle balance fails (Chris
Mason)
- Remove sched_domain_topology_level::flags to simplify the code
(Prateek Nayak)
- Simplify and clean up build_sched_topology() (Li Chen)
- Optimize build_sched_topology() on large machines (Li Chen)
Real-time scheduling:
- Add initial version of proxy execution: a mechanism for
mutex-owning tasks to inherit the scheduling context of higher
priority waiters.
Currently limited to a single runqueue and conditional on
CONFIG_EXPERT, and other limitations (John Stultz, Peter Zijlstra,
Valentin Schneider)
- Deadline scheduler (Juri Lelli):
- Fix dl_servers initialization order (Juri Lelli)
- Fix DL scheduler's root domain reinitialization logic (Juri
Lelli)
- Fix accounting bugs after global limits change (Juri Lelli)
- Fix scalability regression by implementing less agressive
dl_server handling (Peter Zijlstra)
PSI:
- Improve scalability by optimizing psi_group_change() cpu_clock()
usage (Peter Zijlstra)
Rust changes:
- Make Task, CondVar and PollCondVar methods inline to avoid
unnecessary function calls (Kunwu Chan, Panagiotis Foliadis)
- Add might_sleep() support for Rust code: Rust's "#[track_caller]"
mechanism is used so that Rust's might_sleep() doesn't need to be
defined as a macro (Fujita Tomonori)
- Introduce file_from_location() (Boqun Feng)
Debugging & instrumentation:
- Make clangd usable with scheduler source code files again (Peter
Zijlstra)
- tools: Add root_domains_dump.py which dumps root domains info (Juri
Lelli)
- tools: Add dl_bw_dump.py for printing bandwidth accounting info
(Juri Lelli)
Misc cleanups & fixes:
- Remove play_idle() (Feng Lee)
- Fix check_preemption_disabled() (Sebastian Andrzej Siewior)
- Do not call __put_task_struct() on RT if pi_blocked_on is set (Luis
Claudio R. Goncalves)
- Correct the comment in place_entity() (wang wei)"
* tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
sched/idle: Remove play_idle()
sched: Do not call __put_task_struct() on rt if pi_blocked_on is set
sched: Start blocked_on chain processing in find_proxy_task()
sched: Fix proxy/current (push,pull)ability
sched: Add an initial sketch of the find_proxy_task() function
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Move update_curr_task logic into update_curr_se
locking/mutex: Add p->blocked_on wrappers for correctness checks
locking/mutex: Rework task_struct::blocked_on
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
sched/topology: Remove sched_domain_topology_level::flags
x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled
x86/smpboot: moves x86_topology to static initialize and truncate
x86/smpboot: remove redundant CONFIG_SCHED_SMT
smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
tools/sched: Add root_domains_dump.py which dumps root domains info
sched/deadline: Fix accounting after global limits change
sched/deadline: Reset extra_bw to max_bw when clearing root domains
sched/deadline: Initialize dl_servers after SMP
...
|
|
|
|
78bb43e51b |
Updates for the generic entry code:
- Split the code into syscall and exception/interrupt parts to ease the
conversion of ARM[64] to the generic entry infrastructure
- Extend syscall user dispatching to support a single intercepted range
instead of the default single non-intercepted range. That allows
monitoring/analysis of a specific executable range, e.g. a library, and
also provides flexibility for sandboxing scenarios.
- Cleanup and extend the user dispatch selftest
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiIg6UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWn7EACTvQpu7tGd1rN9hCjiB1W5po7nvlCd
gKghjS9Kp0KttTDQPLVcmnH06BhDHWNNn1HXZ1ORea4bpLywiKHtVgqUAsJDsBsv
ETeTHYNphk0sktvAqp3XusA6HF4T0s1KXJQj3W1ACrYZWRkK/VystCLYwBRGpc3r
cj7jAFmJyNpU236R5XYJ7ooHfPYpzZ8VAHBO8ykK7muHDfyBRXEIlmkGep++ctSv
v0uZXAy6LONljKg87YJTien0UA7ze9lFgPTuV1y/qfaLbYNekUaJSDjfuhOpZZUw
TzSh9OYoIvKpd0ylHwB1qMLd5CaXNicaeLfTW3xbX06KaXa7WNAonS35sK0EjhtZ
0bBA9g6bRhphyh0tzR4saF9bczNvJydNCn7/QFo9dKbQUEL/FRXtJiIeusVx/0fJ
+ZqWRTcEdDw2Rsyv52hKgyEJi7F3nL9ovabUN9P1/0aPcTdM3WekMpSOJm1U6wVF
e6oSyeoeNdjcdxgWbQrgRNbmq5CPEV3ig5J+G418r5DTF3ifqZX+WscijUtKTu5K
V5GpLc0PL9eoigQ37LmGkwK/4xoB9SAPTQuzUs9qgh9NidwT0cCfoNxpeGh6GeHX
GLHPGU61vZaefxpwuAuv+SQSgxXSKk2/H/ijPzSjrX/PkUp7MoX9XoOQAh4FxZjO
ok5YEUGXzSJfXQ==
=yaCQ
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic entry code updates from Thomas Gleixner:
- Split the code into syscall and exception/interrupt parts to ease the
conversion of ARM[64] to the generic entry infrastructure
- Extend syscall user dispatching to support a single intercepted range
instead of the default single non-intercepted range. That allows
monitoring/analysis of a specific executable range, e.g. a library,
and also provides flexibility for sandboxing scenarios
- Cleanup and extend the user dispatch selftest
* tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Split generic entry into generic exception and syscall entry
selftests: Add tests for PR_SYS_DISPATCH_INCLUSIVE_ON
syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON
selftests: Fix errno checking in syscall_user_dispatch test
|
|
|
|
adcc3bfa88 |
sched: Adapt sched tracepoints for RV task model
Add the following tracepoint:
* sched_set_need_resched(tsk, cpu, tif)
Called when a task is set the need resched [lazy] flag
Remove the unused ip parameter from sched_entry and sched_exit and alter
sched_entry to have a value of preempt consistent with the one used in
sched_switch.
Also adapt all monitors using sched_{entry,exit} to avoid breaking build.
These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-7-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
|
|
7de9d4f946 |
sched: Start blocked_on chain processing in find_proxy_task()
Start to flesh out the real find_proxy_task() implementation, but avoid the migration cases for now, in those cases just deactivate the donor task and pick again. To ensure the donor task or other blocked tasks in the chain aren't migrated away while we're running the proxy, also tweak the fair class logic to avoid migrating donor or mutex blocked tasks. [jstultz: This change was split out from the larger proxy patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-9-jstultz@google.com |
|
|
|
be39617e38 |
sched: Fix proxy/current (push,pull)ability
Proxy execution forms atomic pairs of tasks: The waiting donor task (scheduling context) and a proxy (execution context). The donor task, along with the rest of the blocked chain, follows the proxy wrt CPU placement. They can be the same task, in which case push/pull doesn't need any modification. When they are different, however, FIFO1 & FIFO42: ,-> RT42 | | blocked-on | v blocked_donor | mutex | | owner | v `-- RT1 RT1 RT42 CPU0 CPU1 ^ ^ | | overloaded !overloaded rq prio = 42 rq prio = 0 RT1 is eligible to be pushed to CPU1, but should that happen it will "carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as push/pullable. Unfortunately, only the donor task is usually dequeued from the rq, and the proxy'ed execution context (rq->curr) remains on the rq. This can cause RT1 to be selected for migration from logic like the rt pushable_list. Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule returns, which allows the sched class logic to avoid adding the now current task to the pushable_list. Furthermore, tasks becoming blocked on a mutex don't need an explicit dequeue/enqueue cycle to be made (push/pull)able: they have to be running to block on a mutex, thus they will eventually hit put_prev_task(). Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-8-jstultz@google.com |
|
|
|
be41bde4c3 |
sched: Add an initial sketch of the find_proxy_task() function
Add a find_proxy_task() function which doesn't do much. When we select a blocked task to run, we will just deactivate it and pick again. The exception being if it has become unblocked after find_proxy_task() was called. This allows us to validate keeping blocked tasks on the runqueue and later deactivating them is working ok, stressing the failure cases for when a proxy isn't found. Greatly simplified from patch by: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> [jstultz: Split out from larger proxy patch and simplified for review and testing.] Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-7-jstultz@google.com |
|
|
|
25c411fce7 |
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument sched_proxy_exec= that can be used to disable the feature at boot time if CONFIG_SCHED_PROXY_EXEC was enabled. Also uses this option to allow the rq->donor to be different from rq->curr. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-2-jstultz@google.com |
|
|
|
8f2146159b |
Merge branch 'tip/sched/urgent'
Avoid merge conflicts Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
|
|
|
9f239df555 |
sched/deadline: Initialize dl_servers after SMP
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).
Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.
Fixes:
|
|
|
|
db6cc3f4ac |
Revert "sched/numa: add statistics of numa balance task"
This reverts commit |
|
|
|
009836b4fa |
sched/core: Fix migrate_swap() vs. hotplug
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:
> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>
So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.
Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.
This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.
Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.
Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.
Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.
Fixes:
|
|
|
|
3ebb1b6522 |
sched: Fix preemption string of preempt_dynamic_none
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".
Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).
Fixes:
|
|
|
|
a70e9f647f |
entry: Split generic entry into generic exception and syscall entry
Currently CONFIG_GENERIC_ENTRY enables both the generic exception entry logic and the generic syscall entry logic, which are otherwise loosely coupled. Introduce separate config options for these so that architectures can select the two independently. This will make it easier for architectures to migrate to generic entry code. Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/20250213130007.1418890-2-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20250624-generic-entry-split-v1-1-53d5ef4f94df@linaro.org [Linus Walleij: rebase onto v6.16-rc1] |
|
|
|
ddceadce63 |
sched_ext: Add support for cgroup bandwidth control interface
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
e4e149dd2f |
sched_ext: Merge branch 'for-6.16-fixes' into for-6.17
Pull sched_ext/for-6.16-fixes to receive: |
|
|
|
5bc34be478 |
sched/core: Reorganize cgroup bandwidth control interface file writes
- Move input parameter validation from tg_set_cfs_bandwidth() to the new
outer function tg_set_bandwidth(). The outer function handles parameters
in usecs, validates them and calls tg_set_cfs_bandwidth() which converts
them into nsecs. This matches tg_bandwidth() on the read side.
- max/min_cfs_* consts are now used by tg_set_bandwidth(). Relocate, convert
into usecs and drop "cfs" from the names.
- Reimplement cpu_cfs_{period|quote|burst}_write_*() using tg_bandwidth()
and tg_set_bandwidth() and replace "cfs" in the names with "bw".
- Update cpu_max_write() to use tg_set_bandiwdth(). cpu_period_quota_parse()
is updated to drop nsec conversion accordingly. This aligns the behavior
with cfs_period_quota_print().
- Drop now unused tg_set_cfs_{period|quota|burst}().
- While at it, for consistency, rename default_cfs_period() to
default_bw_period_us() and make it return usecs.
This is to prepare for adding bandwidth control support to sched_ext.
tg_set_bandwidth() will be used as the muxing point. No functional changes
intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-5-tj@kernel.org
|
|
|
|
43e33f53e2 |
sched/core: Reorganize cgroup bandwidth control interface file reads
- Update tg_get_cfs_*() to return u64 values. These are now used as the low level accessors to the fair's bandwidth configuration parameters. Translation to usecs takes place in these functions. - Add tg_bandwidth() which reads all three bandwidth parameters using tg_get_cfs_*(). - Reimplement cgroup interface read functions using tg_bandwidth(). Drop cfs from the function names. This is to prepare for adding bandwidth control support to sched_ext. tg_bandwidth() will be used as the muxing point similar to tg_weight(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-4-tj@kernel.org |
|
|
|
de4c80c696 |
sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()
Collect the getters, relocate the trivial interface file wrappers, and put all of them in period, quota, burst order to prepare for future changes. Pure reordering. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-3-tj@kernel.org |
|
|
|
33796b9187 |
sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.
sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().
v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
to build failures on !CONFIG_GROUP_SCHED configs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes:
|
|
|
|
74063c1755 |
sched/smp: Use the SMP version of idle_thread_set_boot_cpu()
Simplify the scheduler by making the CONFIG_SMP=y version of idle_thread_set_boot_cpu() unconditional. Note that idle_thread_set_boot_cpu() is already conditional on CONFIG_GENERIC_SMP_IDLE_THREAD, which most architectures select unconditionally on both UP and SMP kernels. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-28-mingo@kernel.org |
|
|
|
1f25730e5a |
sched/smp: Use the SMP version of sched_exec()
Simplify the scheduler making CONFIG_SMP=y sched_exec() code unconditional. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-27-mingo@kernel.org |
|
|
|
588467616c |
sched/smp: Use the SMP version of wake_up_new_task()
Simplify the scheduler by making CONFIG_SMP=y code in wake_up_new_task() unconditional. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-26-mingo@kernel.org |
|
|
|
8039addbe5 |
sched/smp: Use the SMP version of __task_needs_rq_lock()
Simplify the scheduler by making CONFIG_SMP=y code in __task_needs_rq_lock() unconditional. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-25-mingo@kernel.org |
|
|
|
d0a0a055a5 |
sched/smp: Use the SMP version of try_to_wake_up()
Simplify the scheduler by making CONFIG_SMP=y logic within try_to_wake_up() unconditional. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-24-mingo@kernel.org |
|
|
|
a1416303d1 |
sched/smp: Always define rq->hrtick_csd
Simplify the scheduler by making CONFIG_SMP=y data structure of rq->hrtick_csd unconditional. Adjust hrtick_start() accordingly, which was split due to the ::hrtick_csd asymmetry and use the SMP version there too. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-23-mingo@kernel.org |
|
|
|
cac5cefbad |
sched/smp: Make SMP unconditional
Simplify the scheduler by making CONFIG_SMP=y primitives and data structures unconditional. Introduce transitory wrappers for functionality not yet converted to SMP. Note that this patch is pretty large, because there's no clear separation between various aspects of the SMP scheduler, it's basically a huge block of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to boot and work on UP systems. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org |
|
|
|
b7ebb75856 |
sched: Clean up and standardize #if/#else/#endif markers in sched/core.c
- Use the standard #ifdef marker format for larger blocks, where appropriate: #if CONFIG_FOO ... #else /* !CONFIG_FOO: */ ... #endif /* !CONFIG_FOO */ - Apply this simplification: -#if defined(CONFIG_FOO) +#ifdef CONFIG_FOO - Fix whitespace noise. - Use vertical alignment to better visualize nested #ifdef blocks, where appropriate: #ifdef CONFIG_FOO # ifdef CONFIG_BAR ... # endif #endif Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-4-mingo@kernel.org |
|
|
|
fd1f847350 |
- The 2 patch series "zram: support algorithm-specific parameters" from
Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - The 5 patch series "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - The 2 patch series "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - The 2 patch series "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - The 2 patch series "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw= =oruw -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ... |
|
|
|
ad6b26b6a0 |
sched/numa: add statistics of numa balance task
On systems with NUMA balancing enabled, it has been found that tracking
task activities resulting from NUMA balancing is beneficial. NUMA
balancing employs two mechanisms for task migration: one is to migrate
a task to an idle CPU within its preferred node, and the other is to
swap tasks located on different nodes when they are on each other's
preferred nodes.
The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it
lacks statistics regarding task migration and swapping. Therefore,
relevant counts for task migration and swapping should be added.
The following two new fields:
numa_task_migrated
numa_task_swapped
will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat.
Introducing both per-task and per-memory cgroup (memcg) NUMA balancing
statistics facilitates a rapid evaluation of the performance and
resource utilization of the target workload. For instance, users can
first identify the container with high NUMA balancing activity and then
further pinpoint a specific task within that group, and subsequently
adjust the memory policy for that task. In short, although it is
possible to iterate through /proc/$pid/sched to locate the problematic
task, the introduction of aggregated NUMA balancing activity for tasks
within each memcg can assist users in identifying the task more
efficiently through a divide-and-conquer approach.
As Libo Chen pointed out, the memcg event relies on the text names in
vmstat_text, and /proc/vmstat generates corresponding items based on
vmstat_text. Thus, the relevant task migration and swapping events
introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in /proc/vmstat.
In theory, task migration and swap events are part of the scheduler's
activities. The reason for exposing them through the
memory.stat/vmstat interface is that we already have NUMA balancing
statistics in memory.stat/vmstat, and these events are closely related
to each other. Following Shakeel's suggestion, we describe the
end-to-end flow/story of all these events occurring on a timeline for
future reference:
The goal of NUMA balancing is to co-locate a task and its memory pages
on the same NUMA node. There are two strategies: migrate the pages to
the task's node, or migrate the task to the node where its pages
reside.
Suppose a task p1 is running on Node 0, but its pages are located on
Node 1. NUMA page fault statistics for p1 reveal its "page footprint"
across nodes. If NUMA balancing detects that most of p1's pages are on
Node 1:
1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.
2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.
Case 2.1: Idle CPU Available
If Node 1 has an idle CPU, p1 is directly scheduled there. This
event is logged as numa_task_migrated.
Case 2.2: No Idle CPU (Task Swap)
If all CPUs on Node1 are busy, direct migration could cause CPU
contention or load imbalance. Instead: The Numa balance selects a
candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own
page footprint). p1 and p2 are swapped. This cross-node swap is
recorded as numa_task_swapped.
Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ayush Jain <Ayush.jain3@amd.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
eaed94d1f6 |
Scheduler updates for v6.16:
Core & fair scheduler changes:
- Tweak wait_task_inactive() to force dequeue sched_delayed tasks
(John Stultz)
- Adhere to place_entity() constraints (Peter Zijlstra)
- Allow decaying util_est when util_avg > CPU capacity (Pierre Gondois)
- Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan)
Energy management:
- Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak)
- cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change
(K Prateek Nayak)
- Align uclamp and util_est and call before freq update (Xuewen Yan)
CPU isolation:
- Make use of more than one housekeeping CPU (Phil Auld)
RT scheduler:
- Fix race in push_rt_task() (Harshit Agarwal)
- Add kernel cmdline option for rt_group_sched (Michal Koutný)
Scheduler topology support:
- Improve topology_span_sane speed (Steve Wahl)
Scheduler debugging:
- Move and extend the sched_process_exit() tracepoint (Andrii Nakryiko)
- Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný)
- Fix trace_sched_switch(.prev_state) (Peter Zijlstra)
- Untangle cond_resched() and live-patching (Peter Zijlstra)
Fixes and cleanups:
- Misc fixes and cleanups (K Prateek Nayak, Michal Koutný,
Peter Zijlstra, Xuewen Yan)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy50ARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jFQQ/+KXl2XDg1V/VVmMG8GmtDlR29V3M3ricy
D7/2s0D1Y1ErHb+pRMBG31EubT9/bXjUshWIuuf51DciSLBmpELHxY5J+AevRa0L
/pHFwSvP6H5pDakI/xZ01FlYt7PxZGs+1m1o2615Mbwq6J2bjZTan54CYzrdpLOy
Nqb3OT4tSqU1+7SV7hVForBpZp9u3CvVBRt/wE6vcHltW/I486bM8OCOd2XrUlnb
QoIRliGI9KHpqCpbAeKPRSKXpf9tZv/AijZ+0WUu2yY8iwSN4p3RbbbwdCipjVQj
w5I5oqKI6cylFfl2dEFWXVO+tLBihs06w8KSQrhYmQ9DUu4RGBVM9ORINGDBPejL
bvoQh1mAkqvIL+oodujdbMDIqLupvOEtVSvwzR7SJn8BJSB00js88ngCWLjo/CcU
imLbWy9FSBLvOswLBzQthgAJEj+ejCkOIbcvM2lINWhX/zNsMFaaqYcO1wRunGGR
SavTI1s+ZksCQY6vCwRkwPrOZjyg91TA/q4FK102fHL1IcthH6xubE4yi4lTIUYs
L56HuGm8e7Shc8M2Y5rAYsVG3GoIHFLXnptOn2HnCRWaAAJYsBaLUlzoBy9MxCfw
I2YVDCylkQxevosSi2XxXo3tbM6auISU9SelAT/dAz32V1rsjWQojRJXeGYKIbu7
KBuN/dLItW0=
=s/ra
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core & fair scheduler changes:
- Tweak wait_task_inactive() to force dequeue sched_delayed tasks
(John Stultz)
- Adhere to place_entity() constraints (Peter Zijlstra)
- Allow decaying util_est when util_avg > CPU capacity (Pierre
Gondois)
- Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan)
Energy management:
- Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak)
- cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings
change (K Prateek Nayak)
- Align uclamp and util_est and call before freq update (Xuewen Yan)
CPU isolation:
- Make use of more than one housekeeping CPU (Phil Auld)
RT scheduler:
- Fix race in push_rt_task() (Harshit Agarwal)
- Add kernel cmdline option for rt_group_sched (Michal Koutný)
Scheduler topology support:
- Improve topology_span_sane speed (Steve Wahl)
Scheduler debugging:
- Move and extend the sched_process_exit() tracepoint (Andrii
Nakryiko)
- Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný)
- Fix trace_sched_switch(.prev_state) (Peter Zijlstra)
- Untangle cond_resched() and live-patching (Peter Zijlstra)
Fixes and cleanups:
- Misc fixes and cleanups (K Prateek Nayak, Michal Koutný, Peter
Zijlstra, Xuewen Yan)"
* tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
sched/uclamp: Align uclamp and util_est and call before freq update
sched/util_est: Simplify condition for util_est_{en,de}queue()
sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE
sched,livepatch: Untangle cond_resched() and live-patching
sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
sched/fair: Adhere to place_entity() constraints
sched/debug: Print the local group's asym_prefer_cpu
cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change
sched/topology: Introduce sched_update_asym_prefer_cpu()
sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpu
sched/isolation: Make use of more than one housekeeping cpu
sched/rt: Fix race in push_rt_task
sched: Add annotations to RT_GROUP_SCHED fields
sched: Add RT_GROUP WARN checks for non-root task_groups
sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled
sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED
sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
sched: Add commadline option for RT_GROUP_SCHED toggling
sched: Always initialize rt_rq's task_group
sched: Remove unneeed macro wrap
...
|