mirror of https://github.com/torvalds/linux.git
4671 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
038730dc12 |
sched_ext: Change the event type from u64 to s64
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
8a9b1585e2 |
sched_ext: Merge branch 'for-6.14-fixes' into for-6.15
Pull for-6.14-fixes to receive: |
|
|
|
9360dfe4cb |
sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.
Fixes:
|
|
|
|
d203484f25 |
Prevent cond_resched() based preemption when interrupts are disabled,
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfCDDMRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gf+RAAvFNXelLgrNbILZ6ckp/ikWnjCbf2QOIk aCm6JMQm7WrFvgo1u6CM4vQQYZdEqf8+KiEjJJnoq2P4jvYzhO1/1pLfEDNaHeiH GneosmKAwSMR8lgDlw5DXxhXsfeuYYhG5VMe2ia+kyiIA83TUF6hl9jpawWB3dsw +xB6CAg3JLoR2v44E/Mf1PdGaGrF90fYxp+X5RNSqxVXcN54cgVx2G9lHeTIWcnp SjIiWo5mply50de+dxD5dNUB9mj/k+yLQaiuPfUDGo/ZOjFyBnsP5VlD+ySbhkIa Rwdw6olLqXLcX5D5RsPIuePm/XdmAQXr6GXxJjdhtV1oWTP3Bejev3upQ/kxHQ50 DQa+aSTqNx9bNlwphUafCmVo1OZap4mViOSWP7r96HhFwehLGGmkjEaU9eFuUl0P kG+qGq28U+Nnz0r6/pEkwic1B6wbq2x1XRbtJqxXnBcQvMxMgDWNrTIj1ytDcSBb 3Qo0shRrtjH7DN1ly8IBllLQ0wXXI5O6GwjI7absEyEjpdoxFyMsHpaFONlTWRdi NgR2+5MWTxExeWaDRPAJM+THzwucfWVTeZVXJFMRfQnNIBj7TpO3X3Y4xzP9Vl/Y 2HEz8voSDZUVN6Ejxx/am7kb68WpWw46xmj59wWT7nf9SVEEm+R4Pfe3O9+0yvQV V4l6tN4yfEU= =RknP -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Prevent cond_resched() based preemption when interrupts are disabled, on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels" * tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Prevent rescheduling when interrupts are disabled |
|
|
|
82c387ef75 |
sched/core: Prevent rescheduling when interrupts are disabled
David reported a warning observed while loop testing kexec jump: Interrupts enabled after irqrouter_resume+0x0/0x50 WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220 kernel_kexec+0xf6/0x180 __do_sys_reboot+0x206/0x250 do_syscall_64+0x95/0x180 The corresponding interrupt flag trace: hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90 hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90 That means __up_console_sem() was invoked with interrupts enabled. Further instrumentation revealed that in the interrupt disabled section of kexec jump one of the syscore_suspend() callbacks woke up a task, which set the NEED_RESCHED flag. A later callback in the resume path invoked cond_resched() which in turn led to the invocation of the scheduler: __cond_resched+0x21/0x60 down_timeout+0x18/0x60 acpi_os_wait_semaphore+0x4c/0x80 acpi_ut_acquire_mutex+0x3d/0x100 acpi_ns_get_node+0x27/0x60 acpi_ns_evaluate+0x1cb/0x2d0 acpi_rs_set_srs_method_data+0x156/0x190 acpi_pci_link_set+0x11c/0x290 irqrouter_resume+0x54/0x60 syscore_resume+0x6a/0x200 kernel_kexec+0x145/0x1c0 __do_sys_reboot+0xeb/0x240 do_syscall_64+0x95/0x180 This is a long standing problem, which probably got more visible with the recent printk changes. Something does a task wakeup and the scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and invokes schedule() from a completely bogus context. The scheduler enables interrupts after context switching, which causes the above warning at the end. Quite some of the code paths in syscore_suspend()/resume() can result in triggering a wakeup with the exactly same consequences. They might not have done so yet, but as they share a lot of code with normal operations it's just a question of time. The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling models. Full preemption is not affected as cond_resched() is disabled and the preemption check preemptible() takes the interrupt disabled flag into account. Cure the problem by adding a corresponding check into cond_resched(). Reported-by: David Woodhouse <dwmw@amazon.co.uk> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org |
|
|
|
e6d3c4e535 |
sched_ext: A fix for v6.14-rc4
pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ79/Ww4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGdhUAQDM1AcK7pJUHzayuQCecCxNspGty8nR9T4KeVly 51pA2gEA4sbs6Fj4doVKVyaCunsvFoZ8Tb/utCX716fVnpjMMgY= =yDMV -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead" * tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance() |
|
|
|
fde7d64766 |
sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its
idle state.
Also clarify this logic in the function documentation.
Fixes:
|
|
|
|
8fef0a3b17 |
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
|
|
|
|
0e9b4c10e8 |
sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
1a5d3492f8 |
sched: Add unlikey branch hints to several system calls
Adding an unlikely() hint on early error return paths improves the run-time performance of several sched related system calls. Benchmarking on an i9-12900 shows the following per system call performance improvements: before after improvement sched_getattr 182.4ns 170.6ns ~6.5% sched_setattr 284.3ns 267.6ns ~5.9% sched_getparam 161.6ns 148.1ns ~8.4% sched_setparam 1265.4ns 1227.6ns ~3.0% sched_getscheduler 129.4ns 118.2ns ~8.7% sched_setscheduler 1237.3ns 1216.7ns ~1.7% Results are based on running 20 tests with turbo disabled (to reduce clock freq turbo changes), with 10 second run per test based on the number of system calls per second. The % standard deviation of the measurements for the 20 tests was 0.05% to 0.40%, so the results are reliable. Tested on kernel build with gcc 14.2.1 Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250219142423.45516-1-colin.i.king@gmail.com |
|
|
|
b796ea8489 |
sched/core: Remove duplicate included header file stats.h
The header file stats.h is included twice. Remove the redundant include and the following make includecheck warning: stats.h is included more than once Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250219111756.3070-2-thorsten.blum@linux.dev |
|
|
|
01059219b0 |
sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
ee13da875b |
sched: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de |
|
|
|
02d954c0fd |
sched: Compact RSEQ concurrency IDs with reduced threads and affinity
When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit |
|
|
|
ff3b373ecc |
- Clarify what happens when a task is woken up from the wake queue and make
clear its removal from that queue is atomic -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmexq1MACgkQEsHwGGHe VUq6ghAAoD872KGmQ3YioDZs+FKLpLWvo+6lC2rY+GFQ3oCu4TJfmlsiTGLCyzjv aYwL52diIORD9/Yfn6Sq/ZWkfncoVmwnht+tgVjXeRr3Pb4EnttgWxPRx/xYQizr jgRASpNsRUTr6zNzqEeeQYodIJaInOF5+r26oqYArcN5V9XB9Qaj0+f14UiyB6u+ 53qpEQqQopeLPyG4t59iUfefsaWm2ZIW3EnoWeyI8sRuaapGY/0LHhUAn6vcA++N kuUkliVsk+f/uTNZeJ4zv2uy8DpBXO4kTjmwPVwFz46sJ8RL8P7MOLax7e7fNssw tylwHt4qoLEoB2vg1yMvlUNFMeH85gj8hTyJMsgGtFTbwCH0kLFEoXUz0lfKMS7U A271E1Bumu3OT7FrAnxQahViv02YWG5fcg6R3OidQdSmgoQBIMJwDA2pyKLiq9FL 7mWoNfEqqBWn4O/1qBlf3jCvfFlzXRSSgVzEruoNB93cgzTaQaN5yVgeekMzwJEj NDowmIZRxEN8+lJyMxIGSOGa44aTXu0/+dtehEDeSpsDOXULFc5fpYt4SSa0Jt/F LlgnPGkM1vF0ddG5vDJpGw6B9Dhb82i1oYy2IVwOCkLOXSp2kvLDEI3sqgk5mmlH zFtAV21Zm4jqBke/aF7r+RCYRvDyQkuW0d1+H9okGWLSk7jKauM= =J43o -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Clarify what happens when a task is woken up from the wake queue and make clear its removal from that queue is atomic * tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clarify wake_up_q()'s write to task->wake_q.next |
|
|
|
48849271e6 |
sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes
Scheduler:
- scx_simple [1] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)
Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:
avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
|
|
0aaaf89df8 |
sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
d73249f887 |
sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
04f41cbf03 |
sched_ext: Fixes for v6.14-rc2
- Fix lock imbalance in a corner case of dispatch_to_local_dsq(). - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically. - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path. - tools/sched_ext, selftest and other misc updates. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ695uA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGeCfAQDmUixMNJCIrRphYsWcYUzlGLZyyRpQEEYFtRMO UC266gD+PUV2UvuO5sAVH8HVnGdOqkXaE/IRG+TC7fQH3ruPlgI= =LFd1 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix lock imbalance in a corner case of dispatch_to_local_dsq() - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path - tools/sched_ext, selftest and other misc updates * tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h. sched_ext: selftests: Fix grammar in tests description sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq() sched_ext: Fix migration disabled handling in targeted dispatches sched_ext: Implement auto local dispatching of migration disabled tasks sched_ext: Fix incorrect time delta calculation in time_delta() sched_ext: Fix lock imbalance in dispatch_to_local_dsq() sched_ext: selftests/dsp_local_on: Fix selftest on UP systems tools/sched_ext: Add helper to check task migration state sched_ext: Fix incorrect autogroup migration detection sched_ext: selftests/dsp_local_on: Fix sporadic failures selftests/sched_ext: Fix enum resolution sched_ext: Include task weight in the error state dump sched_ext: Fixes typos in comments |
|
|
|
ad3b301aa0 |
sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value. In addition, the format of scx_dump_event() is adjusted as the event name gets longer. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
d34e798094 |
sched/fair: Refactor can_migrate_task() to elimate looping
The function "can_migrate_task()" utilize "for_each_cpu_and" with a "if" statement inside to find the destination cpu. It's the same logic to find the first set bit of the result of the bitwise-AND of "env->dst_grpmask", "env->cpus" and "p->cpus_ptr". Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the first cpu within the intersection as the destination cpu, so we can elimate the need of looping and multiple times of branch. After the refactoring this part of the code can speed up from ~115ns to ~54ns, according to the test below. Ran the test for 5 times and the result is showned in the following table, and the test script is paste in next section. ------------------------------------------------------- |Old method| 130| 118| 115| 109| 106| avg ~115ns| ------------------------------------------------------- |New method| 58| 55| 54| 48| 55| avg ~54ns| ------------------------------------------------------- Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250210103019.283824-1-richard120310@gmail.com |
|
|
|
563bc2161b |
sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks
When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and hence the root->min_slice
leaves unchanged. The same issue happens when a task is dequeued and its
parent cgroup se has other runnable entities, and the parent cgroup se
will not be dequeued.
Force propagating min_slice when se doesn't need to be enqueued or
dequeued. Ensure the se hierarchy always get the latest min_slice.
Fixes:
|
|
|
|
b9f2b29b94 |
sched: Don't define sched_clock_irqtime as static key
The sched_clock_irqtime was defined as a static key in commit |
|
|
|
2ae891b826 |
sched: Reduce the default slice to avoid tasks getting an extra tick
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
|
|
|
|
f553741ac8 |
sched: Cancel the slice protection of the idle entity
A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has to wait, so just cancel it.
This patch is aimed at minimizing the impact of SCHED_IDLE on
SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for
1s and then runs for 3 ms, running cyclictest on the same cpu, has a
maximum latency of 3 ms, which is caused by the slice protection of the
idle entity. It is unreasonable. With this patch, the cyclictest latency
under the same conditions is basically the same on the cpu with idle
processes and on empty cpu.
[peterz: add helpers]
Fixes:
|
|
|
|
3539c6411a |
sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries. When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue. However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since |
|
|
|
f5717c93a1 |
sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit |
|
|
|
78e4690de4 |
Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive |
|
|
|
f3f08c3acf |
sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, |
|
|
|
1a4e0d8682 |
sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
eace54dff0 |
sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
Count the number of times a migration disabled task is automatically dispatched to its local DSQ. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> |
|
|
|
26176116d9 |
sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects: - It counted both migration disabled and offline events. - It didn't count events from scx_bpf_dsq_move() path. Fix it by moving the counting into task_can_run_on_remote_rq() which is shared by both paths and can distinguish the different rejection conditions. The argument @trigger_error is renamed to @enforce as it now does more than just triggering error. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> |
|
|
|
29ef4a2fcf |
Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive: - |
|
|
|
3296682157 |
sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: |
|
|
|
2fa0fbeb69 |
sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
|
|
|
c7b92e8969 |
Fix a cfs_rq->h_nr_runnable accounting bug that trips up a
defensive SCHED_WARN_ON() on certain workloads. The bug is believed to be (accidentally) self-correcting, hence no behavioral side effects are expected. Also print se.slice in debug output, since this value can now be set via the syscall ABI and can be useful to track. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN 2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX 6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+ 0r7+aYwn0F0= =5mot -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive SCHED_WARN_ON() on certain workloads. The bug is believed to be (accidentally) self-correcting, hence no behavioral side effects are expected. Also print se.slice in debug output, since this value can now be set via the syscall ABI and can be useful to track" * tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Provide slice length for fair tasks sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue |
|
|
|
bcc6244e13 |
sched: Clarify wake_up_q()'s write to task->wake_q.next
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after which a concurrent __wake_q_add() can immediately overwrite task->wake_q.next again. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com |
|
|
|
6d3f8fb4b2 |
sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many tasks have been enqueued (or pick_task-ed or select_cpu-ed) with a default time slice (SCX_SLICE_DFL). Scheduling a task with SCX_SLICE_DFL unintentionally would be a source of latency spikes because SCX_SLICE_DFL is relatively long (20 msec). Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF scheduler bugs, causing latency spikes, especially when ops.select_cpu() is provided. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
2c00e1199c |
sched: update __cond_resched comment about RCU quiescent states
Update comment in __cond_resched() clarifying how urgently needed quiescent state are provided. Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
|
|
|
d46457c31c |
sched_ext: Add an event, SCX_EV_BYPASS_DURATION
Add a core event, SCX_EV_BYPASS_DURATION, which represents the total duration of bypass modes in nanoseconds. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
5c605cd33c |
sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many tasks have been dispatched in the bypass mode. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
4f7a38c7c9 |
sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many times the bypass mode has been triggered. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
824d4f2dce |
sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7125660bc1 |
sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
9be0a1b0c8 |
sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how many times a BPF scheduler tries to dispatch to an offlined local DSQ. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
f7f6142107 |
sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACK
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times ops.select_cpu() returns a CPU that the task can't use. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
17103b8504 |
sched_ext: Implement event counter infrastructure
Collect the statistics of specific types of behavior in the sched_ext core, which are not easily visible but still interesting to an scx scheduler. An event type is defined in 'struct scx_event_stats.' When an event occurs, its counter is accumulated using 'scx_add_event()' and '__scx_add_event()' to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()' aggregates all the per-CPU counters and exposes a system-wide counters. For convenience and readability of the code, 'scx_agg_event()' and 'scx_dump_event()' are provided. The collected events can be observed after a BPF scheduler is unloaded beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats' are reset. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
9065ce6975 |
sched/debug: Provide slice length for fair tasks
Since commit:
|
|
|
|
f55b0671e3 |
More power management updates for 6.14-rc1
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang).
- Revert a commit that added unused macros (Andy Shevchenko).
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed during
the subsequent system-wide resume transition (Rafael Wysocki).
- Clean up the teo cpuidle governor and make the handling of short idle
intervals in it consistent regardless of the properties of idle
states supplied by the cpuidle driver (Rafael Wysocki).
- Fix some boost-related issues in cpufreq (Lifeng Zheng).
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar).
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can happen
from any CPU (Christian Loehle).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5xYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxcFsP/2FIoEI2G6J7pk8zChWT225qkkaieh5P
tHIkcFINlgzyjLnqmyWELUdt+sB7re6/dMmoLor+abudHimvBvUfAj6Oiz1F3p2F
utE9TpfhOkXi1ci5zBl9h6+iDj2Z5op3Qe/qw/W3DTlcManAD+6r60A2tOEy0jhi
GTbp2SEEU28+LU/2J59IfxEuRTTH4pbQGXi+iKv/k9bmtLvQofa1saXyQCBSZrvO
z3MBdqnAxLeZCg/qILmEGsBvbv1wpugvp3yoMLVwGNyul12Augcs8PreQz7e5tFq
spEuCfpBJwyJLAGlOnjOYgsPbJBXWRkIBeLH7JealfZr9TX0y4LZSAHi/xe0Asd3
BBZLLDojxhYMLzmqSkuafHlQd5J7jKl++RS1A9Qm6aqglKjeSOC9Ca9fmrc9Ub9P
Jpf1SVJ3kJsv1Z7wuUcaj6oLxD8wlAgCo5pigNgWTP2HllhP2bmc22M8JWxpAz+m
nMeW8nr8bAu4XViZeb74YKGgUDngO/uKRwBpthSkE2fqM7q0Wr5E7G0u9M91mzG6
nd/XDwta5TeznMQpSy339NgT61i4HHfyc/SDIdpkBxI0C5l6jknNawq79i9gy7/E
In4MyooOlls/iX/JR0uxx2hXEByltF0IHqwsRYeJ6dmIajYgASXR1Hhh5Iy9fPJ/
JTJ7vR5oZPB/
=EgsM
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly fixes on top of the previously merged power
management material with the addition of some teo cpuidle governor
updates, some of which may also be regarded as fixes:
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang)
- Revert a commit that added unused macros (Andy Shevchenko)
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed
during the subsequent system-wide resume transition (Rafael
Wysocki)
- Clean up the teo cpuidle governor and make the handling of short
idle intervals in it consistent regardless of the properties of
idle states supplied by the cpuidle driver (Rafael Wysocki)
- Fix some boost-related issues in cpufreq (Lifeng Zheng)
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar)
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can
happen from any CPU (Christian Loehle)"
* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Synchronize runtime PM status of parents and children
cpufreq: airoha: Depends on OF
PM: Revert "Add EXPORT macros for exporting PM functions"
PM: hibernate: Add error handling for syscore_suspend()
cpufreq/schedutil: Only bind threads if needed
cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
cpufreq: CPPC: Fix wrong max_freq in policy initialization
cpufreq: Introduce a more generic way to set default per-policy boost flag
cpufreq: Fix re-boost issue after hotplugging a CPU
cpufreq: s3c64xx: Fix compilation warning
cpuidle: teo: Skip sleep length computation for low latency constraints
cpuidle: teo: Replace time_span_ns with a flag
cpuidle: teo: Simplify handling of total events count
cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
cpuidle: teo: Simplify counting events used for tick management
cpuidle: teo: Clarify two code comments
cpuidle: teo: Drop local variable prev_intercept_idx
cpuidle: teo: Combine candidate state index checks against 0
cpuidle: teo: Reorder candidate state index checks
cpuidle: teo: Rearrange idle state lookup code
|
|
|
|
1751f872cc |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit
|
|
|
|
337d1b354a |
sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
1626e5ef0b |
sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.
Fix by properly tracking the currently locked rq.
Fixes:
|
|
|
|
d6f3e7d564 |
sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.
This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:
WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().
Link: https://github.com/sched-ext/scx/issues/370
Fixes:
|
|
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
|
|
|
c159dfbdd4 |
Mainly individually changelogged singleton patches. The patch series in
this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap library
code.
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
cleanup and Rust preparation in the xarray library code.
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
pathnames in some code comments.
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
new secs_to_jiffies() in various places where that is appropriate.
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API.
- "Convert ocfs2 to use folios" from Matthew Wilcox does that.
- "Remove get_task_comm() and print task comm directly" from Yafang Shao
removes now-unneeded calls to get_task_comm() in various places.
- "squashfs: reduce memory usage and update docs" from Phillip Lougher
implements some memory savings in squashfs and performs some
maintainability work.
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work.
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
corrupted image.
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc.
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger.
- "minmax.h: Cleanups and minor optimisations" from David Laight
does some maintenance work on the min/max library code.
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
on the xarray library code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
mJnaZ1/ibpEpearrChCViApQtcyEGQI=
=ti4E
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
|
|
|
|
65ef17aa07 |
hung_task: add task->flags, blocked by coredump to log
Resending this patch as I haven't received feedback on my initial submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflare.com/ For the processes which are terminated abnormally the kernel can provide a coredump if enabled. When the coredump is performed, the process and all its threads are put into the D state (TASK_UNINTERRUPTIBLE | TASK_FREEZABLE). On the other hand, we have kernel thread khungtaskd which monitors the processes in the D state. If the task stuck in the D state more than kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel log. The higher memory usage of a process, the longer it takes to create coredump, the longer tasks are in the D state. We have hung_task alerts for the processes with memory usage above 10Gb. Although, our kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec. Adding additional information to the log that the task is blocked by coredump will help with monitoring. Another approach might be to completely filter out alerts for such tasks, but in that case we would lose transparency about what is putting pressure on some system resources, e.g. we saw an increase in I/O when coredump occurs due its writing to disk. Additionally, it would be helpful to have task_struct->flags in the log from the function sched_show_task(). Currently it prints task_struct->thread_info->flags, this seems misleading as the line starts with "task:xxxx". [akpm@linux-foundation.org: fix printk control string] Link: https://lkml.kernel.org/r/20250110160328.64947-1-oxana@cloudflare.com Signed-off-by: Oxana Kharitonova <oxana@cloudflare.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Segall <bsegall@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2279563e3a |
sched_ext: Include task weight in the error state dump
Report the task weight when dumping the task state during an error exit. Moreover, adjust the output format to display dsq_vtime, slice, and weight on the same line. This can help identify whether certain tasks were excessively prioritized or de-prioritized due to large niceness gaps. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
be8ee18152 |
sched_ext: Fixes typos in comments
Fixes some spelling errors in the comments. Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
bc8198dc7e |
sched_ext: Changes for v6.14
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp in struct rq to avoid reading TSC multiple times within a locked scheduling operation. - Minor updates to the built-in idle CPU selection logic. - tool/sched_ext updates and other misc changes. Pulling sched_ext/for-6.14 into master causes a merge conflict between the following two commits (first commit in master, second in for-6.14): |
|
|
|
93940fbdc4 |
cpufreq/schedutil: Only bind threads if needed
Remove the unconditional binding of sugov kthreads to the affected CPUs if the cpufreq driver indicates that updates can happen from any CPU. This allows userspace to set affinities to either save power (waking up bigger CPUs on HMP can be expensive) or increasing performance (by letting the utilized CPUs run without preemption of the sugov kthread). Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/5a8deed4-7764-4729-a9d4-9520c25fa7e8@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
|
|
|
f4b9d3bf44 |
Power management updates for 6.14-rc1
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski).
- Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
Nick Chan).
- Add new cpufreq driver for Airoha SoCs (Christian Marangi).
- Fix using cpufreq-dt as module (Andreas Kemnade).
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam).
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy).
- Fix an amd-pstate driver regression with preferred core rankings not
being used (Mario Limonciello).
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki).
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello).
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello).
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar).
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a built
-in scaling factor for Arrow Lake processors (Rafael Wysocki).
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
consistency with CPU offline (Christian Loehle).
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf).
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson).
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang).
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap).
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy).
- Clean up the Exynos devfreq driver and devfreq core (Markus Elfring,
Jeongjun Park).
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong, Joe
Hattori).
- Implement dev_pm_opp_get_bw() (Neil Armstrong).
- Expose OPP reference counting helpers for Rust (Viresh Kumar).
- Fix TSC MHz calculation in cpupower (He Rongguang).
- Add install and uninstall options to bindings Makefile and add header
changes for cpufreq.h to SWIG bindings in cpupower (John B. Wyatt IV).
- Add missing residency header changes in cpuidle.h to SWIG bindings in
cpupower (John B. Wyatt IV).
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian).
- Fix cross-compilation in cpupower Makefile (Peng Fan).
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai).
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeOthsSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxqQsP/ivDt8nqDnxdKB7cKFQIsEK+tl0RnFVD
o5regvYeRcGWpUXuMaqBtTmCMjsB8bUkcj2yLquM54ubjHAGF6zJuw9ZytMPHVcC
b2xk3RCFlXSBFXVK8eOh3XRviA9nGhuY97ZnPsQOlvoECrxT2xyeL+mWo7s+t+q9
2NUH+yfRoi5FM+nqqDhsm0xXxJuPaNg6eAjIASuMjXap48rNk3L5kW6W/6nw7i0I
xQWd/pKLHaI5e7DRF/QdMKu8+Fm4BbN0jMqLblKPOmTe9KggvBkck5q1Um20sYkJ
vdKMAT02ClGavIC7DtY092Xik84NZfID4ZUchS6e2hJIQ3Uaw/eDvAo/jlT8gIzq
fnXPdApRIzQGDvMxFaAsKaGlwxiVlAGHPDSTH6MVWzsp+1DSkbloSwVPAfeYIn44
Jhov+6Ydux3597sSjo+YmD58acimXl7urVuk8P6m3U5+gb8/jlgbxpIn+vbxH3Ka
o44Vt7axD63gezOQY134sj5gic5JL0GuZovOlvzrF6+FsjvVqcax6FZ4n3uIXu7P
C1nwai+Wdzo7wvuz7RfO0g15Y15wYLQLYsRq/osRlf+sOmGVv7nA9tSzZ0LUdD5D
Pp6PxppF6anM0Kjen8Ppuu+Bcr11JfVvhnVTJqhs6u71XdAy4TnG1JjL4lPWYJ4D
Gfz2hyPNjiQX
=AoMC
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes here are cpufreq updates which are dominated
by amd-pstate driver changes, like in the previous cycle. Moreover,
changes related to amd-pstate are also the majority of cpupower
utility updates.
Included are some pieces of new hardware support, like the addition of
Clearwater Forest processors support to intel_idle, new cpufreq driver
for Airoha SoCs, and Apple cpufreq driver extensions to support more
SoCs. The intel_pstate driver is also extended to be able to support
new platforms by using ACPI CPPC to compute scaling factors between
HWP performance states and frequency.
The rest is mostly fixes and cleanups in assorted pieces of power
management code.
Specifics:
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski)
- Extend the Apple cpufreq driver to support more SoCs (Hector
Martin, Nick Chan)
- Add new cpufreq driver for Airoha SoCs (Christian Marangi)
- Fix using cpufreq-dt as module (Andreas Kemnade)
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam)
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy)
- Fix an amd-pstate driver regression with preferred core rankings
not being used (Mario Limonciello)
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki)
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello)
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello)
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar)
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a
built-in scaling factor for Arrow Lake processors (Rafael Wysocki)
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN
for consistency with CPU offline (Christian Loehle)
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf)
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson)
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan)
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang)
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap)
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy)
- Clean up the Exynos devfreq driver and devfreq core (Markus
Elfring, Jeongjun Park)
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong,
Joe Hattori)
- Implement dev_pm_opp_get_bw() (Neil Armstrong)
- Expose OPP reference counting helpers for Rust (Viresh Kumar)
- Fix TSC MHz calculation in cpupower (He Rongguang)
- Add install and uninstall options to bindings Makefile and add
header changes for cpufreq.h to SWIG bindings in cpupower (John B.
Wyatt IV)
- Add missing residency header changes in cpuidle.h to SWIG bindings
in cpupower (John B. Wyatt IV)
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian)
- Fix cross-compilation in cpupower Makefile (Peng Fan)
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai)
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello)"
* tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits)
PM / OPP: Add reference counting helpers for Rust implementation
PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
cpufreq: Use str_enable_disable()-like helpers
cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
PM: sleep: convert comment from kernel-doc to plain comment
cpufreq: ACPI: Fix max-frequency computation
pm: cpupower: Add missing residency header changes in cpuidle.h to SWIG
PM / devfreq: exynos: remove unused function parameter
OPP: OF: Fix an OF node leak in _opp_add_static_v2()
cpufreq/amd-pstate: Refactor max frequency calculation
cpufreq/amd-pstate: Fix prefcore rankings
pm: cpupower: Add header changes for cpufreq.h to SWIG bindings
cpufreq: sparc: change kzalloc to kcalloc
cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
cpufreq: apple-soc: Increase cluster switch timeout to 400us
cpufreq: apple-soc: Use 32-bit read for status register
...
|
|
|
|
1d6d399223 |
Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
relevant code on any other CPU. This is currently handled by smpboot
code which takes care of CPU-hotplug operations. Affinity here is
a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
run anywhere else. The affinity is set through kthread_bind_mask()
and the subsystem takes care by itself to handle CPU-hotplug
operations. Affinity here is assumed to be a correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
is not a correctness constraint but merely a preference in terms of
memory locality. kswapd and kcompactd both fall into this category.
The affinity is set manually like for any other task and CPU-hotplug
is supposed to be handled by the relevant subsystem so that the task
is properly reaffined whenever a given CPU from the node comes up.
Also care should be taken so that the node affinity doesn't cross
isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its own
ad-hoc way.
This is an infrastructure proposal to handle this with the following API
changes:
_ kthread_create_on_node() automatically affines the created kthread to
its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called right
after kthread_create_on_node() to specify a preferred affinity
different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine
to the housekeeping CPUs (which means to all online CPUs most of the
time or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
* Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
* Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
* Add some correctness check to ensure kthread_bind() is always called
before the first kthread wake up.
* Default affine kthread to its preferred node.
* Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
* Implement kthreads preferred affinity
* Unify kthread worker and kthread API's style
* Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
=g8In
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never
execute relevant code on any other CPU. This is currently handled
by smpboot code which takes care of CPU-hotplug operations.
Affinity here is a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and
can't run anywhere else. The affinity is set through
kthread_bind_mask() and the subsystem takes care by itself to
handle CPU-hotplug operations. Affinity here is assumed to be a
correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
This is not a correctness constraint but merely a preference in
terms of memory locality. kswapd and kcompactd both fall into this
category. The affinity is set manually like for any other task and
CPU-hotplug is supposed to be handled by the relevant subsystem so
that the task is properly reaffined whenever a given CPU from the
node comes up. Also care should be taken so that the node affinity
doesn't cross isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its
own ad-hoc way.
This is an infrastructure proposal to handle this with the following
API changes:
- kthread_create_on_node() automatically affines the created kthread
to its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called
right after kthread_create_on_node() to specify a preferred
affinity different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine to
the housekeeping CPUs (which means to all online CPUs most of the time
or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
- Consolidate a bunch of ad-hoc implementations of
kthread_run_on_cpu()
- Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
- Add some correctness check to ensure kthread_bind() is always
called before the first kthread wake up.
- Default affine kthread to its preferred node.
- Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
- Implement kthreads preferred affinity
- Unify kthread worker and kthread API's style
- Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation"
* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
kthread: modify kernel-doc function name to match code
rcu: Use kthread preferred affinity for RCU exp kworkers
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
rcu: Use kthread preferred affinity for RCU boost
kthread: Implement preferred affinity
mm: Create/affine kswapd to its preferred node
mm: Create/affine kcompactd to its preferred node
kthread: Default affine kthread to its preferred NUMA node
kthread: Make sure kthread hasn't started while binding it
sched,arm64: Handle CPU isolation on last resort fallback rq selection
arm64: Exclude nohz_full CPUs from 32bits el0 support
lib: test_objpool: Use kthread_run_on_cpu()
kallsyms: Use kthread_run_on_cpu()
soc/qman: test: Use kthread_run_on_cpu()
arm/bL_switcher: Use kthread_run_on_cpu()
|
|
|
|
62de6e1685 |
Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve
readability (Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
- Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain
changes (Juri Lelli)
- Correctly account for allocated bandwidth during
hotplug (Juri Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task()
(John Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
- Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible
tasks in sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
- Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use
the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
- Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
- RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config
(Mathieu Desnoyers)
- PSI enhancements:
- Fix race when task wakes up before psi_sched_switch()
adjusts flags (Chengming Zhou)
- IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
- Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
- Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
- Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
sfucuLawLPM=
=5ZNW
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain changes
(Juri Lelli)
- Correctly account for allocated bandwidth during hotplug (Juri
Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task() (John
Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)
PSI enhancements:
- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)
IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"
* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...
|
|
|
|
3429dd57f0 |
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes:
|
|
|
|
a5c16f29a8 |
Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.14: - Use str_enable_disable()-like helpers in cpufreq (Krzysztof Kozlowski). - Extend the Apple cpufreq driver to support more SoCs (Hector Martin, Nick Chan). - Add new cpufreq driver for Airoha SoCs (Christian Marangi). - Fix using cpufreq-dt as module (Andreas Kemnade). - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter Edwards, Sibi Sankar, Manivannan Sadhasivam). - Fix the maximum supported frequency computation in the ACPI cpufreq driver to avoid relying on unfounded assumptions (Gautham Shenoy). - Fix an amd-pstate driver regression with preferred core rankings not being used (Mario Limonciello). - Fix a precision issue with frequency calculation in the amd-pstate driver (Naresh Solanki). - Add ftrace event to the amd-pstate driver for active mode (Mario Limonciello). - Set default EPP policy on Ryzen processors in amd-pstate (Mario Limonciello). - Clean up the amd-pstate cpufreq driver and optimize it to increase code reuse (Mario Limonciello, Dhananjay Ugwekar). - Use CPPC to get scaling factors between HWP performance levels and frequency in the intel_pstate driver and make it stop using a built -in scaling factor for the Arrow Lake processor (Rafael Wysocki). - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for consistency with CPU offline (Christian Loehle). - Fix superfluous updates caused by need_freq_update in the schedutil cpufreq governor (Sultan Alsawaf). * pm-cpufreq: (40 commits) cpufreq: Use str_enable_disable()-like helpers cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver cpufreq: ACPI: Fix max-frequency computation cpufreq/amd-pstate: Refactor max frequency calculation cpufreq/amd-pstate: Fix prefcore rankings cpufreq: sparc: change kzalloc to kcalloc cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT cpufreq: apple-soc: Increase cluster switch timeout to 400us cpufreq: apple-soc: Use 32-bit read for status register cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1 cpufreq: apple-soc: Drop setting the PS2 field on M2+ dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq cpufreq: fix using cpufreq-dt as module cpufreq: scmi: Register for limit change notifications cpufreq: schedutil: Fix superfluous updates caused by need_freq_update cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN ... |
|
|
|
1225bb42b8 |
Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'
Merge updates related to system sleep, a cpuidle update and an Energy Model handling code update for 6.14-rc1: - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson). - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan). - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang). - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap). - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy). - Move sched domains rebuild function from the schedutil cpufreq governor to the Energy Model handling code (Rafael Wysocki). * pm-sleep: PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment PM: wakeup: implement devm_device_init_wakeup() helper PM: sleep: sysfs: don't include 'pm_wakeup.h' directly PM: sleep: autosleep: don't include 'pm_wakeup.h' directly PM: sleep: Update stale comment in device_resume() * pm-cpuidle: intel_idle: add Clearwater Forest SoC support * pm-em: PM: EM: Move sched domains rebuild function from schedutil to EM |
|
|
|
8ff6d472ab |
- Do not adjust the weight of empty group entities and avoid scheduling
artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeM2+8ACgkQEsHwGGHe VUrqHA/+PyPsILNfDFnaXD/Shv7kCgk3hH2YVrCnyIQn8fhTCL+K0luXRRmAPhl1 pEbRyEtVRXZM9WyHJ1r7VmnIHVanUKWcuZzAx8HT9b7Tkrqy/xSriH/jAV7FYG9h gIhWA4vXT34HG2MQHTecJ/4gvkeI5tV93nkZ6vwEwwN4pGxSvijPR3yr3yl286Nt BlHP9cXBWWsxTq4CYb2j5q9nmymKuiDmhD8jc1fVjDKxjmrNHEwX5mtvtRLGI1dR O4pjUEqYQ+TWIkX8ZIRBPzLd9b6750ncO/9yb24cY7Z3RMKov4wz8b819Dt77cTp XF+bT+8fsaKNRjkzPEseZU4OL16ZO+33Kcn+JoNPWvbONgFKusZ4qkCCwvWpiQlV 0Eddh1XjKBoXA1tR6VrREcUKQ2zWoXrn8tlHUTMfPuPq1jlbQNtzN7OWcR+WBy7L FiClWzWLv0fCeTXaAEcO9+/MN5z2lDQsJRiLtAlGh0JJU6/1H6IVX2tSZllkpR/R 5pyHCpYsh5cucx6cQPk+rp9O6PDYu2frMuJ8itWlQLHeSzjzOoeE6EwwZcMQG4xb UG/azMbmqWrtmZqCgNCBtq7RAkVRe0IuxWuCtV1VkatU+2RXdtV85tVKn/JG2KLW 05c8GTdQZgnoY65/TZHwudhkzytclynKrrvhfKFllAWnb8D8gyQ= =fZWm -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag |
|
|
|
423124ab97 | Merge back earlier cpufreq material for 6.14 | |
|
|
21641bd9a7 |
lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit |
|
|
|
d40797d672 |
kasan: make kasan_record_aux_stack_noalloc() the default behaviour
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes:
|
|
|
|
987ce79b52 |
sched_ext: fix kernel-doc warnings
Use the correct function parameter names and function names. Use the correct kernel-doc comment format for struct sched_ext_ops to eliminate a bunch of warnings. ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked' ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set' ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less' ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup' ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass' ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask' ext.c:209: warning: Incorrect use of kernel-doc format: * select_cpu - Pick the target CPU for a task which is being woken up ext.c:236: warning: Incorrect use of kernel-doc format: * enqueue - Enqueue a task on the BPF scheduler ext.c:251: warning: Incorrect use of kernel-doc format: * dequeue - Remove a task from the BPF scheduler ext.c:267: warning: Incorrect use of kernel-doc format: * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs ext.c:290: warning: Incorrect use of kernel-doc format: * tick - Periodic tick ext.c:300: warning: Incorrect use of kernel-doc format: * runnable - A task is becoming runnable on its associated CPU ext.c:327: warning: Incorrect use of kernel-doc format: * running - A task is starting to run on its associated CPU ext.c:335: warning: Incorrect use of kernel-doc format: * stopping - A task is stopping execution ext.c:346: warning: Incorrect use of kernel-doc format: * quiescent - A task is becoming not runnable on its associated CPU ext.c:366: warning: Incorrect use of kernel-doc format: * yield - Yield CPU ext.c:381: warning: Incorrect use of kernel-doc format: * core_sched_before - Task ordering for core-sched ext.c:399: warning: Incorrect use of kernel-doc format: * set_weight - Set task weight ext.c:408: warning: Incorrect use of kernel-doc format: * set_cpumask - Set CPU affinity ext.c:418: warning: Incorrect use of kernel-doc format: * update_idle - Update the idle state of a CPU ext.c:439: warning: Incorrect use of kernel-doc format: * cpu_acquire - A CPU is becoming available to the BPF scheduler ext.c:449: warning: Incorrect use of kernel-doc format: * cpu_release - A CPU is taken away from the BPF scheduler ext.c:461: warning: Incorrect use of kernel-doc format: * init_task - Initialize a task to run in a BPF scheduler ext.c:476: warning: Incorrect use of kernel-doc format: * exit_task - Exit a previously-running task from the system ext.c:485: warning: Incorrect use of kernel-doc format: * enable - Enable BPF scheduling for a task ext.c:494: warning: Incorrect use of kernel-doc format: * disable - Disable BPF scheduling for a task ext.c:504: warning: Incorrect use of kernel-doc format: * dump - Dump BPF scheduler state on error ext.c:512: warning: Incorrect use of kernel-doc format: * dump_cpu - Dump BPF scheduler state for a CPU on error ext.c:524: warning: Incorrect use of kernel-doc format: * dump_task - Dump BPF scheduler state for a runnable task on error ext.c:535: warning: Incorrect use of kernel-doc format: * cgroup_init - Initialize a cgroup ext.c:550: warning: Incorrect use of kernel-doc format: * cgroup_exit - Exit a cgroup ext.c:559: warning: Incorrect use of kernel-doc format: * cgroup_prep_move - Prepare a task to be moved to a different cgroup ext.c:574: warning: Incorrect use of kernel-doc format: * cgroup_move - Commit cgroup move ext.c:585: warning: Incorrect use of kernel-doc format: * cgroup_cancel_move - Cancel cgroup move ext.c:597: warning: Incorrect use of kernel-doc format: * cgroup_set_weight - A cgroup's weight is being changed ext.c:611: warning: Incorrect use of kernel-doc format: * cpu_online - A CPU became online ext.c:620: warning: Incorrect use of kernel-doc format: * cpu_offline - A CPU is going offline ext.c:633: warning: Incorrect use of kernel-doc format: * init - Initialize the BPF scheduler ext.c:638: warning: Incorrect use of kernel-doc format: * exit - Clean up after the BPF scheduler ext.c:648: warning: Incorrect use of kernel-doc format: * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch ext.c:653: warning: Incorrect use of kernel-doc format: * flags - %SCX_OPS_* flags ext.c:658: warning: Incorrect use of kernel-doc format: * timeout_ms - The maximum amount of time, in milliseconds, that a ext.c:667: warning: Incorrect use of kernel-doc format: * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default ext.c:673: warning: Incorrect use of kernel-doc format: * hotplug_seq - A sequence number that may be set by the scheduler to ext.c:682: warning: Incorrect use of kernel-doc format: * name - BPF scheduler's name ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7d9da04057 |
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:
psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
When investigating the series of events leading up to the splat,
following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
the block task can wakeup when newidle balance drops the runqueue lock
during __schedule().
If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
any modifications in psi_enqueue() which would still see the flags of a
running task and not a blocked one. Instead, rely on psi_sched_switch()
to do the right thing.
Since the status returned by try_to_block_task() may no longer be true
by the time schedule reaches psi_sched_switch(), check if the task is
blocked or not using a combination of task_on_rq_queued() and
p->se.sched_delayed checks.
[ prateek: Commit message, testing, early bailout in psi_enqueue() ]
Fixes:
|
|
|
|
a6fd16148f |
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com |
|
|
|
763a744e24 |
sched: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com |
|
|
|
8722903cbb |
sched: Define sched_clock_irqtime as static key
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com |
|
|
|
3229adbe78 |
sched/fair: Do not compute overloaded status unnecessarily during lb
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com |
|
|
|
0ac1ee9ebf |
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com |
|
|
|
873199d27b |
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.
So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.
Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:
Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.
Based on master
|
|
|
|
8061b9f5e1 |
sched/debug: Change need_resched warnings to pr_err
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic. It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve. Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Josh Don <joshdon@google.com> Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com |
|
|
|
2cf9ac4007 |
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file. No functional changes expected Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org |
|
|
|
5d808c78d9 |
sched: Fix race between yield_to() and try_to_wake_up()
We met a SCHED_WARN in set_next_buddy():
__warn_printk
set_next_buddy
yield_to_task_fair
yield_to
kvm_vcpu_yield_to [kvm]
...
After a short dig, we found the rq_lock held by yield_to() may not
be exactly the rq that the target task belongs to. There is a race
window against try_to_wake_up().
CPU0 target_task
blocking on CPU1
lock rq0 & rq1
double check task_rq == p_rq, ok
woken to CPU2 (lock task_pi & rq2)
task_rq = rq2
yield_to_task_fair (w/o lock rq2)
In this race window, yield_to() is operating the task w/o the correct
lock. Fix this by taking task pi_lock first.
Fixes:
|
|
|
|
66951e4860 |
sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.
However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.
As such, don't adjust the weight for empty group entities and let them compete
at their original weight.
Fixes:
|
|
|
|
f65c64f311 |
delayacct: add delay min to record delay peak
Delay accounting can now calculate the average delay of processes, detect
the overall system load, and also record the 'delay max' to identify
potential abnormal delays. However, 'delay min' can help us identify
another useful delay peak. By comparing the difference between 'delay
max' and 'delay min', we can understand the optimization space for
latency, providing a reference for the optimization of latency
performance.
Use case
=========
bash-4.4# ./getdelays -d -t 242
print delayacct stats ON
TGID 242
CPU count real total virtual total delay total delay average delay max delay min
39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms
IO count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
SWAP count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
RECLAIM count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
THRASHING count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
COMPACT count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
WPCOPY count delay total delay average delay max delay min
156 11215873 0.072ms 0.207403ms 0.033913ms
IRQ count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
Link: https://lkml.kernel.org/r/20241220173105906EOdsPhzjMLYNJJBqgz1ga@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Co-developed-by: Kun Jiang <jiang.kun2@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
658eb5ab91 |
delayacct: add delay max to record delay peak
Introduce the use cases of delay max, which can help quickly detect
potential abnormal delays in the system and record the types and specific
details of delay spikes.
Problem
========
Delay accounting can track the average delay of processes to show
system workload. However, when a process experiences a significant
delay, maybe a delay spike, which adversely affects performance,
getdelays can only display the average system delay over a period
of time. Yet, average delay is unhelpful for diagnosing delay peak.
It is not even possible to determine which type of delay has spiked,
as this information might be masked by the average delay.
Solution
=========
the 'delay max' can display delay peak since the system's startup,
which can record potential abnormal delays over time, including
the type of delay and the maximum delay. This is helpful for
quickly identifying crash caused by delay.
Use case
=========
bash# ./getdelays -d -p 244
print delayacct stats ON
PID 244
CPU count real total virtual total delay total delay average delay max
68 192000000 213676651 705643 0.010ms 0.306381ms
IO count delay total delay average delay max
0 0 0.000ms 0.000000ms
SWAP count delay total delay average delay max
0 0 0.000ms 0.000000ms
RECLAIM count delay total delay average delay max
0 0 0.000ms 0.000000ms
THRASHING count delay total delay average delay max
0 0 0.000ms 0.000000ms
COMPACT count delay total delay average delay max
0 0 0.000ms 0.000000ms
WPCOPY count delay total delay average delay max
235 15648284 0.067ms 0.263842ms
IRQ count delay total delay average delay max
0 0 0.000ms 0.000000ms
[wang.yaxin@zte.com.cn: update docs and fix some spelling errors]
Link: https://lkml.kernel.org/r/20241213192700771XKZ8H30OtHSeziGqRVMs0@zte.com.cn
Link: https://lkml.kernel.org/r/20241203164848805CS62CQPQWG9GLdQj2_BxS@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
2e3f3090bd |
sched_ext: Fixes for v6.13-rc6
- Fix corner case bug where ops.dispatch() couldn't extend the execution of the current task if SCX_OPS_ENQ_LAST is set. - Fix ops.cpu_release() not being called when a SCX task is preempted by a higher priority sched class task. - Fix buitin idle mask being incorrectly left as busy after an idle CPU is picked and kicked. - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq pinning related sanity checks which could trigger spuriously. Switch to raw_spin_rq_lock(). -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gmpw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGVntAP0b4i4PEIkupj9+i8ZzlwqvYX3gFJ7E4v3wmjDp 1VYdrAD/ZetrhrM+9RyyKpMIDFnN+xE6YbslBSlAzGzgfdsbXA0= =zGXi -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix corner case bug where ops.dispatch() couldn't extend the execution of the current task if SCX_OPS_ENQ_LAST is set. - Fix ops.cpu_release() not being called when a SCX task is preempted by a higher priority sched class task. - Fix buitin idle mask being incorrectly left as busy after an idle CPU is picked and kicked. - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq pinning related sanity checks which could trigger spuriously. Switch to raw_spin_rq_lock(). * tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Refresh idle masks during idle-to-idle transitions sched_ext: switch class when preempted by higher priority scheduler sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() sched_ext: keep running prev when prev->scx.slice != 0 |
|
|
|
a2a3374c47 |
sched_ext: idle: Refresh idle masks during idle-to-idle transitions
With the consolidation of put_prev_task/set_next_task(), see commit |
|
|
|
3a9910b590 |
sched_ext: Implement scx_bpf_now()
Returns a high-performance monotonically non-decreasing clock for the current CPU. The clock returned is in nanoseconds. It provides the following properties: 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently to account for execution time and track tasks' runtime properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which eventually reads a hardware timestamp counter -- is neither performant nor scalable. scx_bpf_now() aims to provide a high-performance clock by using the rq clock in the scheduler core whenever possible. 2) High enough resolution for the BPF scheduler use cases: In most BPF scheduler use cases, the required clock resolution is lower than the most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically uses the rq clock in the scheduler core whenever it is valid. It considers that the rq clock is valid from the time the rq clock is updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock). 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now() guarantees the clock never goes backward when comparing them in the same CPU. On the other hand, when comparing clocks in different CPUs, there is no such guarantee -- the clock can go backward. It provides a monotonically *non-decreasing* clock so that it would provide the same clock values in two different scx_bpf_now() calls in the same CPU during the same period of when the rq clock is valid. An rq clock becomes valid when it is updated using update_rq_clock() and invalidated when the rq is unlocked using rq_unpin_lock(). Let's suppose the following timeline in the scheduler core: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is set), so scx_bpf_now() calls during [T2, T4) (including T3) will return the rq clock updated at T2. For duration [T4, T7), when a BPF scheduler can still call scx_bpf_now() (T5), we consider the rq clock is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling scx_bpf_now() at T5, we will return a fresh clock value by calling sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks from a previous scx scheduler, invalidate all the rq clocks when unloading a BPF scheduler. One example of calling scx_bpf_now(), when the rq clock is invalid (like T5), is in scx_central [1]. The scx_central scheduler uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh clock value rather than returning the old one (T2). [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
ea9b262627 |
sched_ext: Relocate scx_enabled() related code
scx_enabled() will be used in scx_rq_clock_update/invalidate() in the following patch, so relocate the scx_enabled() related code to the proper location. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
6d71a9c616 |
sched/fair: Fix EEVDF entity placement bug causing scheduling lag
I noticed this in my traces today:
turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
{ weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }
Which is a weight transition: 1048576 -> 2 -> 1048576.
One would expect the lag to shoot out *AND* come back, notably:
-1123*1048576/2 = -588775424
-588775424*2/1048576 = -1123
Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.
This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.
And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().
With the below patch, I now see things like:
migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
{ weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
{ weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
{ weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
{ weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }
Which isn't perfect yet, but much closer.
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes:
|
|
|
|
b04e317b52 |
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> |
|
|
|
3a5446612a |
sched,arm64: Handle CPU isolation on last resort fallback rq selection
When a kthread or any other task has an affinity mask that is fully offline or unallowed, the scheduler reaffines the task to all possible CPUs as a last resort. This default decision doesn't mix up very well with nohz_full CPUs that are part of the possible cpumask but don't want to be disturbed by unbound kthreads or even detached pinned user tasks. Make the fallback affinity setting aware of nohz_full. Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
|
|
|
68e449d849 |
sched_ext: switch class when preempted by higher priority scheduler
ops.cpu_release() function, if defined, must be invoked when preempted by a higher priority scheduler class task. This scenario was skipped in commit |
|
|
|
6268d5bc10 |
sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().
Without this change, we observe the following warning:
===== START =====
[ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback
[ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
===== END =====
Fixes:
|
|
|
|
30dd3b13f9 |
sched_ext: keep running prev when prev->scx.slice != 0
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
382d7efc14 |
sched_ext: Include remaining task time slice in error state dump
Report the remaining time slice when dumping task information during an error exit. This information can be useful for tracking incorrect or excessively long time slices in schedulers that implement dynamic time slice logic. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
e4975ac535 |
sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
With commit |
|
|
|
d9071ecb31 |
sched_ext: idle: small CPU iteration refactoring
Replace the loop to check if all SMT CPUs are idle with cpumask_subset(). This simplifies the code and slightly improves efficiency, while preserving the original behavior. Note that idle_masks.smt handling remains racy, which is acceptable as it serves as an optimization and is self-correcting. Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
63676eefb7 |
sched_ext: Fixes for v6.13-rc5
- Fix the bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration. - Fix the bug where scx_ops_bypass() could call irq_restore twice. - Add Andrea and Changwoo as maintainers for better review coverage. - selftests and tools/sched_ext build and other fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hpXg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGS/lAQDOZDfcJtO1VEsLoPY9NhFHPuBDTfoJyjSi/4mh GsjgDAD/Sx0rN6C9S/+ToUjmq3FA+ft0m2+97VqgLwkzwA9YxwI= =jaZ6 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors |
|
|
|
c0cf353009 |
sched_ext: idle: introduce check_builtin_idle_enabled() helper
Minor refactoring to add a helper function for checking if the built-in idle CPU selection policy is enabled. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
02f034dcbf |
sched_ext: idle: clarify comments
Add a comments to clarify about the usage of cpumask_intersects(). Moreover, update scx_select_cpu_dfl() description clarifying that the final step of the idle selection logic involves searching for any idle CPU in the system that the task can use. Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
9cf9aceed2 |
sched_ext: idle: use assign_cpu() to update the idle cpumask
Use the assign_cpu() helper to set or clear the CPU in the idle mask, based on the idle condition. Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
35bf430e08 |
sched_ext: initialize kit->cursor.flags
struct bpf_iter_scx_dsq *it maybe not initialized.
If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice
before scx_bpf_dsq_move, it would cause unexpected behaviors:
1. Assign a huge slice into p->scx.slice
2. Assign a invalid vtime into p->scx.dsq_vtime
Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Fixes:
|
|
|
|
bc3a116a44 |
sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
851daf833e | Merge back earlier cpufreq material for 6.14 | |
|
|
7c8cd569ff |
docs: Update Schedstat version to 17
Update the Schedstat version to 17 as more fields are added to report different kinds of imbalances in the sched domain. Also domain field started printing corresponding domain name. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com |
|
|
|
011b3a14dc |
sched/stats: Print domain name in /proc/schedstat
Currently, there does not exist a straightforward way to extract the
names of the sched domains and match them to the per-cpu domain entry in
/proc/schedstat other than looking at the debugfs files which are only
visible after enabling "verbose" debug after commit
|
|
|
|
1c055a0f5d |
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
/proc/schedstat file shows cpu and sched domain level scheduler statistics. It does not show domain name instead shows domain level. It will be very useful for tools like `perf sched stats`[1] to aggragate domain level stats if domain names are shown in /proc/schedstat. But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG. [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/ Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com |
|
|
|
3b2a793ea7 |
sched: Report the different kinds of imbalances in /proc/schedstat
In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance does not mention whether the imbalance
is due to load, utilization, nr_tasks or misfit_tasks. Remove this field
from /proc/schedstat.
Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.
Added fields to /proc/schedstat:
- lb_imbalance_load: Total imbalance due to load.
- lb_imbalance_util: Total imbalance due to utilization.
- lb_imbalance_task: Total imbalance due to number of tasks.
- lb_imbalance_misfit: Total imbalance due to misfit tasks.
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
|
|
|
|
c3856c9ce6 |
sched/fair: Cleanup in migrate_degrades_locality() to improve readability
migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would improve
locality, would be ambivalent to locality improvements.
This patch improves readability by changing the return value to mean:
* Any positive value degrades locality
* 0 migration doesn't affect locality
* Any negative value improves locality
[Swapnil: Fixed comments around code and wrote commit log]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com
|
|
|
|
a430d99e34 |
sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migrate_task()
if the task is migratable and hot. After incrementing the value,
load balancer can still decide not to migrate this task leading to wrong
accounting. Fix this by incrementing stats when hot tasks are detached.
This issue only exists in detach_tasks() where we can decide to not
migrate hot task even if it is migratable. However, in detach_one_task(),
we migrate it unconditionally.
[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]
Fixes:
|
|
|
|
ee8118c1f1 |
sched/fair: Update comments after sched_tick() rename.
scheduler_tick() was renamed to sched_tick() in
|
|
|
|
ebeeee390b |
PM: EM: Move sched domains rebuild function from schedutil to EM
Function sugov_eas_rebuild_sd() defined in the schedutil cpufreq governor implements generic functionality that may be useful in other places. In particular, there is a plan to use it in the intel_pstate driver in the future. For this reason, move it from schedutil to the energy model code and rename it to em_rebuild_sched_domains(). This also helps to get rid of some #ifdeffery in schedutil which is a plus. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> |
|
|
|
8e461a1cb4 |
cpufreq: schedutil: Fix superfluous updates caused by need_freq_update
A redundant frequency update is only truly needed when there is a policy
limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS.
In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a
frequency update _all the time_, not just for a policy limits change,
because need_freq_update is never cleared.
Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads
to a redundant frequency update, regardless of whether or not the driver
specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the
same as the current one.
Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS
when there's a policy limits change, and clearing need_freq_update when a
requisite redundant update occurs.
This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test
and instead setting need_freq_update to false in sugov_update_next_freq().
Fixes:
|
|
|
|
af98d8a36a |
sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug
CPU controller limits are not properly enforced during CPU hotplug operations, particularly during CPU offline. When a CPU goes offline, throttled processes are unintentionally being unthrottled across all CPUs in the system, allowing them to exceed their assigned quota limits. Consider below for an example, Assigning 6.25% bandwidth limit to a cgroup in a 8 CPU system, where, workload is running 8 threads for 20 seconds at 100% CPU utilization, expected (user+sys) time = 10 seconds. $ cat /sys/fs/cgroup/test/cpu.max 50000 100000 $ ./ebizzy -t 8 -S 20 // non-hotplug case real 20.00 s user 10.81 s // intended behaviour sys 0.00 s $ ./ebizzy -t 8 -S 20 // hotplug case real 20.00 s user 14.43 s // Workload is able to run for 14 secs sys 0.00 s // when it should have only run for 10 secs During CPU hotplug, scheduler domains are rebuilt and cpu_attach_domain is called for every active CPU to update the root domain. That ends up calling rq_offline_fair which un-throttles any throttled hierarchies. Unthrottling should only occur for the CPU being hotplugged to allow its throttled processes to become runnable and get migrated to other CPUs. With current patch applied, $ ./ebizzy -t 8 -S 20 // hotplug case real 21.00 s user 10.16 s // intended behaviour sys 0.00 s This also has another symptom, when a CPU goes offline, and if the cfs_rq is not in throttled state and the runtime_remaining still had plenty remaining, it gets reset to 1 here, causing the runtime_remaining of cfs_rq to be quickly depleted. Note: hotplug operation (online, offline) was performed in while(1) loop v3: https://lore.kernel.org/all/20241210102346.228663-2-vishalc@linux.ibm.com v2: https://lore.kernel.org/all/20241207052730.1746380-2-vishalc@linux.ibm.com v1: https://lore.kernel.org/all/20241126064812.809903-2-vishalc@linux.ibm.com Suggested-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Tested-by: Samir Mulani <samir@linux.ibm.com> Link: https://lore.kernel.org/r/20241212043102.584863-2-vishalc@linux.ibm.com |
|
|
|
acd855a949 |
- Prevent incorrect dequeueing of the deadline dlserver helper task and fix
its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdexEsACgkQEsHwGGHe VUpFqA//SIIbNJEIQEwGkFrYpGwVpSISm94L4ENsrkWbJWQlALwQEBJF9Me/DOZH vHaX3o+cMxt26W7o0NKyPcvYtulnOr33HZA/uxK35MDaUinSA3Spt3jXHfR3n0mL ljNQQraWHGaJh7dzKMZoxP6DR78/Z0yotXjt33xeBFMSJuzGsklrbIiSJ6c4m/3u Y1lrQT8LncsxJMYIPAKtBAc9hvJfGFV6IOTaTfxP0oTuDo/2qTNVHm7to40wk3NW kb0lf2kjVtE6mwMfEm49rtjE3h0VnPJKGKoEkLi9IQoPbQq9Uf4i9VSmRe3zqPAz yBxV8BAu2koscMZzqw1CTnd9c/V+/A9qOOHfDo72I5MriJ1qVWCEsqB1y3u2yT6n XjwFDbPiVKI8H9YlsZpWERocCRypshevPNlYOF93PlK+YTXoMWaXMQhec5NDzLLw Se1K2sCi3U8BMdln0dH6nhk0unzNKQ8UKzrMFncSjnpWhpJ69uxyUZ/jL//6bvfi Z+7G4U54mUhGyOAaUSGH/20TnZRWJ7NJC542omFgg9v0VLxx+wnZyX4zJIV0jvRr 6voYmYDCO8zn/hO67VBJuei97ayIzxDNP1tVl15LzcvRcIGWNUPOwp5jijv8vDJG lJhQrMF6w4fgPItC20FvptlDvpP9cItSzyyOeg074HjDS53QN2Y= =jOb3 -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Prevent incorrect dequeueing of the deadline dlserver helper task and fix its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed * tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/dlserver: Fix dlserver time accounting sched/dlserver: Fix dlserver double enqueue sched/eevdf: More PELT vs DELAYED_DEQUEUE sched/fair: Fix sched_can_stop_tick() for fair tasks sched/fair: Fix NEXT_BUDDY |
|
|
|
e197f5ec3a |
sched_ext: Use sizeof_field for key_len in dsq_hash_params
Update the `dsq_hash_params` initialization to use `sizeof_field` for the `key_len` field instead of a hardcoded value. This improves code readability and ensures the key length dynamically matches the size of the `id` field in the `scx_dispatch_q` structure. Signed-off-by: Liang Jie <liangjie@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
c7f7e9c731 |
sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
Fixes:
|
|
|
|
b53127db1d |
sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes:
|
|
|
|
18b2093f45 |
sched_ext: Fix invalid irq restore in scx_ops_bypass()
While adding outer irqsave/restore locking, |
|
|
|
7675361ff9 |
sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task
Commit
|
|
|
|
df9e2102de |
- Remove wrong enqueueing of a task for a later wakeup when a task blocks on
a RT mutex - Do not setup a new deadline entity on a boosted task as that has happened already - Update preempt= kernel command line param - Prevent needless softirqd wakeups in the idle task's context - Detect the case where the idle load balancer CPU becomes busy and avoid unnecessary load balancing invocation - Remove an unnecessary load balancing need_resched() call in nohz_csd_func() - Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the warning to catch any other cases - Remove a wrong warning when a cpuset update makes the task affinity no longer a subset of the cpuset -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdWvHYACgkQEsHwGGHe VUrJ6g//eEwmHpa9+at3UvXrVlcYQmZsQpgL2ksjVE0n4KXFeUavwCR4h6SJzvcD RDF9AyDuPAoCqy5DhL5wTXPG/4AnnISqAEkoP7h7YO76P7ks6+HD7t31pCF/uqCH yqS4vc1RJ6yW8otcCpR7rOPEQ49Klqc1KTFTNAFLc6MNEb/SVH5Ih+wFL5Mj/W3I UkBEtUy1oR2Q4QPhJ+0sr0LAI1AwjykdbkWzOhs6D1kPaRqdV4Atgc2fwioLIvhO s++lev9BmGx02dmrRWRmIBL9S9ycSLT1qx28sbzlS+PZMGYqOnImVOW5+EPr+ovK fILc0m8sLD6GyZHIPgeIT2+DqSvDTQOGQwXyUYmoarI+BWGGSz6iZGn4RrZHMRQo cpqYV9z7F2t3X1hPfhrH+40BXJeMMX+wd4ahXNA44QD6Bf7I+zPUfsrfnrR4BwV7 qpXhBzXOuZrgOKolIwJmHIxyLtd79idYccGvjIME5rwj8eBg0J7zmjzoVewqUXsb F9ualvq6twxUIdD4XiClpi+E16Z2Ot3PplNIohosVrUDRDUgvTBbTuDZnUuOkXbb wV26XKuYKQYfx5UfJBSYL3DCfCttkKCVrPX2oiqw6PKNXw9BM8BQIux+XQH2jvIg wOPqZWZf2VIoQJU2N+twc/BAIRAF7CNr/ioTJlXQ1hsOIlTp3kk= =XLf1 -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Remove wrong enqueueing of a task for a later wakeup when a task blocks on a RT mutex - Do not setup a new deadline entity on a boosted task as that has happened already - Update preempt= kernel command line param - Prevent needless softirqd wakeups in the idle task's context - Detect the case where the idle load balancer CPU becomes busy and avoid unnecessary load balancing invocation - Remove an unnecessary load balancing need_resched() call in nohz_csd_func() - Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the warning to catch any other cases - Remove a wrong warning when a cpuset update makes the task affinity no longer a subset of the cpuset * tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex sched/deadline: Fix warning in migrate_enable for boosted tasks sched/core: Update kernel boot parameters for LAZY preempt. sched/core: Prevent wakeup of ksoftirqd during idle load balance sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel sched: fix warning in sched_setaffinity sched/deadline: Fix replenish_dl_new_period dl_server condition |
|
|
|
2a77e4be12 |
sched/fair: Untangle NEXT_BUDDY and pick_next_task()
There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:
- yield_to_task()
- cgroup dequeue / pick optimization
However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.
Fixes:
|
|
|
|
95d9fed3a2 |
sched/fair: Mark m*_vruntime() with __maybe_unused
When max_vruntime() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:
kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function]
526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
| ^~~~~~~~~~~~
Fix this by marking them with __maybe_unused (all cases for the sake of
symmetry).
See also commit
|
|
|
|
0429489e09 |
sched/fair: Fix variable declaration position
Move variable declaration at the beginning of the function Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org |
|
|
|
61b82dfb6b |
sched/fair: Do not try to migrate delayed dequeue task
Migrating a delayed dequeued task doesn't help in balancing the number of runnable tasks in the system. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org |
|
|
|
736c55a02c |
sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed dequeue tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org |
|
|
|
43eef7c3a4 |
sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit
|
|
|
|
31898e7b87 |
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore as it includes delayed dequeue tasks as well. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org |
|
|
|
9216582b0b |
sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready to run or delayed dequeue Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org |
|
|
|
1a49104496 |
sched/fair: Use the new cfs_rq.h_nr_runnable
Use the new h_nr_runnable that tracks only queued and runnable tasks in the statistics that are used to balance the system: - PELT runnable_avg - deciding if a group is overloaded or has spare capacity - numa stats - reduced capacity management - load balance - nohz kick It should be noticed that the rq->nr_running still counts the delayed dequeued tasks as delayed dequeue is a fair feature that is meaningless at core level. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org |
|
|
|
c2a295bffe |
sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task. Create a new h_nr_runnable that tracks only queued and runnable tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org |
|
|
|
7b8a702d94 |
sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org |
|
|
|
40c3b94fbb |
Merge branch 'sched/urgent'
Sync with urgent bits as a base for further work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
|
|
|
76f2f78329 |
sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while commit |
|
|
|
c1f43c342e |
sched/fair: Fix sched_can_stop_tick() for fair tasks
We can't stop the tick of a rq if there are at least 2 tasks enqueued in
the whole hierarchy and not only at the root cfs rq.
rq->cfs.nr_running tracks the number of sched_entity at one level
whereas rq->cfs.h_nr_running tracks all queued tasks in the
hierarchy.
Fixes:
|
|
|
|
493afbd187 |
sched/fair: Fix NEXT_BUDDY
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().
Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.
Fixes:
|
|
|
|
5f1b64e9a9 |
sched/numa: fix memory leak due to the overwritten vma->numab_state
[Problem Description]
When running the hackbench program of LTP, the following memory leak is
reported by kmemleak.
# /opt/ltp/testcases/bin/hackbench 20 thread 1000
Running with 20*40 (== 800) tasks.
# dmesg | grep kmemleak
...
kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888cd8ca2c40 (size 64):
comm "hackbench", pid 17142, jiffies 4299780315
hex dump (first 32 bytes):
ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I.....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc bff18fd4):
[<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0
[<ffffffff8113f715>] task_numa_work+0x725/0xa00
[<ffffffff8110f878>] task_work_run+0x58/0x90
[<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0
[<ffffffff81dd78d5>] do_syscall_64+0x85/0x150
[<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
...
This issue can be consistently reproduced on three different servers:
* a 448-core server
* a 256-core server
* a 192-core server
[Root Cause]
Since multiple threads are created by the hackbench program (along with
the command argument 'thread'), a shared vma might be accessed by two or
more cores simultaneously. When two or more cores observe that
vma->numab_state is NULL at the same time, vma->numab_state will be
overwritten.
Although current code ensures that only one thread scans the VMAs in a
single 'numa_scan_period', there might be a chance for another thread
to enter in the next 'numa_scan_period' while we have not gotten till
numab_state allocation [1].
Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000`
cannot the reproduce the issue. It is verified with 200+ test runs.
[Solution]
Use the cmpxchg atomic operation to ensure that only one thread executes
the vma->numab_state assignment.
[1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/
Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com
Fixes:
|
|
|
|
4572541892 |
sched_ext: Use the NUMA scheduling domain for NUMA optimizations
Rely on the NUMA scheduling domain topology, instead of accessing NUMA
topology information directly.
There is basically no functional change, but in this way we ensure
consistent use of the same topology information determined by the
scheduling subsystem.
Fixes:
|
|
|
|
c907cd44a1 |
sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
As all the non-domain and non-managed_irq housekeeping types have been unified to HK_TYPE_KERNEL_NOISE, replace all these references in the scheduler to use HK_TYPE_KERNEL_NOISE. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com |
|
|
|
6010d245dd |
sched/isolation: Consolidate housekeeping cpumasks that are always identical
The housekeeping cpumasks are only set by two boot commandline parameters: "nohz_full" and "isolcpus". When there is more than one of "nohz_full" or "isolcpus", the extra ones must have the same CPU list or the setup will fail partially. The HK_TYPE_DOMAIN and HK_TYPE_MANAGED_IRQ types are settable by "isolcpus" only and their settings can be independent of the other types. The other housekeeping types are all set by "nohz_full" or "isolcpus=nohz" without a way to set them individually. So they all have identical cpumasks. There is actually no point in having different cpumasks for these "nohz_full" only housekeeping types. Consolidate these types to use the same cpumask by aliasing them to the same value. If there is a need to set any of them independently in the future, we can break them out to their own cpumasks again. With this change, the number of cpumasks in the housekeeping structure drops from 9 to 3. Other than that, there should be no other functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-4-longman@redhat.com |
|
|
|
1174b9344b |
sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"
The "isolcpus=nohz" boot parameter and flag were used to disable tick when running a single task. Nowsdays, this "nohz" flag is seldomly used as it is included as part of the "nohz_full" parameter. Extend this flag to cover other kernel noises disabled by the "nohz_full" parameter to make them equivalent. This also eliminates the need to use both the "isolcpus" and the "nohz_full" parameters to fully isolated a given set of CPUs. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-3-longman@redhat.com |
|
|
|
ae5c677729 |
sched/core: Remove HK_TYPE_SCHED
The HK_TYPE_SCHED housekeeping type is defined but not set anywhere. So any code that try to use HK_TYPE_SCHED are essentially dead code. So remove HK_TYPE_SCHED and any code that use it. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-2-longman@redhat.com |
|
|
|
a76328d44c |
sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
Andy reported that clang gets upset with CONFIG_CFS_BANDWIDTH=n: kernel/sched/fair.c:6580:20: error: unused function 'cfs_bandwidth_used' [-Werror,-Wunused-function] 6580 | static inline bool cfs_bandwidth_used(void) | ^~~~~~~~~~~~~~~~~~ Indeed, cfs_bandwidth_used() is only used within functions defined under CONFIG_CFS_BANDWIDTH=y. Remove its CONFIG_CFS_BANDWIDTH=n declaration & definition. Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20241127165501.160004-1-vschneid@redhat.com |
|
|
|
3a181f20fb |
sched/deadline: Consolidate Timer Cancellation
After commit
|
|
|
|
53916d5fd3 |
sched/deadline: Check bandwidth overflow earlier for hotplug
Currently we check for bandwidth overflow potentially due to hotplug operations at the end of sched_cpu_deactivate(), after the cpu going offline has already been removed from scheduling, active_mask, etc. This can create issues for DEADLINE tasks, as there is a substantial race window between the start of sched_cpu_deactivate() and the moment we possibly decide to roll-back the operation if dl_bw_deactivate() returns failure in cpuset_cpu_inactive(). An example is a throttled task that sees its replenishment timer firing while the cpu it was previously running on is considered offline, but before dl_bw_deactivate() had a chance to say no and roll-back happened. Fix this by directly calling dl_bw_deactivate() first thing in sched_cpu_deactivate() and do the required calculation in the former function considering the cpu passed as an argument as offline already. By doing so we also simplify sched_cpu_deactivate(), as there is no need anymore for any kind of roll-back if we fail early. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb |
|
|
|
d4742f6ed7 |
sched/deadline: Correctly account for allocated bandwidth during hotplug
For hotplug operations, DEADLINE needs to check that there is still enough bandwidth left after removing the CPU that is going offline. We however fail to do so currently. Restore the correct behavior by restructuring dl_bw_manage() a bit, so that overflow conditions (not enough bandwidth left) are properly checked. Also account for dl_server bandwidth, i.e. discount such bandwidth in the calculation since NORMAL tasks will be anyway moved away from the CPU as a result of the hotplug operation. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com |
|
|
|
41d4200b71 |
sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes
When root domain non-destructive changes (e.g., only modifying one of the existing root domains while the rest is not touched) happen we still need to clear DEADLINE bandwidth accounting so that it's then properly restored, taking into account DEADLINE tasks associated to each cpuset (associated to each root domain). After the introduction of dl_servers, we fail to restore such servers contribution after non-destructive changes (as they are only considered on destructive changes when runqueues are attached to the new domains). Fix this by making sure we iterate over the dl_servers attached to domains that have not been destroyed and add their bandwidth contribution back correctly. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20241114142810.794657-2-juri.lelli@redhat.com |
|
|
|
59297e2093 |
sched: add READ_ONCE to task_on_rq_queued
task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is
set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release
in __block_task, and also read with READ_ONCE in task_on_rq_migrating.
Make all of these accesses pair together by adding READ_ONCE in the
task_on_rq_queued.
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com
|
|
|
|
108ad09990 |
sched: Don't try to catch up excess steal time.
When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, in some situations, as they end up getting additional steal time that did not actually happen. This is because there is a window between reading the elapsed time in update_rq_clock() and sampling the steal time in update_rq_clock_task(). If the VCPU gets preempted between those two points, any additional steal time is accounted to the outgoing task even though the calculated delta did not actually contain any of that "stolen" time. When this race happens, we can end up with steal time that exceeds the calculated delta, and the previous code would try to catch up that excess steal time in future clock updates, which is given to the next, incoming task, even though it did not actually have any time stolen. This behavior is particularly bad when steal time can be very long, which we've seen when trying to extend steal time to contain the duration that the host was suspended [0]. When this happens, clock_task stays frozen, during which the running task stays running for the whole duration, since its run time doesn't increase. However the race can happen even under normal operation. Ideally we would read the elapsed cpu time and the steal time atomically, to prevent this race from happening in the first place, but doing so is non-trivial. Since the time between those two points isn't otherwise accounted anywhere, neither to the outgoing task nor the incoming task (because the "end of outgoing task" and "start of incoming task" timestamps are the same), I would argue that the right thing to do is to simply drop any excess steal time, in order to prevent these issues. [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/ Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com |
|
|
|
0664e2c311 |
sched/deadline: Fix warning in migrate_enable for boosted tasks
When running the following command:
while true; do
stress-ng --cyclic 30 --timeout 30s --minimize --quiet
done
a warning is eventually triggered:
WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794
setup_new_dl_entity+0x13e/0x180
...
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? enqueue_dl_entity+0x631/0x6e0
? setup_new_dl_entity+0x13e/0x180
? __warn+0x7e/0xd0
? report_bug+0x11a/0x1a0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
enqueue_dl_entity+0x631/0x6e0
enqueue_task_dl+0x7d/0x120
__do_set_cpus_allowed+0xe3/0x280
__set_cpus_allowed_ptr_locked+0x140/0x1d0
__set_cpus_allowed_ptr+0x54/0xa0
migrate_enable+0x7e/0x150
rt_spin_unlock+0x1c/0x90
group_send_sig_info+0xf7/0x1a0
? kill_pid_info+0x1f/0x1d0
kill_pid_info+0x78/0x1d0
kill_proc_info+0x5b/0x110
__x64_sys_kill+0x93/0xc0
do_syscall_64+0x5c/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f0dab31f92b
This warning occurs because set_cpus_allowed dequeues and enqueues tasks
with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning
is triggered. A boosted task already had its parameters set by
rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary,
hence the WARN_ON call.
Check if we are requeueing a boosted task and avoid calling
setup_new_dl_entity if that's the case.
Fixes:
|
|
|
|
e932c4ab38 |
sched/core: Prevent wakeup of ksoftirqd during idle load balance
Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.
Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
<idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED]
<idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
...
Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.
Following are the observations with the changes when enabling the same
set of events:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
<idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.
Fixes:
|
|
|
|
ff47a0acfc |
sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
Commit |
|
|
|
ea9cffc0a1 |
sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit |
|
|
|
70ee7947a2 |
sched: fix warning in sched_setaffinity
Commit |
|
|
|
22368fe1f9 |
sched/deadline: Fix replenish_dl_new_period dl_server condition
The condition in replenish_dl_new_period() that checks if a reservation
(dl_server) is deferred and is not handling a starvation case is
obviously wrong.
Fix it.
Fixes:
|
|
|
|
8f7c8b88bd |
sched_ext: Change for v6.13
- Improve the default select_cpu() implementation making it topology aware
and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args could
be NULL in a rather hackish way. Use the new __nullable CFI stub tags
instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by hammering on
the same queue across sockets, could live-lock the system to the point
where the system couldn't make reasonable forward progress. This could
lead to soft-lockup triggered resets or stalling out bypass mode switch
and thus BPF scheduler ejection for tens of minutes if not hours. After
trying a number of mitigations, the following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while sched_ext is
trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is imminent (a
quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity and
overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch" and
the confusion around "consume". This is due to how the dispatch queues
became more generic over time. Rename the affected kfuncs for clarity.
Thanks to BPF's compatibility features, this change can be made in a way
that's both forward and backward compatible. The compatibility code will
be dropped in a few releases.
- Pull sched_ext/for-6.12-fixes to receive a prerequisite change. Other misc
changes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZztuXA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGePUAP4nFTDaUDngVlxGv5hpYz8/Gcv1bPsWEydRRmH/
3F+pNgEAmGIGAEwFYfc9Zn8Kbjf0eJAduf2RhGRatQO6F/+GSwo=
=AcyC
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve the default select_cpu() implementation making it topology
aware and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args
could be NULL in a rather hackish way. Use the new __nullable CFI
stub tags instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by
hammering on the same queue across sockets, could live-lock the
system to the point where the system couldn't make reasonable forward
progress.
This could lead to soft-lockup triggered resets or stalling out
bypass mode switch and thus BPF scheduler ejection for tens of
minutes if not hours. After trying a number of mitigations, the
following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while
sched_ext is trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is
imminent (a quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity
and overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch"
and the confusion around "consume". This is due to how the dispatch
queues became more generic over time. Rename the affected kfuncs for
clarity. Thanks to BPF's compatibility features, this change can be
made in a way that's both forward and backward compatible. The
compatibility code will be dropped in a few releases.
- Other misc changes
* tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits)
sched_ext: Replace scx_next_task_picked() with switch_class() in comment
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
sched_ext: Clarify sched_ext_ops table for userland scheduler
sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
sched_ext: Avoid live-locking bypass mode switching
sched_ext: Fix incorrect use of bitwise AND
sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
sched_ext: Introduce NUMA awareness to the default idle selection policy
sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
sched_ext: Rename CFI stubs to names that are recognized by BPF
sched_ext: Introduce LLC awareness to the default idle selection policy
sched_ext: Clarify ops.select_cpu() for single-CPU tasks
sched_ext: improve WAKE_SYNC behavior for default idle CPU selection
sched_ext: Use btf_ids to resolve task_struct
sched/ext: Use tg_cgroup() to elieminate duplicate code
sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
...
|
|
|
|
bf9aa14fc5 |
A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the signal
of the timer is ignored so that the timer signal can be delivered once
the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small intervals
and keeps the system pointlessly out of low power states for no value.
This is a long standing non-trivial problem due to the lock order of
posix-timer lock and the sighand lock along with life time issues as
the timer and the sigqueue have different life time rules.
Cure this by:
* Embedding the sigqueue into the timer struct to have the same life
time rules. Aside of that this also avoids the lookup of the timer
in the signal delivery and rearm path as it's just a always valid
container_of() now.
* Queuing ignored timer signals onto a seperate ignored list.
* Moving queued timer signals onto the ignored list when the signal is
switched to SIG_IGN before it could be delivered.
* Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal delivery
code to rearm the timer.
This also required to consolidate the signal delivery rules so they are
consistent across all situations. With that all self test scenarios
finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time stamps
by default and switch to fine grained time stamps when inode attributes
are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that the
VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
* Move all sleep and timeout functions into one file
* Rework udelay() and ndelay() into proper documented inline functions
and replace the hardcoded magic numbers by proper defines.
* Rework the fsleep() implementation to take the reality of the timer
wheel granularity on different HZ values into account. Right now the
boundaries are hard coded time ranges which fail to provide the
requested accuracy on different HZ settings.
* Update documentation for all sleep/timeout related functions and fix
up stale documentation links all over the place
* Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as that's
the clock which drives the timekeeping adjustments via the various user
space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file descriptor
based posix clocks, but their usability is very limited. They can't be
accessed fast as they always go all the way out to the hardware and
they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the kernel
provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework converts
timekeeping and adjtimex(2) into generic functionality which operates
on pointers to data structures instead of using static variables.
This allows to provide time accessors and adjtimex(2) functionality for
the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less straight
forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the core
code and a few usage sites of the less frequently used interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is already
prepared and scheduled for the next merge window.
- Drivers:
* Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with other
clusters.
* Mostly boring cleanups, fixes, improvements and code movement
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
hZZ4hdM9+3T7cg==
=2VC6
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
|
|
|
|
3f020399e4 |
Scheduler changes for v6.13:
- Core facilities:
- Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
fair-class preemption by delaying preemption requests to the
tick boundary, while working as full preemption for RR/FIFO/DEADLINE
classes. (Peter Zijlstra)
- x86: Enable Lazy preemption (Peter Zijlstra)
- riscv: Enable Lazy preemption (Jisheng Zhang)
- Initialize idle tasks only once (Thomas Gleixner)
- sched/ext: Remove sched_fork() hack (Thomas Gleixner)
- Fair scheduler:
- Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
- Idle loop:
Optimize the generic idle loop by removing unnecessary
memory barrier (Zhongqiu Han)
- RSEQ:
- Improve cache locality of RSEQ concurrency IDs for
intermittent workloads (Mathieu Desnoyers)
- Waitqueues:
- Make wake_up_{bit,var} less fragile (Neil Brown)
- PSI:
- Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
- Preparatory patches for proxy execution:
- core: Add move_queued_task_locked helper (Connor O'Brien)
- core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
- core: Split out __schedule() deactivate task logic into a helper (John Stultz)
- core: Split scheduler and execution contexts (Peter Zijlstra)
- locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
- locking/mutex: Expose __mutex_owner() (Juri Lelli)
- locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
- Misc fixes and cleanups:
- core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
- core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
- wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
- fair: remove the DOUBLE_TICK feature (Huang Shijie)
- fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
- uclamp: Fix unnused variable warning (Christian Loehle)
- rt: No PREEMPT_RT=y for all{yes,mod}config
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
esh1TgbUbpE=
=Zcqj
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core facilities:
- Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
optimizes fair-class preemption by delaying preemption requests to
the tick boundary, while working as full preemption for
RR/FIFO/DEADLINE classes. (Peter Zijlstra)
- x86: Enable Lazy preemption (Peter Zijlstra)
- riscv: Enable Lazy preemption (Jisheng Zhang)
- Initialize idle tasks only once (Thomas Gleixner)
- sched/ext: Remove sched_fork() hack (Thomas Gleixner)
Fair scheduler:
- Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
Idle loop:
- Optimize the generic idle loop by removing unnecessary memory
barrier (Zhongqiu Han)
RSEQ:
- Improve cache locality of RSEQ concurrency IDs for intermittent
workloads (Mathieu Desnoyers)
Waitqueues:
- Make wake_up_{bit,var} less fragile (Neil Brown)
PSI:
- Pass enqueue/dequeue flags to psi callbacks directly (Johannes
Weiner)
Preparatory patches for proxy execution:
- Add move_queued_task_locked helper (Connor O'Brien)
- Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
- Split out __schedule() deactivate task logic into a helper (John
Stultz)
- Split scheduler and execution contexts (Peter Zijlstra)
- Make mutex::wait_lock irq safe (Juri Lelli)
- Expose __mutex_owner() (Juri Lelli)
- Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
Misc fixes and cleanups:
- Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
Disseldorp)
- Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
Siewior)
- Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
- remove the DOUBLE_TICK feature (Huang Shijie)
- fix the comment for PREEMPT_SHORT (Huang Shijie)
- Fix unnused variable warning (Christian Loehle)
- No PREEMPT_RT=y for all{yes,mod}config"
* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
sched: No PREEMPT_RT=y for all{yes,mod}config
riscv: add PREEMPT_LAZY support
sched, x86: Enable Lazy preemption
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
sched: Add Lazy preemption model
sched: Add TIF_NEED_RESCHED_LAZY infrastructure
sched/ext: Remove sched_fork() hack
sched: Initialize idle tasks only once
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
sched/uclamp: Fix unnused variable warning
sched: Split scheduler and execution contexts
sched: Split out __schedule() deactivate task logic into a helper
sched: Consolidate pick_*_task to task_is_pushable helper
sched: Add move_queued_task_locked helper
locking/mutex: Expose __mutex_owner()
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
sched: idle: Optimize the generic idle loop by removing needless memory barrier
...
|
|
|
|
ad52c55e1d |
Power management updates for 6.13-rc1
- Update the amd-pstate driver to set the initial scaling frequency
policy lower bound to be the lowest non-linear frequency (Dhananjay
Ugwekar).
- Enable amd-pstate by default on servers starting with newer AMD Epyc
processors (Swapnil Sapkal).
- Align more codepaths between shared memory and MSR designs in
amd-pstate (Dhananjay Ugwekar).
- Clean up amd-pstate code to rename functions and remove redundant
calls (Dhananjay Ugwekar, Mario Limonciello).
- Do other assorted fixes and cleanups in amd-pstate (Dhananjay Ugwekar
and Mario Limonciello).
- Change the Balance-performance EPP value for Granite Rapids in the
intel_pstate driver to a more performance-biased one (Srinivas
Pandruvada).
- Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
S. Bae).
- Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds to always enforce sched domains rebuild in case EAS needs
to be enabled (Christian Loehle).
- Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König).
- Use proper frequency unit names in cpufreq (Marcin Juszkiewicz).
- Add a built-in idle states table for Granite Rapids Xeon D to the
intel_idle driver (Artem Bityutskiy).
- Fix some typos in comments in the cpuidle core and drivers (Shen
Lichuan).
- Remove iowait influence from the menu cpuidle governor (Christian
Loehle).
- Add min/max available performance state limits to the Energy Model
management code (Lukasz Luba).
- Update pm-graph to v5.13 (Todd Brandt).
- Add documentation for some recently introduced cpupower utility
options (Tor Vic).
- Make cpupower inform users where cpufreq-bench.conf should be located
when opening it fails (Peng Fan).
- Allow overriding cross-compiling env params in cpupower (Peng Fan).
- Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
IV).
- Improve disable c_state block in cpupower bindings and add a test to
confirm that CPU state is disabled to it (John B. Wyatt IV).
- Add Chinese Simplified translation to cpupower (Kieran Moy).
- Add checks for xgettext and msgfmt to cpupower (Siddharth Menon).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmc3r6sSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxQMUQALNEbh/Ko1d+avq0sfvyPw18BZjEiQw7
M+L0GydLW6tXLYOrD+ZTASksdDhHbK0iuFr1Gca2cZi0Dl+1XF9sy70ITTqzCDIA
8qj1JrPmRYI0KXCfiSSke0W9fU18IdxVX3I7XezVqBl0ICzsroN5wliCkmEnVOU9
LQkw0fyYr7gev4GFEGSJ7WzfPxci0d6J9pYnafFlDEE28WpKz/cyOzYuSghX5lmG
ISHIVNIM6lqNgXyQirConvhrlg60XAyw5k5jqAYZbe78T+dqhH7lr9sDi7c4XxkG
syeiOOyjpiBMZv1rSjIUapi8AfJHyqH7B6KyTgiulIy31x8Dji62925B63CSahkM
AminAq0lYkqbhIcqEr4sW0JQ/oW3iX4cZ3TJXTUL+vFByR0ZF81tgQcXufhrcvBs
ViNugcX0q1vDX3lZsm9L6UHXN2yhUb36sgreUvbGfwnE79tuR/eUnAukTWBfXau/
TWnyDiQn1CjZcfHB+YAPYZNyUHHqjoIJwzfJLwnsaHgFA80YcSwfSC9kcogCawK1
NCyfs29lAccWsrOul5iARJu8pLw1X//UfDEmVNrBD+1hveKYMrjjiQXnPoVVnNhc
J5T2q5S1QeO05+wf8WaZ7MbRNzHLj0A3gYHSVPWNclxFwsQjqCHHZS2qz8MTX+f6
W6/eZuvmMbG7
=w8QT
-----END PGP SIGNATURE-----
Merge tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The amd-pstate cpufreq driver gets the majority of changes this time.
They are mostly fixes and cleanups, but one of them causes it to
become the default cpufreq driver on some AMD server platforms.
Apart from that, the menu cpuidle governor is modified to not use
iowait any more, the intel_idle gets a custom C-states table for
Granite Rapids Xeon D, and the intel_pstate driver will use a more
aggressive Balance- performance default EPP value on Granite Rapids
now.
There are also some fixes, cleanups and tooling updates.
Specifics:
- Update the amd-pstate driver to set the initial scaling frequency
policy lower bound to be the lowest non-linear frequency (Dhananjay
Ugwekar)
- Enable amd-pstate by default on servers starting with newer AMD
Epyc processors (Swapnil Sapkal)
- Align more codepaths between shared memory and MSR designs in
amd-pstate (Dhananjay Ugwekar)
- Clean up amd-pstate code to rename functions and remove redundant
calls (Dhananjay Ugwekar, Mario Limonciello)
- Do other assorted fixes and cleanups in amd-pstate (Dhananjay
Ugwekar and Mario Limonciello)
- Change the Balance-performance EPP value for Granite Rapids in the
intel_pstate driver to a more performance-biased one (Srinivas
Pandruvada)
- Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
S. Bae)
- Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds to always enforce sched domains rebuild in case EAS needs
to be enabled (Christian Loehle)
- Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König)
- Use proper frequency unit names in cpufreq (Marcin Juszkiewicz)
- Add a built-in idle states table for Granite Rapids Xeon D to the
intel_idle driver (Artem Bityutskiy)
- Fix some typos in comments in the cpuidle core and drivers (Shen
Lichuan)
- Remove iowait influence from the menu cpuidle governor (Christian
Loehle)
- Add min/max available performance state limits to the Energy Model
management code (Lukasz Luba)
- Update pm-graph to v5.13 (Todd Brandt)
- Add documentation for some recently introduced cpupower utility
options (Tor Vic)
- Make cpupower inform users where cpufreq-bench.conf should be
located when opening it fails (Peng Fan)
- Allow overriding cross-compiling env params in cpupower (Peng Fan)
- Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
IV)
- Improve disable c_state block in cpupower bindings and add a test
to confirm that CPU state is disabled to it (John B. Wyatt IV)
- Add Chinese Simplified translation to cpupower (Kieran Moy)
- Add checks for xgettext and msgfmt to cpupower (Siddharth Menon)"
* tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (38 commits)
cpufreq: intel_pstate: Update Balance-performance EPP for Granite Rapids
cpufreq: ACPI: Simplify MSR read on the boot CPU
sched/cpufreq: Ensure sd is rebuilt for EAS check
intel_idle: add Granite Rapids Xeon D support
PM: EM: Add min/max available performance state limits
cpufreq/amd-pstate: Move registration after static function call update
cpufreq/amd-pstate: Push adjust_perf vfunc init into cpu_init
cpufreq/amd-pstate: Align offline flow of shared memory and MSR based systems
cpufreq/amd-pstate: Call cppc_set_epp_perf in the reenable function
cpufreq/amd-pstate: Do not attempt to clear MSR_AMD_CPPC_ENABLE
cpufreq/amd-pstate: Rename functions that enable CPPC
cpufreq/amd-pstate-ut: Add fix for min freq unit test
amd-pstate: Switch to amd-pstate by default on some Server platforms
amd-pstate: Set min_perf to nominal_perf for active mode performance gov
cpufreq/amd-pstate: Remove the redundant amd_pstate_set_driver() call
cpufreq/amd-pstate: Remove the switch case in amd_pstate_init()
cpufreq/amd-pstate: Call amd_pstate_set_driver() in amd_pstate_register_driver()
cpufreq/amd-pstate: Call amd_pstate_register() in amd_pstate_init()
cpufreq/amd-pstate: Set the initial min_freq to lowest_nonlinear_freq
cpufreq/amd-pstate: Remove the redundant verify() function
...
|
|
|
|
a5ca574796 |
vfs-6.13.usercopy
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzchMwAKCRCRxhvAZXjc okICAP4h6tDl7dgTv8GkL0tgaHi/36m+ilctXbEtIe9fbkc/fQD8D5t6jYaz47gu zVY7qOrtQOQ/diNavzxyky99Uh3dKgo= =lwkw -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_struct_to_user helper from Christian Brauner: "This adds a copy_struct_to_user() helper which is a companion helper to the already widely used copy_struct_from_user(). It copies a struct from kernel space to userspace, in a way that guarantees backwards-compatibility for struct syscall arguments as long as future struct extensions are made such that all new fields are appended to the old struct, and zeroed-out new fields have the same meaning as the old struct. The first user is sched_getattr() system call but the new extensible pidfs ioctl will be ported to it as well" * tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sched_getattr: port to copy_struct_to_user uaccess: add copy_struct_to_user helper |
|
|
|
d79944b094 |
sched_ext: One more fix for v6.12-rc7
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the operation to call kfuncs which shouldn't be allowed. Fix it by using SCX_KF_REST instead, which is trivial and low risk. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzamXw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGRReAP4/JQ1mKkJv+9nTZkW9OcFFHGVVhrprOUEEFk5j pmHwPAD8DTBMMS/BCQOoXDdiB9uU7ut6M8VdsIj1jmJkMja+eQI= =942J -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "One more fix for v6.12-rc7 ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the operation to call kfuncs which shouldn't be allowed. Fix it by using SCX_KF_REST instead, which is trivial and low risk" * tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST |
|
|
|
6b8950ef99 |
sched_ext: Replace scx_next_task_picked() with switch_class() in comment
scx_next_task_picked() has been replaced with siwtch_class(), but comment is still referencing old one, so replace it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
a4af89cc50 |
sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Zhao Mengmeng <zhaomzhao@126.com>
Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org
Fixes:
|
|
|
|
70d8b6485b |
sched/cpufreq: Ensure sd is rebuilt for EAS check
Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds. The out goto initialized sugov without forcing the rebuild.
Previously the missing call to sugov_eas_rebuild_sd() could lead to EAS
not being enabled on boot when it should have been, because it requires
all policies to be controlled by schedutil while they might not have
been initialized yet.
Fixes:
|
|
|
|
3022e9d00e |
sched_ext: Fixes for v6.12-rc7
- The fair sched class currently has a bug where its balance() returns true telling the sched core that it has tasks to run but then NULL from pick_task(). This makes sched core call sched_ext's pick_task() without preceding balance() which can lead to stalls in partial mode. For now, work around by detecting the condition and forcing the CPU to go through another scheduling cycle. - Add a missing newline to an error message and fix drgn introspection tool which went out of sync. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzI8sw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGb5KAP40b/o6TyAFDG+Hn6GxyxQT7rcAUMXsdB2bcEpg /IjmzQEAwbHU5KP5vQXV6XHv+2V7Rs7u6ZqFtDnL88N0A9hf3wk= =7hL8 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - The fair sched class currently has a bug where its balance() returns true telling the sched core that it has tasks to run but then NULL from pick_task(). This makes sched core call sched_ext's pick_task() without preceding balance() which can lead to stalls in partial mode. For now, work around by detecting the condition and forcing the CPU to go through another scheduling cycle. - Add a missing newline to an error message and fix drgn introspection tool which went out of sync. * tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx() sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type sched_ext: Add a missing newline at the end of an error message |
|
|
|
5cbb302880 |
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the third set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. - scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT macros as they were introduced during v6.12 cycle. Wrap new API in __COMPAT macros too and trigger build errors on both __COMPAT prefixed and naked usages of the old names. The compat features will be dropped after v6.15. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com> |
|
|
|
5209c03c8e |
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the second rename. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Comment and documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com> |
|
|
|
cc26abb1a1 |
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the first set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com> |
|
|
|
a6250aa251 |
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and thus every pick_task_scx() call must be preceded by balance_scx(). While this usually holds, due to a bug, there are cases where the fair class's balance() returns true indicating that it has tasks to run on the CPU and thus terminating balance() calls but fails to actually find the next task to run when pick_task() is called. In such cases, pick_task_scx() can be called without preceding balance_scx(). Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep running the previous task if possible and avoid stalling from entering idle without balancing. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org |
|
|
|
72b85bf6a7 |
sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
|
|
|
|
f39489fea6 |
sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
When getting an LLC CPU mask in the default CPU selection policy, scx_select_cpu_dfl(), a pointer to the sched_domain is dereferenced using rcu_read_lock() without holding rcu_read_lock(). Such an unprotected dereference often causes the following warning and can cause an invalid memory access in the worst case. Therefore, protect dereference of a sched_domain pointer using a pair of rcu_read_lock() and unlock(). [ 20.996135] ============================= [ 20.996345] WARNING: suspicious RCU usage [ 20.996563] 6.11.0-virtme #17 Tainted: G W [ 20.996576] ----------------------------- [ 20.996576] kernel/sched/ext.c:3323 suspicious rcu_dereference_check() usage! [ 20.996576] [ 20.996576] other info that might help us debug this: [ 20.996576] [ 20.996576] [ 20.996576] rcu_scheduler_active = 2, debug_locks = 1 [ 20.996576] 4 locks held by kworker/8:1/140: [ 20.996576] #0: ffff8b18c00dd348 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x4a0/0x590 [ 20.996576] #1: ffffb3da01f67e58 ((work_completion)(&dev->power.work)){+.+.}-{0:0}, at: process_one_work+0x1ba/0x590 [ 20.996576] #2: ffffffffa316f9f0 (&rcu_state.gp_wq){..-.}-{2:2}, at: swake_up_one+0x15/0x60 [ 20.996576] #3: ffff8b1880398a60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x59/0x7d0 [ 20.996576] [ 20.996576] stack backtrace: [ 20.996576] CPU: 8 UID: 0 PID: 140 Comm: kworker/8:1 Tainted: G W 6.11.0-virtme #17 [ 20.996576] Tainted: [W]=WARN [ 20.996576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 20.996576] Workqueue: pm pm_runtime_work [ 20.996576] Sched_ext: simple (disabling+all), task: runnable_at=-6ms [ 20.996576] Call Trace: [ 20.996576] <IRQ> [ 20.996576] dump_stack_lvl+0x6f/0xb0 [ 20.996576] lockdep_rcu_suspicious.cold+0x4e/0x96 [ 20.996576] scx_select_cpu_dfl+0x234/0x260 [ 20.996576] select_task_rq_scx+0xfb/0x190 [ 20.996576] select_task_rq+0x47/0x110 [ 20.996576] try_to_wake_up+0x110/0x7d0 [ 20.996576] swake_up_one+0x39/0x60 [ 20.996576] rcu_core+0xb08/0xe50 [ 20.996576] ? srso_alias_return_thunk+0x5/0xfbef5 [ 20.996576] ? mark_held_locks+0x40/0x70 [ 20.996576] handle_softirqs+0xd3/0x410 [ 20.996576] irq_exit_rcu+0x78/0xa0 [ 20.996576] sysvec_apic_timer_interrupt+0x73/0x80 [ 20.996576] </IRQ> [ 20.996576] <TASK> [ 20.996576] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 20.996576] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70 [ 20.996576] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 11 b4 36 ff 48 89 df e8 99 0d 37 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 5b 55 3c 5e 74 16 5b 5d e9 95 8e 28 00 e8 a5 ee 44 ff 9c [ 20.996576] RSP: 0018:ffffb3da01f67d20 EFLAGS: 00000246 [ 20.996576] RAX: 0000000000000002 RBX: ffffffffa4640220 RCX: 0000000000000040 [ 20.996576] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1c7b27b [ 20.996576] RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000 [ 20.996576] R10: 0000000000000001 R11: 000000000000021c R12: 0000000000000246 [ 20.996576] R13: ffff8b1881363958 R14: 0000000000000000 R15: ffff8b1881363800 [ 20.996576] ? _raw_spin_unlock_irqrestore+0x4b/0x70 [ 20.996576] serial_port_runtime_resume+0xd4/0x1a0 [ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10 [ 20.996576] __rpm_callback+0x44/0x170 [ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10 [ 20.996576] rpm_callback+0x55/0x60 [ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10 [ 20.996576] rpm_resume+0x582/0x7b0 [ 20.996576] pm_runtime_work+0x7c/0xb0 [ 20.996576] process_one_work+0x1fb/0x590 [ 20.996576] worker_thread+0x18e/0x350 [ 20.996576] ? __pfx_worker_thread+0x10/0x10 [ 20.996576] kthread+0xe2/0x110 [ 20.996576] ? __pfx_kthread+0x10/0x10 [ 20.996576] ret_from_fork+0x34/0x50 [ 20.996576] ? __pfx_kthread+0x10/0x10 [ 20.996576] ret_from_fork_asm+0x1a/0x30 [ 20.996576] </TASK> [ 21.056592] sched_ext: BPF scheduler "simple" disabled (unregistered from user space) Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
153591f703 |
sched_ext: Clarify sched_ext_ops table for userland scheduler
Update the comments in sched_ext_ops to clarify this table is for a BPF scheduler and a userland scheduler should also rely on the sched_ext_ops table through the BPF scheduler. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
e32c260195 |
sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly behaving BPF scheduler can live-lock the system by making multiple CPUs bang on the same DSQ to the point where soft-lockup detection triggers before SCX's own watchdog can take action. It also seems possible that the machine can be live-locked enough to prevent scx_ops_helper, which is an RT task, from running in a timely manner. Implement scx_softlockup() which is called when three quarters of soft-lockup threshold has passed. The function immediately enables the ops breather and triggers an ops error to initiate ejection of the BPF scheduler. The previous and this patch combined enable the kernel to reliably recover the system from live-lock conditions that can be triggered by a poorly behaving BPF scheduler on Intel dual socket systems. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
62dcbab8b0 |
sched_ext: Avoid live-locking bypass mode switching
A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly banging on the same DSQ on a large NUMA system to the point where switching to the bypass mode can take a long time. Turning on the bypass mode requires dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they are on are live-locked, this can take tens of seconds cascading into other failures. This was observed on 2 x Intel Sapphire Rapids machines with 224 logical CPUs. Inject artifical delays while the bypass mode is switching to guarantee timely completion. While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it to bypass_lock. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Valentin Andrei <vandrei@meta.com> Reported-by: Patrick Lu <patlu@meta.com> |
|
|
|
f07b806ad8 |
Merge branch 'for-6.12-fixes' into for-6.13
Pull sched_ext/for-6.12-fixes to receive
|
|
|
|
6d594af5bf |
sched_ext: Fix incorrect use of bitwise AND
There is no reason to use a bitwise AND when checking the conditions to
enable NUMA optimization for the built-in CPU idle selection policy, so
use a logical AND instead.
Fixes:
|
|
|
|
f6ce6b9493 |
sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.
Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.
Therefore, detect overlapping domains and enable topology optimizations
only when necessary.
Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.
Fixes:
|
|
|
|
46d076af6d |
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimer_setup_on_stack() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init_on_stack() and the open coded initialization of hrtimer::function with the new setup mechanism. The conversion was done with Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/17f9421fed6061df4ad26a4cc91873d2c078cb0f.1730386209.git.namcao@linutronix.de |
|
|
|
f7d1b585e1 |
sched_ext: Add a missing newline at the end of an error message
Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
35772d627b |
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is not a change in how preemption works, but rather it makes a ton more code preemptible. Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make no sense (but are techincally possible). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.441622332@infradead.org |
|
|
|
7c70cb94d2 |
sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY. This LAZY bit will be promoted to the full NEED_RESCHED bit on tick. As such, the average delay between setting LAZY and actually rescheduling will be TICK_NSEC/2. In short, Lazy preemption will delay preemption for fair class but will function as Full preemption for all the other classes, most notably the realtime (RR/FIFO/DEADLINE) classes. The goal is to bridge the performance gap with Voluntary, such that we might eventually remove that option entirely. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org |
|
|
|
26baa1f1c4 |
sched: Add TIF_NEED_RESCHED_LAZY infrastructure
Add the basic infrastructure to split the TIF_NEED_RESCHED bit in two. Either bit will cause a resched on return-to-user, but only TIF_NEED_RESCHED will drive IRQ preemption. No behavioural change intended. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.219540785@infradead.org |
|
|
|
0f0d1b8e50 |
sched/ext: Remove sched_fork() hack
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.
Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.
Fixes:
|
|
|
|
b23decf8ac |
sched: Initialize idle tasks only once
Idle tasks are initialized via __sched_fork() twice:
fork_idle()
copy_process()
sched_fork()
__sched_fork()
init_idle()
__sched_fork()
Instead of cleaning this up, sched_ext hacked around it. Even when analyis
and solution were provided in a discussion, nobody cared to clean this up.
init_idle() is also invoked from sched_init() to initialize the boot CPU's
idle task, which requires the __sched_fork() invocation. But this can be
trivially solved by invoking __sched_fork() before init_idle() in
sched_init() and removing the __sched_fork() invocation from init_idle().
Do so and clean up the comments explaining this historical leftover.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de
|
|
|
|
33e83ffe4c |
A few scheduler fixes:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not queued.
Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq is
the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap of
the address space can result in a NULL pointer return from vma_next()
which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnTqATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX/aD/4yvskeG9i7wAj2NOdDTAs1K0gLURt+
nHDb1YkoIOXOfanaG7ZdBWb4sYnsnLX/KIhVsDQiXACFr6G0IjQ1zaN1iRtEkH79
5BfVi98gAXdFU3y+EGqyaqiAp7MFOBTmsfJi5095fX0L+2aViSAjDEvHzvvC/hXD
tmq47vFQEgIZPSxljEaKPaNmyDM+geusv5lX/lABH5MG0fYsT85VV6BQ2T1LsN1O
WFBLD/uPEOSXumyZW8nV8yE2PioLDJz8W+uSnr38/HCH99mtJApqZyskaagKtr0g
vLhOfoaYVR/j5ODUk6LExZ8zy140zDzUWzC5+RNnyb8jQf/Lx88fTNZY8/Wsm5m9
oKtoiGzkL0LG/c05Cjh/vqReK26qILK4+ynDGaowDmTlUTS2jeNZL1ABlIwWkaLP
5TDegJPkoUA1Z4YegxtRFROGHp1J+lfbqz537bghMaqdJXMaG84qjSszsPz9NbS9
F7K63JKjfXAF6N8bhKvZk4jAbD97EYf3B0o8E69TjoZxaiuKf00xK7HGWmuQD3u3
lOHkfIZzf5b7ELNgcketCYsbJvxbI4oQrp/9V425ORSr1Ih2GxCT51/x/NlFHoEH
ujIjAe2YQyLhb26M0RG8Xao3BPT7RGMR058C8lwxtPLuPNIwB8MqCsXmU9xlEypg
iexGnsj6zXTddg==
=4mie
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not
queued. Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq
is the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap
of the address space can result in a NULL pointer return from
vma_next() which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/ext: Fix scx vs sched_delayed
sched: Pass correct scheduling policy to __setscheduler_class
sched/numa: Fix the potential null pointer dereference in task_numa_work()
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
|
|
|
|
69d5e722be |
sched/ext: Fix scx vs sched_delayed
Commit |
|
|
|
daa9f66fe1 |
sched_ext: Fixes for v6.12-rc5
- Instances of scx_ops_bypass() could race each other leading to misbehavior. Fix by protecting the operation with a spinlock. - selftest and userspace header fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZyF/5Q4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGRi+AP4+jGUz+O1LS0bCNj44Xlr0v6kci5dfJR7TlBv5 hwROcgEA84i7nRq6oJ1IkK7ItLbZYwgZyxqdn0Pgsq+oMWhgAwE= =R766 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Instances of scx_ops_bypass() could race each other leading to misbehavior. Fix by protecting the operation with a spinlock. - selftest and userspace header fixes * tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix enq_last_no_enq_fails selftest sched_ext: Make cast_mask() inline scx: Fix raciness in scx_ops_bypass() scx: Fix exit selftest to use custom DSQ sched_ext: Fix function pointer type mismatches in BPF selftests selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ |
|
|
|
860a45219b |
sched_ext: Introduce NUMA awareness to the default idle selection policy
Similarly to commit
|
|
|
|
5db91545ef |
sched: Pass correct scheduling policy to __setscheduler_class
Commit |
|
|
|
1a6151017e |
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
What psi needs to do on each enqueue and dequeue has gotten more subtle, and the generic sched code trying to distill this into a bool for the callbacks is awkward. Pass the flags directly and let psi parse them. For that to work, the #include "stats.h" (which has the psi callback implementations) needs to be below the flag definitions in "sched.h". Move that section further down, next to some of the other accounting stuff. This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump label, slightly reducing overhead when PSI=y but runtime disabled. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org |
|
|
|
9c70b2a33c |
sched/numa: Fix the potential null pointer dereference in task_numa_work()
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work()
for numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Fixes:
|
|
|
|
23f1178ad7 |
sched/uclamp: Fix unnused variable warning
uclamp_mutex is only used for CONFIG_SYSCTL or CONFIG_UCLAMP_TASK_GROUP so declare it __maybe_unused. Closes: https://lore.kernel.org/oe-kbuild-all/202410060258.bPl2ZoUo-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202410250459.EJe6PJI5-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a1e9c342-01c9-44f0-a789-2c908e57942b@arm.com |
|
|
|
0e7ffff1b8 |
scx: Fix raciness in scx_ops_bypass()
scx_ops_bypass() can currently race on the ops enable / disable path as
follows:
1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1
2. An op on the init path exits, which schedules scx_ops_disable_workfn()
3. scx_ops_bypass(false) is called on the disable path, and bypass depth
is decremented to 0
4. kthread is scheduled to execute scx_ops_disable_workfn()
5. scx_ops_bypass(true) called, bypass depth set to 1
6. scx_ops_bypass() races when iterating over CPUs
While it's not safe to take any blocking locks on the bypass path, it is
safe to take a raw spinlock which cannot be preempted. This patch therefore
updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes
scx_ops_bypass_depth to be a regular int.
Without this change, we observe the following warnings when running the
'exit' sched_ext selftest (sometimes requires a couple of runs):
.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
...
[ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280
[ 14.935126] Modules linked in:
[ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24
[ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 14.935242] Sched_ext: exit (enabling+all)
[ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280
[ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24
[ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002
[ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a
[ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080
[ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a
[ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[ 14.935765] PKRU: 55555554
[ 14.935782] Call Trace:
[ 14.935802] <TASK>
[ 14.935823] ? __warn+0xce/0x220
[ 14.935850] ? scx_ops_bypass+0x1ca/0x280
[ 14.935881] ? report_bug+0xc1/0x160
[ 14.935909] ? handle_bug+0x61/0x90
[ 14.935934] ? exc_invalid_op+0x1a/0x50
[ 14.935959] ? asm_exc_invalid_op+0x1a/0x20
[ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30
[ 14.936019] ? scx_ops_bypass+0x1ca/0x280
[ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.936111] scx_ops_disable_workfn+0x146/0xac0
[ 14.936142] ? finish_task_switch+0xa9/0x2c0
[ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.936244] kthread_worker_fn+0x101/0x2c0
[ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10
[ 14.936299] kthread+0xec/0x110
[ 14.936327] ? __pfx_kthread+0x10/0x10
[ 14.936351] ret_from_fork+0x37/0x50
[ 14.936374] ? __pfx_kthread+0x10/0x10
[ 14.936400] ret_from_fork_asm+0x1a/0x30
[ 14.936427] </TASK>
[ 14.936443] irq event stamp: 21002
[ 14.936467] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[ 14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[ 14.936571] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.936672] ---[ end trace 0000000000000000 ]---
[ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 14.953352] ------------[ cut here ]------------
[ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280
[ 14.953428] Modules linked in:
[ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24
[ 14.953505] Tainted: [W]=WARN
[ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280
[ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0
[ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046
[ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318
[ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[ 14.954052] PKRU: 55555554
[ 14.954068] Call Trace:
[ 14.954085] <TASK>
[ 14.954102] ? __warn+0xce/0x220
[ 14.954126] ? scx_ops_bypass+0x1d8/0x280
[ 14.954150] ? report_bug+0xc1/0x160
[ 14.954178] ? handle_bug+0x61/0x90
[ 14.954203] ? exc_invalid_op+0x1a/0x50
[ 14.954226] ? asm_exc_invalid_op+0x1a/0x20
[ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30
[ 14.954285] ? scx_ops_bypass+0x1d8/0x280
[ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260
[ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0
[ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.954413] kthread_worker_fn+0x101/0x2c0
[ 14.954442] ? __pfx_kthread_worker_fn+0x10/0x10
[ 14.954479] kthread+0xec/0x110
[ 14.954507] ? __pfx_kthread+0x10/0x10
[ 14.954530] ret_from_fork+0x37/0x50
[ 14.954553] ? __pfx_kthread+0x10/0x10
[ 14.954576] ret_from_fork_asm+0x1a/0x30
[ 14.954603] </TASK>
[ 14.954621] irq event stamp: 21002
[ 14.954644] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[ 14.954686] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[ 14.954735] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.954782] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.954829] ---[ end trace 0000000000000000 ]---
[ 15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
===== END =====
And with it, the test passes without issue after 1000s of runs:
.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places
OUTPUT:
[ 7.412856] sched_ext: BPF scheduler "exit" enabled
[ 7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.466677] sched_ext: BPF scheduler "exit" enabled
[ 7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.512803] sched_ext: BPF scheduler "exit" enabled
[ 7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.586809] sched_ext: BPF scheduler "exit" enabled
[ 7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
===== END =====
=============================
RESULTS:
PASSED: 1
SKIPPED: 0
FAILED: 0
Fixes:
|
|
|
|
b7d0bbcf0c |
sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
ops.dispatch() and ops.yield() may be fed a NULL task_struct pointer. set_arg_maybe_null() is used to tell the verifier that they should be NULL checked before being dereferenced. BPF now has an a lot prettier way to express this - tagging arguments in CFI stubs with __nullable. Replace set_arg_maybe_null() with __nullable CFI stub tags. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
cf583264d0 |
sched_ext: Rename CFI stubs to names that are recognized by BPF
CFI stubs can be used to tag arguments with __nullable (and possibly other tags in the future) but for that to work the CFI stubs must have names that are recognized by BPF. Rename them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
dfa4ed29b1 |
sched_ext: Introduce LLC awareness to the default idle selection policy
Rely on the scheduler topology information to implement basic LLC awareness in the sched_ext build-in idle selection policy. This allows schedulers using the built-in policy to make more informed decisions when selecting an idle CPU in systems with multiple LLCs, such as NUMA systems or chiplet-based architectures, and it helps keep tasks within the same LLC domain, thereby improving cache locality. For efficiency, LLC awareness is applied only to tasks that can run on all the CPUs in the system for now. If a task's affinity is modified from user space, it's the responsibility of user space to choose the appropriate optimized scheduling domain. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
b452ae4d20 |
sched_ext: Clarify ops.select_cpu() for single-CPU tasks
Update ops.select_cpu() documentation to clarify that this method is not called for tasks that are restricted to run on a single CPU, as these tasks do not have the option to select a different CPU. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
b55945c500 |
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().
The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.
The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.
Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.
Fixes:
|
|
|
|
112cca098a
|
sched_getattr: port to copy_struct_to_user
sched_getattr(2) doesn't care about trailing non-zero bytes in the (ksize > usize) case, so just use copy_struct_to_user() without checking ignored_trailing. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-2-d2833dfe6edd@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
d1fb8a78b2 |
Linux 6.12-rc4
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmcVgfoeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhCYH/0Sdfp3cIq3JWLRv HCkWhPkPbEvR5XQlYQsAvTPVrEc0ZG9PKlXCaYaa8Tvt8xQ7WT/VDTjKgaWEhr8s qa6bNTx1zggiNBTP/3jYsNliOyAYfw5qjxA7fpEmueAeuT5y1XKZFKPHEXE/1qbR 8zeISKTkE0qwUmLqCdXe2qBWFnCC5i+78RcI6IN7uErnuNWk7ssapldgU4DB+dEl DDRxi1FTvARGPQGl8T+jPkfJiugv87ksG7l4WsqcYgoW+045K76C7I6vQjkDOrsd wqtPIow/yPmGQbbdRhWLxNU+wDmselYQ6xp7aMxppNF45HoHtzNm+X+T2ZU3bPoP iT2Mkbg= =+GXK -----END PGP SIGNATURE----- Merge tag 'v6.12-rc4' into sched/core, to resolve conflict Overlapping fixes solving the same bug slightly differently: |