mirror of https://github.com/torvalds/linux.git
4492 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
a22ea738f4 |
Fix a sleeping-while-atomic bug caused by a recent optimization
utilizing static keys that didn't consider that the static_key_disable() call could be triggered in atomic context. Revert the optimization. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfT8m4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h8IRAAlvqGw8HmdN5VGdco95ejRK8zdPLWQFvI m7FvjSZu5ciZrxz72QVmpIoprfNSb7WiYFhrVi7x6qIHgc/g9frIWxRP/OnXsMSg hWAHystWPVrLirD6lN46nOKvVaqaACa6NyndNnwGtLEwTyf3Rc87XU/gRjiCwCBG hFt7HAEbgSi0YnKcPm79Le0Es33ZfH1Z47LSOSLlEyZbp+dcoliOzveJmbxhcfTM 3yUO9HtoFRKsUbDZCYHE/ZWgDJbaGNPokwHf9XMSFrWbTXPHxvJvKgZKMfaN+rtr NKXzDvUW7D3Au3KHrBk39TcBZMkAKLnnze5RV8xAurKzMZBRWD7ybKetzeMt8gzg os/Ge9M5AxldiuPxK8s8S+X+L5tQ/uNwzbQEaC0w4ouH3ZBjgoHTWNJM6Lpwpl42 3c+zBC+2TdC2dDj0elL2Z34T6X6FiHMoBHOIKaTvX7Lh5G4lTLC9+GIBAt/vSAFv vqpo3Bc61qq2jgpnmWRVVfvHKrvA7dvfvId0oE8JQfGB446glLlEKJ7Dyn65NJp6 9rc2wPAAs+oaaxi47p94CKwBvYzyKkMXSnyB5WXQJOk4O6QD7uv9qGXHzy+8gAom QAFTJGpTr7mnXcu/ZOdGGBqoq5Kcbvf15SnnkYqduE0f/omoW+BqgEyGxvbP+qsV /7D7ZCXbyk0= =fXc6 -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a sleeping-while-atomic bug caused by a recent optimization utilizing static keys that didn't consider that the static_key_disable() call could be triggered in atomic context. Revert the optimization" * tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Don't define sched_clock_irqtime as static key |
|
|
|
b7f94fcf55 |
sched_ext: A fix for v6.14-rc6
BPF schedulers could trigger a crash by passing in an invalid CPU to the helper scx_bpf_select_cpu_dfl(). Fix it by verifying input validity. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9H4tw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGY2PAP95rqNEACvl5uz/HQM+T0WpwGDaIJ3fmKYd3GZY 3XJjhwD/YmKMLmth0xeDLkAtVUNsMp4EjpssKdzi0CJq+Nl4nQw= =OxEw -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "BPF schedulers could trigger a crash by passing in an invalid CPU to the scx_bpf_select_cpu_dfl() helper. Fix it by verifying input validity" * tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl() |
|
|
|
f3fa0e40df |
sched/clock: Don't define sched_clock_irqtime as static key
The sched_clock_irqtime was defined as a static key in: |
|
|
|
14672f059d |
sched/deadline: Use online cpus for validating runtime
The ftrace selftest reported a failure because writing -1 to sched_rt_runtime_us returns -EBUSY. This happens when the possible CPUs are different from active CPUs. Active CPUs are part of one root domain, while remaining CPUs are part of def_root_domain. Since active cpumask is being used, this results in cpus=0 when a non active CPUs is used in the loop. Fix it by looping over the online CPUs instead for validating the bandwidth calculations. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com |
|
|
|
3b4035ddbf |
sched/fair: Fix potential memory corruption in child_cfs_rq_on_list
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.
The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same
leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.
This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.
This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.
Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.
Fixes:
|
|
|
|
9360dfe4cb |
sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.
Fixes:
|
|
|
|
d203484f25 |
Prevent cond_resched() based preemption when interrupts are disabled,
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfCDDMRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gf+RAAvFNXelLgrNbILZ6ckp/ikWnjCbf2QOIk aCm6JMQm7WrFvgo1u6CM4vQQYZdEqf8+KiEjJJnoq2P4jvYzhO1/1pLfEDNaHeiH GneosmKAwSMR8lgDlw5DXxhXsfeuYYhG5VMe2ia+kyiIA83TUF6hl9jpawWB3dsw +xB6CAg3JLoR2v44E/Mf1PdGaGrF90fYxp+X5RNSqxVXcN54cgVx2G9lHeTIWcnp SjIiWo5mply50de+dxD5dNUB9mj/k+yLQaiuPfUDGo/ZOjFyBnsP5VlD+ySbhkIa Rwdw6olLqXLcX5D5RsPIuePm/XdmAQXr6GXxJjdhtV1oWTP3Bejev3upQ/kxHQ50 DQa+aSTqNx9bNlwphUafCmVo1OZap4mViOSWP7r96HhFwehLGGmkjEaU9eFuUl0P kG+qGq28U+Nnz0r6/pEkwic1B6wbq2x1XRbtJqxXnBcQvMxMgDWNrTIj1ytDcSBb 3Qo0shRrtjH7DN1ly8IBllLQ0wXXI5O6GwjI7absEyEjpdoxFyMsHpaFONlTWRdi NgR2+5MWTxExeWaDRPAJM+THzwucfWVTeZVXJFMRfQnNIBj7TpO3X3Y4xzP9Vl/Y 2HEz8voSDZUVN6Ejxx/am7kb68WpWw46xmj59wWT7nf9SVEEm+R4Pfe3O9+0yvQV V4l6tN4yfEU= =RknP -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Prevent cond_resched() based preemption when interrupts are disabled, on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels" * tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Prevent rescheduling when interrupts are disabled |
|
|
|
82c387ef75 |
sched/core: Prevent rescheduling when interrupts are disabled
David reported a warning observed while loop testing kexec jump: Interrupts enabled after irqrouter_resume+0x0/0x50 WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220 kernel_kexec+0xf6/0x180 __do_sys_reboot+0x206/0x250 do_syscall_64+0x95/0x180 The corresponding interrupt flag trace: hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90 hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90 That means __up_console_sem() was invoked with interrupts enabled. Further instrumentation revealed that in the interrupt disabled section of kexec jump one of the syscore_suspend() callbacks woke up a task, which set the NEED_RESCHED flag. A later callback in the resume path invoked cond_resched() which in turn led to the invocation of the scheduler: __cond_resched+0x21/0x60 down_timeout+0x18/0x60 acpi_os_wait_semaphore+0x4c/0x80 acpi_ut_acquire_mutex+0x3d/0x100 acpi_ns_get_node+0x27/0x60 acpi_ns_evaluate+0x1cb/0x2d0 acpi_rs_set_srs_method_data+0x156/0x190 acpi_pci_link_set+0x11c/0x290 irqrouter_resume+0x54/0x60 syscore_resume+0x6a/0x200 kernel_kexec+0x145/0x1c0 __do_sys_reboot+0xeb/0x240 do_syscall_64+0x95/0x180 This is a long standing problem, which probably got more visible with the recent printk changes. Something does a task wakeup and the scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and invokes schedule() from a completely bogus context. The scheduler enables interrupts after context switching, which causes the above warning at the end. Quite some of the code paths in syscore_suspend()/resume() can result in triggering a wakeup with the exactly same consequences. They might not have done so yet, but as they share a lot of code with normal operations it's just a question of time. The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling models. Full preemption is not affected as cond_resched() is disabled and the preemption check preemptible() takes the interrupt disabled flag into account. Cure the problem by adding a corresponding check into cond_resched(). Reported-by: David Woodhouse <dwmw@amazon.co.uk> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org |
|
|
|
e6d3c4e535 |
sched_ext: A fix for v6.14-rc4
pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ79/Ww4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGdhUAQDM1AcK7pJUHzayuQCecCxNspGty8nR9T4KeVly 51pA2gEA4sbs6Fj4doVKVyaCunsvFoZ8Tb/utCX716fVnpjMMgY= =yDMV -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead" * tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance() |
|
|
|
8fef0a3b17 |
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
|
|
|
|
02d954c0fd |
sched: Compact RSEQ concurrency IDs with reduced threads and affinity
When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit |
|
|
|
ff3b373ecc |
- Clarify what happens when a task is woken up from the wake queue and make
clear its removal from that queue is atomic -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmexq1MACgkQEsHwGGHe VUq6ghAAoD872KGmQ3YioDZs+FKLpLWvo+6lC2rY+GFQ3oCu4TJfmlsiTGLCyzjv aYwL52diIORD9/Yfn6Sq/ZWkfncoVmwnht+tgVjXeRr3Pb4EnttgWxPRx/xYQizr jgRASpNsRUTr6zNzqEeeQYodIJaInOF5+r26oqYArcN5V9XB9Qaj0+f14UiyB6u+ 53qpEQqQopeLPyG4t59iUfefsaWm2ZIW3EnoWeyI8sRuaapGY/0LHhUAn6vcA++N kuUkliVsk+f/uTNZeJ4zv2uy8DpBXO4kTjmwPVwFz46sJ8RL8P7MOLax7e7fNssw tylwHt4qoLEoB2vg1yMvlUNFMeH85gj8hTyJMsgGtFTbwCH0kLFEoXUz0lfKMS7U A271E1Bumu3OT7FrAnxQahViv02YWG5fcg6R3OidQdSmgoQBIMJwDA2pyKLiq9FL 7mWoNfEqqBWn4O/1qBlf3jCvfFlzXRSSgVzEruoNB93cgzTaQaN5yVgeekMzwJEj NDowmIZRxEN8+lJyMxIGSOGa44aTXu0/+dtehEDeSpsDOXULFc5fpYt4SSa0Jt/F LlgnPGkM1vF0ddG5vDJpGw6B9Dhb82i1oYy2IVwOCkLOXSp2kvLDEI3sqgk5mmlH zFtAV21Zm4jqBke/aF7r+RCYRvDyQkuW0d1+H9okGWLSk7jKauM= =J43o -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Clarify what happens when a task is woken up from the wake queue and make clear its removal from that queue is atomic * tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clarify wake_up_q()'s write to task->wake_q.next |
|
|
|
04f41cbf03 |
sched_ext: Fixes for v6.14-rc2
- Fix lock imbalance in a corner case of dispatch_to_local_dsq(). - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically. - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path. - tools/sched_ext, selftest and other misc updates. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ695uA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGeCfAQDmUixMNJCIrRphYsWcYUzlGLZyyRpQEEYFtRMO UC266gD+PUV2UvuO5sAVH8HVnGdOqkXaE/IRG+TC7fQH3ruPlgI= =LFd1 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix lock imbalance in a corner case of dispatch_to_local_dsq() - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path - tools/sched_ext, selftest and other misc updates * tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h. sched_ext: selftests: Fix grammar in tests description sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq() sched_ext: Fix migration disabled handling in targeted dispatches sched_ext: Implement auto local dispatching of migration disabled tasks sched_ext: Fix incorrect time delta calculation in time_delta() sched_ext: Fix lock imbalance in dispatch_to_local_dsq() sched_ext: selftests/dsp_local_on: Fix selftest on UP systems tools/sched_ext: Add helper to check task migration state sched_ext: Fix incorrect autogroup migration detection sched_ext: selftests/dsp_local_on: Fix sporadic failures selftests/sched_ext: Fix enum resolution sched_ext: Include task weight in the error state dump sched_ext: Fixes typos in comments |
|
|
|
f5717c93a1 |
sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit |
|
|
|
f3f08c3acf |
sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, |
|
|
|
3296682157 |
sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: |
|
|
|
2fa0fbeb69 |
sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> |
|
|
|
c7b92e8969 |
Fix a cfs_rq->h_nr_runnable accounting bug that trips up a
defensive SCHED_WARN_ON() on certain workloads. The bug is believed to be (accidentally) self-correcting, hence no behavioral side effects are expected. Also print se.slice in debug output, since this value can now be set via the syscall ABI and can be useful to track. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN 2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX 6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+ 0r7+aYwn0F0= =5mot -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive SCHED_WARN_ON() on certain workloads. The bug is believed to be (accidentally) self-correcting, hence no behavioral side effects are expected. Also print se.slice in debug output, since this value can now be set via the syscall ABI and can be useful to track" * tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Provide slice length for fair tasks sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue |
|
|
|
bcc6244e13 |
sched: Clarify wake_up_q()'s write to task->wake_q.next
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after which a concurrent __wake_q_add() can immediately overwrite task->wake_q.next again. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com |
|
|
|
9065ce6975 |
sched/debug: Provide slice length for fair tasks
Since commit:
|
|
|
|
f55b0671e3 |
More power management updates for 6.14-rc1
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang).
- Revert a commit that added unused macros (Andy Shevchenko).
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed during
the subsequent system-wide resume transition (Rafael Wysocki).
- Clean up the teo cpuidle governor and make the handling of short idle
intervals in it consistent regardless of the properties of idle
states supplied by the cpuidle driver (Rafael Wysocki).
- Fix some boost-related issues in cpufreq (Lifeng Zheng).
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar).
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can happen
from any CPU (Christian Loehle).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5xYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxcFsP/2FIoEI2G6J7pk8zChWT225qkkaieh5P
tHIkcFINlgzyjLnqmyWELUdt+sB7re6/dMmoLor+abudHimvBvUfAj6Oiz1F3p2F
utE9TpfhOkXi1ci5zBl9h6+iDj2Z5op3Qe/qw/W3DTlcManAD+6r60A2tOEy0jhi
GTbp2SEEU28+LU/2J59IfxEuRTTH4pbQGXi+iKv/k9bmtLvQofa1saXyQCBSZrvO
z3MBdqnAxLeZCg/qILmEGsBvbv1wpugvp3yoMLVwGNyul12Augcs8PreQz7e5tFq
spEuCfpBJwyJLAGlOnjOYgsPbJBXWRkIBeLH7JealfZr9TX0y4LZSAHi/xe0Asd3
BBZLLDojxhYMLzmqSkuafHlQd5J7jKl++RS1A9Qm6aqglKjeSOC9Ca9fmrc9Ub9P
Jpf1SVJ3kJsv1Z7wuUcaj6oLxD8wlAgCo5pigNgWTP2HllhP2bmc22M8JWxpAz+m
nMeW8nr8bAu4XViZeb74YKGgUDngO/uKRwBpthSkE2fqM7q0Wr5E7G0u9M91mzG6
nd/XDwta5TeznMQpSy339NgT61i4HHfyc/SDIdpkBxI0C5l6jknNawq79i9gy7/E
In4MyooOlls/iX/JR0uxx2hXEByltF0IHqwsRYeJ6dmIajYgASXR1Hhh5Iy9fPJ/
JTJ7vR5oZPB/
=EgsM
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly fixes on top of the previously merged power
management material with the addition of some teo cpuidle governor
updates, some of which may also be regarded as fixes:
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang)
- Revert a commit that added unused macros (Andy Shevchenko)
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed
during the subsequent system-wide resume transition (Rafael
Wysocki)
- Clean up the teo cpuidle governor and make the handling of short
idle intervals in it consistent regardless of the properties of
idle states supplied by the cpuidle driver (Rafael Wysocki)
- Fix some boost-related issues in cpufreq (Lifeng Zheng)
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar)
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can
happen from any CPU (Christian Loehle)"
* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Synchronize runtime PM status of parents and children
cpufreq: airoha: Depends on OF
PM: Revert "Add EXPORT macros for exporting PM functions"
PM: hibernate: Add error handling for syscore_suspend()
cpufreq/schedutil: Only bind threads if needed
cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
cpufreq: CPPC: Fix wrong max_freq in policy initialization
cpufreq: Introduce a more generic way to set default per-policy boost flag
cpufreq: Fix re-boost issue after hotplugging a CPU
cpufreq: s3c64xx: Fix compilation warning
cpuidle: teo: Skip sleep length computation for low latency constraints
cpuidle: teo: Replace time_span_ns with a flag
cpuidle: teo: Simplify handling of total events count
cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
cpuidle: teo: Simplify counting events used for tick management
cpuidle: teo: Clarify two code comments
cpuidle: teo: Drop local variable prev_intercept_idx
cpuidle: teo: Combine candidate state index checks against 0
cpuidle: teo: Reorder candidate state index checks
cpuidle: teo: Rearrange idle state lookup code
|
|
|
|
1751f872cc |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit
|
|
|
|
1626e5ef0b |
sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.
Fix by properly tracking the currently locked rq.
Fixes:
|
|
|
|
d6f3e7d564 |
sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.
This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:
WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().
Link: https://github.com/sched-ext/scx/issues/370
Fixes:
|
|
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
|
|
|
c159dfbdd4 |
Mainly individually changelogged singleton patches. The patch series in
this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap library
code.
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
cleanup and Rust preparation in the xarray library code.
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
pathnames in some code comments.
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
new secs_to_jiffies() in various places where that is appropriate.
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API.
- "Convert ocfs2 to use folios" from Matthew Wilcox does that.
- "Remove get_task_comm() and print task comm directly" from Yafang Shao
removes now-unneeded calls to get_task_comm() in various places.
- "squashfs: reduce memory usage and update docs" from Phillip Lougher
implements some memory savings in squashfs and performs some
maintainability work.
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work.
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
corrupted image.
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc.
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger.
- "minmax.h: Cleanups and minor optimisations" from David Laight
does some maintenance work on the min/max library code.
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
on the xarray library code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
mJnaZ1/ibpEpearrChCViApQtcyEGQI=
=ti4E
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
|
|
|
|
65ef17aa07 |
hung_task: add task->flags, blocked by coredump to log
Resending this patch as I haven't received feedback on my initial submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflare.com/ For the processes which are terminated abnormally the kernel can provide a coredump if enabled. When the coredump is performed, the process and all its threads are put into the D state (TASK_UNINTERRUPTIBLE | TASK_FREEZABLE). On the other hand, we have kernel thread khungtaskd which monitors the processes in the D state. If the task stuck in the D state more than kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel log. The higher memory usage of a process, the longer it takes to create coredump, the longer tasks are in the D state. We have hung_task alerts for the processes with memory usage above 10Gb. Although, our kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec. Adding additional information to the log that the task is blocked by coredump will help with monitoring. Another approach might be to completely filter out alerts for such tasks, but in that case we would lose transparency about what is putting pressure on some system resources, e.g. we saw an increase in I/O when coredump occurs due its writing to disk. Additionally, it would be helpful to have task_struct->flags in the log from the function sched_show_task(). Currently it prints task_struct->thread_info->flags, this seems misleading as the line starts with "task:xxxx". [akpm@linux-foundation.org: fix printk control string] Link: https://lkml.kernel.org/r/20250110160328.64947-1-oxana@cloudflare.com Signed-off-by: Oxana Kharitonova <oxana@cloudflare.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Segall <bsegall@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2279563e3a |
sched_ext: Include task weight in the error state dump
Report the task weight when dumping the task state during an error exit. Moreover, adjust the output format to display dsq_vtime, slice, and weight on the same line. This can help identify whether certain tasks were excessively prioritized or de-prioritized due to large niceness gaps. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
be8ee18152 |
sched_ext: Fixes typos in comments
Fixes some spelling errors in the comments. Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
bc8198dc7e |
sched_ext: Changes for v6.14
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp in struct rq to avoid reading TSC multiple times within a locked scheduling operation. - Minor updates to the built-in idle CPU selection logic. - tool/sched_ext updates and other misc changes. Pulling sched_ext/for-6.14 into master causes a merge conflict between the following two commits (first commit in master, second in for-6.14): |
|
|
|
93940fbdc4 |
cpufreq/schedutil: Only bind threads if needed
Remove the unconditional binding of sugov kthreads to the affected CPUs if the cpufreq driver indicates that updates can happen from any CPU. This allows userspace to set affinities to either save power (waking up bigger CPUs on HMP can be expensive) or increasing performance (by letting the utilized CPUs run without preemption of the sugov kthread). Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/5a8deed4-7764-4729-a9d4-9520c25fa7e8@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
|
|
|
f4b9d3bf44 |
Power management updates for 6.14-rc1
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski).
- Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
Nick Chan).
- Add new cpufreq driver for Airoha SoCs (Christian Marangi).
- Fix using cpufreq-dt as module (Andreas Kemnade).
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam).
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy).
- Fix an amd-pstate driver regression with preferred core rankings not
being used (Mario Limonciello).
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki).
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello).
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello).
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar).
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a built
-in scaling factor for Arrow Lake processors (Rafael Wysocki).
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
consistency with CPU offline (Christian Loehle).
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf).
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson).
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang).
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap).
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy).
- Clean up the Exynos devfreq driver and devfreq core (Markus Elfring,
Jeongjun Park).
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong, Joe
Hattori).
- Implement dev_pm_opp_get_bw() (Neil Armstrong).
- Expose OPP reference counting helpers for Rust (Viresh Kumar).
- Fix TSC MHz calculation in cpupower (He Rongguang).
- Add install and uninstall options to bindings Makefile and add header
changes for cpufreq.h to SWIG bindings in cpupower (John B. Wyatt IV).
- Add missing residency header changes in cpuidle.h to SWIG bindings in
cpupower (John B. Wyatt IV).
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian).
- Fix cross-compilation in cpupower Makefile (Peng Fan).
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai).
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeOthsSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxqQsP/ivDt8nqDnxdKB7cKFQIsEK+tl0RnFVD
o5regvYeRcGWpUXuMaqBtTmCMjsB8bUkcj2yLquM54ubjHAGF6zJuw9ZytMPHVcC
b2xk3RCFlXSBFXVK8eOh3XRviA9nGhuY97ZnPsQOlvoECrxT2xyeL+mWo7s+t+q9
2NUH+yfRoi5FM+nqqDhsm0xXxJuPaNg6eAjIASuMjXap48rNk3L5kW6W/6nw7i0I
xQWd/pKLHaI5e7DRF/QdMKu8+Fm4BbN0jMqLblKPOmTe9KggvBkck5q1Um20sYkJ
vdKMAT02ClGavIC7DtY092Xik84NZfID4ZUchS6e2hJIQ3Uaw/eDvAo/jlT8gIzq
fnXPdApRIzQGDvMxFaAsKaGlwxiVlAGHPDSTH6MVWzsp+1DSkbloSwVPAfeYIn44
Jhov+6Ydux3597sSjo+YmD58acimXl7urVuk8P6m3U5+gb8/jlgbxpIn+vbxH3Ka
o44Vt7axD63gezOQY134sj5gic5JL0GuZovOlvzrF6+FsjvVqcax6FZ4n3uIXu7P
C1nwai+Wdzo7wvuz7RfO0g15Y15wYLQLYsRq/osRlf+sOmGVv7nA9tSzZ0LUdD5D
Pp6PxppF6anM0Kjen8Ppuu+Bcr11JfVvhnVTJqhs6u71XdAy4TnG1JjL4lPWYJ4D
Gfz2hyPNjiQX
=AoMC
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes here are cpufreq updates which are dominated
by amd-pstate driver changes, like in the previous cycle. Moreover,
changes related to amd-pstate are also the majority of cpupower
utility updates.
Included are some pieces of new hardware support, like the addition of
Clearwater Forest processors support to intel_idle, new cpufreq driver
for Airoha SoCs, and Apple cpufreq driver extensions to support more
SoCs. The intel_pstate driver is also extended to be able to support
new platforms by using ACPI CPPC to compute scaling factors between
HWP performance states and frequency.
The rest is mostly fixes and cleanups in assorted pieces of power
management code.
Specifics:
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski)
- Extend the Apple cpufreq driver to support more SoCs (Hector
Martin, Nick Chan)
- Add new cpufreq driver for Airoha SoCs (Christian Marangi)
- Fix using cpufreq-dt as module (Andreas Kemnade)
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam)
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy)
- Fix an amd-pstate driver regression with preferred core rankings
not being used (Mario Limonciello)
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki)
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello)
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello)
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar)
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a
built-in scaling factor for Arrow Lake processors (Rafael Wysocki)
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN
for consistency with CPU offline (Christian Loehle)
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf)
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson)
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan)
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang)
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap)
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy)
- Clean up the Exynos devfreq driver and devfreq core (Markus
Elfring, Jeongjun Park)
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong,
Joe Hattori)
- Implement dev_pm_opp_get_bw() (Neil Armstrong)
- Expose OPP reference counting helpers for Rust (Viresh Kumar)
- Fix TSC MHz calculation in cpupower (He Rongguang)
- Add install and uninstall options to bindings Makefile and add
header changes for cpufreq.h to SWIG bindings in cpupower (John B.
Wyatt IV)
- Add missing residency header changes in cpuidle.h to SWIG bindings
in cpupower (John B. Wyatt IV)
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian)
- Fix cross-compilation in cpupower Makefile (Peng Fan)
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai)
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello)"
* tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits)
PM / OPP: Add reference counting helpers for Rust implementation
PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
cpufreq: Use str_enable_disable()-like helpers
cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
PM: sleep: convert comment from kernel-doc to plain comment
cpufreq: ACPI: Fix max-frequency computation
pm: cpupower: Add missing residency header changes in cpuidle.h to SWIG
PM / devfreq: exynos: remove unused function parameter
OPP: OF: Fix an OF node leak in _opp_add_static_v2()
cpufreq/amd-pstate: Refactor max frequency calculation
cpufreq/amd-pstate: Fix prefcore rankings
pm: cpupower: Add header changes for cpufreq.h to SWIG bindings
cpufreq: sparc: change kzalloc to kcalloc
cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
cpufreq: apple-soc: Increase cluster switch timeout to 400us
cpufreq: apple-soc: Use 32-bit read for status register
...
|
|
|
|
1d6d399223 |
Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
relevant code on any other CPU. This is currently handled by smpboot
code which takes care of CPU-hotplug operations. Affinity here is
a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
run anywhere else. The affinity is set through kthread_bind_mask()
and the subsystem takes care by itself to handle CPU-hotplug
operations. Affinity here is assumed to be a correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
is not a correctness constraint but merely a preference in terms of
memory locality. kswapd and kcompactd both fall into this category.
The affinity is set manually like for any other task and CPU-hotplug
is supposed to be handled by the relevant subsystem so that the task
is properly reaffined whenever a given CPU from the node comes up.
Also care should be taken so that the node affinity doesn't cross
isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its own
ad-hoc way.
This is an infrastructure proposal to handle this with the following API
changes:
_ kthread_create_on_node() automatically affines the created kthread to
its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called right
after kthread_create_on_node() to specify a preferred affinity
different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine
to the housekeeping CPUs (which means to all online CPUs most of the
time or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
* Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
* Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
* Add some correctness check to ensure kthread_bind() is always called
before the first kthread wake up.
* Default affine kthread to its preferred node.
* Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
* Implement kthreads preferred affinity
* Unify kthread worker and kthread API's style
* Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
=g8In
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never
execute relevant code on any other CPU. This is currently handled
by smpboot code which takes care of CPU-hotplug operations.
Affinity here is a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and
can't run anywhere else. The affinity is set through
kthread_bind_mask() and the subsystem takes care by itself to
handle CPU-hotplug operations. Affinity here is assumed to be a
correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
This is not a correctness constraint but merely a preference in
terms of memory locality. kswapd and kcompactd both fall into this
category. The affinity is set manually like for any other task and
CPU-hotplug is supposed to be handled by the relevant subsystem so
that the task is properly reaffined whenever a given CPU from the
node comes up. Also care should be taken so that the node affinity
doesn't cross isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its
own ad-hoc way.
This is an infrastructure proposal to handle this with the following
API changes:
- kthread_create_on_node() automatically affines the created kthread
to its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called
right after kthread_create_on_node() to specify a preferred
affinity different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine to
the housekeeping CPUs (which means to all online CPUs most of the time
or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
- Consolidate a bunch of ad-hoc implementations of
kthread_run_on_cpu()
- Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
- Add some correctness check to ensure kthread_bind() is always
called before the first kthread wake up.
- Default affine kthread to its preferred node.
- Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
- Implement kthreads preferred affinity
- Unify kthread worker and kthread API's style
- Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation"
* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
kthread: modify kernel-doc function name to match code
rcu: Use kthread preferred affinity for RCU exp kworkers
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
rcu: Use kthread preferred affinity for RCU boost
kthread: Implement preferred affinity
mm: Create/affine kswapd to its preferred node
mm: Create/affine kcompactd to its preferred node
kthread: Default affine kthread to its preferred NUMA node
kthread: Make sure kthread hasn't started while binding it
sched,arm64: Handle CPU isolation on last resort fallback rq selection
arm64: Exclude nohz_full CPUs from 32bits el0 support
lib: test_objpool: Use kthread_run_on_cpu()
kallsyms: Use kthread_run_on_cpu()
soc/qman: test: Use kthread_run_on_cpu()
arm/bL_switcher: Use kthread_run_on_cpu()
|
|
|
|
62de6e1685 |
Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve
readability (Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
- Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain
changes (Juri Lelli)
- Correctly account for allocated bandwidth during
hotplug (Juri Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task()
(John Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
- Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible
tasks in sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
- Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use
the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
- Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
- RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config
(Mathieu Desnoyers)
- PSI enhancements:
- Fix race when task wakes up before psi_sched_switch()
adjusts flags (Chengming Zhou)
- IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
- Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
- Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
- Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
sfucuLawLPM=
=5ZNW
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain changes
(Juri Lelli)
- Correctly account for allocated bandwidth during hotplug (Juri
Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task() (John
Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)
PSI enhancements:
- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)
IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"
* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...
|
|
|
|
3429dd57f0 |
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes:
|
|
|
|
a5c16f29a8 |
Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.14: - Use str_enable_disable()-like helpers in cpufreq (Krzysztof Kozlowski). - Extend the Apple cpufreq driver to support more SoCs (Hector Martin, Nick Chan). - Add new cpufreq driver for Airoha SoCs (Christian Marangi). - Fix using cpufreq-dt as module (Andreas Kemnade). - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter Edwards, Sibi Sankar, Manivannan Sadhasivam). - Fix the maximum supported frequency computation in the ACPI cpufreq driver to avoid relying on unfounded assumptions (Gautham Shenoy). - Fix an amd-pstate driver regression with preferred core rankings not being used (Mario Limonciello). - Fix a precision issue with frequency calculation in the amd-pstate driver (Naresh Solanki). - Add ftrace event to the amd-pstate driver for active mode (Mario Limonciello). - Set default EPP policy on Ryzen processors in amd-pstate (Mario Limonciello). - Clean up the amd-pstate cpufreq driver and optimize it to increase code reuse (Mario Limonciello, Dhananjay Ugwekar). - Use CPPC to get scaling factors between HWP performance levels and frequency in the intel_pstate driver and make it stop using a built -in scaling factor for the Arrow Lake processor (Rafael Wysocki). - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for consistency with CPU offline (Christian Loehle). - Fix superfluous updates caused by need_freq_update in the schedutil cpufreq governor (Sultan Alsawaf). * pm-cpufreq: (40 commits) cpufreq: Use str_enable_disable()-like helpers cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver cpufreq: ACPI: Fix max-frequency computation cpufreq/amd-pstate: Refactor max frequency calculation cpufreq/amd-pstate: Fix prefcore rankings cpufreq: sparc: change kzalloc to kcalloc cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT cpufreq: apple-soc: Increase cluster switch timeout to 400us cpufreq: apple-soc: Use 32-bit read for status register cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1 cpufreq: apple-soc: Drop setting the PS2 field on M2+ dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq cpufreq: fix using cpufreq-dt as module cpufreq: scmi: Register for limit change notifications cpufreq: schedutil: Fix superfluous updates caused by need_freq_update cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN ... |
|
|
|
1225bb42b8 |
Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'
Merge updates related to system sleep, a cpuidle update and an Energy Model handling code update for 6.14-rc1: - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson). - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan). - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang). - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap). - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy). - Move sched domains rebuild function from the schedutil cpufreq governor to the Energy Model handling code (Rafael Wysocki). * pm-sleep: PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment PM: wakeup: implement devm_device_init_wakeup() helper PM: sleep: sysfs: don't include 'pm_wakeup.h' directly PM: sleep: autosleep: don't include 'pm_wakeup.h' directly PM: sleep: Update stale comment in device_resume() * pm-cpuidle: intel_idle: add Clearwater Forest SoC support * pm-em: PM: EM: Move sched domains rebuild function from schedutil to EM |
|
|
|
8ff6d472ab |
- Do not adjust the weight of empty group entities and avoid scheduling
artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeM2+8ACgkQEsHwGGHe VUrqHA/+PyPsILNfDFnaXD/Shv7kCgk3hH2YVrCnyIQn8fhTCL+K0luXRRmAPhl1 pEbRyEtVRXZM9WyHJ1r7VmnIHVanUKWcuZzAx8HT9b7Tkrqy/xSriH/jAV7FYG9h gIhWA4vXT34HG2MQHTecJ/4gvkeI5tV93nkZ6vwEwwN4pGxSvijPR3yr3yl286Nt BlHP9cXBWWsxTq4CYb2j5q9nmymKuiDmhD8jc1fVjDKxjmrNHEwX5mtvtRLGI1dR O4pjUEqYQ+TWIkX8ZIRBPzLd9b6750ncO/9yb24cY7Z3RMKov4wz8b819Dt77cTp XF+bT+8fsaKNRjkzPEseZU4OL16ZO+33Kcn+JoNPWvbONgFKusZ4qkCCwvWpiQlV 0Eddh1XjKBoXA1tR6VrREcUKQ2zWoXrn8tlHUTMfPuPq1jlbQNtzN7OWcR+WBy7L FiClWzWLv0fCeTXaAEcO9+/MN5z2lDQsJRiLtAlGh0JJU6/1H6IVX2tSZllkpR/R 5pyHCpYsh5cucx6cQPk+rp9O6PDYu2frMuJ8itWlQLHeSzjzOoeE6EwwZcMQG4xb UG/azMbmqWrtmZqCgNCBtq7RAkVRe0IuxWuCtV1VkatU+2RXdtV85tVKn/JG2KLW 05c8GTdQZgnoY65/TZHwudhkzytclynKrrvhfKFllAWnb8D8gyQ= =fZWm -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag |
|
|
|
423124ab97 | Merge back earlier cpufreq material for 6.14 | |
|
|
21641bd9a7 |
lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit |
|
|
|
d40797d672 |
kasan: make kasan_record_aux_stack_noalloc() the default behaviour
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes:
|
|
|
|
987ce79b52 |
sched_ext: fix kernel-doc warnings
Use the correct function parameter names and function names. Use the correct kernel-doc comment format for struct sched_ext_ops to eliminate a bunch of warnings. ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked' ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set' ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less' ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup' ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass' ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask' ext.c:209: warning: Incorrect use of kernel-doc format: * select_cpu - Pick the target CPU for a task which is being woken up ext.c:236: warning: Incorrect use of kernel-doc format: * enqueue - Enqueue a task on the BPF scheduler ext.c:251: warning: Incorrect use of kernel-doc format: * dequeue - Remove a task from the BPF scheduler ext.c:267: warning: Incorrect use of kernel-doc format: * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs ext.c:290: warning: Incorrect use of kernel-doc format: * tick - Periodic tick ext.c:300: warning: Incorrect use of kernel-doc format: * runnable - A task is becoming runnable on its associated CPU ext.c:327: warning: Incorrect use of kernel-doc format: * running - A task is starting to run on its associated CPU ext.c:335: warning: Incorrect use of kernel-doc format: * stopping - A task is stopping execution ext.c:346: warning: Incorrect use of kernel-doc format: * quiescent - A task is becoming not runnable on its associated CPU ext.c:366: warning: Incorrect use of kernel-doc format: * yield - Yield CPU ext.c:381: warning: Incorrect use of kernel-doc format: * core_sched_before - Task ordering for core-sched ext.c:399: warning: Incorrect use of kernel-doc format: * set_weight - Set task weight ext.c:408: warning: Incorrect use of kernel-doc format: * set_cpumask - Set CPU affinity ext.c:418: warning: Incorrect use of kernel-doc format: * update_idle - Update the idle state of a CPU ext.c:439: warning: Incorrect use of kernel-doc format: * cpu_acquire - A CPU is becoming available to the BPF scheduler ext.c:449: warning: Incorrect use of kernel-doc format: * cpu_release - A CPU is taken away from the BPF scheduler ext.c:461: warning: Incorrect use of kernel-doc format: * init_task - Initialize a task to run in a BPF scheduler ext.c:476: warning: Incorrect use of kernel-doc format: * exit_task - Exit a previously-running task from the system ext.c:485: warning: Incorrect use of kernel-doc format: * enable - Enable BPF scheduling for a task ext.c:494: warning: Incorrect use of kernel-doc format: * disable - Disable BPF scheduling for a task ext.c:504: warning: Incorrect use of kernel-doc format: * dump - Dump BPF scheduler state on error ext.c:512: warning: Incorrect use of kernel-doc format: * dump_cpu - Dump BPF scheduler state for a CPU on error ext.c:524: warning: Incorrect use of kernel-doc format: * dump_task - Dump BPF scheduler state for a runnable task on error ext.c:535: warning: Incorrect use of kernel-doc format: * cgroup_init - Initialize a cgroup ext.c:550: warning: Incorrect use of kernel-doc format: * cgroup_exit - Exit a cgroup ext.c:559: warning: Incorrect use of kernel-doc format: * cgroup_prep_move - Prepare a task to be moved to a different cgroup ext.c:574: warning: Incorrect use of kernel-doc format: * cgroup_move - Commit cgroup move ext.c:585: warning: Incorrect use of kernel-doc format: * cgroup_cancel_move - Cancel cgroup move ext.c:597: warning: Incorrect use of kernel-doc format: * cgroup_set_weight - A cgroup's weight is being changed ext.c:611: warning: Incorrect use of kernel-doc format: * cpu_online - A CPU became online ext.c:620: warning: Incorrect use of kernel-doc format: * cpu_offline - A CPU is going offline ext.c:633: warning: Incorrect use of kernel-doc format: * init - Initialize the BPF scheduler ext.c:638: warning: Incorrect use of kernel-doc format: * exit - Clean up after the BPF scheduler ext.c:648: warning: Incorrect use of kernel-doc format: * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch ext.c:653: warning: Incorrect use of kernel-doc format: * flags - %SCX_OPS_* flags ext.c:658: warning: Incorrect use of kernel-doc format: * timeout_ms - The maximum amount of time, in milliseconds, that a ext.c:667: warning: Incorrect use of kernel-doc format: * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default ext.c:673: warning: Incorrect use of kernel-doc format: * hotplug_seq - A sequence number that may be set by the scheduler to ext.c:682: warning: Incorrect use of kernel-doc format: * name - BPF scheduler's name ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7d9da04057 |
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:
psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
When investigating the series of events leading up to the splat,
following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
the block task can wakeup when newidle balance drops the runqueue lock
during __schedule().
If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
any modifications in psi_enqueue() which would still see the flags of a
running task and not a blocked one. Instead, rely on psi_sched_switch()
to do the right thing.
Since the status returned by try_to_block_task() may no longer be true
by the time schedule reaches psi_sched_switch(), check if the task is
blocked or not using a combination of task_on_rq_queued() and
p->se.sched_delayed checks.
[ prateek: Commit message, testing, early bailout in psi_enqueue() ]
Fixes:
|
|
|
|
a6fd16148f |
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com |
|
|
|
763a744e24 |
sched: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com |
|
|
|
8722903cbb |
sched: Define sched_clock_irqtime as static key
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com |
|
|
|
3229adbe78 |
sched/fair: Do not compute overloaded status unnecessarily during lb
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com |
|
|
|
0ac1ee9ebf |
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com |
|
|
|
873199d27b |
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.
So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.
Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:
Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.
Based on master
|
|
|
|
8061b9f5e1 |
sched/debug: Change need_resched warnings to pr_err
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic. It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve. Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Josh Don <joshdon@google.com> Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com |