Commit Graph

1823 Commits

Author SHA1 Message Date
Peter Zijlstra 33cf66d883 sched/fair: Proportional newidle balance
Add a randomized algorithm that runs newidle balancing proportional to
its success rate.

This improves schbench significantly:

 6.18-rc4:			2.22 Mrps/s
 6.18-rc4+revert:		2.04 Mrps/s
 6.18-rc4+revert+random:	2.18 Mrps/S

Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%:

 6.17:			-6%
 6.17+revert:		 0%
 6.17+revert+random:	-1%

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com
Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17 17:13:16 +01:00
Phil Auld aaab6bb54a sched: Increase sched_tick_remote timeout
Increase the sched_tick_remote WARN_ON timeout to remove false
positives due to temporarily busy HK cpus. The suggestion
was 30 seconds to catch really stuck remote tick processing
but not trigger it too easily.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
2025-11-17 17:13:15 +01:00
Hao Jia e40cea333e sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
Since commit d4c64207b8 ("sched: Cleanup the sched_change NOCLOCK usage"),
update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin()
to update the rq clock. This results in a duplicate call update_rq_clock()
in __set_cpus_allowed_ptr_locked().

While holding the rq lock and before calling do_set_cpus_allowed(),
there is nothing that depends on an updated rq_clock.

Therefore, remove the redundant update_rq_clock() in
__set_cpus_allowed_ptr_locked() to avoid the warning about double
rq clock updates.

Fixes: d4c64207b8 ("sched: Cleanup the sched_change NOCLOCK usage")
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com
2025-11-11 12:33:38 +01:00
Peter Zijlstra 977b9a0054 Merge branch 'linus/master' into sched/core, to resolve conflict
Conflicts:
	kernel/sched/ext.c

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-29 08:42:28 +01:00
Peter Zijlstra af13e5e437 sched: Fix the do_set_cpus_allowed() locking fix
Commit abfc01077d ("sched: Fix do_set_cpus_allowed() locking")
overlooked that __balance_push_cpu_stop() calls select_fallback_rq()
with rq->lock held. This makes that set_cpus_allowed_force() will
recursively take rq->lock and the machine locks up.

Run select_fallback_rq() earlier, without holding rq->lock. This opens
up a race window where a task could get migrated out from under us, but
that is harmless, we want the task migrated.

select_fallback_rq() itself will not be subject to concurrency as it
will be fully serialized by p->pi_lock, so there is no chance of
set_cpus_allowed_force() getting called with different arguments and
selecting different fallback CPUs for one task.

Fixes: abfc01077d ("sched: Fix do_set_cpus_allowed() locking")
Reported-by: Jan Polensky <japo@linux.ibm.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jan Polensky <japo@linux.ibm.com>
Closes: https://lore.kernel.org/oe-lkp/202510271206.24495a68-lkp@intel.com
Link: https://patch.msgid.link/20251027110133.GI3245006@noisy.programming.kicks-ass.net
2025-10-28 15:00:48 +01:00
Peter Zijlstra 4c95380701 sched/ext: Fold balance_scx() into pick_task_scx()
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Joel Fernandes 50653216e4 sched: Add support to pick functions to take rf
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.

[peterz: - added RETRY_TASK handling
         - removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra 1e900f415c sched: Detect per-class runqueue changes
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra 73ec89a1ce sched: Mandate shared flags for sched_change
Shrikanth noted that sched_change pattern relies on using shared
flags.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra d4c64207b8 sched: Cleanup the sched_change NOCLOCK usage
Teach the sched_change pattern how to do update_rq_clock(); this
allows for some simplifications / cleanups.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra 5892cbd85d sched: Match __task_rq_{,un}lock()
In preparation to adding more rules to __task_rq_lock(), such that
__task_rq_unlock() will no longer be equivalent to rq_unlock(),
make sure every __task_rq_lock() is matched by a __task_rq_unlock()
and vice-versa.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra 46a177fb01 sched: Add locking comments to sched_class methods
'Document' the locking context the various sched_class methods are
called under.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra 650952d3fb sched: Make __do_set_cpus_allowed() use the sched_change pattern
Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra b079d93796 sched: Rename do_set_cpus_allowed()
Hopefully saner naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra abfc01077d sched: Fix do_set_cpus_allowed() locking
All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.

Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 942b8db965 sched: Fix migrate_disable_switch() locking
For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.

Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 6455ad5346 sched: Move sched_class::prio_changed() into the change pattern
Move sched_class::prio_changed() into the change pattern.

And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 1ae5f5dfe5 sched: Cleanup sched_delayed handling for class switches
Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra 637b068282 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.

This changes the order of callbacks slightly:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.

Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra e9139f765a sched: Employ sched_change guards
As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Peter Zijlstra (Intel) ee6e44dfe6 sched/deadline: Stop dl_server before CPU goes offline
IBM CI tool reported kernel warning[1] when running a CPU removal
operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1"

WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170
NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
Call Trace:
[c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable)
[c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[c00000000034df84] __hrtimer_run_queues+0x1a4/0x390
[c00000000034f624] hrtimer_interrupt+0x124/0x300
[c00000000002a230] timer_interrupt+0x140/0x320

Git bisects to: commit 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")

This happens since:
- dl_server hrtimer gets enqueued close to cpu offline, when
  kthread_park enqueues a fair task.
- CPU goes offline and drmgr removes it from cpu_present_mask.
- hrtimer fires and warning is hit.

Fix it by stopping the dl_server before CPU is marked dead.

[1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/
[2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr

[sshegde: wrote the changelog and tested it]
Fixes: 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")
Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
2025-10-14 13:43:08 +02:00
Linus Torvalds c5448d46b3 Updates for the time(rs) core subsystem:
- Address the inconsistent shutdown sequence of per CPU clockevents on
     CPU hotplug, which onoly removed it from the core but failed to invoke
     the actual device driver shutdown callback. This keeps the timer
     active, which prevents power savings and causes pointless noise in
     virtualization.
 
   - Encapsulate the open coded access to the hrtimer clock base, which is a
     private implementation detail, so that the implementation can be
     changed without breaking a lot of usage sites.
 
   - Enhance the debug output of the clocksource watchdog to provide better
     information for analysis.
 
   - The usual set of cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmjaT+ETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoT8HD/47VRqHbFKazGcwZ4+OLKfFTq0PKzIU
 kIhimNMSKVqiPlmWhud/4idGX5JFqMP3dS9ZaDlp7xCtu1ScCqzH72kbCrab0l4g
 wjRgneGWsHhv06PY50Ty9FusFSetkjA8XSYavFV9HZNgCRIvho958akdckpazMcF
 i4V5WTIVkwhGynhcGwqBXcmHASUa7x6gEjZYBbX6wspPl2Wk5z+vn/zAVIHAozwS
 9GPw8HlUeKqM72U12mWGkt4KLjy2gzoupvx9vD4giXRvqkt09rfHtRqvEdkew+/G
 ZANhmTJqfA1AiihYP30fHgY5lQbNQG+9O10UKhlUyrlBKPKHKu6dIuJQRSBy0j59
 Bqef4UubPBlMP4xRgTbt11SFrjuqDo68bTIDGmQNQvb1BXSXhGA08PDPbIKkFdJh
 8cXwUHzjLkDk0tL6kVXRWlsZcCmjz87TVS7PufGE3EIZkh0J2Ob+g7nFS3PsORKW
 65ROSRoYmh4CRh/l9CTNbPWFvz3jahIEnxlrSc70o0A4DUeeWTwT/M+MoNOtSJuV
 D4qSD35JsFaEbKFfSJwF/0CybG3Nx3UPI8nvEEYQtS4D+BCJHy7INg1P/NtWw+vb
 W7puAFd+WbjSN14uQfYkYWC53vQ22isW0nU8W6G5NbMRD5K4WIld4UhDksJIGCLe
 2LZ0VHxv4gaXkg==
 =A9Tu
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - Address the inconsistent shutdown sequence of per CPU clockevents on
   CPU hotplug, which only removed it from the core but failed to invoke
   the actual device driver shutdown callback. This kept the timer
   active, which prevented power savings and caused pointless noise in
   virtualization.

 - Encapsulate the open coded access to the hrtimer clock base, which is
   a private implementation detail, so that the implementation can be
   changed without breaking a lot of usage sites.

 - Enhance the debug output of the clocksource watchdog to provide
   better information for analysis.

 - The usual set of cleanups and enhancements all over the place

* tag 'timers-core-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Fix spelling mistakes in comments
  clocksource: Print durations for sync check unconditionally
  LoongArch: Remove clockevents shutdown call on offlining
  tick: Do not set device to detached state in tick_shutdown()
  hrtimer: Reorder branches in hrtimer_clockid_to_base()
  hrtimer: Remove hrtimer_clock_base:: Get_time
  hrtimer: Use hrtimer_cb_get_time() helper
  media: pwm-ir-tx: Avoid direct access to hrtimer clockbase
  ALSA: hrtimer: Avoid direct access to hrtimer clockbase
  lib: test_objpool: Avoid direct access to hrtimer clockbase
  sched/core: Avoid direct access to hrtimer clockbase
  timers/itimer: Avoid direct access to hrtimer clockbase
  posix-timers: Avoid direct access to hrtimer clockbase
  jiffies: Remove obsolete SHIFTED_HZ comment
2025-09-30 16:09:27 -07:00
Linus Torvalds 6c7340a7a8 Scheduler updates for v6.18:
Core scheduler changes:
 
  - Make migrate_{en,dis}able() inline, to improve performance
    (Menglong Dong)
 
  - Move STDL_INIT() functions out-of-line (Peter Zijlstra)
 
  - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
 
 Fair scheduling:
 
  - Defer throttling when tasks exit to user-space, to reduce the
    chance & impact of throttle-preemption with held locks and
    other resources. (Aaron Lu, Valentin Schneider)
 
  - Get rid of sched_domains_curr_level hack for tl->cpumask(),
    as the warning was getting triggered on certain topologies.
    (Peter Zijlstra)
 
 Misc cleanups & fixes:
 
  - Header cleanups (Menglong Dong)
 
  - Fix race in push_dl_task() (Harshit Agarwal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
 xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
 8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
 v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
 9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
 pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
 5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
 U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
 cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
 ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
 Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
 drSUoI1P570=
 =8ZTZ
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Make migrate_{en,dis}able() inline, to improve performance
     (Menglong Dong)

   - Move STDL_INIT() functions out-of-line (Peter Zijlstra)

   - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)

  Fair scheduling:

   - Defer throttling to when tasks exit to user-space, to reduce the
     chance & impact of throttle-preemption with held locks and other
     resources (Aaron Lu, Valentin Schneider)

   - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
     warning was getting triggered on certain topologies (Peter
     Zijlstra)

  Misc cleanups & fixes:

   - Header cleanups (Menglong Dong)

   - Fix race in push_dl_task() (Harshit Agarwal)"

* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix some typos in include/linux/preempt.h
  sched: Make migrate_{en,dis}able() inline
  rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
  arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
  sched/fair: Do not balance task to a throttled cfs_rq
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: Get rid of throttled_lb_pair()
  sched/fair: Task based throttle time accounting
  sched/fair: Switch to task based throttle model
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Add related data structure for task based throttle
  sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
  sched: Move STDL_INIT() functions out-of-line
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
  sched/deadline: Fix race in push_dl_task()
2025-09-30 10:35:11 -07:00
Linus Torvalds a23cd25bae sched_ext: Changes for v6.18
- Code organization cleanup. Separate internal types and accessors to
   ext_internal.h to reduce the size of ext.c and improve maintainability.
 
 - Prepare for cgroup sub-scheduler support by adding @sch parameter to
   various functions and helpers, reorganizing scheduler instance handling,
   and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid().
 
 - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide
   safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is
   deprecated with warnings due to potential race conditions.
 
 - Improve debugging with migration-disabled counter in error state dumps,
   SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other
   enhancements to help diagnose issues.
 
 - Use cgroup_lock/unlock() for cgroup synchronization instead of
   scx_cgroup_rwsem based synchronization. This is simpler and allows
   enable/disable paths to synchronize against cgroup changes independent of
   the CPU controller.
 
 - rhashtable_lookup() replacement to avoid redundant RCU locking was
   reverted due to RCU usage warnings. Will be redone once rhashtable is
   updated to use rcu_dereference_all().
 
 - Other misc updates and fixes including bypass handling improvements,
   scx_task_iter_relock() improvements, tools/sched_ext updates, and
   compatibility helpers.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaNbpHQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTzWAP9cOG1O2DE0nCJJlDfGoJd1suz9quK+OrVdXhO+
 6nnXoAEAonDP0jXtwoW9GVQFyj8WZ0mfV4QJk0OHM1CoOV5XkQY=
 =Ev85
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Code organization cleanup. Separate internal types and accessors to
   ext_internal.h to reduce the size of ext.c and improve
   maintainability.

 - Prepare for cgroup sub-scheduler support by adding @sch parameter to
   various functions and helpers, reorganizing scheduler instance
   handling, and dropping obsolete helpers like scx_kf_exit() and
   kf_cpu_valid().

 - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to
   provide safer access patterns with proper RCU protection.
   scx_bpf_cpu_rq() is deprecated with warnings due to potential race
   conditions.

 - Improve debugging with migration-disabled counter in error state
   dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and
   other enhancements to help diagnose issues.

 - Use cgroup_lock/unlock() for cgroup synchronization instead of
   scx_cgroup_rwsem based synchronization. This is simpler and allows
   enable/disable paths to synchronize against cgroup changes
   independent of the CPU controller.

 - rhashtable_lookup() replacement to avoid redundant RCU locking was
   reverted due to RCU usage warnings. Will be redone once rhashtable is
   updated to use rcu_dereference_all().

 - Other misc updates and fixes including bypass handling improvements,
   scx_task_iter_relock() improvements, tools/sched_ext updates, and
   compatibility helpers.

* tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits)
  Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"
  sched_ext: Misc updates around scx_sched instance pointer
  sched_ext: Drop scx_kf_exit() and scx_kf_error()
  sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()
  sched_ext: Drop kf_cpu_valid()
  sched_ext: Add the @sch parameter to ext_idle helpers
  sched_ext: Add the @sch parameter to __bstr_format()
  sched_ext: Separate out scx_kick_cpu() and add @sch to it
  tools/sched_ext: scx_qmap: Make debug output quieter by default
  sched_ext: Make qmap dump operation non-destructive
  sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
  sched_ext: Use bitfields for boolean warning flags
  sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
  sched_ext: Improve SCX_KF_DISPATCH comment
  sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
  sched_ext: Verify RCU protection in scx_bpf_cpu_curr()
  sched_ext: Add migration-disabled counter to error state dump
  sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning
  tools/sched_ext: Add compat helper for scx_bpf_cpu_curr()
  sched_ext: deprecation warn for scx_bpf_cpu_rq()
  ...
2025-09-30 09:05:07 -07:00
Linus Torvalds 722df25ddf kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
 ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
 =tUXG
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Menglong Dong 378b770819 sched: Make migrate_{en,dis}able() inline
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):

  54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
                 bpf_prog_2dcccf652aac1793_bench_trigger_fentry
  10.43% [kernel] [k] migrate_enable
  10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
  8.06% [kernel] [k] __bpf_prog_exit_recur
  4.11% libc.so.6 [.] syscall
  2.15% [kernel] [k] entry_SYSCALL_64
  1.48% [kernel] [k] memchr_inv
  1.32% [kernel] [k] fput
  1.16% [kernel] [k] _copy_to_user
  0.73% [kernel] [k] bpf_prog_test_run_raw_tp

So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.

Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".

The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.

Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().

The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
  typeof((ptr) + 0)

So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.

The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.

Before this patch, the performance of BPF FENTRY is:

  fentry         :  113.030 ± 0.149M/s
  fentry         :  112.501 ± 0.187M/s
  fentry         :  112.828 ± 0.267M/s
  fentry         :  115.287 ± 0.241M/s

After this patch, the performance of BPF FENTRY increases to:

  fentry         :  143.644 ± 0.670M/s
  fentry         :  149.764 ± 0.362M/s
  fentry         :  149.642 ± 0.156M/s
  fentry         :  145.263 ± 0.221M/s

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25 09:57:16 +02:00
Tejun Heo ebfd5226ec sched_ext: Merge branch 'for-6.17-fixes' into for-6.18
Pull sched_ext/for-6.17-fixes to receive:

 55ed11b181 ("sched_ext: idle: Handle migration-disabled tasks in BPF code")

which conflicts with the following commit in for-6.18:

 2407bae23d ("sched_ext: Add the @sch parameter to ext_idle helpers")

The conflict is a simple context conflict which can be resolved by taking
the updated parts from both commits.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:10:20 -10:00
Tejun Heo a1eab4d813 sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED
While collecting SCX related fields in struct task_group into struct
scx_task_group, 6e6558a6bc ("sched_ext, sched/core: Factor out struct
scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which
leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but
CONFIG_EXT_GROUP_SCHED is enabled. Fix it.

Fixes: 6e6558a6bc ("sched_ext, sched/core: Factor out struct scx_task_group")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 23:07:27 -10:00
Thomas Weißschuh b68b7f3e9b sched/core: Avoid direct access to hrtimer clockbase
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.

Switch to the equivalent helper.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:18 +02:00
Tejun Heo a5bd6ba30b sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations
SCX hooks into CPU cgroup controller operations and read-locks
scx_cgroup_rwsem to exclude them while enabling and disable schedulers.
While this works, it's unnecessarily complicated given that
cgroup_[un]lock() are available and thus the cgroup operations can be locked
out that way.

Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach
operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop
scx_cgroup_finish_attach() which is no longer necessary. Drop the now
unnecessary rcu locking and css ref bumping in scx_cgroup_init() and
scx_cgroup_exit().

As scx_cgroup_set_weight/bandwidth() paths aren't protected by
cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain
the locking there.

This is overall simpler and will also allow enable/disable paths to
synchronize against cgroup changes independent of the CPU controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:36:07 -10:00
Valentin Schneider 2cd571245b sched/fair: Add related data structure for task based throttle
Add related data structures for this new throttle functionality.

Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com
2025-09-03 10:03:13 +02:00
Simon Schuster edd3cb05c0 copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.

While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.

Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.

Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:34 +02:00
Linus Torvalds 6a68cec16b sched_ext: Changes for v6.17
- Add support for cgroup "cpu.max" interface.
 
 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/.
 
 - Drop UP paths in accordance with sched core changes.
 
 - Documentation and other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqnxg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGUh5AQC6YM7ggRPYRmy28m5B0nubpKtCHqPOAHSd/QbY
 MCiThgD+JuE9ewg3wYO/jvJx3NyIRB1McMnAaG59hf6R0Plh5Qo=
 =TeLF
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Add support for cgroup "cpu.max" interface

 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/

 - Drop UP paths in accordance with sched core changes

 - Documentation and other misc changes

* tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix scx_bpf_reenqueue_local() reference
  sched_ext: Drop kfuncs marked for removal in 6.15
  sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic
  kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments
  sched_ext: Add support for cgroup bandwidth control interface
  sched_ext, sched/core: Factor out struct scx_task_group
  sched_ext: Return NULL in llc_span
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.c
  sched_ext: Always use SMP versions in kernel/sched/ext.h
  sched_ext: Always use SMP versions in kernel/sched/ext.c
  sched_ext: Documentation: Clarify time slice handling in task lifecycle
  sched_ext: Make scx_locked_rq() inline
  sched_ext: Make scx_rq_bypassing() inline
  sched_ext: idle: Make local functions static in ext_idle.c
  sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-31 16:29:46 -07:00
Linus Torvalds 4ff261e725 Runtime verification changes for 6.17
- Added Linear temporal logic monitors for RT application
 
   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page faults, or
   may be blocked trying to take a mutex without priority inheritance.
 
   However, while attempting to implement DA monitors for these real-time
   rules, deterministic automaton is found to be inappropriate as the
   specification language. The automaton is complicated, hard to understand,
   and error-prone.
 
   For these cases, linear temporal logic is found to be more suitable. The
   LTL is more concise and intuitive.
 
 - Make printk_deferred() public
 
   The new monitors needed access to printk_deferred(). Make them visible for
   the entire kernel.
 
 - Add a vpanic() to allow for va_list to be passed to panic.
 
 - Add rtapp container monitor.
 
   A collection of monitors that check for common problems with real-time
   applications that cause unexpected latency.
 
 - Add page fault tracepoints to risc-v
 
   These tracepoints are necessary to for the RV monitor to run on risc-v.
 
 - Fix the behaviour of the rv tool with -s and idle tasks.
 
 - Allow the rv tool to gracefully terminate with SIGTERM
 
 - Adjusts dot2c not to create lines over 100 columns
 
 - Properly order nested monitors in the RV Kconfig file
 
 - Return the registration error in all DA monitor instead of 0
 
 - Update and add new sched collection monitors
 
   Replace tss and sncid monitors with more complete sts:
   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.
 
   New monitor: nrp
    Preemption requires need resched which is cleared by any switch
    (includes a non optimal workaround for /nested/ preemptions)
 
   New monitor: sssw
    suspension requires setting the task to sleepable and, after the
    switch occurs, the task requires a wakeup to come back to runnable
 
   New monitor: opid
    waking and need-resched operations occur with interrupts and
    preemption disabled or in IRQ without explicitly disabling preemption
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIk8cBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qi3DAQCFu6DM7uPSh94oggWlH2LukOYVGk2b
 CvGrqMFuefae7QD/aK9nCMfzaBehixMOMQHLHELEh527Hd+RwQCrlnLALQU=
 =r5HZ
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verification updates from Steven Rostedt:

 - Added Linear temporal logic monitors for RT application

   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page
   faults, or may be blocked trying to take a mutex without priority
   inheritance.

   However, while attempting to implement DA monitors for these
   real-time rules, deterministic automaton is found to be inappropriate
   as the specification language. The automaton is complicated, hard to
   understand, and error-prone.

   For these cases, linear temporal logic is found to be more suitable.
   The LTL is more concise and intuitive.

 - Make printk_deferred() public

   The new monitors needed access to printk_deferred(). Make them
   visible for the entire kernel.

 - Add a vpanic() to allow for va_list to be passed to panic.

 - Add rtapp container monitor.

   A collection of monitors that check for common problems with
   real-time applications that cause unexpected latency.

 - Add page fault tracepoints to risc-v

   These tracepoints are necessary to for the RV monitor to run on
   risc-v.

 - Fix the behaviour of the rv tool with -s and idle tasks.

 - Allow the rv tool to gracefully terminate with SIGTERM

 - Adjusts dot2c not to create lines over 100 columns

 - Properly order nested monitors in the RV Kconfig file

 - Return the registration error in all DA monitor instead of 0

 - Update and add new sched collection monitors

   Replace tss and sncid monitors with more complete sts:

   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.

   New monitor: nrp
     Preemption requires need resched which is cleared by any switch
     (includes a non optimal workaround for /nested/ preemptions)

   New monitor: sssw
     suspension requires setting the task to sleepable and, after the
     switch occurs, the task requires a wakeup to come back to runnable

   New monitor: opid
      waking and need-resched operations occur with interrupts and
      preemption disabled or in IRQ without explicitly disabling
      preemption"

* tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (48 commits)
  rv: Add opid per-cpu monitor
  rv: Add nrp and sssw per-task monitors
  rv: Replace tss and sncid monitors with more complete sts
  sched: Adapt sched tracepoints for RV task model
  rv: Retry when da monitor detects race conditions
  rv: Adjust monitor dependencies
  rv: Use strings in da monitors tracepoints
  rv: Remove trailing whitespace from tracepoint string
  rv: Add da_handle_start_run_event_ to per-task monitors
  rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
  rv: Fix wrong type cast in monitors_show()
  rv: Remove struct rv_monitor::reacting
  rv: Remove rv_reactor's reference counter
  rv: Merge struct rv_reactor_def into struct rv_reactor
  rv: Merge struct rv_monitor_def into struct rv_monitor
  rv: Remove unused field in struct rv_monitor_def
  rv: Return init error when registering monitors
  verification/rvgen: Organise Kconfig entries for nested monitors
  tools/dot2c: Fix generated files going over 100 column limit
  tools/rv: Stop gracefully also on SIGTERM
  ...
2025-07-30 16:23:12 -07:00
Linus Torvalds bf76f23aa1 Scheduler updates for v6.17:
Core scheduler changes:
 
  - Better tracking of maximum lag of tasks in presence of different
    slices duration, for better handling of lag in the fair
    scheduler. (Vincent Guittot)
 
  - Clean up and standardize #if/#else/#endif markers throughout
    the entire scheduler code base (Ingo Molnar)
 
  - Make SMP unconditional: build the SMP scheduler's
    data structures and logic on UP kernel too, even though
    they are not used, to simplify the scheduler and remove
    around 200 #ifdef/[#else]/#endif blocks from the
    scheduler. (Ingo Molnar)
 
  - Reorganize cgroup bandwidth control interface handling
    for better interfacing with sched_ext (Tejun Heo)
 
 Balancing:
 
  - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris Mason)
  - Remove sched_domain_topology_level::flags to simplify the code (Prateek Nayak)
  - Simplify and clean up build_sched_topology() (Li Chen)
  - Optimize build_sched_topology() on large machines (Li Chen)
 
 Real-time scheduling:
 
  - Add initial version of proxy execution: a mechanism for mutex-owning
    tasks to inherit the scheduling context of higher priority waiters.
    Currently limited to a single runqueue and conditional on CONFIG_EXPERT,
    and other limitations. (John Stultz, Peter Zijlstra, Valentin Schneider)
 
  - Deadline scheduler (Juri Lelli):
 
    - Fix dl_servers initialization order (Juri Lelli)
    - Fix DL scheduler's root domain reinitialization logic (Juri Lelli)
    - Fix accounting bugs after global limits change (Juri Lelli)
    - Fix scalability regression by implementing less agressive dl_server handling
      (Peter Zijlstra)
 
 PSI:
 
  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
    (Peter Zijlstra)
 
 Rust changes:
 
  - Make Task, CondVar and PollCondVar methods inline to avoid unnecessary
    function calls (Kunwu Chan, Panagiotis Foliadis)
 
  - Add might_sleep() support for Rust code: Rust's "#[track_caller]"
    mechanism is used so that Rust's might_sleep() doesn't need to be
    defined as a macro (Fujita Tomonori)
 
  - Introduce file_from_location() (Boqun Feng)
 
 Debugging & instrumentation:
 
  - Make clangd usable with scheduler source code files again (Peter Zijlstra)
 
  - tools: Add root_domains_dump.py which dumps root domains info (Juri Lelli)
 
  - tools: Add dl_bw_dump.py for printing bandwidth accounting info (Juri Lelli)
 
 Misc cleanups & fixes:
 
  - Remove play_idle() (Feng Lee)
 
  - Fix check_preemption_disabled() (Sebastian Andrzej Siewior)
 
  - Do not call __put_task_struct() on RT if pi_blocked_on is set
    (Luis Claudio R. Goncalves)
 
  - Correct the comment in place_entity() (wang wei)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmiHHNIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g7DhAAg9aMW33PuC24A4hCS1XQay6j3rgmR5qC
 AOqDofj/CY4Q374HQtOl4m5CYZB/G5csRv6TZliWQKhAy9vr6VWddoyOMJYOAlAx
 XRurl1Z3MriOMD6DPgNvtHd5PrR5Un8ygALgT+32d0PRz27KNXORW5TyvEf2Bv4r
 BX4/GazlOlK0PdGUdZl0q/3dtkU4Wr5IifQzT8KbarOSBbNwZwVcg+83hLW5gJMx
 LgMGLaAATmiN7VuvJWNDATDfEOmOvQOu8veoS8TuP1AOVeJPfPT2JVh9Jen5V1/5
 3w1RUOkUI2mQX+cujWDW3koniSxjsA1OegXfHnFkF5BXp4q5e54k6D5sSh1xPFDX
 iDhkU5jsbKkkJS2ulD6Vi4bIAct3apMl4IrbJn/OYOLcUVI8WuunHs4UPPEuESAS
 TuQExKSdj4Ntrzo3pWEy8kX3/Z9VGa+WDzwsPUuBSvllB5Ir/jjKgvkxPA6zGsiY
 rbkmZT8qyI01IZ/GXqfI2AQYCGvgp+SOvFPi755ZlELTQS6sUkGZH2/2M5XnKA9t
 Z1wB2iwttoS1VQInx0HgiiAGrXrFkr7IzSIN2T+CfWIqilnL7+nTxzwlJtC206P4
 DB97bF6azDtJ6yh1LetRZ1ZMX/Gr56Cy0Z6USNoOu+a12PLqlPk9+fPBBpkuGcdy
 BRk8KgysEuk=
 =8T0v
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Better tracking of maximum lag of tasks in presence of different
     slices duration, for better handling of lag in the fair scheduler
     (Vincent Guittot)

   - Clean up and standardize #if/#else/#endif markers throughout the
     entire scheduler code base (Ingo Molnar)

   - Make SMP unconditional: build the SMP scheduler's data structures
     and logic on UP kernel too, even though they are not used, to
     simplify the scheduler and remove around 200 #ifdef/[#else]/#endif
     blocks from the scheduler (Ingo Molnar)

   - Reorganize cgroup bandwidth control interface handling for better
     interfacing with sched_ext (Tejun Heo)

  Balancing:

   - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris
     Mason)

   - Remove sched_domain_topology_level::flags to simplify the code
     (Prateek Nayak)

   - Simplify and clean up build_sched_topology() (Li Chen)

   - Optimize build_sched_topology() on large machines (Li Chen)

  Real-time scheduling:

   - Add initial version of proxy execution: a mechanism for
     mutex-owning tasks to inherit the scheduling context of higher
     priority waiters.

     Currently limited to a single runqueue and conditional on
     CONFIG_EXPERT, and other limitations (John Stultz, Peter Zijlstra,
     Valentin Schneider)

   - Deadline scheduler (Juri Lelli):
      - Fix dl_servers initialization order (Juri Lelli)
      - Fix DL scheduler's root domain reinitialization logic (Juri
        Lelli)
      - Fix accounting bugs after global limits change (Juri Lelli)
      - Fix scalability regression by implementing less agressive
        dl_server handling (Peter Zijlstra)

  PSI:

   - Improve scalability by optimizing psi_group_change() cpu_clock()
     usage (Peter Zijlstra)

  Rust changes:

   - Make Task, CondVar and PollCondVar methods inline to avoid
     unnecessary function calls (Kunwu Chan, Panagiotis Foliadis)

   - Add might_sleep() support for Rust code: Rust's "#[track_caller]"
     mechanism is used so that Rust's might_sleep() doesn't need to be
     defined as a macro (Fujita Tomonori)

   - Introduce file_from_location() (Boqun Feng)

  Debugging & instrumentation:

   - Make clangd usable with scheduler source code files again (Peter
     Zijlstra)

   - tools: Add root_domains_dump.py which dumps root domains info (Juri
     Lelli)

   - tools: Add dl_bw_dump.py for printing bandwidth accounting info
     (Juri Lelli)

  Misc cleanups & fixes:

   - Remove play_idle() (Feng Lee)

   - Fix check_preemption_disabled() (Sebastian Andrzej Siewior)

   - Do not call __put_task_struct() on RT if pi_blocked_on is set (Luis
     Claudio R. Goncalves)

   - Correct the comment in place_entity() (wang wei)"

* tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
  sched/idle: Remove play_idle()
  sched: Do not call __put_task_struct() on rt if pi_blocked_on is set
  sched: Start blocked_on chain processing in find_proxy_task()
  sched: Fix proxy/current (push,pull)ability
  sched: Add an initial sketch of the find_proxy_task() function
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Move update_curr_task logic into update_curr_se
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  locking/mutex: Rework task_struct::blocked_on
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched/topology: Remove sched_domain_topology_level::flags
  x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled
  x86/smpboot: moves x86_topology to static initialize and truncate
  x86/smpboot: remove redundant CONFIG_SCHED_SMT
  smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
  tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
  tools/sched: Add root_domains_dump.py which dumps root domains info
  sched/deadline: Fix accounting after global limits change
  sched/deadline: Reset extra_bw to max_bw when clearing root domains
  sched/deadline: Initialize dl_servers after SMP
  ...
2025-07-29 17:42:52 -07:00
Linus Torvalds 78bb43e51b Updates for the generic entry code:
- Split the code into syscall and exception/interrupt parts to ease the
     conversion of ARM[64] to the generic entry infrastructure
 
   - Extend syscall user dispatching to support a single intercepted range
     instead of the default single non-intercepted range. That allows
     monitoring/analysis of a specific executable range, e.g. a library, and
     also provides flexibility for sandboxing scenarios.
 
   - Cleanup and extend the user dispatch selftest
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiIg6UTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWn7EACTvQpu7tGd1rN9hCjiB1W5po7nvlCd
 gKghjS9Kp0KttTDQPLVcmnH06BhDHWNNn1HXZ1ORea4bpLywiKHtVgqUAsJDsBsv
 ETeTHYNphk0sktvAqp3XusA6HF4T0s1KXJQj3W1ACrYZWRkK/VystCLYwBRGpc3r
 cj7jAFmJyNpU236R5XYJ7ooHfPYpzZ8VAHBO8ykK7muHDfyBRXEIlmkGep++ctSv
 v0uZXAy6LONljKg87YJTien0UA7ze9lFgPTuV1y/qfaLbYNekUaJSDjfuhOpZZUw
 TzSh9OYoIvKpd0ylHwB1qMLd5CaXNicaeLfTW3xbX06KaXa7WNAonS35sK0EjhtZ
 0bBA9g6bRhphyh0tzR4saF9bczNvJydNCn7/QFo9dKbQUEL/FRXtJiIeusVx/0fJ
 +ZqWRTcEdDw2Rsyv52hKgyEJi7F3nL9ovabUN9P1/0aPcTdM3WekMpSOJm1U6wVF
 e6oSyeoeNdjcdxgWbQrgRNbmq5CPEV3ig5J+G418r5DTF3ifqZX+WscijUtKTu5K
 V5GpLc0PL9eoigQ37LmGkwK/4xoB9SAPTQuzUs9qgh9NidwT0cCfoNxpeGh6GeHX
 GLHPGU61vZaefxpwuAuv+SQSgxXSKk2/H/ijPzSjrX/PkUp7MoX9XoOQAh4FxZjO
 ok5YEUGXzSJfXQ==
 =yaCQ
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull generic entry code updates from Thomas Gleixner:

 - Split the code into syscall and exception/interrupt parts to ease the
   conversion of ARM[64] to the generic entry infrastructure

 - Extend syscall user dispatching to support a single intercepted range
   instead of the default single non-intercepted range. That allows
   monitoring/analysis of a specific executable range, e.g. a library,
   and also provides flexibility for sandboxing scenarios

 - Cleanup and extend the user dispatch selftest

* tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Split generic entry into generic exception and syscall entry
  selftests: Add tests for PR_SYS_DISPATCH_INCLUSIVE_ON
  syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON
  selftests: Fix errno checking in syscall_user_dispatch test
2025-07-29 15:14:29 -07:00
Gabriele Monaco adcc3bfa88 sched: Adapt sched tracepoints for RV task model
Add the following tracepoint:
* sched_set_need_resched(tsk, cpu, tif)
    Called when a task is set the need resched [lazy] flag

Remove the unused ip parameter from sched_entry and sched_exit and alter
sched_entry to have a value of preempt consistent with the one used in
sched_switch.

Also adapt all monitors using sched_{entry,exit} to avoid breaking build.

These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Juri Lelli <jlelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250728135022.255578-7-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28 16:47:34 -04:00
Peter Zijlstra 7de9d4f946 sched: Start blocked_on chain processing in find_proxy_task()
Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.

To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.

[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-9-jstultz@google.com
2025-07-14 17:16:33 +02:00
Valentin Schneider be39617e38 sched: Fix proxy/current (push,pull)ability
Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-8-jstultz@google.com
2025-07-14 17:16:33 +02:00
John Stultz be41bde4c3 sched: Add an initial sketch of the find_proxy_task() function
Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

This allows us to validate keeping blocked tasks on the runqueue
and later deactivating them is working ok, stressing the failure
cases for when a proxy isn't found.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-7-jstultz@google.com
2025-07-14 17:16:32 +02:00
John Stultz 25c411fce7 sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_proxy_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.

Also uses this option to allow the rq->donor to be different from
rq->curr.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-2-jstultz@google.com
2025-07-14 17:16:31 +02:00
Peter Zijlstra 8f2146159b Merge branch 'tip/sched/urgent'
Avoid merge conflicts

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-07-14 17:16:28 +02:00
Juri Lelli 9f239df555 sched/deadline: Initialize dl_servers after SMP
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).

Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.

Fixes: d741f297bc ("sched/fair: Fair server interface")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
2025-07-14 10:59:32 +02:00
Chen Yu db6cc3f4ac Revert "sched/numa: add statistics of numa balance task"
This reverts commit ad6b26b6a0.

This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit.  Later, when performing the
actual task swapping, the p->mm caused the problem.

CPU0                                   CPU1
:
...
task_numa_migrate
     task_numa_find_cpu
      task_numa_compare
        # a normal task p is chosen
        env->best_task = p

                                          # p exit:
                                          exit_signals(p);
                                             p->flags |= PF_EXITING
                                          exit_mm
                                             p->mm = NULL;

      migrate_swap_stop
        __migrate_swap_task((arg->src_task, arg->dst_cpu)
         count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL

task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening.  After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations.  Revert
the change and rely on the tracepoint for this purpose.

Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 21:07:56 -07:00
Peter Zijlstra 009836b4fa sched/core: Fix migrate_swap() vs. hotplug
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:

> So, the potential race scenario is:
>
> 	CPU0							CPU1
> 	// doing migrate_swap(cpu0/cpu1)
> 	stop_two_cpus()
> 							  ...
> 							 // doing _cpu_down()
> 							      sched_cpu_deactivate()
> 								set_cpu_active(cpu, false);
> 								balance_push_set(cpu, true);
> 	cpu_stop_queue_two_works
> 	    __cpu_stop_queue_work(stopper1,...);
> 	    __cpu_stop_queue_work(stopper2,..);
> 	stop_cpus_in_progress -> true
> 		preempt_enable();
> 								...
> 							1st balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 1st add push_work
> 							wake_up_q(&wakeq);  -> "wakeq is empty.
> 										This implies that the stopper is at wakeq@migrate_swap."
> 	preempt_disable
> 	wake_up_q(&wakeq);
> 	        wake_up_process // wakeup migrate/0
> 		    try_to_wake_up
> 		        ttwu_queue
> 		            ttwu_queue_cond ->meet below case
> 		                if (cpu == smp_processor_id())
> 			         return false;
> 			ttwu_do_activate
> 			//migrate/0 wakeup done
> 		wake_up_process // wakeup migrate/1
> 	           try_to_wake_up
> 		    ttwu_queue
> 			ttwu_queue_cond
> 		        ttwu_queue_wakelist
> 			__ttwu_queue_wakelist
> 			__smp_call_single_queue
> 	preempt_enable();
>
> 							2nd balance_push
> 							stop_one_cpu_nowait
> 							cpu_stop_queue_work
> 							__cpu_stop_queue_work
> 							list_add_tail  -> 2nd add push_work, so the double list add is detected
> 							...
> 							...
> 							cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>

So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.

Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.

This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.

Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.

Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.

Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.

Fixes: 2558aacff8 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
2025-07-01 15:02:03 +02:00
Thomas Weißschuh 3ebb1b6522 sched: Fix preemption string of preempt_dynamic_none
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".

Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).

Fixes: 8bdc5daaa0 ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
2025-07-01 15:02:02 +02:00
Jinjie Ruan a70e9f647f entry: Split generic entry into generic exception and syscall entry
Currently CONFIG_GENERIC_ENTRY enables both the generic exception
entry logic and the generic syscall entry logic, which are otherwise
loosely coupled.

Introduce separate config options for these so that architectures can
select the two independently. This will make it easier for
architectures to migrate to generic entry code.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/20250213130007.1418890-2-ruanjinjie@huawei.com
Link: https://lore.kernel.org/all/20250624-generic-entry-split-v1-1-53d5ef4f94df@linaro.org

[Linus Walleij: rebase onto v6.16-rc1]
2025-06-30 19:52:55 +02:00
Tejun Heo ddceadce63 sched_ext: Add support for cgroup bandwidth control interface
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 13 Jun 2025 15:06:47 -1000

- Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both
  CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED.

- Put bandwidth control interface files for both cgroup v1 and v2 under
  CONFIG_GROUP_SCHED_BANDWIDTH.

- Update tg_bandwidth() to fetch configuration parameters from fair if
  CONFIG_CFS_BANDWIDTH, SCX otherwise.

- Update tg_set_bandwidth() to update the parameters for both fair and SCX.

- Add bandwidth control parameters to struct scx_cgroup_init_args.

- Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth
  control parameter updates.

- Update scx_qmap and maximal selftest to test the new feature.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20 17:03:51 -10:00
Tejun Heo e4e149dd2f sched_ext: Merge branch 'for-6.16-fixes' into for-6.17
Pull sched_ext/for-6.16-fixes to receive:

 c50784e99f ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight")
 33796b9187 ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()")

which are needed to implement CPU bandwidth control interface support.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20 17:01:21 -10:00
Tejun Heo 5bc34be478 sched/core: Reorganize cgroup bandwidth control interface file writes
- Move input parameter validation from tg_set_cfs_bandwidth() to the new
  outer function tg_set_bandwidth(). The outer function handles parameters
  in usecs, validates them and calls tg_set_cfs_bandwidth() which converts
  them into nsecs. This matches tg_bandwidth() on the read side.

- max/min_cfs_* consts are now used by tg_set_bandwidth(). Relocate, convert
  into usecs and drop "cfs" from the names.

- Reimplement cpu_cfs_{period|quote|burst}_write_*() using tg_bandwidth()
  and tg_set_bandwidth() and replace "cfs" in the names with "bw".

- Update cpu_max_write() to use tg_set_bandiwdth(). cpu_period_quota_parse()
  is updated to drop nsec conversion accordingly. This aligns the behavior
  with cfs_period_quota_print().

- Drop now unused tg_set_cfs_{period|quota|burst}().

- While at it, for consistency, rename default_cfs_period() to
  default_bw_period_us() and make it return usecs.

This is to prepare for adding bandwidth control support to sched_ext.
tg_set_bandwidth() will be used as the muxing point. No functional changes
intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-5-tj@kernel.org
2025-06-18 13:59:57 +02:00