mirror of https://github.com/torvalds/linux.git
682 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
f0cc749254 |
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
syzbot is again reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex. Do like what we did with commit |
|
|
|
0a2dc6ac33 |
cgroup: remove cgroup_rstat_flush_atomic()
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0dad9b072b |
cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static
They're only called inside cgroup.c. Make them static. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
c20d4d889f |
rdmacg: fix kernel-doc warnings in rdmacg
Fix all kernel-doc warnings in rdmacg: kernel/cgroup/rdma.c:209: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge_hierarchy' kernel/cgroup/rdma.c:230: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge' Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7bf11e90a3 |
cgroup: Replace the css_set call with cgroup_get
We will release the refcnt of cgroup via cgroup_put, for example in the cgroup_lock_and_drain_offline function, so replace css_get with the cgroup_get function for better readability. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
a49a11dc64 |
cgroup: remove unused macro for_each_e_css()
for_each_e_css() is unused now. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
659db0789c |
cgroup: Update out-of-date comment in cgroup_migrate()
Commit
|
|
|
|
2bd1103392 |
cgroup: always put cset in cgroup_css_set_put_fork
A successful call to cgroup_css_set_fork() will always have taken
a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always
do a corresponding put in cgroup_css_set_put_fork().
Without this, a cset and its contained css structures will be
leaked for some fork failures. The following script reproduces
the leak for a fork failure due to exceeding pids.max in the
pids controller. A similar thing can happen if we jump to the
bad_fork_cancel_cgroup label in copy_process().
[ -z "$1" ] && echo "Usage $0 pids-root" && exit 1
PID_ROOT=$1
CGROUP=$PID_ROOT/foo
[ -e $CGROUP ] && rmdir -f $CGROUP
mkdir $CGROUP
echo 5 > $CGROUP/pids.max
echo $$ > $CGROUP/cgroup.procs
fork_bomb()
{
set -e
for i in $(seq 10); do
/bin/sleep 3600 &
done
}
(fork_bomb) &
wait
echo $$ > $PID_ROOT/cgroup.procs
kill $(cat $CGROUP/cgroup.procs)
rmdir $CGROUP
Fixes:
|
|
|
|
c33080cdc0 |
cgroup: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
ab1de7ead8 |
cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
The commit |
|
|
|
6588880509 |
cgroup/cpuset: remove unneeded header files
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
2ef269ef1a |
cgroup/cpuset: Free DL BW in case can_attach() fails
cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
c0f78fd5ed |
cgroup/cpuset: Iterate only if DEADLINE tasks are present
update_tasks_root_domain currently iterates over all tasks even if no DEADLINE task is present on the cpuset/root domain for which bandwidth accounting is being rebuilt. This has been reported to introduce 10+ ms delays on suspend-resume operations. Skip the costly iteration for cpusets that don't contain DEADLINE tasks. Reported-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/lkml/20230206221428.2125324-1-qyousef@layalina.io/ Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
6c24849f55 |
sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
Qais reported that iterating over all tasks when rebuilding root domains for finding out which ones are DEADLINE and need their bandwidth correctly restored on such root domains can be a costly operation (10+ ms delays on suspend-resume). To fix the problem keep track of the number of DEADLINE tasks belonging to each cpuset and then use this information (followup patch) to only perform the above iteration if DEADLINE tasks are actually present in the cpuset for which a corresponding root domain is being rebuilt. Reported-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/lkml/20230206221428.2125324-1-qyousef@layalina.io/ Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
111cd11bbc |
sched/cpuset: Bring back cpuset_mutex
Turns out percpu_cpuset_rwsem - commit
|
|
|
|
ad3a557daf |
cgroup/cpuset: Rename functions dealing with DEADLINE accounting
rebuild_root_domains() and update_tasks_root_domain() have neutral names, but actually deal with DEADLINE bandwidth accounting. Rename them to use 'dl_' prefix so that intent is more clear. No functional change. Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
519fabc7aa |
psi: remove 500ms min window size limitation for triggers
Current 500ms min window size for psi triggers limits polling interval to 50ms to prevent polling threads from using too much cpu bandwidth by polling too frequently. However the number of cgroups with triggers is unlimited, so this protection can be defeated by creating multiple cgroups with psi triggers (triggers in each cgroup are served by a single "psimon" kernel thread). Instead of limiting min polling period, which also limits the latency of psi events, it's better to limit psi trigger creation to authorized users only, like we do for system-wide psi triggers (/proc/pressure/* files can be written only by processes with CAP_SYS_RESOURCE capability). This also makes access rules for cgroup psi files consistent with system-wide ones. Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and remove the psi window min size limitation. Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/ |
|
|
|
86e98ed15b |
cgroup changes for v6.4-rc1
* cpuset changes including the fix for an incorrect interaction with CPU hotplug and an optimization. * Other doc and cosmetic changes. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZErfng4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGVVtAQCDycK4VSgc4nsFPG1vh1Oy1A723ciEUwAbKmV/ F1n7xwEA68FiDvE29LpMJJuYP9HnX0A5zRMyNnb52kN9jmgcEQI= =ALol -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset changes including the fix for an incorrect interaction with CPU hotplug and an optimization - Other doc and cosmetic changes * tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1/cpusets: update libcgroup project link cgroup/cpuset: Minor updates to test_cpuset_prs.sh cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset cpuset: Clean up cpuset_node_allowed cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers |
|
|
|
586b222d74 |
Scheduler changes for v6.4:
- Allow unprivileged PSI poll()ing
- Fix performance regression introduced by mm_cid
- Improve livepatch stalls by adding livepatch task switching to cond_resched(),
this resolves livepatching busy-loop stalls with certain CPU-bound kthreads.
- Improve sched_move_task() performance on autogroup configs.
- On core-scheduling CPUs, avoid selecting throttled tasks to run
- Misc cleanups, fixes and improvements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK39cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hXPhAAk2WqOV2cW4BjSCHjWWE05IfTb0HMn8si
mFGBAnr1GIkJRvICAusAwDU3FcmP5mWyXA+LK110d3x4fKJP15vCD5ru5lHnBfX7
fSD+Ml8uM4Xlp8iUoQspilbQwmWkQSwhudbDs3Nj7XGUzJCvNgm1sM3xPRDlqSJ5
6zumfVOPTfzSGcZY3a8sMuJnCepZHLRR6NkLzo/DuI1NMy2Jw1dK43dh77AO1mBF
M53PF2IQgm6Wu/67p2k5eDq4c0AKL4PyIb4dRTGOPyljWMf41n28jwMv1tjlvu+Y
uT0JD8MJSrFiylyT41x7Asr7orAGXj3cPhShK5R0vrutx/SbqBiaaE1MO9U3aC3B
7xVXEORHWD6KIDqTvzmWGrMBkIdyWB6CLk6EJKr3MqM9hUtP2ift7bkAgIad9h+4
G9DdVePGoCyh/TQtJ9EPIULAYeu9mmDZe8rTQ8C5MCSg//05/CTMgBbb0NiFWhnd
0JQl1B0nNUA87whVUxK8Hfu4DLh7m9jrzgQr9Ww8/FwQ6tQHBOKWgDdbv45ckkaG
cJIQt/+vLilddazc8u8E+BGaD5w2uIYF0uL7kvG6Q5oARX06AZ5dj1m06vhZe/Ym
laOVZEpJsbQnxviY6jwj1n+CSB9aK7feiQfDePBPbpJGGUHyZoKrnLN6wmW2se+H
VCHtdgsEl5I=
=Hgci
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Allow unprivileged PSI poll()ing
- Fix performance regression introduced by mm_cid
- Improve livepatch stalls by adding livepatch task switching to
cond_resched(). This resolves livepatching busy-loop stalls with
certain CPU-bound kthreads
- Improve sched_move_task() performance on autogroup configs
- On core-scheduling CPUs, avoid selecting throttled tasks to run
- Misc cleanups, fixes and improvements
* tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/clock: Fix local_clock() before sched_clock_init()
sched/rt: Fix bad task migration for rt tasks
sched: Fix performance regression introduced by mm_cid
sched/core: Make sched_dynamic_mutex static
sched/psi: Allow unprivileged polling of N*2s period
sched/psi: Extract update_triggers side effect
sched/psi: Rename existing poll members in preparation
sched/psi: Rearrange polling code in preparation
sched/fair: Fix inaccurate tally of ttwu_move_affine
vhost: Fix livepatch timeouts in vhost_worker()
livepatch,sched: Add livepatch task switching to cond_resched()
livepatch: Skip task_call_func() for current task
livepatch: Convert stack entries array to percpu
sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
sched/core: Reduce cost of sched_move_task when config autogroup
sched/core: Avoid selecting the task that is throttled to run when core-sched enable
sched/topology: Make sched_energy_mutex,update static
|
|
|
|
7fa8a8ee94 |
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
|
|
|
|
6e98b09da9 |
Networking changes for 6.4.
Core
----
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances.
- Reduce compound page head access for zero-copy data transfers.
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
- Threaded NAPI improvements, adding defer skb free support and unneeded
softirq avoidance.
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking.
- Add lockless accesses annotation to sk_err[_soft].
- Optimize again the skb struct layout.
- Extends the skb drop reasons to make it usable by multiple
subsystems.
- Better const qualifier awareness for socket casts.
BPF
---
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and variable-sized
accesses.
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward.
- Add more precise memory usage reporting for all BPF map types.
- Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params.
- Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton.
- Bigger batch of BPF verifier improvements to prepare for upcoming BPF
open-coded iterators allowing for less restrictive looping capabilities.
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
programs to NULL-check before passing such pointers into kfunc.
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
local storage maps.
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps.
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree.
- Add BPF verifier support for ST instructions in convert_ctx_access()
which will help new -mcpu=v4 clang flag to start emitting them.
- Add ARM32 USDT support to libbpf.
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations.
Protocols
---------
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address.
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
- Add the handshake upcall mechanism, allowing the user-space
to implement generic TLS handshake on kernel's behalf.
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures.
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers.
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction.
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore.
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter
---------
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged.
- Update bridge netfilter and ovs conntrack helpers to handle
IPv6 Jumbo packets properly, i.e. fetch the packet length
from hop-by-hop extension header. This is needed for BIT TCP
support.
- The iptables 32bit compat interface isn't compiled in by default
anymore.
- Move ip(6)tables builtin icmp matches to the udptcp one.
This has the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used.
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device.
Driver API
----------
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time.
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them.
- Allow the page_pool to directly recycle the pages from safely
localized NAPI.
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization.
- Add YNL support for user headers and struct attrs.
- Add partial YNL specification for devlink.
- Add partial YNL specification for ethtool.
- Add tc-mqprio and tc-taprio support for preemptible traffic classes.
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device.
- Add basic LED support for switch/phy.
- Add NAPI documentation, stop relaying on external links.
- Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
work to make the hardware timestamping layer selectable by user
space.
- Add transceiver support and improve the error messages for CAN-FD
controllers.
New hardware / drivers
----------------------
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers
-------
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors.
- add support for configuring max SDU for each Tx queue.
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only
on shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates.
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices
(e.g. MAC address from efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
vP7ugFnttneg
=ITVa
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
|
|
|
|
ef36b9afc2 |
fget() to fdget() conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ 64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6 27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM= =GA9W -----END PGP SIGNATURE----- Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fget updates from Al Viro: "fget() to fdget() conversions" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse_dev_ioctl(): switch to fdget() cgroup_get_from_fd(): switch to fdget_raw() bpf: switch to fdget_raw() build_mount_idmapped(): switch to fdget() kill the last remaining user of proc_ns_fget() SVM-SEV: convert the rest of fget() uses to fdget() in there convert sgx_set_attribute() to fdget()/fdput() convert setns(2) to fdget()/fdput() |
|
|
|
5a4d3b38ed |
Linux 6.3-rc7
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmQ8dXkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG24sH/1ADVMOmTCAsYODk
QeFm1PvcpDenmGnGN69r5iJ2AbPAhfo3TOAMnQLSxz7ES4VFs6CeArqMGKYYLa0k
2no/gra2jpVLx/qBMIjxkxUBS4uKNOjm9PR+vaamJ2yZOXTWTJFUThzMjVZI8anm
TFewF4Nb/A91+a4unPtYWROSjozr27g0aqUAu80/V73xmxSk74pvDJLLA+NMB7vZ
cQWkABqW9wpSPr1vkxVNgf6N5DmSmKZWnePG5GjfN5P+BU+eQLyERrOx8ttOvmAR
Z62R7S49Zc5BeR2CuBNxWwDober1UIb2Q0PUbvLo6HbN+LidJh90WtAlRYpref2f
NRB49N4=
=70dU
-----END PGP SIGNATURE-----
Merge branch 'v6.3-rc7'
Sync with the urgent patches; in particular:
|
|
|
|
2f31fa029d |
cgroup_get_from_fd(): switch to fdget_raw()
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
681c5b51dc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Adjacent changes: net/mptcp/protocol.h |
|
|
|
8bff9a04ca |
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3. rstat flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system. The purpose of this series is to minimize the contexts where we flush stats atomically. Patches 1 and 2 are cleanups requested during reviews of prior versions of this series. Patch 3 makes sure we never try to flush from within an irq context. Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for atomic and non-atomic flushing, and make sure we only flush the stats atomically when necessary. Patch 8 is a slightly tangential optimization that limits the work done by rstat flushing in some scenarios. This patch (of 8): cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as "irqs are disabled throughout", which is what the current implementation does (currently under discussion [1]), but is not the intention. The intention is that this function is safe to call from atomic contexts. Name it as such. Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7e27cb6ad4 |
cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset
It is found that attaching a task to the top_cpuset does not currently ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the code is changed to fix that. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
eee8785379 |
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.
We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.
Fixes:
|
|
|
|
42a11bf5c5 |
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
By default, the clone(2) syscall spawn a child process into the same cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag introduced by commit |
|
|
|
ba9182a896 |
cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.
Fixes:
|
|
|
|
57dcd64c7e |
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex
syzbot is reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex, for commit |
|
|
|
d82caa2735 |
sched/psi: Allow unprivileged polling of N*2s period
PSI offers 2 mechanisms to get information about a specific resource pressure. One is reading from /proc/pressure/<resource>, which gives average pressures aggregated every 2s. The other is creating a pollable fd for a specific resource and cgroup. The trigger creation requires CAP_SYS_RESOURCE, and gives the possibility to pick specific time window and threshold, spawing an RT thread to aggregate the data. Systemd would like to provide containers the option to monitor pressure on their own cgroup and sub-cgroups. For example, if systemd launches a container that itself then launches services, the container should have the ability to poll() for pressure in individual services. But neither the container nor the services are privileged. This patch implements a mechanism to allow unprivileged users to create pressure triggers. The difference with privileged triggers creation is that unprivileged ones must have a time window that's a multiple of 2s. This is so that we can avoid unrestricted spawning of rt threads, and use instead the same aggregation mechanism done for the averages, which runs independently of any triggers. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20230330105418.77061-5-cerasuolodomenico@gmail.com |
|
|
|
6667439f51 |
cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
Similar to commit
|
|
|
|
df59b72cd8 |
cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
5dcad50636 |
Merge branch 'for-6.3-fixes' into for-6.4
To receive
|
|
|
|
292fd843de |
cgroup/cpuset: Fix partition root's cpuset.cpus update bug
It was found that commit |
|
|
|
8e4645226b |
cpuset: Clean up cpuset_node_allowed
Commit
|
|
|
|
fcdb1eda53 |
cgroup: fix display of forceidle time at root
We need to reset forceidle_sum to 0 when reading from root, since the
bstat we accumulate into is stack allocated.
To make this more robust, just replace the existing cputime reset with a
memset of the overall bstat.
Signed-off-by: Josh Don <joshdon@google.com>
Fixes:
|
|
|
|
4cdb91b0de |
cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
b8a2e3f93d |
cgroup: Make current_cgns_cgroup_dfl() safe to call after exit_task_namespace()
The commit |
|
|
|
5b7c4cabbb |
Networking changes for 6.3.
Core
----
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used
to describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols
---------
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
path manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF
---
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
to better support decap on GRE tunnel devices not operating
in collect metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk
and bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols
by livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter
---------
- Remove the CLUSTERIP target. It has been marked as obsolete
for years, and we still have WARN splats wrt. races of
the out-of-band /proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to
the existing 'delete' commands, but do not return an error if
the referenced object (set, chain, rule...) did not exist.
Driver API
----------
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into multiple
files, drop some of the unnecessarily granular locks and factor out
common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
interface instead.
- Improve the CAN bit timing configuration. Use extack to return error
messages directly to user space, update the SJW handling, including
the definition of a new default value that will benefit CAN-FD
controllers, by increasing their oscillator tolerance.
New hardware / drivers
----------------------
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers
-------
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- enetc: support XDP_REDIRECT for XDP non-linear buffers
- enetc: improve reconfig, avoid link flap and waiting for idle
- enetc: support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
=xXhC
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
|
|
|
|
8ca8d89b43 |
cgroup changes for v6.3-rc1
All the commits are trivial. Doc updates and a trivial code cleanup. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY/Kvjw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGfcZAQDoyqEm807R2iaELEfee0xq/NPwJIaDRnXDP6ZL rIEtqAD8CzpvvtKauekMIsCHVq7SuxDae8Hi0a/RImN8e1sy6wY= =pUSo -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "All the changes are trivial: documentation updates and a trivial code cleanup" * tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix a few kernel-doc warnings & coding style docs: cgroup-v1: use numbered lists for user interface setup docs: cgroup-v1: add internal cross-references docs: cgroup-v1: make swap extension subsections subsections docs: cgroup-v1: use bullet lists for list of stat file tables docs: cgroup-v1: move hierarchy of accounting caption docs: cgroup-v1: fix footnotes docs: cgroup-v1: use code block for locking order schema docs: cgroup-v1: wrap remaining admonitions in admonition blocks docs: cgroup-v1: replace custom note constructs with appropriate admonition blocks cgroup/cpuset: no need to explicitly init a global static variable |
|
|
|
05e6295f7b |
fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in
|
|
|
|
de42873367 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd agIZTkRs6BHF9He8oROypARZxTeMLwM= =nO1C -----END PGP SIGNATURE----- Daniel Borkmann says: ==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit |
|
|
|
7a2127e66a |
cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task
set_cpus_allowed_ptr() will fail with -EINVAL if the requested
affinity mask is not a subset of the task_cpu_possible_mask() for the
task being updated. Consequently, on a heterogeneous system with cpusets
spanning the different CPU types, updates to the cgroup hierarchy can
silently fail to update task affinities when the effective affinity
mask for the cpuset is expanded.
For example, consider an arm64 system with 4 CPUs, where CPUs 2-3 are
the only cores capable of executing 32-bit tasks. Attaching a 32-bit
task to a cpuset containing CPUs 0-2 will correctly affine the task to
CPU 2. Extending the cpuset to CPUs 0-3, however, will fail to extend
the affinity mask of the 32-bit task because update_tasks_cpumask() will
pass the full 0-3 mask to set_cpus_allowed_ptr().
Extend update_tasks_cpumask() to take a temporary 'cpumask' paramater
and use it to mask the 'effective_cpus' mask with the possible mask for
each task being updated.
Fixes:
|
|
|
|
3fb906e7fa |
cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
Since commit |
|
|
|
400031e05a |
bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com |
|
|
|
e5ae880384 |
cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask()
It was found that the check to see if a partition could use up all
the cpus from the parent cpuset in update_parent_subparts_cpumask()
was incorrect. As a result, it is possible to leave parent with no
effective cpu left even if there are tasks in the parent cpuset. This
can lead to system panic as reported in [1].
Fix this probem by updating the check to fail the enabling the partition
if parent's effective_cpus is a subset of the child's cpus_allowed.
Also record the error code when an error happens in update_prstate()
and add a test case where parent partition and child have the same cpu
list and parent has task. Enabling partition in the child will fail in
this case.
[1] https://www.spinics.net/lists/cgroups/msg36254.html
Fixes:
|
|
|
|
4609e1f18e
|
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
|
|
|
32a47817d0 |
cgroup/cpuset: fix a few kernel-doc warnings & coding style
Fix kernel-doc notation warnings: kernel/cgroup/cpuset.c:1309: warning: Excess function parameter 'cpuset' description in 'update_parent_subparts_cpumask' kernel/cgroup/cpuset.c:3909: warning: expecting prototype for cpuset_mem_spread_node(). Prototype was for cpuset_spread_node() instead Also drop a blank line before EXPORT_SYMBOL_GPL() to be consistent with kernel coding style. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
21786e5cb3 |
cgroup/cpuset: no need to explicitly init a global static variable
cpuset_rwsem is a static variable defined with DEFINE_STATIC_PERCPU_RWSEM(). It's initialized at build time and so there's no need for explicit runtime init leaking one percpu int. Signed-off-by: Daniel Vacek <neelx@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
e2ca6ba6ba |
MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu. - Several convert-to-folios series from Sidhartha Kumar and Huang Ying. - Some filemap cleanups from Vishal Moola. - David Hildenbrand added the ability to selftest anon memory COW handling. - Some cpuset simplifications from Liu Shixin. - Addition of vmalloc tracing support by Uladzislau Rezki. - Some pagecache folioifications and simplifications from Matthew Wilcox. - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it. - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series shold have been in the non-MM tree, my bad. - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages. - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages. - Peter Xu utilized the PTE marker code for handling swapin errors. - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient. - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand. - zram support for multiple compression streams from Sergey Senozhatsky. - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway. - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations. - Vishal Moola removed the try_to_release_page() wrapper. - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache. - David Hildenbrand did some cleanup and repair work on KSM COW breaking. - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend. - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range(). - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen. - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect. - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages(). - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting. - David Hildenbrand has fixed some of our MM selftests for 32-bit machines. - Many singleton patches, as usual. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml CodAgiA51qwzId3GRytIo/tfWZSezgA= =d19R -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ... |
|
|
|
7e68dd7d07 |
Networking changes for 6.2.
Core
----
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations.
- Add inet drop monitor support.
- A few GRO performance improvements.
- Add infrastructure for atomic dev stats, addressing long standing
data races.
- De-duplicate common code between OVS and conntrack offloading
infrastructure.
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up
the workload with the number of available CPUs.
- Add the helper support for connection-tracking OVS offload.
BPF
---
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF.
- Make cgroup local storage available to non-cgroup attached BPF
programs.
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers.
- A relevant bunch of BPF verifier fixes and improvements.
- Veristat tool improvements to support custom filtering, sorting,
and replay of results.
- Add LLVM disassembler as default library for dumping JITed code.
- Lots of new BPF documentation for various BPF maps.
- Add bpf_rcu_read_{,un}lock() support for sleepable programs.
- Add RCU grace period chaining to BPF to wait for the completion
of access from both sleepable and non-sleepable BPF programs.
- Add support storing struct task_struct objects as kptrs in maps.
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values.
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
Protocols
---------
- TCP: implement Protective Load Balancing across switch links.
- TCP: allow dynamically disabling TCP-MD5 static key, reverting
back to fast[er]-path.
- UDP: Introduce optional per-netns hash lookup table.
- IPv6: simplify and cleanup sockets disposal.
- Netlink: support different type policies for each generic
netlink operation.
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
- MPTCP: add netlink notification support for listener sockets
events.
- SCTP: add VRF support, allowing sctp sockets binding to VRF
devices.
- Add bridging MAC Authentication Bypass (MAB) support.
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios.
- More work for Wi-Fi 7 support, comprising conversion of all
the existing drivers to internal TX queue usage.
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading.
- IPSec: extended ack support for more descriptive XFRM error
reporting.
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking.
- IEEE 802154: synchronous send frame and extended filtering
support, initial support for scanning available 15.4 networks.
- Tun: bump the link speed from 10Mbps to 10Gbps.
- Tun/VirtioNet: implement UDP segmentation offload support.
Driver API
----------
- PHY/SFP: improve power level switching between standard
level 1 and the higher power levels.
- New API for netdev <-> devlink_port linkage.
- PTP: convert existing drivers to new frequency adjustment
implementation.
- DSA: add support for rx offloading.
- Autoload DSA tagging driver when dynamically changing protocol.
- Add new PCP and APPTRUST attributes to Data Center Bridging.
- Add configuration support for 800Gbps link speed.
- Add devlink port function attribute to enable/disable RoCE and
migratable.
- Extend devlink-rate to support strict prioriry and weighted fair
queuing.
- Add devlink support to directly reading from region memory.
- New device tree helper to fetch MAC address from nvmem.
- New big TCP helper to simplify temporary header stripping.
New hardware / drivers
----------------------
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches.
- Marvel Prestera AC5X Ethernet Switch.
- WangXun 10 Gigabit NIC.
- Motorcomm yt8521 Gigabit Ethernet.
- Microchip ksz9563 Gigabit Ethernet Switch.
- Microsoft Azure Network Adapter.
- Linux Automation 10Base-T1L adapter.
- PHY:
- Aquantia AQR112 and AQR412.
- Motorcomm YT8531S.
- PTP:
- Orolia ART-CARD.
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices.
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices.
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets.
- Realtek RTL8852BE and RTL8723DS.
- Cypress.CYW4373A0 WiFi + Bluetooth combo device.
Drivers
-------
- CAN:
- gs_usb: bus error reporting support.
- kvaser_usb: listen only and bus error reporting support.
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping.
- implement devlink-rate support.
- support direct read from memory.
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate.
- Support for enhanced events compression.
- extend H/W offload packet manipulation capabilities.
- implement IPSec packet offload mode.
- nVidia/Mellanox (mlx4):
- better big TCP support.
- Netronome Ethernet NICs (nfp):
- IPsec offload support.
- add support for multicast filter.
- Broadcom:
- RSS and PTP support improvements.
- AMD/SolarFlare:
- netlink extened ack improvements.
- add basic flower matches to offload, and related stats.
- Virtual NICs:
- ibmvnic: introduce affinity hint support.
- small / embedded:
- FreeScale fec: add initial XDP support.
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
- TI am65-cpsw: add suspend/resume support.
- Mediatek MT7986: add RX wireless wthernet dispatch support.
- Realtek 8169: enable GRO software interrupt coalescing per
default.
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP.
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support.
- add ip6gre support.
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support.
- enable flow offload support.
- Renesas:
- add rswitch R-Car Gen4 gPTP support.
- Microchip (lan966x):
- add full XDP support.
- add TC H/W offload via VCAP.
- enable PTP on bridge interfaces.
- Microchip (ksz8):
- add MTU support for KSZ8 series.
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan.
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support.
- add ack signal support.
- enable coredump support.
- remain_on_channel support.
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
- 320 MHz channels support.
- RealTek WiFi (rtw89):
- new dynamic header firmware format support.
- wake-over-WLAN support.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
V8gnt1EL+0cc
=CbJC
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations
- Add inet drop monitor support
- A few GRO performance improvements
- Add infrastructure for atomic dev stats, addressing long standing
data races
- De-duplicate common code between OVS and conntrack offloading
infrastructure
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up the
workload with the number of available CPUs
- Add the helper support for connection-tracking OVS offload
BPF:
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF
- Make cgroup local storage available to non-cgroup attached BPF
programs
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers
- A relevant bunch of BPF verifier fixes and improvements
- Veristat tool improvements to support custom filtering, sorting,
and replay of results
- Add LLVM disassembler as default library for dumping JITed code
- Lots of new BPF documentation for various BPF maps
- Add bpf_rcu_read_{,un}lock() support for sleepable programs
- Add RCU grace period chaining to BPF to wait for the completion of
access from both sleepable and non-sleepable BPF programs
- Add support storing struct task_struct objects as kptrs in maps
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions
Protocols:
- TCP: implement Protective Load Balancing across switch links
- TCP: allow dynamically disabling TCP-MD5 static key, reverting back
to fast[er]-path
- UDP: Introduce optional per-netns hash lookup table
- IPv6: simplify and cleanup sockets disposal
- Netlink: support different type policies for each generic netlink
operation
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support
- MPTCP: add netlink notification support for listener sockets events
- SCTP: add VRF support, allowing sctp sockets binding to VRF devices
- Add bridging MAC Authentication Bypass (MAB) support
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios
- More work for Wi-Fi 7 support, comprising conversion of all the
existing drivers to internal TX queue usage
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading
- IPSec: extended ack support for more descriptive XFRM error
reporting
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking
- IEEE 802154: synchronous send frame and extended filtering support,
initial support for scanning available 15.4 networks
- Tun: bump the link speed from 10Mbps to 10Gbps
- Tun/VirtioNet: implement UDP segmentation offload support
Driver API:
- PHY/SFP: improve power level switching between standard level 1 and
the higher power levels
- New API for netdev <-> devlink_port linkage
- PTP: convert existing drivers to new frequency adjustment
implementation
- DSA: add support for rx offloading
- Autoload DSA tagging driver when dynamically changing protocol
- Add new PCP and APPTRUST attributes to Data Center Bridging
- Add configuration support for 800Gbps link speed
- Add devlink port function attribute to enable/disable RoCE and
migratable
- Extend devlink-rate to support strict prioriry and weighted fair
queuing
- Add devlink support to directly reading from region memory
- New device tree helper to fetch MAC address from nvmem
- New big TCP helper to simplify temporary header stripping
New hardware / drivers:
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches
- Marvel Prestera AC5X Ethernet Switch
- WangXun 10 Gigabit NIC
- Motorcomm yt8521 Gigabit Ethernet
- Microchip ksz9563 Gigabit Ethernet Switch
- Microsoft Azure Network Adapter
- Linux Automation 10Base-T1L adapter
- PHY:
- Aquantia AQR112 and AQR412
- Motorcomm YT8531S
- PTP:
- Orolia ART-CARD
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets
- Realtek RTL8852BE and RTL8723DS
- Cypress.CYW4373A0 WiFi + Bluetooth combo device
Drivers:
- CAN:
- gs_usb: bus error reporting support
- kvaser_usb: listen only and bus error reporting support
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping
- implement devlink-rate support
- support direct read from memory
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate
- Support for enhanced events compression
- extend H/W offload packet manipulation capabilities
- implement IPSec packet offload mode
- nVidia/Mellanox (mlx4):
- better big TCP support
- Netronome Ethernet NICs (nfp):
- IPsec offload support
- add support for multicast filter
- Broadcom:
- RSS and PTP support improvements
- AMD/SolarFlare:
- netlink extened ack improvements
- add basic flower matches to offload, and related stats
- Virtual NICs:
- ibmvnic: introduce affinity hint support
- small / embedded:
- FreeScale fec: add initial XDP support
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
- TI am65-cpsw: add suspend/resume support
- Mediatek MT7986: add RX wireless wthernet dispatch support
- Realtek 8169: enable GRO software interrupt coalescing per
default
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support
- add ip6gre support
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support
- enable flow offload support
- Renesas:
- add rswitch R-Car Gen4 gPTP support
- Microchip (lan966x):
- add full XDP support
- add TC H/W offload via VCAP
- enable PTP on bridge interfaces
- Microchip (ksz8):
- add MTU support for KSZ8 series
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support
- add ack signal support
- enable coredump support
- remain_on_channel support
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support
- RealTek WiFi (rtw89):
- new dynamic header firmware format support
- wake-over-WLAN support"
* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
ipvs: fix type warning in do_div() on 32 bit
net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
net: ipa: add IPA v4.7 support
dt-bindings: net: qcom,ipa: Add SM6350 compatible
bnxt: Use generic HBH removal helper in tx path
IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
selftests: forwarding: Add bridge MDB test
selftests: forwarding: Rename bridge_mdb test
bridge: mcast: Support replacement of MDB port group entries
bridge: mcast: Allow user space to specify MDB entry routing protocol
bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
bridge: mcast: Add support for (*, G) with a source list and filter mode
bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
bridge: mcast: Add a flag for user installed source entries
bridge: mcast: Expose __br_multicast_del_group_src()
bridge: mcast: Expose br_multicast_new_group_src()
bridge: mcast: Add a centralized error path
bridge: mcast: Place netlink policy before validation functions
bridge: mcast: Split (*, G) and (S, G) addition into different functions
bridge: mcast: Do not derive entry type from its filter mode
...
|
|
|
|
a312a8cc3c |
cgroup changes for v6.2-rc1
Nothing too interesting.
* Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations kprobable.
* A couple cpuset optimizations.
* Other misc changes including doc and test updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY5bHvg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGcYrAQCfrlzrbWw6gTQ7fmr0Avxjy5FxLjsdzEGPcmGY
ByEMhgD/VdUf3zI/Khr91Gsi5JXQxQf7a5caD369xupRWUWjqA8=
=Nf+E
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Nothing too interesting:
- Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations
kprobable
- A couple cpuset optimizations
- Other misc changes including doc and test updates"
* tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq()
cgroup/cpuset: Improve cpuset_css_alloc() description
kselftest/cgroup: Add cleanup() to test_cpuset_prs.sh
cgroup/cpuset: Optimize cpuset_attach() on v2
cgroup/cpuset: Skip spread flags update on v2
kselftest/cgroup: Fix gathering number of CPUs
cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF
cgroup: Implement DEBUG_CGROUP_REF
|
|
|
|
3b91010500 | Merge branch 'mm-hotfixes-stable' into mm-stable | |
|
|
4a7ba45b1a |
memcg: fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to |
|
|
|
837e8ac871 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
fbf8321238 |
memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to |
|
|
|
674b745e22 |
cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq()
spin_lock_irq() already disable preempt, so remove rcu_read_lock(). Signed-off-by: Ran Tian <tianran_trtr@163.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
0a2cafe6c7 |
cgroup/cpuset: Improve cpuset_css_alloc() description
Change the function argument in the description of cpuset_css_alloc()
from 'struct cgroup' -> 'struct cgroup_subsys_state'. The change to the
argument type was introduced by commit
|
|
|
|
7fd4da9c15 |
cgroup/cpuset: Optimize cpuset_attach() on v2
It was found that with the default hierarchy, enabling cpuset in the child cgroups can trigger a cpuset_attach() call in each of the child cgroups that have tasks with no change in effective cpus and mems. If there are many processes in those child cgroups, it will burn quite a lot of cpu cycles iterating all the tasks without doing useful work. Optimizing this case by comparing between the old and new cpusets and skip useless update if there is no change in effective cpus and mems. Also mems_allowed are less likely to be changed than cpus_allowed. So skip changing mm if there is no change in effective_mems and CS_MEMORY_MIGRATE is not set. By inserting some instrumentation code and running a simple command in a container 200 times in a cgroup v2 system, it was found that all the cpuset_attach() calls are skipped (401 times in total) as there was no change in effective cpus and mems. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
18f9a4d475 |
cgroup/cpuset: Skip spread flags update on v2
Cpuset v2 has no spread flags to set. So we can skip spread flags update if cpuset v2 is being used. Also change the name to cpuset_update_task_spread_flags() to indicate that there are multiple spread flags. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
1eeaa4fd39 |
memory: move hotplug memory notifier priority to same file for easy sorting
The priority of hotplug memory callback is defined in a different file. And there are some callers using numbers directly. Collect them together into include/linux/memory.h for easy reading. This allows us to sort their priorities more intuitively without additional comments. Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f9e60becee |
cgroup/cpuset: use hotplug_memory_notifier() directly
Patch series "mm: Use hotplug_memory_notifier() instead of register_hotmemory_notifier()", v4. Commit |
|
|
|
b54a0d4094 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2GuKgAKCRDbK58LschI gy32AP9PI0e/bUGDExKJ8g97PeeEtnpj4TTI6g+XKILtYnyXlgD/Rk4j2D/f3IBF Ha9TmqYvAUim+U/g50vUrNuoNLNJ5w8= =OKC1 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2022-11-02 We've added 70 non-merge commits during the last 14 day(s) which contain a total of 96 files changed, 3203 insertions(+), 640 deletions(-). The main changes are: 1) Make cgroup local storage available to non-cgroup attached BPF programs such as tc BPF ones, from Yonghong Song. 2) Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers, from Martin KaFai Lau. 3) Add LLVM disassembler as default library for dumping JITed code in bpftool, from Quentin Monnet. 4) Various kprobe_multi_link fixes related to kernel modules, from Jiri Olsa. 5) Optimize x86-64 JIT with emitting BMI2-based shift instructions, from Jie Meng. 6) Improve BPF verifier's memory type compatibility for map key/value arguments, from Dave Marchevsky. 7) Only create mmap-able data section maps in libbpf when data is exposed via skeletons, from Andrii Nakryiko. 8) Add an autoattach option for bpftool to load all object assets, from Wang Yufen. 9) Various memory handling fixes for libbpf and BPF selftests, from Xu Kuohai. 10) Initial support for BPF selftest's vmtest.sh on arm64, from Manu Bretelle. 11) Improve libbpf's BTF handling to dedup identical structs, from Alan Maguire. 12) Add BPF CI and denylist documentation for BPF selftests, from Daniel Müller. 13) Check BPF cpumap max_entries before doing allocation work, from Florian Lehner. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits) samples/bpf: Fix typo in README bpf: Remove the obsolte u64_stats_fetch_*_irq() users. bpf: check max_entries before allocating memory bpf: Fix a typo in comment for DFS algorithm bpftool: Fix spelling mistake "disasembler" -> "disassembler" selftests/bpf: Fix bpftool synctypes checking failure selftests/bpf: Panic on hard/soft lockup docs/bpf: Add documentation for new cgroup local storage selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x selftests/bpf: Add selftests for new cgroup local storage selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str bpftool: Support new cgroup local storage libbpf: Support new cgroup local storage bpf: Implement cgroup storage available to non-cgroup-attached bpf progs bpf: Refactor some inode/task/sk storage functions for reuse bpf: Make struct cgroup btf id global selftests/bpf: Tracing prog can still do lookup under busy lock selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection bpf: Add new bpf_task_storage_delete proto with no deadlock detection bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check ... ==================== Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
79a7f41f7f |
cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF
|
|
|
|
6ab428604f |
cgroup: Implement DEBUG_CGROUP_REF
It's really difficult to debug when cgroup or css refs leak. Let's add a debug option to force the refcnt function to not be inlined so that they can be kprobed for debugging. Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
c4bcfb38a9 |
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.
The life-cycle of storage is managed with the life-cycle of the
cgroup struct. i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.
The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.
Typically, the following code is used to get the current cgroup:
struct task_struct *task = bpf_get_current_task_btf();
... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
struct task_struct {
....
struct css_set __rcu *cgroups;
....
}
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.
Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
bb1a114646 |
cgroup fixes for v6.1-rc1
* Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held. * Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file(). Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. * Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. * Add Josef Bacik as a blkcg maintainer. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY03MlA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTkUAQD7fNcSPuc2m/BvW+gySKQkp9PZMA6E6yOIqirc QKmIgwEAwWECW7RR1alhOGD50RtYkuYVsLD1+6Ka4eMHe+EhwA4= =kGLI -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file() Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. - Add Josef Bacik as a blkcg maintainer. * tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Update MAINTAINERS entry mm: cgroup: fix comments for get from fd/file helpers perf stat: Support old kernels for bperf cgroup counting bpf: cgroup_iter: support cgroup1 using cgroup fd cgroup: add cgroup_v1v2_get_from_[fd/file]() Revert "cgroup: enable cgroup_get_from_file() on cgroup1" cgroup: Reorganize css_set_lock and kernfs path processing |
|
|
|
bd9a3dba18 |
PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup
in the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX
wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH
uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m
lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq
ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel
Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen
wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv
MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q
xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN
RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu
kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW
SIZD/S/hzdI=
=91tB
-----END PGP SIGNATURE-----
Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull PSI updates from Ingo Molnar:
- Various performance optimizations, resulting in a 4%-9% speedup in
the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
sched/psi: Cache parent psi_group to speed up group iteration
sched/psi: Consolidate cgroup_psi()
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
sched/psi: Remove NR_ONCPU task accounting
sched/psi: Optimize task switch inside shared cgroups again
sched/psi: Move private helpers to sched/stats.h
sched/psi: Save percpu memory when !psi_cgroups_enabled
sched/psi: Don't create cgroup PSI files when psi_disabled
sched/psi: Fix periodic aggregation shut off
|
|
|
|
b675d4bdfe |
mm: cgroup: fix comments for get from fd/file helpers
Fix the documentation comments for cgroup_[v1v2_]get_from_[fd/file](). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
a6d1ce5951 |
cgroup: add cgroup_v1v2_get_from_[fd/file]()
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that support both cgroup1 and cgroup2. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
27bc50fc90 |
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... |
|
|
|
03db771615 |
Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
This reverts commit
|
|
|
|
46307fd6e2 |
cgroup: Reorganize css_set_lock and kernfs path processing
The commit |
|
|
|
adf4bfc4a9 |
cgroup changes for v6.1-rc1.
* cpuset now support isolated cpus.partition type, which will enable dynamic
CPU isolation.
* pids.peak added to remember the max number of pids used.
* Holes in cgroup namespace plugged.
* Internal cleanups.
Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
follow-up cleanups and each merge commit has details.
Also,
|
|
|
|
30c999937f |
Scheduler changes for v6.1:
- Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
- Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
- Freezer:
- Rewrite the core freezer to behave better wrt thawing and be simpler
in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
all the fallout.
- Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
- Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
TeSLCxQxXlU=
=KjMD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
Freezer:
- Rewrite the core freezer to behave better wrt thawing and be
simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
fixing/adjusting all the fallout.
Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() &
replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications"
* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched: Fix more TASK_state comparisons
sched: Fix TASK_state comparisons
sched/fair: Move call to list_last_entry() in detach_tasks
sched/fair: Cleanup loop_max and loop_break
sched/fair: Make sure to try to detach at least one movable task
sched: Show PF_flag holes
freezer,sched: Rewrite core freezer logic
sched: Widen TAKS_state literals
sched/wait: Add wait_event_state()
sched/completion: Add wait_for_completion_state()
sched: Add TASK_ANY for wait_task_inactive()
sched: Change wait_task_inactive()s match_state
freezer,umh: Clean up freezer/initrd interaction
freezer: Have {,un}lock_system_sleep() save/restore flags
sched: Rename task_running() to task_on_cpu()
sched/fair: Cleanup for SIS_PROP
sched/fair: Default to false in test_idle_cores()
sched/fair: Remove useless check in select_idle_core()
sched/fair: Avoid double search on same cpu
sched/fair: Remove redundant check in select_idle_smt()
...
|
|
|
|
e8bc52cb8d |
Driver core changes for 6.1-rc1
Here is the big set of driver core and debug printk changes for 6.1-rc1.
Included in here is:
- dynamic debug updates for the core and the drm subsystem. The
drm changes have all been acked by the relevant maintainers.
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they
were not being used and they really did not actually do
anything.)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BYUA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylozwCdFRlcghaf7XBUyNgRZRwMC+oQI8EAn1G/nEDE
6aFd2er41uK0IGQnSmYO
=OK0k
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core and debug printk changes for
6.1-rc1. Included in here is:
- dynamic debug updates for the core and the drm subsystem. The drm
changes have all been acked by the relevant maintainers
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they were
not being used and they really did not actually do anything)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
docs: filesystems: sysfs: Make text and code for ->show() consistent
Documentation: NBD_REQUEST_MAGIC isn't a magic number
a.out: restore CMAGIC
device property: Add const qualifier to device_get_match_data() parameter
drm_print: add _ddebug descriptor to drm_*dbg prototypes
drm_print: prefer bare printk KERN_DEBUG on generic fn
drm_print: optimize drm_debug_enabled for jump-label
drm-print: add drm_dbg_driver to improve namespace symmetry
drm-print.h: include dyndbg header
drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
drm_print: interpose drm_*dbg with forwarding macros
drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
drm_print: condense enum drm_debug_category
debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
Documentation: ENI155_MAGIC isn't a magic number
Documentation: NBD_REPLY_MAGIC isn't a magic number
nbd: remove define-only NBD_MAGIC, previously magic number
Documentation: FW_HEADER_MAGIC isn't a magic number
Documentation: EEPROM_MAGIC_VALUE isn't a magic number
...
|
|
|
|
accc3b4a57 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
354ed59744 |
mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_has_hw_pte_young() returns
true
0x0004: clearing the accessed bit in non-leaf PMD entries, when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
[yYnN]: apply to all the components above
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion. So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs. The SPF and the Maple Tree also have provided their own
assessments [2][3]. However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it. In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.
Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.
[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
8619e94d3b |
cgroup: use strscpy() is more robust and safer
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
61c41711b1 |
cgroup: simplify code in cgroup_apply_control
It could directly return 'cgroup_update_dfl_csses' to simplify code. Signed-off-by: William Dean <williamsukatube@163.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7e1eb5437d |
cgroup: Make cgroup_get_from_id() prettier
After merging 836ac87d ("cgroup: fix cgroup_get_from_id") into for-6.1, its
combination with two commits in for-6.1 -
|
|
|
|
026e14a276 |
Merge branch 'for-6.0-fixes' into for-6.1
for-6.0 has the following fix for cgroup_get_from_id().
836ac87d ("cgroup: fix cgroup_get_from_id")
which conflicts with the following two commits in for-6.1.
|
|
|
|
df02452f3d |
cgroup: cgroup_get_from_id() must check the looked-up kn is a directory
cgroup has to be one kernfs dir, otherwise kernel panic is caused,
especially cgroup id is provide from userspace.
Reported-by: Marco Patalano <mpatalan@redhat.com>
Fixes:
|
|
|
|
a791dc1353 |
Linux 6.0-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmMeQ2keHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYRMH+gLNHiGirGZlm2GQ tKaZQUy7MiXuIP0hGDonDIIIAmIVhnjm9MDG8KT4W8AvEd7ukncyYqJfwWeWQPhP 4mZcf6l3Z8Ke+qiaFpXpMPCxTyWcln1ox0EoNx2g9gdPxZntaRuuaTQVljUfTiey aVPHxve8ip3G7jDoJnuLSxESOqWxkb8v/SshBP1E5bF5BZ+cgZRqq7FNigFqxjbk wF29K09BVOPjdgkSvY/b0/SnL5KlSdMAv+FrPcJNGivcdIPgf/qJks5cI2HRUo7o CpKgbcLorCVyD+d+zLonJBwIy3arbmKD8JqYnfdTSIqVOUqHXWUDfeydsH32u1Gu lPSI2Hw= =7LTL -----END PGP SIGNATURE----- Merge 6.0-rc5 into driver-core-next We need the driver core and debugfs changes in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
34f26a1561 |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit
|
|
|
|
57899a6610 |
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi". This patch move cgroup_psi() definition to <linux/psi.h>, in which we can return psi_system for root cgroup, so can handle all cgroups. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com |
|
|
|
52b1364ba0 |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com |
|
|
|
58d8c2586c |
sched/psi: Don't create cgroup PSI files when psi_disabled
commit
|
|
|
|
0fb7b6f9d3 |
Merge branch 'driver-core/driver-core-next'
Pull in dependent cgroup patches Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
|
|
|
f5d39b0208 |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
|
|
|
|
c478bd8836 |
cgroup/cpuset: remove unreachable code
The function sched_partition_show cannot execute seq_puts, delete the invalid code. kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
2786bcff28 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-09-05 The following pull-request contains BPF updates for your *net-next* tree. We've added 106 non-merge commits during the last 18 day(s) which contain a total of 159 files changed, 5225 insertions(+), 1358 deletions(-). There are two small merge conflicts, resolve them as follows: 1) tools/testing/selftests/bpf/DENYLIST.s390x Commit |
|
|
|
8a693f7766 |
cgroup: Remove CFTYPE_PRESSURE
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner. No userland visible changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> |
|
|
|
0083d27b21 |
cgroup: Improve cftype add/rm error handling
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes. Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
d7c8142d5a |
cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
Currently, changes in "cpust.cpus" of a partition root is not allowed if it violates the sibling cpu exclusivity rule when the check is done in the validate_change() function. That is inconsistent with the other cpuset changes that are always allowed but may make a partition invalid. Update the cpuset code to allow cpumask change even if it violates the sibling cpu exclusivity rule, but invalidate the partition instead just like the other changes. However, other sibling partitions with conflicting cpumask will also be invalidated in order to not violating the exclusivity rule. This behavior is specific to this partition rule violation. Note that a previous commit has made sibling cpu exclusivity rule check the last check of validate_change(). So if -EINVAL is returned, we can be sure that sibling cpu exclusivity rule violation is the only rule that is broken. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
74027a6535 |
cgroup/cpuset: Relocate a code block in validate_change()
This patch moves down the exclusive cpu and memory check in validate_change(). There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7476a636d3 |
cgroup/cpuset: Show invalid partition reason string
There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid (<reason>)" where <reason> is a string that describes why the partition is invalid. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
f28e22441f |
cgroup/cpuset: Add a new isolated cpus.partition type
Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
f0af1bfc27 |
cgroup/cpuset: Relax constraints to partition & cpus changes
Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.
This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.
Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.
1) The "cpuset.cpus" is non-empty and exclusive.
2) The parent cpuset is a valid partition root.
3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".
Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
|
|
e2d59900d9 |
cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
Currently, a partition root cannot have empty "cpuset.cpus.effective". As a result, a parent partition root cannot distribute out all its CPUs to child partitions with no CPUs left. However in most cases, there shouldn't be any tasks associated with intermediate nodes of the default hierarchy. So the current rule is too restrictive and can waste valuable CPU resource. To address this issue, we are now allowing a partition to have empty "cpuset.cpus.effective" as long as it has no task. Since cpuset is threaded, no-internal-process rule does not apply. So it is possible to have tasks in a partition root with child sub-partitions even though that should be uncommon. A parent partition with no task can now have all its CPUs distributed out to its child partitions. The top cpuset always have some house-keeping tasks running and so its list of effective cpu can't be empty. Once a partition with empty "cpuset.cpus.effective" is formed, no new task can be moved into it until "cpuset.cpus.effective" becomes non-empty. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
18065ebe9b |
cgroup/cpuset: Miscellaneous cleanups & add helper functions
The partition root state (PRS) macro names do not currently match the external names. Change them to match the external names and add helper functions to read or change the state. Shorten the cpuset argument of update_parent_subparts_cpumask() to cs to match other cpuset functions. Remove the new_prs argument from notify_partition_change() as the cs->partition_root_state has already been set to new_prs before it is called. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
ec5fbdfb99 |
cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
Previously, update_tasks_cpumask() is not supposed to be called with
top cpuset. With cpuset partition that takes CPUs away from the top
cpuset, adjusting the cpus_mask of the tasks in the top cpuset is
necessary. Percpu kthreads, however, are ignored.
Fixes:
|
|
|
|
5251c6c436 |
cgroup: add pids.peak interface for pids controller
pids.peak tracks the high watermark of usage for number of pids. This helps give a better baseline on which to set pids.max. Polling pids.current isn't really feasible, since it would potentially miss short-lived spikes. This interface is analogous to memory.peak. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
dc79ec1b23 |
cgroup: Remove data-race around cgrp_dfl_visible
There's a seemingly harmless data-race around cgrp_dfl_visible detected by kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/ |
|
|
|
e2691f6b44 |
cgroup: Implement cgroup_file_show()
Add cgroup_file_show() which allows toggling visibility of a cgroup file using the new kernfs_show(). This will be used to hide psi interface files on cgroups where it's disabled. Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
075b593f54 |
cgroup: Use cgroup_attach_{lock,unlock}() from cgroup_attach_task_all()
No behavior changes; preparing for potential locking changes in future. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by:Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
265efc941f |
Merge branch 'for-6.0-fixes' into for-6.1
Pulling to receive
|
|
|
|
fa7e439cf9 |
cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()). This change also fixes commit |
|
|
|
4534dee941 |
cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id
Cgroup ids are resolved in the global scope. That may be needed sometime
(in future) but currently it violates virtual view provided through
cgroup namespaces.
There are currently following users of the resolution:
- fc_appid_store
- bpf_iter_attach_cgroup
- mem_cgroup_get_from_ino
None of the is a called on behalf of kernel but the resolution is made
with proper userspace context, hence the default to current->nsproxy
makes sens. (This doesn't rule out cgroup_get_from_id with cgroup NS
parameter in the future.)
Since cgroup ids are defined on v2 hierarchy only, we simply check
existence in the cgroup namespace by looking at ancestry on the default
hierarchy.
Fixes:
|
|
|
|
74e4b956eb |
cgroup: Honor caller's cgroup NS when resolving path
cgroup_get_from_path() is not widely used function. Its callers presume
the path is resolved under cgroup namespace. (There is one caller
currently and resolving in init NS won't make harm (netfilter). However,
future users may be subject to different effects when resolving
globally.)
Since, there's currently no use for the global resolution, modify the
existing function to take cgroup NS into account.
Fixes:
|
|
|
|
880b0dd94f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c |
|
|
|
a319185be9 |
cgroup: bpf: enable bpf programs to integrate with rstat
Enable bpf programs to make use of rstat to collect cgroup hierarchical stats efficiently: - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats. - Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats. - Add an empty bpf_rstat_flush() hook that is called during rstat flushing, for bpf progs that flush stats to attach to. Attaching a bpf prog to this hook effectively registers it as a flush callback. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
43626dade3 |
cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at cpuset_attach() [1], for commit |
|
|
|
763f4fb76e |
cgroup: Fix race condition at rebind_subsystems()
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().
Solution:
Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes:
|
|
|
|
3f5f728a72 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says:
====================
bpf-next 2022-08-17
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
4f7e723643 |
cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.
Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes:
|
|
|
|
76b079ef4c |
sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
7f203bc89e |
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().
While at it, improve comments around cgroup_root->cgrp_ancestor_storage.
This patch shouldn't cause user-visible behavior differences.
v2: Update cgroup_ancestor() to use ->ancestors[].
v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
cgroup->ancestors[]. Better comments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
|
|
|
|
f3a2aebdd6 |
cgroup: enable cgroup_get_from_file() on cgroup1
cgroup_get_from_file() currently fails with -EBADF if called on cgroup v1. However, the current implementation works on cgroup v1 as well, so the restriction is unnecessary. This enabled cgroup_get_from_fd() to work on cgroup v1, which would be the only thing stopping bpf cgroup_iter from supporting cgroup v1. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220805214821.1058337-3-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
cac03ac368 |
Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9 yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru 998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx eT5m/K9a7L0= =CYl5 -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix" * tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Do not requeue task on CPU excluded from cpus_mask sched/rt: Fix Sparse warnings due to undefined rt.c declarations exit: Fix typo in comment: s/sub-theads/sub-threads sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed |
|
|
|
b6bb70f9ab |
Several core optimizations:
* threadgroup_rwsem write locking is skipped when configuring controllers in
empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common
static usage pattern to not grab threadgroup_rwsem at all (glibc still
doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
* threadgroup_rwsem used to be put into non-percpu mode by default due to
latency concerns in specific use cases. There's no reason for everyone
else to pay for it. Make the behavior optional.
* psi no longer allocates memory when disabled.
along with some code cleanups.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx
IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo=
=sucy
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Several core optimizations:
- threadgroup_rwsem write locking is skipped when configuring
controllers in empty subtrees.
Combined with CLONE_INTO_CGROUP, this allows the common static
usage pattern to not grab threadgroup_rwsem at all (glibc still
doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
- threadgroup_rwsem used to be put into non-percpu mode by default
due to latency concerns in specific use cases. There's no reason
for everyone else to pay for it. Make the behavior optional.
- psi no longer allocates memory when disabled.
... along with some code cleanups"
* tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Skip subtree root in cgroup_update_dfl_csses()
cgroup: remove "no" prefixed mount options
cgroup: Make !percpu threadgroup_rwsem operations optional
cgroup: Add "no" prefixed mount options
cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
psi: dont alloc memory for psi by default
|
|
|
|
b6e8d40d43 |
sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:
[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
:
[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
:
[80468.207946] Call Trace:
[80468.208947] cpuset_can_attach+0xa0/0x140
[80468.209953] cgroup_migrate_execute+0x8c/0x490
[80468.210931] cgroup_update_dfl_csses+0x254/0x270
[80468.211898] cgroup_subtree_control_write+0x322/0x400
[80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
[80468.213777] new_sync_write+0x11f/0x1b0
[80468.214689] vfs_write+0x1eb/0x280
[80468.215592] ksys_write+0x5f/0xe0
[80468.216463] do_syscall_64+0x5c/0x80
[80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.
Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.
Fixes:
|
|
|
|
b167fdffe9 |
This cycle's scheduler updates for v6.0 are:
Load-balancing improvements:
============================
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness metrics (PELT),
and remove the conservative threshold requiring 6% energy savings to
migrate a task. Doing this improves power efficiency for most workloads,
and also increases the reliability of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time searching
for an idle CPU on overloaded systems. There's reports of several
milliseconds spent there on large systems with large workloads ...
[ Since the search logic changed, there might be behavioral side effects. ]
- Improve NUMA imbalance behavior. On certain systems
with spare capacity, initial placement of tasks is non-deterministic,
and such an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which benefit
from the artificial locality of the placement bug(s). Mel Gorman's
conclusion, with which we concur, was that consistency is better than
random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think it's
better to be consistent and use similar logic at both fork time
and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in sched_core_update_cookie() that
caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly
woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup latencies.
ABI improvements/fixes:
=======================
- Do not check capabilities and do not issue capability check denial messages
when a scheduler syscall doesn't require privileges. (Such as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
==============
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
======================
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not being
re-enabled when the last SCHED_RT task is gone from a runqueue but there's
still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42
HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB
x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn
dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA
pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U
wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB
Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE
eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT
/Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn
SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um
xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4
nzDVOob1LgE=
=xr2b
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness
metrics (PELT), and remove the conservative threshold requiring 6%
energy savings to migrate a task. Doing this improves power
efficiency for most workloads, and also increases the reliability
of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time
searching for an idle CPU on overloaded systems. There's reports of
several milliseconds spent there on large systems with large
workloads ...
[ Since the search logic changed, there might be behavioral side
effects. ]
- Improve NUMA imbalance behavior. On certain systems with spare
capacity, initial placement of tasks is non-deterministic, and such
an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which
benefit from the artificial locality of the placement bug(s). Mel
Gorman's conclusion, with which we concur, was that consistency is
better than random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think
it's better to be consistent and use similar logic at both fork
time and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in
sched_core_update_cookie() that caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
for newly woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup
latencies.
ABI improvements/fixes:
- Do not check capabilities and do not issue capability check denial
messages when a scheduler syscall doesn't require privileges. (Such
as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the
previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not
being re-enabled when the last SCHED_RT task is gone from a
runqueue but there's still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes"
* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rseq: Kill process when unknown flags are encountered in ABI structures
rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
sched/core: Always flush pending blk_plug
sched/fair: fix case with reduced capacity CPU
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
sched/core: add forced idle accounting for cgroups
sched/fair: Remove the energy margin in feec()
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
sched/fair: Rename select_idle_mask to select_rq_mask
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
sched: only perform capability check on privileged operation
sched: Remove unused function group_first_cpu()
sched/fair: Remove redundant word " *"
selftests/rseq: check if libc rseq support is registered
...
|
|
|
|
265792d0de |
cgroup: Skip subtree root in cgroup_update_dfl_csses()
The cgroup_update_dfl_csses() function updates css associations when a cgroup's subtree_control file is modified. Any changes made to a cgroup's subtree_control file, however, will only affect its descendants but not the cgroup itself. So there is no point in migrating csses associated with that cgroup. We can skip them instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
c808f46323 |
cgroup: remove "no" prefixed mount options
|
|
|
|
6a010a49b6 |
cgroup: Make !percpu threadgroup_rwsem operations optional
|
|
|
|
30312730bd |
cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed variants so that they can be turned off too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> |
|
|
|
671c11f061 |
cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the csses can trigger process migrations. However, if the subtree doesn't contain any tasks, there aren't gonna be any cgroup migrations. This condition can be trivially detected by testing whether mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if the subtree is empty. After this optimization, the usage pattern of creating a cgroup, enabling the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and then removing the cgroup after it becomes empty doesn't need to write-lock threadgroup_rwsem at all. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> |
|
|
|
1fcf54deb7 |
sched/core: add forced idle accounting for cgroups
|
|
|
|
d75cd55ae2 |
cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
We have:
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
{
...
/* mixables don't care */
if (cgroup_is_mixable(dst_cgrp))
return 0;
/*
* If @dst_cgrp is already or can become a thread root or is
* threaded, it doesn't matter.
*/
if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp))
return 0;
...
}
but in fact the entry of cgroup_can_be_thread_root() covers case that
checking cgroup_is_mixable() as following:
static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
{
/* mixables don't care */
if (cgroup_is_mixable(cgrp))
return true;
...
}
so explicitly checking in cgroup_migrate_vet_dst is unnecessary.
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
|
|
07fd5b6cdf |
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes:
|
|
|
|
e210a89f5b |
cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
No funtionality change, but save us some lines. Signed-off-by: Lin Feng <linf@wangsu.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
5f69a6577b |
psi: dont alloc memory for psi by default
Memory about struct psi_group is allocated by default for each cgroup even if psi_disabled is true, in this case, these allocated memory is waste, so alloc memory for struct psi_group only when psi_disabled is false. Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
8b49c4b1b6 |
Merge branch 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Nothing too interesting. This adds cpu controller selftests and there are a couple code cleanup patches" * 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: remove the superfluous judgment cgroup: Make cgroup_debug static kseltest/cgroup: Make test_stress.sh work if run interactively kselftest/cgroup: fix test_stress.sh to use OUTPUT dir cgroup: Add config file to cgroup selftest suite cgroup: Add test_cpucg_max_nested() testcase cgroup: Add test_cpucg_max() testcase cgroup: Add test_cpucg_nested_weight_underprovisioned() testcase cgroup: Adding test_cpucg_nested_weight_overprovisioned() testcase cgroup: Add test_cpucg_weight_underprovisioned() testcase cgroup: Add test_cpucg_weight_overprovisioned() testcase cgroup: Add test_cpucg_stats() testcase to cgroup cpu selftests cgroup: Add new test_cpu.c test suite in cgroup selftests |
|
|
|
b154a017c9 |
cgroup: remove the superfluous judgment
Remove the superfluous judgment since the function is never called for a root cgroup, as suggested by Tejun. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
29ed17389c |
cgroup: Make cgroup_debug static
Make cgroup_debug static since it's only used in cgroup.c Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
2685027fca |
cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
There are 3 places where the cpu and node masks of the top cpuset can be initialized in the order they are executed: 1) start_kernel -> cpuset_init() 2) start_kernel -> cgroup_init() -> cpuset_bind() 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp() The first cpuset_init() call just sets all the bits in the masks. The second cpuset_bind() call sets cpus_allowed and mems_allowed to the default v2 values. The third cpuset_init_smp() call sets them back to v1 values. For systems with cgroup v2 setup, cpuset_bind() is called once. As a result, cpu and memory node hot add may fail to update the cpu and node masks of the top cpuset to include the newly added cpu or node in a cgroup v2 environment. For systems with cgroup v1 setup, cpuset_bind() is called again by rebind_subsystem() when the v1 cpuset filesystem is mounted as shown in the dmesg log below with an instrumented kernel. [ 2.609781] cpuset_bind() called - v2 = 1 [ 3.079473] cpuset_init_smp() called [ 7.103710] cpuset_bind() called - v2 = 0 smp_init() is called after the first two init functions. So we don't have a complete list of active cpus and memory nodes until later in cpuset_init_smp() which is the right time to set up effective_cpus and effective_mems. To fix this cgroup v2 mask setup problem, the potentially incorrect cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed. For cgroup v2 systems, the initial cpuset_bind() call will set the masks correctly. For cgroup v1 systems, the second call to cpuset_bind() will do the right setup. cc: stable@vger.kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
266d17a8c0 |
Driver core changes for 5.18-rc1
Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c
with your tree, the merge conflict should be easy (take all the
changes).
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU
wUUCGqKcOnTJEcW6dMPh
=1VVi
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
Documentation: update stable review cycle documentation
drivers/base/dd.c : Remove the initial value of the global variable
Documentation: update stable tree link
Documentation: add link to stable release candidate tree
devres: fix typos in comments
Documentation: add note block surrounding security patch note
samples/kobject: Use sysfs_emit instead of sprintf
base: soc: Make soc_device_match() simpler and easier to read
driver core: dd: fix return value of __setup handler
driver core: Refactor sysfs and drv/bus remove hooks
driver core: Refactor multiple copies of device cleanup
scripts: get_abi.pl: Fix typo in help message
kernfs: fix typos in comments
kernfs: remove unneeded #if 0 guard
ALSA: hda/realtek: Make use of the helper component_compare_dev_name
video: omapfb: dss: Make use of the helper component_compare_dev
power: supply: ab8500: Make use of the helper component_compare_dev
ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of
iommu/mediatek: Make use of the helper component_compare/release_of
drm: of: Make use of the helper component_release_of
...
|
|
|
|
52deda9551 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ... |
|
|
|
b1e2c8df0f |
cgroup: use irqsave in cgroup_rstat_flush_locked().
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Link: https://lkml.kernel.org/r/20220301122143.1521823-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Subject: cgroup: add a comment to cgroup_rstat_flush_locked(). Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed. Link: https://lkml.kernel.org/r/Yh+DOK73hfVV5ThX@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2fce7ea0e0 |
Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "All trivial cleanups without meaningful behavior changes" * 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: cleanup comments cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment cgroup: rstat: retrieve current bstat to delta directly cgroup: rstat: use same convention to assign cgroup_base_stat |
|
|
|
ccdbf33c23 |
Linux 5.17-rc8
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmIuUskeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCFkH/2n3mpGXuITp0ZXE TNrpbdZOof5SgLw+w7THswXuo6m5yRGNKQs9fvIvDD8Vf7/OdQQfPOmF1cIE5+nk wcz6aHKbdrok8Jql2qjJqWXZ5xbGj6qywg3zZrwOUsCKFP5p+AjBJcmZOsvQHjSp ASODy1moOlK+nO52TrMaJw74a8xQPmQiNa+T2P+FedEYjlcRH/c7hLJ7GEnL6+cC /R4bATZq3tiInbTBlkC0hR0iVNgRXwXNyv9PEXrYYYHnekh8G1mgSNf06iejLcsG aAYsW9NyPxu8zPhhHNx79K9o8BMtxGD4YQpsfdfIEnf9Q3euqAKe2evRWqHHlDms RuSCtsc= =M9Nc -----END PGP SIGNATURE----- Merge tag 'v5.17-rc8' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
f9da322e86 |
cgroup: cleanup comments
for spdx, add a space before // replacements judgement to judgment transofrmed to transformed partitition to partition histrical to historical migratecd to migrated Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
4a248f85b3 |
Merge 5.17-rc6 into driver-core-next
We need the driver core fix in here as well for future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
f2eb478f2f |
kernfs: move struct kernfs_root out of the public view.
There is no need to have struct kernfs_root be part of kernfs.h for the whole kernel to see and poke around it. Move it internal to kernfs code and provide a helper function, kernfs_root_to_node(), to handle the one field that kernfs users were directly accessing from the structure. Cc: Imran Khan <imran.f.khan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
5c1ee56966 |
Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo: - Fix for a subtle bug in the recent release_agent permission check update - Fix for a long-standing race condition between cpuset and cpu hotplug - Comment updates * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix kernel-doc cgroup-v1: Correct privileges check in release_agent writes cgroup: clarify cgroup_css_set_fork() cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug |
|
|
|
c70cd039f1 |
cpuset: Fix kernel-doc
Fix the following W=1 kernel warnings: kernel/cgroup/cpuset.c:3718: warning: expecting prototype for cpuset_memory_pressure_bump(). Prototype was for __cpuset_memory_pressure_bump() instead. kernel/cgroup/cpuset.c:3568: warning: expecting prototype for cpuset_node_allowed(). Prototype was for __cpuset_node_allowed() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
|
|
|
467a726b75 |
cgroup-v1: Correct privileges check in release_agent writes
The idea is to check: a) the owning user_ns of cgroup_ns, b) capabilities in init_user_ns. The commit |