mirror of https://github.com/torvalds/linux.git
4054 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
53ebef53a6 |
bpf: Use proper type to calculate bpf_raw_tp_null_args.mask index
The calculation of the index used to access the mask field in 'struct bpf_raw_tp_null_args' is done with 'int' type, which could overflow when the tracepoint being attached has more than 8 arguments. While none of the tracepoints mentioned in raw_tp_null_args[] currently have more than 8 arguments, there do exist tracepoints that had more than 8 arguments (e.g. iocost_iocg_forgive_debt), so use the correct type for calculation and avoid Smatch static checker warning. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20250418074946.35569-1-shung-hsi.yu@suse.com Closes: https://lore.kernel.org/r/843a3b94-d53d-42db-93d4-be10a4090146@stanley.mountain/ |
|
|
|
07e32237ed |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaAFFvgAKCRAraS+Z+3y5 E/1OAP9SGmTMgHuHLlF8en+MaYdtwgcHy6uurXgbSQAAV/RwwQEAh2oXZE1D9I7a EtxsaJYqbbhD09RPwWa2Rd8iJrJYXQk= =qcoU -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-04-17 We've added 12 non-merge commits during the last 9 day(s) which contain a total of 18 files changed, 1748 insertions(+), 19 deletions(-). The main changes are: 1) bpf qdisc support, from Amery Hung. A qdisc can be implemented in bpf struct_ops programs and can be used the same as other existing qdiscs in the "tc qdisc" command. 2) Add xsk tail adjustment tests, from Tushar Vyavahare. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Test attaching bpf qdisc to mq and non root selftests/bpf: Add a bpf fq qdisc to selftest selftests/bpf: Add a basic fifo qdisc test libbpf: Support creating and destroying qdisc bpf: net_sched: Disable attaching bpf qdisc to non root bpf: net_sched: Support updating bstats bpf: net_sched: Add a qdisc watchdog timer bpf: net_sched: Add basic bpf qdisc kfuncs bpf: net_sched: Support implementation of Qdisc_ops in bpf bpf: Prepare to reuse get_ctx_arg_idx selftests/xsk: Add tail adjustment tests and support check selftests/xsk: Add packet stream replacement function ==================== Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
5709be4c35 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
240ce924d2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py |
|
|
|
a1b669ea16 |
bpf: Prepare to reuse get_ctx_arg_idx
Rename get_ctx_arg_idx to bpf_ctx_arg_idx, and allow others to call it. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-2-ameryhung@gmail.com |
|
|
|
a650d38915 |
bpf: Convert ringbuf map to rqspinlock
Convert the raw spinlock used by BPF ringbuf to rqspinlock. Currently, we have an open syzbot report of a potential deadlock. In addition, the ringbuf can fail to reserve spuriously under contention from NMI context. It is potentially attractive to enable unconstrained usage (incl. NMIs) while ensuring no deadlocks manifest at runtime, perform the conversion to rqspinlock to achieve this. This change was benchmarked for BPF ringbuf's multi-producer contention case on an Intel Sapphire Rapids server, with hyperthreading disabled and performance governor turned on. 5 warm up runs were done for each case before obtaining the results. Before (raw_spinlock_t): Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.440 ± 0.019M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 2.706 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 3.130 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 2.472 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.352 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.813 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.988 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.245 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.148 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.190 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.490 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.180 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.201 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.226 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.164 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.874 ± 0.001M/s (drops 0.000 ± 0.000M/s) After (rqspinlock_t): Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.078 ± 0.019M/s (drops 0.000 ± 0.000M/s) (-3.16%) rb-libbpf nr_prod 2 2.801 ± 0.014M/s (drops 0.000 ± 0.000M/s) (3.51%) rb-libbpf nr_prod 3 3.454 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.35%) rb-libbpf nr_prod 4 2.567 ± 0.002M/s (drops 0.000 ± 0.000M/s) (3.84%) rb-libbpf nr_prod 8 2.468 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.93%) rb-libbpf nr_prod 12 2.510 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-10.77%) rb-libbpf nr_prod 16 2.075 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.38%) rb-libbpf nr_prod 20 2.640 ± 0.001M/s (drops 0.000 ± 0.000M/s) (17.59%) rb-libbpf nr_prod 24 2.092 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-2.61%) rb-libbpf nr_prod 28 2.426 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.78%) rb-libbpf nr_prod 32 2.331 ± 0.004M/s (drops 0.000 ± 0.000M/s) (-6.39%) rb-libbpf nr_prod 36 2.306 ± 0.003M/s (drops 0.000 ± 0.000M/s) (5.78%) rb-libbpf nr_prod 40 2.178 ± 0.002M/s (drops 0.000 ± 0.000M/s) (-1.04%) rb-libbpf nr_prod 44 2.293 ± 0.001M/s (drops 0.000 ± 0.000M/s) (3.01%) rb-libbpf nr_prod 48 2.022 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-6.56%) rb-libbpf nr_prod 52 1.809 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-3.47%) There's a fair amount of noise in the benchmark, with numbers on reruns going up and down by 10%, so all changes are in the range of this disturbance, and we see no major regressions. Reported-by: syzbot+850aaf14624dc0c6d366@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000004aa700061379547e@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250411101759.4061366-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
0f08335ade |
trace: tcp: Add tracepoint for tcp_sendmsg_locked()
Add a tracepoint to monitor TCP send operations, enabling detailed visibility into TCP message transmission. Create a new tracepoint within the tcp_sendmsg_locked function, capturing traditional fields along with size_goal, which indicates the optimal data size for a single TCP segment. Additionally, a reference to the struct sock sk is passed, allowing direct access for BPF programs. The implementation is largely based on David's patch[1] and suggestions. Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1] Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
2f41503d64 |
bpf: Convert queue_stack map to rqspinlock
Replace all usage of raw_spinlock_t in queue_stack_maps.c with rqspinlock. This is a map type with a set of open syzbot reports reproducing possible deadlocks. Prior attempt to fix the issues was at [0], but was dropped in favor of this approach. Make sure we return the -EBUSY error in case of possible deadlocks or timeouts, just to make sure user space or BPF programs relying on the error code to detect problems do not break. With these changes, the map should be safe to access in any context, including NMIs. [0]: https://lore.kernel.org/all/20240429165658.1305969-1-sidchintamaneni@gmail.com Reported-by: syzbot+8bdfc2c53fb2b63e1871@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000004c3fc90615f37756@google.com Reported-by: syzbot+252bc5c744d0bba917e1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c80abd0616517df9@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250410153142.2064340-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
92b90f780d |
bpf: Use architecture provided res_smp_cond_load_acquire
In v2 of rqspinlock [0], we fixed potential problems with WFE usage in arm64 to fallback to a version copied from Ankur's series [1]. This logic was moved into arch-specific headers in v3 [2]. However, we missed using the arch-provided res_smp_cond_load_acquire in commit |
|
|
|
6704b1e8cf |
bpf: Don't allocate per-cpu extra_elems for fd htab
The update of element in fd htab is in-place now, therefore, there is no need to allocate per-cpu extra_elems, just remove it. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
e8a65856c7 |
bpf: Add is_fd_htab() helper
Add is_fd_htab() helper to check whether the map is htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
2c304172e0 |
bpf: Support atomic update for htab of maps
As reported by Cody Haas [1], when there is concurrent map lookup and map update operation in an existing element for htab of maps, the map lookup procedure may return -ENOENT unexpectedly. The root cause is twofold: 1) the update of existing element involves two separated list operation In htab_map_update_elem(), it first inserts the new element at the head of list, then it deletes the old element. Therefore, it is possible a lookup operation has already iterated to the middle of the list when a concurrent update operation begins, and the lookup operation will fail to find the target element. 2) the immediate reuse of htab element. It is more subtle. Even through the lookup operation finds the old element, it is possible that the target element has been removed by a concurrent update operation, and the element has been reused immediately by other update operation which runs on the same CPU as the previous update operation, and the element is inserted into the same bucket list. After these steps above, when the lookup operation tries to compare the key in the old element with the expected key, the match will fail because the key in the old element have been overwritten by other update operation. The two-step update process is relatively straightforward to address. The more challenging aspect is the immediate reuse. As Alexei pointed out: So since 2022 both prealloc and no_prealloc reuse elements. We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP that will use _rcu() flavor of freeing into bpf_ma, but it has to have a strong reason. Given that htab of maps doesn't support special field in value and directly stores the inner map pointer in htab_element, just do in-place update for htab of maps instead of attempting to address the immediate reuse issue. [1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
5771e306b6 |
bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place, and add a new percpu argument for the helper to support in-place update for both per-cpu htab and htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
ba2b31b0f3 |
bpf: Factor out htab_elem_value helper()
All hash maps store map key and map value together. The relative offset of the map value compared to the map key is round_up(key_size, 8). Therefore, factor out a common helper htab_elem_value() to calculate the address of the map value instead of duplicating the logic. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
fa6fe07d15
|
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
The lookup_one_len family of functions is (now) only used internally by a filesystem on itself either - in a context where permission checking is irrelevant such as by a virtual filesystem populating itself, or xfs accessing its ORPHANAGE or dquota accessing the quota file; or - in a context where a permission check (MAY_EXEC on the parent) has just been performed such as a network filesystem finding in "silly-rename" file in the same directory. This is also the context after the _parentat() functions where currently lookup_one_qstr_excl() is used. So the permission check is pointless. The name "one_len" is unhelpful in understanding the purpose of these functions and should be changed. Most of the callers pass the len as "strlen()" so using a qstr and QSTR() can simplify the code. This patch renames these functions (include lookup_positive_unlocked() which is part of the family despite the name) to have a name based on "lookup_noperm". They are changed to receive a 'struct qstr' instead of separate name and len. In a few cases the use of QSTR() results in a new call to strlen(). try_lookup_noperm() takes a pointer to a qstr instead of the whole qstr. This is consistent with d_hash_and_lookup() (which is nearly identical) and useful for lookup_noperm_unlocked(). The new lookup_noperm_common() doesn't take a qstr yet. That will be tidied up in a subsequent patch. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
aa918db707 |
bpf_try_alloc_pages
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfkHCQACgkQ6rmadz2v
bTrzWhAAnDcJsGgSJ9EbElpTfgBWE7aijXo/MsPxxRhORc0uR6MnhPx1iADP4KYj
lTGEIBgRuDG3qaM4EXpPd32rUJJHv8hot7z9zfvUgSuFNLZEHWXJtz/i4ileOxin
08zV+zA5WL2fqamAmMRFMI37DeSWy3xU0/qlbWgNnURjPjRri6CF4rVFUWq+QMY+
XP8ITD/6nOLUR6Bq2M18aHnk2VJWkxVP9Oi+vz1VHbOjKaJC7ATa1+Q4qMqWyTb1
8IAYWiZR1ZPc214ITaspVzLoLb/wxHxy3QMrdAWAL6sjp0B4J8YxIq1qsBuR1FN7
TxTRQND/+LjqrAgs5AmFqz3ndKmahjGQWnQEh/rDYJtx+sLJk9hfsMIDF8Wmxuwl
RftdV0g9bPljR5Qgc9i8DNtEjoAbNjoP8xLjt9HfQakVl8V9jPe0bxZ5tJDf+T0M
n/VgEjaRzdXqFOLal6Z5wl/jkIn1l1kWQuCMI2z5Z0Ls+PlYX56xdZxfK2Rh3m+e
3W89vqj9ytJ3rZKG8DRsxukuHwnJ+Gia3XI2h/5cc8kEM5ss1Ase8oIkmrwaLd9x
+zVXNoDCCPRQgTStwItW+2YdFmE9uijhEZh9yPwT1/rtFuKd0oSebVIpjih/bGqH
mMN9gYO4+ArSbqku9X2lP3VjMOf6M6SZGm+PzG25PAMGzjqGqwk=
=AHTr
-----END PGP SIGNATURE-----
Merge tag 'bpf_try_alloc_pages' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf try_alloc_pages() support from Alexei Starovoitov:
"The pull includes work from Sebastian, Vlastimil and myself with a lot
of help from Michal and Shakeel.
This is a first step towards making kmalloc reentrant to get rid of
slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. These patches
make page allocator safe from any context.
Vlastimil kicked off this effort at LSFMM 2024:
https://lwn.net/Articles/974138/
and we continued at LSFMM 2025:
https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/
Why:
SLAB wrappers bind memory to a particular subsystem making it
unavailable to the rest of the kernel. Some BPF maps in production
consume Gbytes of preallocated memory. Top 5 in Meta: 1.5G, 1.2G,
1.1G, 300M, 200M. Once we have kmalloc that works in any context BPF
map preallocation won't be necessary.
How:
Synchronous kmalloc/page alloc stack has multiple stages going from
fast to slow: cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages ->
rmqueue_pcplist -> __rmqueue, where rmqueue_pcplist was already
relying on trylock.
This set changes rmqueue_bulk/rmqueue_buddy to attempt a trylock and
return ENOMEM if alloc_flags & ALLOC_TRYLOCK. It then wraps this
functionality into try_alloc_pages() helper. We make sure that the
logic is sane in PREEMPT_RT.
End result: try_alloc_pages()/free_pages_nolock() are safe to call
from any context.
try_kmalloc() for any context with similar trylock approach will
follow. It will use try_alloc_pages() when slab needs a new page.
Though such try_kmalloc/page_alloc() is an opportunistic allocator,
this design ensures that the probability of successful allocation of
small objects (up to one page in size) is high.
Even before we have try_kmalloc(), we already use try_alloc_pages() in
BPF arena implementation and it's going to be used more extensively in
BPF"
* tag 'bpf_try_alloc_pages' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
mm: Fix the flipped condition in gfpflags_allow_spinning()
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
mm, bpf: Use memcg in try_alloc_pages().
memcg: Use trylock to access memcg stock_lock.
mm, bpf: Introduce free_pages_nolock()
mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
locking/local_lock: Introduce localtry_lock_t
|
|
|
|
494e7fe591 |
bpf_res_spin_lock
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfcq3kACgkQ6rmadz2v
bToxkw/8DHIqjVnzU2O9hbRM1anYo6yM8e34IxCt0ajHTSEVJ93+C161QDWo/6Dk
+RNlaeGekaBUk+QOLb4u+rzZ2eR/pWSm37xuDRAiBCQ+3MgR60gGRaSljpS3IUem
0FvS6C1HObBCEUXMU2rNv/5cJB5/qrQYa9FEEjRvBTLqgQkdS7yaW/KKuZaNb+Ts
KiEeWvPrPSZXStfRGy8Wr4eS2rYhxPAikUR+xde9CM+HtMWwKTCTSp8qXrqA92Dj
Cz9ix01scznuf78QCRDZp09im3lZys8ZQprmPgMxyEscN+CDL7n68wAhmTJq0uo3
3NqIv7zBQ8wMChj0f0HjwZ0Wrj7BJAveY2Q0RterxdzT4vMKdtNkThX46ISaCoX/
XQAAhZHemK6MvBJk+LKkqqMgrD+3FAzvY7O+SCyUBAMs4FK1myRJQihdLXHGfiBU
DMDZE1jsE8qBaeUbz4LIuCy8fx2LhtVwVNwbNIBUZHdyfjxIXnQT/8Cnrgklwy2i
tnYekhAsHDQY+QDkrvJpc4E1vUtiXwSDI5ErcnWdSzctEOyVeUg7OuuGD4riCd1c
emdJmtASM1z9Ajqa1dytDxVaF6wjKlbhQgnKamuex5JLGCK6makk8ZoB+DBfKYHD
VoWummTu8ldf+Dp4ehBh7AbeF2vn4kLqcF1PLRsBO6ytJs4HIt8=
=5O7h
-----END PGP SIGNATURE-----
Merge tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf relisient spinlock support from Alexei Starovoitov:
"This patch set introduces Resilient Queued Spin Lock (or rqspinlock
with res_spin_lock() and res_spin_unlock() APIs).
This is a qspinlock variant which recovers the kernel from a stalled
state when the lock acquisition path cannot make forward progress.
This can occur when a lock acquisition attempt enters a deadlock
situation (e.g. AA, or ABBA), or more generally, when the owner of the
lock (which we’re trying to acquire) isn’t making forward progress.
Deadlock detection is the main mechanism used to provide instant
recovery, with the timeout mechanism acting as a final line of
defense. Detection is triggered immediately when beginning the waiting
loop of a lock slow path.
Additionally, BPF programs attached to different parts of the kernel
can introduce new control flow into the kernel, which increases the
likelihood of deadlocks in code not written to handle reentrancy.
There have been multiple syzbot reports surfacing deadlocks in
internal kernel code due to the diverse ways in which BPF programs can
be attached to different parts of the kernel. By switching the BPF
subsystem’s lock usage to rqspinlock, all of these issues are
mitigated at runtime.
This spin lock implementation allows BPF maps to become safer and
remove mechanisms that have fallen short in assuring safety when
nesting programs in arbitrary ways in the same context or across
different contexts.
We run benchmarks that stress locking scalability and perform
comparison against the baseline (qspinlock). For the rqspinlock case,
we replace the default qspinlock with it in the kernel, such that all
spin locks in the kernel use the rqspinlock slow path. As such,
benchmarks that stress kernel spin locks end up exercising rqspinlock.
More details in the cover letter in commit
|
|
|
|
fa593d0f96 |
bpf-next-6.15
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v
bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD
D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y
rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ
MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF
bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS
QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a
wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS
9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e
lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG
IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ
Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg=
=aN/V
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
"For this merge window we're splitting BPF pull request into three for
higher visibility: main changes, res_spin_lock, try_alloc_pages.
These are the main BPF changes:
- Add DFA-based live registers analysis to improve verification of
programs with loops (Eduard Zingerman)
- Introduce load_acquire and store_release BPF instructions and add
x86, arm64 JIT support (Peilin Ye)
- Fix loop detection logic in the verifier (Eduard Zingerman)
- Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet)
- Add kfunc for populating cpumask bits (Emil Tsalapatis)
- Convert various shell based tests to selftests/bpf/test_progs
format (Bastien Curutchet)
- Allow passing referenced kptrs into struct_ops callbacks (Amery
Hung)
- Add a flag to LSM bpf hook to facilitate bpf program signing
(Blaise Boscaccy)
- Track arena arguments in kfuncs (Ihor Solodrai)
- Add copy_remote_vm_str() helper for reading strings from remote VM
and bpf_copy_from_user_task_str() kfunc (Jordan Rome)
- Add support for timed may_goto instruction (Kumar Kartikeya
Dwivedi)
- Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy)
- Reduce bpf_cgrp_storage_busy false positives when accessing cgroup
local storage (Martin KaFai Lau)
- Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko)
- Allow retrieving BTF data with BTF token (Mykyta Yatsenko)
- Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix
(Song Liu)
- Reject attaching programs to noreturn functions (Yafang Shao)
- Introduce pre-order traversal of cgroup bpf programs (Yonghong
Song)"
* tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits)
selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid
bpf: Fix out-of-bounds read in check_atomic_load/store()
libbpf: Add namespace for errstr making it libbpf_errstr
bpf: Add struct_ops context information to struct bpf_prog_aux
selftests/bpf: Sanitize pointer prior fclose()
selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
selftests/bpf: test_xdp_vlan: Rename BPF sections
bpf: clarify a misleading verifier error message
selftests/bpf: Add selftest for attaching fexit to __noreturn functions
bpf: Reject attaching fexit/fmod_ret to __noreturn functions
bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
bpf: Make perf_event_read_output accessible in all program types.
bpftool: Using the right format specifiers
bpftool: Add -Wformat-signedness flag to detect format errors
selftests/bpf: Test freplace from user namespace
libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
bpf: Return prog btf_id without capable check
bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
bpf, x86: Fix objtool warning for timed may_goto
bpf: Check map->record at the beginning of check_and_free_fields()
...
|
|
|
|
1a9239bb42 |
Networking changes for 6.15.
Core & protocols
----------------
- Continue Netlink conversions to per-namespace RTNL lock
(IPv4 routing, routing rules, routing next hops, ARP ioctls).
- Continue extending the use of netdev instance locks. As a driver
opt-in protect queue operations and (in due course) ethtool
operations with the instance lock and not RTNL lock.
- Support collecting TCP timestamps (data submitted, sent, acked)
in BPF, allowing for transparent (to the application) and lower
overhead tracking of TCP RPC performance.
- Tweak existing networking Rx zero-copy infra to support zero-copy
Rx via io_uring.
- Optimize MPTCP performance in single subflow mode by 29%.
- Enable GRO on packets which went thru XDP CPU redirect (were queued
for processing on a different CPU). Improving TCP stream performance
up to 2x.
- Improve performance of contended connect() by 200% by searching
for an available 4-tuple under RCU rather than a spin lock.
Bring an additional 229% improvement by tweaking hash distribution.
- Avoid unconditionally touching sk_tsflags on RX, improving
performance under UDP flood by as much as 10%.
- Avoid skb_clone() dance in ping_rcv() to improve performance under
ping flood.
- Avoid FIB lookup in netfilter if socket is available, 20% perf win.
- Rework network device creation (in-kernel) API to more clearly
identify network namespaces and their roles.
There are up to 4 namespace roles but we used to have just 2 netns
pointer arguments, interpreted differently based on context.
- Use sysfs_break_active_protection() instead of trylock to avoid
deadlocks between unregistering objects and sysfs access.
- Add a new sysctl and sockopt for capping max retransmit timeout
in TCP.
- Support masking port and DSCP in routing rule matches.
- Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
- Support specifying at what time packet should be sent on AF_XDP
sockets.
- Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
- Add Netlink YAML spec for WiFi (nl80211) and conntrack.
- Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
which only need to be exported when IPv6 support is built as a module.
- Age FDB entries based on Rx not Tx traffic in VxLAN, similar
to normal bridging.
- Allow users to specify source port range for GENEVE tunnels.
- netconsole: allow attaching kernel release, CPU ID and task name
to messages as metadata
Driver API
----------
- Continue rework / fixing of Energy Efficient Ethernet (EEE) across
the SW layers. Delegate the responsibilities to phylink where possible.
Improve its handling in phylib.
- Support symmetric OR-XOR RSS hashing algorithm.
- Support tracking and preserving IRQ affinity by NAPI itself.
- Support loopback mode speed selection for interface selftests.
Device drivers
--------------
- Remove the IBM LCS driver for s390.
- Remove the sb1000 cable modem driver.
- Add support for SFP module access over SMBus.
- Add MCTP transport driver for MCTP-over-USB.
- Enable XDP metadata support in multiple drivers.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- add PCIe TLP Processing Hints (TPH) support for new AMD platforms
- support dumping RoCE queue state for debug
- opt into instance locking
- Intel (100G, ice, idpf):
- ice: rework MSI-X IRQ management and distribution
- ice: support for E830 devices
- iavf: add support for Rx timestamping
- iavf: opt into instance locking
- nVidia/Mellanox:
- mlx4: use page pool memory allocator for Rx
- mlx5: support for one PTP device per hardware clock
- mlx5: support for 200Gbps per-lane link modes
- mlx5: move IPSec policy check after decryption
- AMD/Solarflare:
- support FW flashing via devlink
- Cisco (enic):
- use page pool memory allocator for Rx
- enable 32, 64 byte CQEs
- get max rx/tx ring size from the device
- Meta (fbnic):
- support flow steering and RSS configuration
- report queue stats
- support TCP segmentation
- support IRQ coalescing
- support ring size configuration
- Marvell/Cavium:
- support AF_XDP
- Wangxun:
- support for PTP clock and timestamping
- Huawei (hibmcge):
- checksum offload
- add more statistics
- Ethernet virtual:
- VirtIO net:
- aggressively suppress Tx completions, improve perf by 96% with
1 CPU and 55% with 2 CPUs
- expose NAPI to IRQ mapping and persist NAPI settings
- Google (gve):
- support XDP in DQO RDA Queue Format
- opt into instance locking
- Microsoft vNIC:
- support BIG TCP
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- cleanup Tx and Tx clock setting and other link-focused cleanups
- enable SGMII and 2500BASEX mode switching for Intel platforms
- support Sophgo SG2044
- Broadcom switches (b53):
- support for BCM53101
- TI:
- iep: add perout configuration support
- icssg: support XDP
- Cadence (macb):
- implement BQL
- Xilinx (axinet):
- support dynamic IRQ moderation and changing coalescing at runtime
- implement BQL
- report standard stats
- MediaTek:
- support phylink managed EEE
- Intel:
- igc: don't restart the interface on every XDP program change
- RealTek (r8169):
- support reading registers of internal PHYs directly
- increase max jumbo packet size on RTL8125/RTL8126
- Airoha:
- support for RISC-V NPU packet processing unit
- enable scatter-gather and support MTU up to 9kB
- Tehuti (tn40xx):
- support cards with TN4010 MAC and an Aquantia AQR105 PHY
- Ethernet PHYs:
- support for TJA1102S, TJA1121
- dp83tg720: add randomized polling intervals for link detection
- dp83822: support changing the transmit amplitude voltage
- support for LEDs on 88q2xxx
- CAN:
- canxl: support Remote Request Substitution bit access
- flexcan: add S32G2/S32G3 SoC
- WiFi:
- remove cooked monitor support
- strict mode for better AP testing
- basic EPCS support
- OMI RX bandwidth reduction support
- batman-adv: add support for jumbo frames
- WiFi drivers:
- RealTek (rtw88):
- support RTL8814AE and RTL8814AU
- RealTek (rtw89):
- switch using wiphy_lock and wiphy_work
- add BB context to manipulate two PHY as preparation of MLO
- improve BT-coexistence mechanism to play A2DP smoothly
- Intel (iwlwifi):
- add new iwlmld sub-driver for latest HW/FW combinations
- MediaTek (mt76):
- preparation for mt7996 Multi-Link Operation (MLO) support
- Qualcomm/Atheros (ath12k):
- continued work on MLO
- Silabs (wfx):
- Wake-on-WLAN support
- Bluetooth:
- add support for skb TX SND/COMPLETION timestamping
- hci_core: enable buffer flow control for SCO/eSCO
- coredump: log devcd dumps into the monitor
- Bluetooth drivers:
- intel: add support to configure TX power
- nxp: handle bootloader error during cmd5 and cmd7
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
=XY0S
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Continue Netlink conversions to per-namespace RTNL lock
(IPv4 routing, routing rules, routing next hops, ARP ioctls)
- Continue extending the use of netdev instance locks. As a driver
opt-in protect queue operations and (in due course) ethtool
operations with the instance lock and not RTNL lock.
- Support collecting TCP timestamps (data submitted, sent, acked) in
BPF, allowing for transparent (to the application) and lower
overhead tracking of TCP RPC performance.
- Tweak existing networking Rx zero-copy infra to support zero-copy
Rx via io_uring.
- Optimize MPTCP performance in single subflow mode by 29%.
- Enable GRO on packets which went thru XDP CPU redirect (were queued
for processing on a different CPU). Improving TCP stream
performance up to 2x.
- Improve performance of contended connect() by 200% by searching for
an available 4-tuple under RCU rather than a spin lock. Bring an
additional 229% improvement by tweaking hash distribution.
- Avoid unconditionally touching sk_tsflags on RX, improving
performance under UDP flood by as much as 10%.
- Avoid skb_clone() dance in ping_rcv() to improve performance under
ping flood.
- Avoid FIB lookup in netfilter if socket is available, 20% perf win.
- Rework network device creation (in-kernel) API to more clearly
identify network namespaces and their roles. There are up to 4
namespace roles but we used to have just 2 netns pointer arguments,
interpreted differently based on context.
- Use sysfs_break_active_protection() instead of trylock to avoid
deadlocks between unregistering objects and sysfs access.
- Add a new sysctl and sockopt for capping max retransmit timeout in
TCP.
- Support masking port and DSCP in routing rule matches.
- Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
- Support specifying at what time packet should be sent on AF_XDP
sockets.
- Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
users.
- Add Netlink YAML spec for WiFi (nl80211) and conntrack.
- Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
which only need to be exported when IPv6 support is built as a
module.
- Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
normal bridging.
- Allow users to specify source port range for GENEVE tunnels.
- netconsole: allow attaching kernel release, CPU ID and task name to
messages as metadata
Driver API:
- Continue rework / fixing of Energy Efficient Ethernet (EEE) across
the SW layers. Delegate the responsibilities to phylink where
possible. Improve its handling in phylib.
- Support symmetric OR-XOR RSS hashing algorithm.
- Support tracking and preserving IRQ affinity by NAPI itself.
- Support loopback mode speed selection for interface selftests.
Device drivers:
- Remove the IBM LCS driver for s390
- Remove the sb1000 cable modem driver
- Add support for SFP module access over SMBus
- Add MCTP transport driver for MCTP-over-USB
- Enable XDP metadata support in multiple drivers
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- add PCIe TLP Processing Hints (TPH) support for new AMD
platforms
- support dumping RoCE queue state for debug
- opt into instance locking
- Intel (100G, ice, idpf):
- ice: rework MSI-X IRQ management and distribution
- ice: support for E830 devices
- iavf: add support for Rx timestamping
- iavf: opt into instance locking
- nVidia/Mellanox:
- mlx4: use page pool memory allocator for Rx
- mlx5: support for one PTP device per hardware clock
- mlx5: support for 200Gbps per-lane link modes
- mlx5: move IPSec policy check after decryption
- AMD/Solarflare:
- support FW flashing via devlink
- Cisco (enic):
- use page pool memory allocator for Rx
- enable 32, 64 byte CQEs
- get max rx/tx ring size from the device
- Meta (fbnic):
- support flow steering and RSS configuration
- report queue stats
- support TCP segmentation
- support IRQ coalescing
- support ring size configuration
- Marvell/Cavium:
- support AF_XDP
- Wangxun:
- support for PTP clock and timestamping
- Huawei (hibmcge):
- checksum offload
- add more statistics
- Ethernet virtual:
- VirtIO net:
- aggressively suppress Tx completions, improve perf by 96%
with 1 CPU and 55% with 2 CPUs
- expose NAPI to IRQ mapping and persist NAPI settings
- Google (gve):
- support XDP in DQO RDA Queue Format
- opt into instance locking
- Microsoft vNIC:
- support BIG TCP
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- cleanup Tx and Tx clock setting and other link-focused
cleanups
- enable SGMII and 2500BASEX mode switching for Intel platforms
- support Sophgo SG2044
- Broadcom switches (b53):
- support for BCM53101
- TI:
- iep: add perout configuration support
- icssg: support XDP
- Cadence (macb):
- implement BQL
- Xilinx (axinet):
- support dynamic IRQ moderation and changing coalescing at
runtime
- implement BQL
- report standard stats
- MediaTek:
- support phylink managed EEE
- Intel:
- igc: don't restart the interface on every XDP program change
- RealTek (r8169):
- support reading registers of internal PHYs directly
- increase max jumbo packet size on RTL8125/RTL8126
- Airoha:
- support for RISC-V NPU packet processing unit
- enable scatter-gather and support MTU up to 9kB
- Tehuti (tn40xx):
- support cards with TN4010 MAC and an Aquantia AQR105 PHY
- Ethernet PHYs:
- support for TJA1102S, TJA1121
- dp83tg720: add randomized polling intervals for link detection
- dp83822: support changing the transmit amplitude voltage
- support for LEDs on 88q2xxx
- CAN:
- canxl: support Remote Request Substitution bit access
- flexcan: add S32G2/S32G3 SoC
- WiFi:
- remove cooked monitor support
- strict mode for better AP testing
- basic EPCS support
- OMI RX bandwidth reduction support
- batman-adv: add support for jumbo frames
- WiFi drivers:
- RealTek (rtw88):
- support RTL8814AE and RTL8814AU
- RealTek (rtw89):
- switch using wiphy_lock and wiphy_work
- add BB context to manipulate two PHY as preparation of MLO
- improve BT-coexistence mechanism to play A2DP smoothly
- Intel (iwlwifi):
- add new iwlmld sub-driver for latest HW/FW combinations
- MediaTek (mt76):
- preparation for mt7996 Multi-Link Operation (MLO) support
- Qualcomm/Atheros (ath12k):
- continued work on MLO
- Silabs (wfx):
- Wake-on-WLAN support
- Bluetooth:
- add support for skb TX SND/COMPLETION timestamping
- hci_core: enable buffer flow control for SCO/eSCO
- coredump: log devcd dumps into the monitor
- Bluetooth drivers:
- intel: add support to configure TX power
- nxp: handle bootloader error during cmd5 and cmd7"
* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
unix: fix up for "apparmor: add fine grained af_unix mediation"
mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
net: usb: asix: ax88772: Increase phy_name size
net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
net: libwx: fix Tx L4 checksum
net: libwx: fix Tx descriptor content for some tunnel packets
atm: Fix NULL pointer dereference
net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
net: phy: aquantia: add essential functions to aqr105 driver
net: phy: aquantia: search for firmware-name in fwnode
net: phy: aquantia: add probe function to aqr105 for firmware loading
net: phy: Add swnode support to mdiobus_scan
gve: add XDP DROP and PASS support for DQ
gve: update XDP allocation path support RX buffer posting
gve: merge packet buffer size fields
gve: update GQ RX to use buf_size
gve: introduce config-based allocation for XDP
gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
...
|
|
|
|
a50b4fe095 |
A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0 iJbvLJeisXr3GA== =bYAX -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ... |
|
|
|
e34c38057a |
[ Merge note: this pull request depends on you having merged
two locking commits in the locking tree,
part of the locking-core-2025-03-22 pull request. ]
x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid='
(Brendan Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and
related cleanups (Brian Gerst)
- Convert the stackprotector canary to a regular percpu
variable (Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB instruction
(Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation
(Kirill A. Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems,
to support PCI BAR space beyond the 10TiB region
(CONFIG_PCI_P2PDMA=y) (Balbir Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Bootup:
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0
(Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers
(Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
(Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking instructions
(Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
(Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel,
Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst,
Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin,
Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport,
Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker,
Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra,
Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak,
Vitaly Kuznetsov, Xin Li, liuye.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd
7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0
3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw
OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq
2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX
Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1
q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/
DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn
fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW
NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf
THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M
tJj1oc2TIZw=
=Dcb3
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
"x86 CPU features support:
- Generate the <asm/cpufeaturemasks.h> header based on build config
(H. Peter Anvin, Xin Li)
- x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
- Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
- Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
Jackman)
- Utilize CPU-type for CPU matching (Pawan Gupta)
- Warn about unmet CPU feature dependencies (Sohil Mehta)
- Prepare for new Intel Family numbers (Sohil Mehta)
Percpu code:
- Standardize & reorganize the x86 percpu layout and related cleanups
(Brian Gerst)
- Convert the stackprotector canary to a regular percpu variable
(Brian Gerst)
- Add a percpu subsection for cache hot data (Brian Gerst)
- Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
- Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
MM:
- Add support for broadcast TLB invalidation using AMD's INVLPGB
instruction (Rik van Riel)
- Rework ROX cache to avoid writable copy (Mike Rapoport)
- PAT: restore large ROX pages after fragmentation (Kirill A.
Shutemov, Mike Rapoport)
- Make memremap(MEMREMAP_WB) map memory as encrypted by default
(Kirill A. Shutemov)
- Robustify page table initialization (Kirill A. Shutemov)
- Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
- Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
(Matthew Wilcox)
KASLR:
- x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
Singh)
CPU bugs:
- Implement FineIBT-BHI mitigation (Peter Zijlstra)
- speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
- speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
Gupta)
- RFDS: Exclude P-only parts from the RFDS affected list (Pawan
Gupta)
System calls:
- Break up entry/common.c (Brian Gerst)
- Move sysctls into arch/x86 (Joel Granados)
Intel LAM support updates: (Maciej Wieczor-Retman)
- selftests/lam: Move cpu_has_la57() to use cpuinfo flag
- selftests/lam: Skip test if LAM is disabled
- selftests/lam: Test get_user() LAM pointer handling
AMD SMN access updates:
- Add SMN offsets to exclusive region access (Mario Limonciello)
- Add support for debugfs access to SMN registers (Mario Limonciello)
- Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
Power management updates: (Patryk Wlazlyn)
- Allow calling mwait_play_dead with an arbitrary hint
- ACPI/processor_idle: Add FFH state handling
- intel_idle: Provide the default enter_dead() handler
- Eliminate mwait_play_dead_cpuid_hint()
Build system:
- Raise the minimum GCC version to 8.1 (Brian Gerst)
- Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)
Kconfig: (Arnd Bergmann)
- Add cmpxchg8b support back to Geode CPUs
- Drop 32-bit "bigsmp" machine support
- Rework CONFIG_GENERIC_CPU compiler flags
- Drop configuration options for early 64-bit CPUs
- Remove CONFIG_HIGHMEM64G support
- Drop CONFIG_SWIOTLB for PAE
- Drop support for CONFIG_HIGHPTE
- Document CONFIG_X86_INTEL_MID as 64-bit-only
- Remove old STA2x11 support
- Only allow CONFIG_EISA for 32-bit
Headers:
- Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
headers (Thomas Huth)
Assembly code & machine code patching:
- x86/alternatives: Simplify alternative_call() interface (Josh
Poimboeuf)
- x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
- KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
- x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
- x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
- x86/kexec: Merge x86_32 and x86_64 code using macros from
<asm/asm.h> (Uros Bizjak)
- Use named operands in inline asm (Uros Bizjak)
- Improve performance by using asm_inline() for atomic locking
instructions (Uros Bizjak)
Earlyprintk:
- Harden early_serial (Peter Zijlstra)
NMI handler:
- Add an emergency handler in nmi_desc & use it in
nmi_shootdown_cpus() (Waiman Long)
Miscellaneous fixes and cleanups:
- by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
Kuznetsov, Xin Li, liuye"
* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
x86/asm: Make asm export of __ref_stack_chk_guard unconditional
x86/mm: Only do broadcast flush from reclaim if pages were unmapped
perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
x86/asm: Use asm_inline() instead of asm() in clwb()
x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
x86/hweight: Use asm_inline() instead of asm()
x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
x86/hweight: Use named operands in inline asm()
x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
x86/head/64: Avoid Clang < 17 stack protector in startup code
x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
x86/cpu/intel: Limit the non-architectural constant_tsc model checks
x86/mm/pat: Replace Intel x86_model checks with VFM ones
x86/cpu/intel: Fix fast string initialization for extended Families
...
|
|
|
|
c03bb2fa32 |
bpf: Fix out-of-bounds read in check_atomic_load/store()
syzbot reported the following splat [0]. In check_atomic_load/store(), register validity is not checked before atomic_ptr_type_ok(). This causes the out-of-bounds read in is_ctx_reg() called from atomic_ptr_type_ok() when the register number is MAX_BPF_REG or greater. Call check_load_mem()/check_store_reg() before atomic_ptr_type_ok() to avoid the OOB read. However, some tests introduced by commit |
|
|
|
51d65049cd |
bpf: Add struct_ops context information to struct bpf_prog_aux
This patch adds struct_ops context information to struct bpf_prog_aux. This context information will be used in the kfunc filter. Currently the added context information includes struct_ops member offset and a pointer to struct bpf_struct_ops. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20250319215358.2287371-2-ameryhung@gmail.com |
|
|
|
ea21771c07 |
bpf: Maintain FIFO property for rqspinlock unlock
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave variants enforce strict FIFO ordering anyway, make the same change for normal non-irqsave variants, such that FIFO ordering is enforced. Two new verifier state fields (active_lock_id, active_lock_ptr) are used to denote the top of the stack, and prev_id and prev_ptr are ascertained whenever popping the topmost entry through an unlock. Take special care to make these fields part of the state comparison in refsafe. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-25-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
0de2046137 |
bpf: Implement verifier support for rqspinlock
Introduce verifier-side support for rqspinlock kfuncs. The first step is allowing bpf_res_spin_lock type to be defined in map values and allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK field to recognize and validate. Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only one of them (and at most one of them per-object, like before) must be present. The bpf_res_spin_lock can also be used to protect objects that require lock protection for their kfuncs, like BPF rbtree and linked list. The verifier plumbing to simulate success and failure cases when calling the kfuncs is done by pushing a new verifier state to the verifier state stack which will verify the failure case upon calling the kfunc. The path where success is indicated creates all lock reference state and IRQ state (if necessary for irqsave variants). In the case of failure, the state clears the registers r0-r5, sets the return value, and skips kfunc processing, proceeding to the next instruction. When marking the return value for success case, the value is marked as 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program, whenever user checks the return value as 'if (ret)' or 'if (ret < 0)' the verifier never traverses such branches for success cases, and would be aware that the lock is not held in such cases. We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs are invoked. We introduce a kfunc_class state to avoid mixing lock irqrestore kfuncs with IRQ state created by bpf_local_irq_save. With all this infrastructure, these kfuncs become usable in programs while satisfying all safety properties required by the kernel. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-24-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
97eb35f3ad |
bpf: Introduce rqspinlock kfuncs
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.
Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.
Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.
__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.
These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
47979314c0 |
bpf: Convert lpm_trie.c to rqspinlock
Convert all LPM trie usage of raw_spinlock to rqspinlock. Note that rcu_dereference_protected in trie_delete_elem is switched over to plain rcu_dereference, the RCU read lock should be held from BPF program side or eBPF syscall path, and the trie->lock is just acquired before the dereference. It is not clear the reason the protected variant was used from the commit history, but the above reasoning makes sense so switch over. Closes: https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-22-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
f2ac0e5d1c |
bpf: Convert percpu_freelist.c to rqspinlock
Convert the percpu_freelist.c code to use rqspinlock, and remove the extralist fallback and trylock-based acquisitions to avoid deadlocks. Key thing to note is the retained while (true) loop to search through other CPUs when failing to push a node due to locking errors. This retains the behavior of the old code, where it would keep trying until it would be able to successfully push the node back into the freelist of a CPU. Technically, we should start iteration for this loop from raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus, we skip execution in the loop body instead. Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-21-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
4fa8d68aa5 |
bpf: Convert hashtab.c to rqspinlock
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed per-cpu counter crud from the code base which is no longer necessary. Closes: https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com Closes: https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-20-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
e2082e32fd |
rqspinlock: Add entry to Makefile, MAINTAINERS
Ensure that the rqspinlock code is only built when the BPF subsystem is compiled in. Depending on queued spinlock support, we may or may not end up building the queued spinlock slowpath, and instead fallback to the test-and-set implementation. Also add entries to MAINTAINERS file. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
ecbd804752 |
rqspinlock: Add basic support for CONFIG_PARAVIRT
We ripped out PV and virtualization related bits from rqspinlock in an earlier commit, however, a fair lock performs poorly within a virtual machine when the lock holder is preempted. As such, retain the virt_spin_lock fallback to test and set lock, but with timeout and deadlock detection. We can do this by simply depending on the resilient_tas_spin_lock implementation from the previous patch. We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that requires more involved algorithmic changes and introduces more complexity. It can be done when the need arises in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
c9102a68c0 |
rqspinlock: Add a test-and-set fallback
Include a test-and-set fallback when queued spinlock support is not available. Introduce a rqspinlock type to act as a fallback when qspinlock support is absent. Include ifdef guards to ensure the slow path in this file is only compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add further logic to ensure fallback to the test-and-set implementation when queued spinlock support is unavailable on an architecture. Unlike other waiting loops in rqspinlock code, the one for test-and-set has no theoretical upper bound under contention, therefore we need a longer timeout than usual. Bump it up to a second in this case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
31158ad02d |
rqspinlock: Add deadlock detection and recovery
While the timeout logic provides guarantees for the waiter's forward progress, the time until a stalling waiter unblocks can still be long. The default timeout of 1/4 sec can be excessively long for some use cases. Additionally, custom timeouts may exacerbate recovery time. Introduce logic to detect common cases of deadlocks and perform quicker recovery. This is done by dividing the time from entry into the locking slow path until the timeout into intervals of 1 ms. Then, after each interval elapses, deadlock detection is performed, while also polling the lock word to ensure we can quickly break out of the detection logic and proceed with lock acquisition. A 'held_locks' table is maintained per-CPU where the entry at the bottom denotes a lock being waited for or already taken. Entries coming before it denote locks that are already held. The current CPU's table can thus be looked at to detect AA deadlocks. The tables from other CPUs can be looked at to discover ABBA situations. Finally, when a matching entry for the lock being taken on the current CPU is found on some other CPU, a deadlock situation is detected. This function can take a long time, therefore the lock word is constantly polled in each loop iteration to ensure we can preempt detection and proceed with lock acquisition, using the is_lock_released check. We set 'spin' member of rqspinlock_timeout struct to 0 to trigger deadlock checks immediately to perform faster recovery. Note: Extending lock word size by 4 bytes to record owner CPU can allow faster detection for ABBA. It is typically the owner which participates in a ABBA situation. However, to keep compatibility with existing lock words in the kernel (struct qspinlock), and given deadlocks are a rare event triggered by bugs, we choose to favor compatibility over faster detection. The release_held_lock_entry function requires an smp_wmb, while the release store on unlock will provide the necessary ordering for us. Add comments to document the subtleties of why this is correct. It is possible for stores to be reordered still, but in the context of the deadlock detection algorithm, a release barrier is sufficient and needn't be stronger for unlock's case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
3bb159366a |
rqspinlock: Protect waiters in trylock fallback from stalls
When we run out of maximum rqnodes, the original queued spin lock slow path falls back to a try lock. In such a case, we are again susceptible to stalls in case the lock owner fails to make progress. We use the timeout as a fallback to break out of this loop and return to the caller. This is a fallback for an extreme edge case, when on the same CPU we run out of all 4 qnodes. When could this happen? We are in slow path in task context, we get interrupted by an IRQ, which while in the slow path gets interrupted by an NMI, whcih in the slow path gets another nested NMI, which enters the slow path. All of the interruptions happen after node->count++. We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this fallback, no fairness is guaranteed, so the duration may be too small for contended cases, as the waiting time is not bounded. Since this is an extreme corner case, let's just prefer timing out instead of attempting to spin for longer. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
164c246571 |
rqspinlock: Protect waiters in queue from stalls
Implement the wait queue cleanup algorithm for rqspinlock. There are three forms of waiters in the original queued spin lock algorithm. The first is the waiter which acquires the pending bit and spins on the lock word without forming a wait queue. The second is the head waiter that is the first waiter heading the wait queue. The third form is of all the non-head waiters queued behind the head, waiting to be signalled through their MCS node to overtake the responsibility of the head. In this commit, we are concerned with the second and third kind. First, we augment the waiting loop of the head of the wait queue with a timeout. When this timeout happens, all waiters part of the wait queue will abort their lock acquisition attempts. This happens in three steps. First, the head breaks out of its loop waiting for pending and locked bits to turn to 0, and non-head waiters break out of their MCS node spin (more on that later). Next, every waiter (head or non-head) attempts to check whether they are also the tail waiter, in such a case they attempt to zero out the tail word and allow a new queue to be built up for this lock. If they succeed, they have no one to signal next in the queue to stop spinning. Otherwise, they signal the MCS node of the next waiter to break out of its spin and try resetting the tail word back to 0. This goes on until the tail waiter is found. In case of races, the new tail will be responsible for performing the same task, as the old tail will then fail to reset the tail word and wait for its next pointer to be updated before it signals the new tail to do the same. We terminate the whole wait queue because of two main reasons. Firstly, we eschew per-waiter timeouts with one applied at the head of the wait queue. This allows everyone to break out faster once we've seen the owner / pending waiter not responding for the timeout duration from the head. Secondly, it avoids complicated synchronization, because when not leaving in FIFO order, prev's next pointer needs to be fixed up etc. Lastly, all of these waiters release the rqnode and return to the caller. This patch underscores the point that rqspinlock's timeout does not apply to each waiter individually, and cannot be relied upon as an upper bound. It is possible for the rqspinlock waiters to return early from a failed lock acquisition attempt as soon as stalls are detected. The head waiter cannot directly WRITE_ONCE the tail to zero, as it may race with a concurrent xchg and a non-head waiter linking its MCS node to the head's MCS node through 'prev->next' assignment. One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum duration for the waiting loop (for the wait queue head), since we may have both the owner and pending bit waiter ahead of us, and in the worst case, need to span their maximum permitted critical section lengths. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
337ffea51a |
rqspinlock: Protect pending bit owners from stalls
The pending bit is used to avoid queueing in case the lock is uncontended, and has demonstrated benefits for the 2 contender scenario, esp. on x86. In case the pending bit is acquired and we wait for the locked bit to disappear, we may get stuck due to the lock owner not making progress. Hence, this waiting loop must be protected with a timeout check. To perform a graceful recovery once we decide to abort our lock acquisition attempt in this case, we must unset the pending bit since we own it. All waiters undoing their changes and exiting gracefully allows the lock word to be restored to the unlocked state once all participants (owner, waiters) have been recovered, and the lock remains usable. Hence, set the pending bit back to zero before returning to the caller. Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout event statistics. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
ebababcd03 |
rqspinlock: Hardcode cond_acquire loops for arm64
Currently, for rqspinlock usage, the implementation of smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are susceptible to stalls on arm64, because they do not guarantee that the conditional expression will be repeatedly invoked if the address being loaded from is not written to by other CPUs. When support for event-streams is absent (which unblocks stuck WFE-based loops every ~100us), we may end up being stuck forever. This causes a problem for us, as we need to repeatedly invoke the RES_CHECK_TIMEOUT in the spin loop to break out when the timeout expires. Let us import the smp_cond_load_acquire_timewait implementation Ankur is proposing in [0], and then fallback to it once it is merged. While we rely on the implementation to amortize the cost of sampling check_timeout for us, it will not happen when event stream support is unavailable. This is not the common case, and it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns comparison, hence just let it be. [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com Cc: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
14c48ee814 |
rqspinlock: Add support for timeouts
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect when the timeout has expired for the slow path to return an error. It depends on being passed two variables initialized to 0: ts, ret. The 'ts' parameter is of type rqspinlock_timeout. This macro resolves to the (ret) expression so that it can be used in statements like smp_cond_load_acquire to break the waiting loop condition. The 'spin' member is used to amortize the cost of checking time by dispatching to the implementation every 64k iterations. The 'timeout_end' member is used to keep track of the timestamp that denotes the end of the waiting period. The 'ret' parameter denotes the status of the timeout, and can be checked in the slow path to detect timeouts after waiting loops. The 'duration' member is used to store the timeout duration for each waiting loop. The default timeout value defined in the header (RES_DEF_TIMEOUT) is 0.25 seconds. This macro will be used as a condition for waiting loops in the slow path. Since each waiting loop applies a fresh timeout using the same rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the values can be easily reinitialized to the default state. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
a926d09922 |
rqspinlock: Drop PV and virtualization support
Changes to rqspinlock in subsequent commits will be algorithmic modifications, which won't remain in agreement with the implementations of paravirt spin lock and virt_spin_lock support. These future changes include measures for terminating waiting loops in slow path after a certain point. While using a fair lock like qspinlock directly inside virtual machines leads to suboptimal performance under certain conditions, we cannot use the existing virtualization support before we make it resilient as well. Therefore, drop it for now. Note that we need to drop qspinlock_stat.h, as it's only relevant in case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h in the includes, which was indirectly pulled in before. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
30ff133277 |
rqspinlock: Add rqspinlock.h header
This header contains the public declarations usable in the rest of the kernel for rqspinlock. Let's also type alias qspinlock to rqspinlock_t to ensure consistent use of the new lock type. We want to remove dependence on the qspinlock type in later patches as we need to provide a test-and-set fallback, hence begin abstracting away from now onwards. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
a8fcf2a39b |
locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.c
In preparation for introducing a new lock implementation, Resilient Queued Spin Lock, or rqspinlock, we first begin our modifications by using the existing qspinlock.c code as the base. Simply copy the code to a new file and rename functions and variables from 'queued' to 'resilient_queued'. Since we place the file in kernel/bpf, include needs to be relative. This helps each subsequent commit in clearly showing how and where the code is being changed. The only change after a literal copy in this commit is renaming the functions where necessary, and rename qnodes to rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
a2598045ea |
bpf: clarify a misleading verifier error message
The current verifier error message states that tail_calls are not allowed in non-JITed programs with BPF-to-BPF calls. While this is accurate, it is not the only scenario where this restriction applies. Some architectures do not support this feature combination even when programs are JITed. This update improves the error message to better reflect these limitations. Suggested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20250318083551.8192-1-andreaterzolo3@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
cfe816d469 |
bpf: Reject attaching fexit/fmod_ret to __noreturn functions
If we attach fexit/fmod_ret to __noreturn functions, it will cause an
issue that the bpf trampoline image will be left over even if the bpf
link has been destroyed. Take attaching do_exit() with fexit for example.
The fexit works as follows,
bpf_trampoline
+ __bpf_tramp_enter
+ percpu_ref_get(&tr->pcref);
+ call do_exit()
+ __bpf_tramp_exit
+ percpu_ref_put(&tr->pcref);
Since do_exit() never returns, the refcnt of the trampoline image is
never decremented, preventing it from being freed. That can be verified
with as follows,
$ bpftool link show <<<< nothing output
$ grep "bpf_trampoline_[0-9]" /proc/kallsyms
ffffffffc04cb000 t bpf_trampoline_6442526459 [bpf] <<<< leftover
In this patch, all functions annotated with __noreturn are rejected, except
for the following cases:
- Functions that result in a system reboot, such as panic,
machine_real_restart and rust_begin_unwind
- Functions that are never executed by tasks, such as rest_init and
cpu_startup_entry
- Functions implemented in assembly, such as rewind_stack_and_make_dead and
xen_cpu_bringup_again, lack an associated BTF ID.
With this change, attaching fexit probes to functions like do_exit() will
be rejected.
$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn functions is rejected.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20250318114447.75484-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
f4edc66e48 |
bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
The current cgrp storage has a percpu counter, bpf_cgrp_storage_busy, to detect potential deadlock at a spin_lock that the local storage acquires during new storage creation. There are false positives. It turns out to be too noisy in production. For example, a bpf prog may be doing a bpf_cgrp_storage_get on map_a. An IRQ comes in and triggers another bpf_cgrp_storage_get on a different map_b. It will then trigger the false positive deadlock check in the percpu counter. On top of that, both are doing lookup only and no need to create new storage, so practically it does not need to acquire the spin_lock. The bpf_task_storage_get already has a strategy to minimize this false positive by only failing if the bpf_task_storage_get needs to create a new storage and the percpu counter is busy. Creating a new storage is the only time it must acquire the spin_lock. This patch borrows the same idea. Unlike task storage that has a separate variant for tracing (_recur) and non-tracing, this patch stays with one bpf_cgrp_storage_get helper to keep it simple for now in light of the upcoming res_spin_lock. The variable could potentially use a better name noTbusy instead of nobusy. This patch follows the same naming in bpf_task_storage_get for now. I have tested it by temporarily adding noinline to the cgroup_storage_lookup(), traced it by fentry, and the fentry program succeeded in calling bpf_cgrp_storage_get(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250318182759.3676094-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
ae0a457f5d |
bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like sched_ext schedulers. When the helper is available, provide its bpf_func_proto directly from the bpf base_proto. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
3775be3417 |
bpftool: Using the right format specifiers
Fixed some formatting specifiers errors, such as using %d for int and %u for unsigned int, as well as other byte-length types. Perform type cast using the type derived from the data type itself, for example, if it's originally an int, it will be cast to unsigned int if forced to unsigned. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev |
|
|
|
07651ccda9 |
bpf: Return prog btf_id without capable check
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable check. This patch enables scenario, when freplace program, running from user namespace, requires to query target prog's btf. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com |
|
|
|
0de445d18e |
bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not allow running it from user namespace. This creates a problem when freplace program running from user namespace needs to query target program BTF. This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds support for BPF token that can be passed in attributes to syscall. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com |
|
|
|
bb2243f432 |
bpf: Check map->record at the beginning of check_and_free_fields()
When there are no special fields in the map value, there is no need to invoke bpf_obj_free_fields(). Therefore, checking the validity of map->record in advance. After the change, the benchmark result of the per-cpu update case in map_perf_test increased by 40% under a 16-CPU VM. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250315150930.1511727-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
38c6104e0b |
bpf: preload: Add MODULE_DESCRIPTION
Modpost complains when extra warnings are enabled: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o Add a description from the Kconfig help text. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org ---- Not sure if that description actually fits what the module does. If not, please add a different description instead. Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
082f1db02c |
security: Propagate caller information in bpf hooks
Certain bpf syscall subcommands are available for usage from both userspace and the kernel. LSM modules or eBPF gatekeeper programs may need to take a different course of action depending on whether or not a BPF syscall originated from the kernel or userspace. Additionally, some of the bpf_attr struct fields contain pointers to arbitrary memory. Currently the functionality to determine whether or not a pointer refers to kernel memory or userspace memory is exposed to the bpf verifier, but that information is missing from various LSM hooks. Here we augment the LSM hooks to provide this data, by simply passing a boolean flag indicating whether or not the call originated in the kernel, in any hook that contains a bpf_attr struct that corresponds to a subcommand that may be called from the kernel. Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20250310221737.821889-2-bboscaccy@linux.microsoft.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
014eb5c2d6 |
bpf: fix missing kdoc string fields in cpumask.c
Some bpf_cpumask-related kfuncs have kdoc strings that are missing return values. Add a the missing descriptions for the return values. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250309230427.26603-4-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
950ad93df2 |
bpf: add kfunc for populating cpumask bits
Add a helper kfunc that sets the bitmap of a bpf_cpumask from BPF memory. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250309230427.26603-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
871ef8d50e |
bpf: correct use/def for may_goto instruction
may_goto instruction does not use any registers,
but in compute_insn_live_regs() it was treated as a regular
conditional jump of kind BPF_K with r0 as source register.
Thus unnecessarily marking r0 as used.
Fixes:
|
|
|
|
0fb3cf6110 |
bpf: use register liveness information for func_states_equal
Liveness analysis DFA computes a set of registers live before each instruction. Leverage this information to skip comparison of dead registers in func_states_equal(). This helps with convergance of iterator processing loops, as bpf_reg_state->live marks can't be used when loops are processed. This has certain performance impact for selftests, here is a veristat listing using `-f "insns_pct>5" -f "!insns<200"` selftests: File Program States (A) States (B) States (DIFF) -------------------- ----------------------------- ---------- ---------- -------------- arena_htab.bpf.o arena_htab_llvm 37 35 -2 (-5.41%) arena_htab_asm.bpf.o arena_htab_asm 37 33 -4 (-10.81%) arena_list.bpf.o arena_list_add 37 22 -15 (-40.54%) dynptr_success.bpf.o test_dynptr_copy 22 16 -6 (-27.27%) dynptr_success.bpf.o test_dynptr_copy_xdp 68 58 -10 (-14.71%) iters.bpf.o checkpoint_states_deletion 918 40 -878 (-95.64%) iters.bpf.o clean_live_states 136 66 -70 (-51.47%) iters.bpf.o iter_nested_deeply_iters 43 37 -6 (-13.95%) iters.bpf.o iter_nested_iters 72 62 -10 (-13.89%) iters.bpf.o iter_pass_iter_ptr_to_subprog 30 26 -4 (-13.33%) iters.bpf.o iter_subprog_iters 68 59 -9 (-13.24%) iters.bpf.o loop_state_deps2 35 32 -3 (-8.57%) iters_css.bpf.o iter_css_for_each 32 29 -3 (-9.38%) pyperf600_iter.bpf.o on_event 286 192 -94 (-32.87%) Total progs: 3578 Old success: 2061 New success: 2061 States diff min: -95.64% States diff max: 0.00% -100 .. -90 %: 1 -55 .. -45 %: 3 -45 .. -35 %: 2 -35 .. -25 %: 5 -20 .. -10 %: 12 -10 .. 0 %: 6 sched_ext: File Program States (A) States (B) States (DIFF) ----------------- ---------------------- ---------- ---------- --------------- bpf.bpf.o lavd_dispatch 8950 7065 -1885 (-21.06%) bpf.bpf.o lavd_init 516 480 -36 (-6.98%) bpf.bpf.o layered_dispatch 662 501 -161 (-24.32%) bpf.bpf.o layered_dump 298 237 -61 (-20.47%) bpf.bpf.o layered_init 523 423 -100 (-19.12%) bpf.bpf.o layered_init_task 24 22 -2 (-8.33%) bpf.bpf.o layered_runnable 151 125 -26 (-17.22%) bpf.bpf.o p2dq_dispatch 66 53 -13 (-19.70%) bpf.bpf.o p2dq_init 170 142 -28 (-16.47%) bpf.bpf.o refresh_layer_cpumasks 120 78 -42 (-35.00%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rusty_select_cpu 125 108 -17 (-13.60%) scx_central.bpf.o central_dispatch 59 43 -16 (-27.12%) scx_central.bpf.o central_init 39 28 -11 (-28.21%) scx_nest.bpf.o nest_init 58 51 -7 (-12.07%) scx_pair.bpf.o pair_dispatch 142 111 -31 (-21.83%) scx_qmap.bpf.o qmap_dispatch 174 141 -33 (-18.97%) scx_qmap.bpf.o qmap_init 768 654 -114 (-14.84%) Total progs: 216 Old success: 186 New success: 186 States diff min: -35.00% States diff max: 0.00% -35 .. -25 %: 3 -25 .. -20 %: 6 -20 .. -15 %: 6 -15 .. -5 %: 7 -5 .. 0 %: 6 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
14c8552db6 |
bpf: simple DFA-based live registers analysis
Compute may-live registers before each instruction in the program.
The register is live before the instruction I if it is read by I or
some instruction S following I during program execution and is not
overwritten between I and S.
This information would be used in the next patch as a hint in
func_states_equal().
Use a simple algorithm described in [1] to compute this information:
- define the following:
- I.use : a set of all registers read by instruction I;
- I.def : a set of all registers written by instruction I;
- I.in : a set of all registers that may be alive before I execution;
- I.out : a set of all registers that may be alive after I execution;
- I.successors : a set of instructions S that might immediately
follow I for some program execution;
- associate separate empty sets 'I.in' and 'I.out' with each instruction;
- visit each instruction in a postorder and update corresponding
'I.in' and 'I.out' sets as follows:
I.out = U [S.in for S in I.successors]
I.in = (I.out / I.def) U I.use
(where U stands for set union, / stands for set difference)
- repeat the computation while I.{in,out} changes for any instruction.
On implementation side keep things as simple, as possible:
- check_cfg() already marks instructions EXPLORED in post-order,
modify it to save the index of each EXPLORED instruction in a vector;
- represent I.{in,out,use,def} as bitmasks;
- don't split the program into basic blocks and don't maintain the
work queue, instead:
- do fixed-point computation by visiting each instruction;
- maintain a simple 'changed' flag if I.{in,out} for any instruction
change;
Measurements show that even such simplistic implementation does not
add measurable verification time overhead (for selftests, at-least).
Note on check_cfg() ex_insn_beg/ex_done change:
To avoid out of bounds access to env->cfg.insn_postorder array,
it should be guaranteed that instruction transitions to EXPLORED state
only once. Previously this was not the fact for incorrect programs
with direct calls to exception callbacks.
The 'align' selftest needs adjustment to skip computed insn/live
registers printout. Otherwise it matches lines from the live registers
printout.
[1] https://en.wikipedia.org/wiki/Live-variable_analysis
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
22f8397495 |
bpf: get_call_summary() utility function
Refactor mark_fastcall_pattern_for_call() to extract a utility
function get_call_summary(). For a helper or kfunc call this function
fills the following information: {num_params, is_void, fastcall}.
This function would be used in the next patch in order to get number
of parameters of a helper or kfunc call.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
80ca3f1d77 |
bpf: jmp_offset() and verbose_insn() utility functions
Extract two utility functions: - One BPF jump instruction uses .imm field to encode jump offset, while the rest use .off. Encapsulate this detail as jmp_offset() function. - Avoid duplicating instruction printing callback definitions by defining a verbose_insn() function, which disassembles an instruction into the verifier log while hiding this detail. These functions will be used in the next patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
880442305a |
bpf: Introduce load-acquire and store-release instructions
Introduce BPF instructions with load-acquire and store-release
semantics, as discussed in [1]. Define 2 new flags:
#define BPF_LOAD_ACQ 0x100
#define BPF_STORE_REL 0x110
A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm'
field set to BPF_LOAD_ACQ (0x100).
Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with
the 'imm' field set to BPF_STORE_REL (0x110).
Unlike existing atomic read-modify-write operations that only support
BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and
store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an
exception, however, 64-bit load-acquires/store-releases are not
supported on 32-bit architectures (to fix a build error reported by the
kernel test robot).
An 8- or 16-bit load-acquire zero-extends the value before writing it to
a 32-bit register, just like ARM64 instruction LDARH and friends.
Similar to existing atomic read-modify-write operations, misaligned
load-acquires/store-releases are not allowed (even if
BPF_F_ANY_ALIGNMENT is set).
As an example, consider the following 64-bit load-acquire BPF
instruction (assuming little-endian):
db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0))
opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
imm (0x00000100): BPF_LOAD_ACQ
Similarly, a 16-bit BPF store-release:
cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2)
opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
imm (0x00000110): BPF_STORE_REL
In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have
bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new
instructions, until the corresponding JIT compiler supports them in
arena.
[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
e723608bf4 |
bpf: Add verifier support for timed may_goto
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound. We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched. The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration. Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program. We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86. Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported. The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate. Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled. +-----------------------------+--------------+--------------+------------------+ | Loop type | Iterations | Time (ms) | Time/iter (ns) | +-----------------------------|--------------+--------------+------------------+ | may_goto | 8388608 | 3 | 0.36 | | timed_may_goto (count=65535)| 589674932 | 250 | 0.42 | | bpf_for | 8388608 | 10 | 1.19 | +-----------------------------+--------------+--------------+------------------+ This gives a good approximation at low overhead while staying close to the current implementation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
a752ba4332 |
bpf: Factor out check_load_mem() and check_store_reg()
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic in do_check() into helper functions to be used later. While we are here, make that comment about "reserved fields" more specific. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
2626ffe9f3 |
bpf: Factor out check_atomic_rmw()
Currently, check_atomic() only handles atomic read-modify-write (RMW) instructions. Since we are planning to introduce other types of atomic instructions (i.e., atomic load/store), extract the existing RMW handling logic into its own function named check_atomic_rmw(). Remove the @insn_idx parameter as it is not really necessary. Use 'env->insn_idx' instead, as in other places in verifier.c. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
66faaea94e |
bpf: Factor out atomic_ptr_type_ok()
Factor out atomic_ptr_type_ok() as a helper function to be used later. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
f8ac5a4e1a |
bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero()
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero() and is holding rcu_read_lock(). map_idr_lock does not add any protection, just remove the cost for passive TCP flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <kuifeng@meta.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
e2d8f560d1 |
bpf: Summarize sleepable global subprogs
The verifier currently does not permit global subprog calls when a lock
is held, preemption is disabled, or when IRQs are disabled. This is
because we don't know whether the global subprog calls sleepable
functions or not.
In case of locks, there's an additional reason: functions called by the
global subprog may hold additional locks etc. The verifier won't know
while verifying the global subprog whether it was called in context
where a spin lock is already held by the program.
Perform summarization of the sleepable nature of a global subprog just
like changes_pkt_data and then allow calls to global subprogs for
non-sleepable ones from atomic context.
While making this change, I noticed that RCU read sections had no
protection against sleepable global subprog calls, include it in the
checks and fix this while we're at it.
Care needs to be taken to not allow global subprog calls when regular
bpf_spin_lock is held. When resilient spin locks is held, we want to
potentially have this check relaxed, but not for now.
Also make sure extensions freplacing global functions cannot do so
in case the target is non-sleepable, but the extension is. The other
combination is ok.
Tests are included in the next patch to handle all special conditions.
Fixes:
|
|
|
|
4b82b181a2 |
bpf: Allow pre-ordering for bpf cgroup progs
Currently for bpf progs in a cgroup hierarchy, the effective prog array
is computed from bottom cgroup to upper cgroups (post-ordering). For
example, the following cgroup hierarchy
root cgroup: p1, p2
subcgroup: p3, p4
have BPF_F_ALLOW_MULTI for both cgroup levels.
The effective cgroup array ordering looks like
p3 p4 p1 p2
and at run time, progs will execute based on that order.
But in some cases, it is desirable to have root prog executes earlier than
children progs (pre-ordering). For example,
- prog p1 intends to collect original pkt dest addresses.
- prog p3 will modify original pkt dest addresses to a proxy address for
security reason.
The end result is that prog p1 gets proxy address which is not what it
wants. Putting p1 to every child cgroup is not desirable either as it
will duplicate itself in many child cgroups. And this is exactly a use case
we are encountering in Meta.
To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag
is specified at attachment time, the prog has higher priority and the
ordering with that flag will be from top to bottom (pre-ordering).
For example, in the above example,
root cgroup: p1, p2
subcgroup: p3, p4
Let us say p2 and p4 are marked with BPF_F_PREORDER. The final
effective array ordering will be
p2 p4 p3 p1
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250224230116.283071-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
daec295a70 |
bpf/helpers: Introduce bpf_dynptr_copy kfunc
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to another. This functionality is useful in scenarios such as capturing XDP data to a ring buffer. The implementation consists of 4 branches: * A fast branch for contiguous buffer capacity in both source and destination dynptrs * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy data to/from non-contiguous buffer Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
09206af69c |
bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_write
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code into the static functions namely __bpf_dynptr_read and __bpf_dynptr_write, this allows calling these without compiler warnings. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
8ef890df40 |
net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
0a5c8b2c8c |
bpf: fix a possible NULL deref in bpf_map_offload_map_alloc()
Call bpf_dev_offload_check() before netdev_lock_ops().
This is needed if attr->map_ifindex is not valid.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf]
RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline]
RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline]
RIP: 0010:bpf_map_offload_map_alloc+0x19a/0x910 kernel/bpf/offload.c:533
Call Trace:
<TASK>
map_create+0x946/0x11c0 kernel/bpf/syscall.c:1455
__sys_bpf+0x6d3/0x820 kernel/bpf/syscall.c:5777
__do_sys_bpf kernel/bpf/syscall.c:5902 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5900 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5900
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
Fixes:
|
|
|
|
97246d6d21 |
net: hold netdev instance lock during ndo_bpf
Cover the paths that come via bpf system call and XSK bind. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-10-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
01c7bc5198 |
x86/smp: Move cpu number to percpu hot section
No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-5-brgerst@gmail.com |
|
|
|
cfdaa618de |
Merge branch 'x86/cpu' into x86/asm, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
18cdd90aba |
x86/bpf: Fix BPF percpu accesses
Due to this recent commit in the x86 tree: |
|
|
|
88d5baf690
|
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
357660d759 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c |
|
|
|
c9eb8102e2 |
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
Use try_alloc_pages() and free_pages_nolock() for BPF needs when context doesn't allow using normal alloc_pages. This is a prerequisite for further work. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250222024427.30294-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
ed16b8a4d1 |
bpf: cpumap: switch to napi_skb_cache_get_bulk()
Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower MM layer pressure. Always disable the BH before checking and
running the cpumap-pinned XDP prog and don't re-enable it in between
that and allocating an skb bulk, as we can access the NAPI caches only
from the BH context.
The better GRO aggregates packets, the less new skbs will be allocated.
If an aggregated skb contains 16 frags, this means 15 skbs were returned
to the cache, so next 15 skbs will be built without allocating anything.
The same trafficgen UDP GRO test now shows:
GRO off GRO on
threaded GRO 2.3 4 Mpps
thr bulk GRO 2.4 4.7 Mpps
diff +4 +17 %
Comparing to the baseline cpumap:
baseline 2.7 N/A Mpps
thr bulk GRO 2.4 4.7 Mpps
diff -11 +74 %
Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
|
|
57efe762cd |
bpf: cpumap: reuse skb array instead of a linked list to chain skbs
cpumap still uses linked lists to store a list of skbs to pass to the stack. Now that we don't use listified Rx in favor of napi_gro_receive(), linked list is now an unneeded overhead. Inside the polling loop, we already have an array of skbs. Let's reuse it for skbs passed to cpumap (generic XDP) and keep there in case of XDP_PASS when a program is installed to the map itself. Don't list regular xdp_frames after converting them to skbs as well; store them in the mentioned array (but *before* generic skbs as the latters have lower priority) and call gro_receive_skb() for each array element after they're done. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
4f8ab26a03 |
bpf: cpumap: switch to GRO from netif_receive_skb_list()
cpumap has its own BH context based on kthread. It has a sane batch size of 8 frames per one cycle. GRO can be used here on its own. Adjust cpumap calls to the upper stack to use GRO API instead of netif_receive_skb_list() which processes skbs by batches, but doesn't involve GRO layer at all. In plenty of tests, GRO performs better than listed receiving even given that it has to calculate full frame checksums on the CPU. As GRO passes the skbs to the upper stack in the batches of @gro_normal_batch, i.e. 8 by default, and skb->dev points to the device where the frame comes from, it is enough to disable GRO netdev feature on it to completely restore the original behaviour: untouched frames will be being bulked and passed to the upper stack by 8, as it was with netif_receive_skb_list(). Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
4e4136c644 |
selftests/bpf: Test gen_pro/epilogue that generate kfuncs
Test gen_prologue and gen_epilogue that generate kfuncs that have not
been seen in the main program.
The main bpf program and return value checks are identical to
pro_epilogue.c introduced in commit
|
|
|
|
d519594ee2 |
bpf: Search and add kfuncs in struct_ops prologue and epilogue
Currently, add_kfunc_call() is only invoked once before the main verification loop. Therefore, the verifier could not find the bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined struct_ops operators but introduced in gen_prologue or gen_epilogue during do_misc_fixup(). Fix this by searching kfuncs in the patching instruction buffer and add them to prog->aux->kfunc_tab. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
f3c2d243a3 |
bpf: abort verification if env->cur_state->loop_entry != NULL
In addition to warning abort verification with -EFAULT.
If env->cur_state->loop_entry != NULL something is irrecoverably
buggy.
Fixes:
|
|
|
|
11ba7ce076 |
bpf: Fix kmemleak warning for percpu hashmap
Vlad Poenaru reported the following kmemleak issue:
unreferenced object 0x606fd7c44ac8 (size 32):
backtrace (crc 0):
pcpu_alloc_noprof+0x730/0xeb0
bpf_map_alloc_percpu+0x69/0xc0
prealloc_init+0x9d/0x1b0
htab_map_alloc+0x363/0x510
map_create+0x215/0x3a0
__sys_bpf+0x16b/0x3e0
__x64_sys_bpf+0x18/0x20
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Further investigation shows the reason is due to not 8-byte aligned
store of percpu pointer in htab_elem_set_ptr():
*(void __percpu **)(l->key + key_size) = pptr;
Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size
is 4, that means pptr is stored in a location which is 4 byte aligned but
not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based
on 8 byte stride, so it won't detect above pptr, hence reporting the memory
leak.
In htab_map_alloc(), we already have
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
if (percpu)
htab->elem_size += sizeof(void *);
else
htab->elem_size += round_up(htab->map.value_size, 8);
So storing pptr with 8-byte alignment won't cause any problem and can fix
kmemleak too.
The issue can be reproduced with bpf selftest as well:
1. Enable CONFIG_DEBUG_KMEMLEAK config
2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c.
The purpose is to keep map available so kmemleak can be detected.
3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported.
Reported-by: Vlad Poenaru <thevlad@meta.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250224175514.2207227-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
201b62ccc8 |
bpf: Refactor check_ctx_access()
Reduce the variable passing madness surrounding check_ctx_access(). Currently, check_mem_access() passes many pointers to local variables to check_ctx_access(). They are used to initialize "struct bpf_insn_access_aux info" in check_ctx_access() and then passed to is_valid_access(). Then, check_ctx_access() takes the data our from info and write them back the pointers to pass them back. This can be simpilified by moving info up to check_mem_access(). No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20250221175644.1822383-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
38f1e66abd |
bpf: Do not allow tail call in strcut_ops program with __ref argument
Reject struct_ops programs with refcounted kptr arguments (arguments tagged with __ref suffix) that tail call. Once a refcounted kptr is passed to a struct_ops program from the kernel, it can be freed or xchged into maps. As there is no guarantee a callee can get the same valid refcounted kptr in the ctx, we cannot allow such usage. Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250220221532.1079331-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
bd4319b6c2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf bpf-6.14-rc4
Cross-merge bpf fixes after downstream PR (bpf-6.14-rc4). Minor conflict: kernel/bpf/btf.c Adjacent changes: kernel/bpf/arena.c kernel/bpf/btf.c kernel/bpf/syscall.c kernel/bpf/verifier.c mm/memory.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
319fc77f8f |
BPF fixes:
- Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ7evlRUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIPwHgD/dTvM00Ck4Q73fPivyT7tcqxeXJlD D6ggzWl/SG9LAbwA/2/cSgAM9Jm1g7ddvn/S9QaDYOs5GmFl6urq6krs+tYD =FCs9 -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull BPF fixes from Daniel Borkmann: - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests: bpf: test batch lookup on array of maps with holes bpf: skip non exist keys in generic_map_lookup_batch bpf: Handle allocation failure in acquire_lock_state bpf: verifier: Disambiguate get_constant_map_key() errors bpf: selftests: Test constant key extraction on irrelevant maps bpf: verifier: Do not extract constant map keys for irrelevant maps bpf: Fix softlockup in arena_map_free on 64k page kernel net: Add rx_skb of kfree_skb to raw_tp_null_args[]. bpf: Fix deadlock when freeing cgroup storage selftests/bpf: Add strparser test for bpf selftests/bpf: Fix invalid flag of recv() bpf: Disable non stream socket for strparser bpf: Fix wrong copied_seq calculation strparser: Add read_sock callback bpf: avoid holding freeze_mutex during mmap operation bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic selftests/bpf: Adjust data size to have ETH_HLEN bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type() bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed |
|
|
|
5942246426 |
bpf: Support selective sampling for bpf timestamping
Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to selectively enable TX timestamping on a skb during tcp_sendmsg(). For example, BPF program will limit tracking X numbers of packets and then will stop there instead of tracing all the sendmsgs of matched flow all along. It would be helpful for users who cannot afford to calculate latencies from every sendmsg call probably due to the performance or storage space consideration. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com |
|
|
|
f0f8a5b58f |
bpf: Add bpf_copy_from_user_task_str() kfunc
This new kfunc will be able to copy a zero-terminated C strings from another task's address space. This is similar to `bpf_copy_from_user_str()` but reads memory of specified task. Signed-off-by: Jordan Rome <linux@jordanrome.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250213152125.1837400-2-linux@jordanrome.com |
|
|
|
574078b001 |
bpf: fix env->peak_states computation
Compute env->peak_states as a maximum value of sum of env->explored_states and env->free_list size. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-11-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
408fcf946b |
bpf: free verifier states when they are no longer referenced
When fixes from patches 1 and 3 are applied, Patrick Somaru reported
an increase in memory consumption for sched_ext iterator-based
programs hitting 1M instructions limit. For example, 2Gb VMs ran out
of memory while verifying a program. Similar behaviour could be
reproduced on current bpf-next master.
Here is an example of such program:
/* verification completes if given 16G or RAM,
* final env->free_list size is 369,960 entries.
*/
SEC("raw_tp")
__flag(BPF_F_TEST_STATE_FREQ)
__success
int free_list_bomb(const void *ctx)
{
volatile char buf[48] = {};
unsigned i, j;
j = 0;
bpf_for(i, 0, 10) {
/* this forks verifier state:
* - verification of current path continues and
* creates a checkpoint after 'if';
* - verification of forked path hits the
* checkpoint and marks it as loop_entry.
*/
if (bpf_get_prandom_u32())
asm volatile ("");
/* this marks 'j' as precise, thus any checkpoint
* created on current iteration would not be matched
* on the next iteration.
*/
buf[j++] = 42;
j %= ARRAY_SIZE(buf);
}
asm volatile (""::"r"(buf));
return 0;
}
Memory consumption increased due to more states being marked as loop
entries and eventually added to env->free_list.
This commit introduces logic to free states from env->free_list during
verification. A state in env->free_list can be freed if:
- it has no child states;
- it is not used as a loop_entry.
This commit:
- updates bpf_verifier_state->used_as_loop_entry to be a counter
that tracks how many states use this one as a loop entry;
- adds a function maybe_free_verifier_state(), which:
- frees a state if its ->branches and ->used_as_loop_entry counters
are both zero;
- if the state is freed, state->loop_entry->used_as_loop_entry is
decremented, and an attempt is made to free state->loop_entry.
In the example above, this approach reduces the maximum number of
states in the free list from 369,960 to 16,223.
However, this approach has its limitations. If the buf size in the
example above is modified to 64, state caching overflows: the state
for j=0 is evicted from the cache before it can be used to stop
traversal. As a result, states in the free list accumulate because
their branch counters do not reach zero.
The effect of this patch on the selftests looks as follows:
File Program Max free list (A) Max free list (B) Max free list (DIFF)
-------------------------------- ------------------------------------ ----------------- ----------------- --------------------
arena_list.bpf.o arena_list_add 17 3 -14 (-82.35%)
bpf_iter_task_stack.bpf.o dump_task_stack 39 9 -30 (-76.92%)
iters.bpf.o checkpoint_states_deletion 265 89 -176 (-66.42%)
iters.bpf.o clean_live_states 19 0 -19 (-100.00%)
profiler2.bpf.o tracepoint__syscalls__sys_enter_kill 102 1 -101 (-99.02%)
profiler3.bpf.o tracepoint__syscalls__sys_enter_kill 144 0 -144 (-100.00%)
pyperf600_iter.bpf.o on_event 15 0 -15 (-100.00%)
pyperf600_nounroll.bpf.o on_event 1170 1158 -12 (-1.03%)
setget_sockopt.bpf.o skops_sockopt 18 0 -18 (-100.00%)
strobemeta_nounroll1.bpf.o on_event 147 83 -64 (-43.54%)
strobemeta_nounroll2.bpf.o on_event 312 209 -103 (-33.01%)
strobemeta_subprogs.bpf.o on_event 124 86 -38 (-30.65%)
test_cls_redirect_subprogs.bpf.o cls_redirect 15 0 -15 (-100.00%)
timer.bpf.o test1 30 15 -15 (-50.00%)
Measured using "do-not-submit" patches from here:
https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup
Reported-by: Patrick Somaru <patsomaru@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250215110411.3236773-10-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
5564ee3abb |
bpf: use list_head to track explored states and free list
The next patch in the set needs the ability to remove individual states from env->free_list while only holding a pointer to the state. Which requires env->free_list to be a doubly linked list. This patch converts env->free_list and struct bpf_verifier_state_list to use struct list_head for this purpose. The change to env->explored_states is collateral. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-9-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
590eee4268 |
bpf: do not update state->loop_entry in get_loop_entry()
The patch 9 is simpler if less places modify loop_entry field. The loop deleted by this patch does not affect correctness, but is a performance optimization. However, measurements on selftests and sched_ext programs show that this optimization is unnecessary: - at most 2 steps are done in get_loop_entry(); - most of the time 0 or 1 steps are done in get_loop_entry(). Measured using "do-not-submit" patches from here: https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
bb7abf3049 |
bpf: make state->dfs_depth < state->loop_entry->dfs_depth an invariant
For a generic loop detection algorithm a graph node can be a loop header for itself. However, state loop entries are computed for use in is_state_visited(), where get_loop_entry(state)->branches is checked. is_state_visited() also checks state->branches, thus the case when state == state->loop_entry is not interesting for is_state_visited(). This change does not affect correctness, but simplifies get_loop_entry() a bit and also simplifies change to update_loop_entry() in patch 9. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-7-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
c1ce66357f |
bpf: detect infinite loop in get_loop_entry()
Tejun Heo reported an infinite loop in get_loop_entry(), when verifying a sched_ext program layered_dispatch in [1]. After some investigation I'm sure that root cause is fixed by patches 1,3 in this patch-set. To err on the safe side, this commit modifies get_loop_entry() to detect infinite loops and abort verification in such cases. The number of steps get_loop_entry(S) can make while moving along the bpf_verifier_state->loop_entry chain is bounded by the DFS depth of state S. This fact is exploited to implement the check. To avoid dealing with the potential error code returned from get_loop_entry() in update_loop_entry(), remove the get_loop_entry() calls there: - This change does not affect correctness. Loop entries would still be updated during the backward DFS move in update_branch_counts(). - This change does not affect performance. Measurements show that get_loop_entry() performs at most 1 step on selftests and at most 2 steps on sched_ext programs (1 step in 17 cases, 2 steps in 3 cases, measured using "do-not-submit" patches from [2]). [1] https://github.com/sched-ext/scx/ commit f0b27038ea10 ("XXX - kernel stall") [2] https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
9e63fdb0cb |
bpf: don't do clean_live_states when state->loop_entry->branches > 0
verifier.c:is_state_visited() uses RANGE_WITHIN states comparison rules for cached states that have loop_entry with non-zero branches count (meaning that loop_entry's verification is not yet done). The RANGE_WITHIN rules in regsafe()/stacksafe() require register and stack objects types to be identical in current and old states. verifier.c:clean_live_states() replaces registers and stack spills with NOT_INIT/STACK_INVALID marks, if these registers/stack spills are not read in any child state. This means that clean_live_states() works against loop convergence logic under some conditions. See selftest in the next patch for a specific example. Mitigate this by prohibiting clean_verifier_state() when state->loop_entry->branches > 0. This undoes negative verification performance impact of the copy_verifier_state() fix from the previous patch. Below is comparison between master and current patch. selftests: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ---------------------------------- ---------------------------- --------- --------- --------------- ---------- ---------- -------------- arena_htab.bpf.o arena_htab_llvm 717 423 -294 (-41.00%) 57 37 -20 (-35.09%) arena_htab_asm.bpf.o arena_htab_asm 597 445 -152 (-25.46%) 47 37 -10 (-21.28%) arena_list.bpf.o arena_list_add 1493 1822 +329 (+22.04%) 30 37 +7 (+23.33%) arena_list.bpf.o arena_list_del 309 261 -48 (-15.53%) 23 15 -8 (-34.78%) iters.bpf.o checkpoint_states_deletion 18125 22154 +4029 (+22.23%) 818 918 +100 (+12.22%) iters.bpf.o iter_nested_deeply_iters 593 367 -226 (-38.11%) 67 43 -24 (-35.82%) iters.bpf.o iter_nested_iters 813 772 -41 (-5.04%) 79 72 -7 (-8.86%) iters.bpf.o iter_subprog_check_stacksafe 155 135 -20 (-12.90%) 15 14 -1 (-6.67%) iters.bpf.o iter_subprog_iters 1094 808 -286 (-26.14%) 88 68 -20 (-22.73%) iters.bpf.o loop_state_deps2 479 356 -123 (-25.68%) 46 35 -11 (-23.91%) iters.bpf.o triple_continue 35 31 -4 (-11.43%) 3 3 +0 (+0.00%) kmem_cache_iter.bpf.o open_coded_iter 63 59 -4 (-6.35%) 7 6 -1 (-14.29%) mptcp_subflow.bpf.o _getsockopt_subflow 501 446 -55 (-10.98%) 25 23 -2 (-8.00%) pyperf600_iter.bpf.o on_event 12339 6379 -5960 (-48.30%) 441 286 -155 (-35.15%) verifier_bits_iter.bpf.o max_words 92 84 -8 (-8.70%) 8 7 -1 (-12.50%) verifier_iterating_callbacks.bpf.o cond_break2 113 192 +79 (+69.91%) 12 21 +9 (+75.00%) sched_ext: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ----------------- ---------------------- --------- --------- ----------------- ---------- ---------- ---------------- bpf.bpf.o layered_dispatch 11485 9039 -2446 (-21.30%) 848 662 -186 (-21.93%) bpf.bpf.o layered_dump 7422 5022 -2400 (-32.34%) 681 298 -383 (-56.24%) bpf.bpf.o layered_enqueue 16854 13753 -3101 (-18.40%) 1611 1308 -303 (-18.81%) bpf.bpf.o layered_init 1000001 5549 -994452 (-99.45%) 84672 523 -84149 (-99.38%) bpf.bpf.o layered_runnable 3149 1899 -1250 (-39.70%) 288 151 -137 (-47.57%) bpf.bpf.o p2dq_init 2343 1936 -407 (-17.37%) 201 170 -31 (-15.42%) bpf.bpf.o refresh_layer_cpumasks 16487 1285 -15202 (-92.21%) 1770 120 -1650 (-93.22%) bpf.bpf.o rusty_select_cpu 1937 1386 -551 (-28.45%) 177 125 -52 (-29.38%) scx_central.bpf.o central_dispatch 636 600 -36 (-5.66%) 63 59 -4 (-6.35%) scx_central.bpf.o central_init 913 632 -281 (-30.78%) 48 39 -9 (-18.75%) scx_nest.bpf.o nest_init 636 601 -35 (-5.50%) 60 58 -2 (-3.33%) scx_pair.bpf.o pair_dispatch 1000001 1914 -998087 (-99.81%) 58169 142 -58027 (-99.76%) scx_qmap.bpf.o qmap_dispatch 2393 2187 -206 (-8.61%) 196 174 -22 (-11.22%) scx_qmap.bpf.o qmap_init 16367 22777 +6410 (+39.16%) 603 768 +165 (+27.36%) 'layered_init' and 'pair_dispatch' hit 1M on master, but are verified ok with this patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
bbbc02b744 |
bpf: copy_verifier_state() should copy 'loop_entry' field
The bpf_verifier_state.loop_entry state should be copied by copy_verifier_state(). Otherwise, .loop_entry values from unrelated states would poison env->cur_state. Additionally, env->stack should not contain any states with .loop_entry != NULL. The states in env->stack are yet to be verified, while .loop_entry is set for states that reached an equivalent state. This means that env->cur_state->loop_entry should always be NULL after pop_stack(). See the selftest in the next commit for an example of the program that is not safe yet is accepted by verifier w/o this fix. This change has some verification performance impact for selftests: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ---------------------------------- ---------------------------- --------- --------- -------------- ---------- ---------- ------------- arena_htab.bpf.o arena_htab_llvm 717 426 -291 (-40.59%) 57 37 -20 (-35.09%) arena_htab_asm.bpf.o arena_htab_asm 597 445 -152 (-25.46%) 47 37 -10 (-21.28%) arena_list.bpf.o arena_list_del 309 279 -30 (-9.71%) 23 14 -9 (-39.13%) iters.bpf.o iter_subprog_check_stacksafe 155 141 -14 (-9.03%) 15 14 -1 (-6.67%) iters.bpf.o iter_subprog_iters 1094 1003 -91 (-8.32%) 88 83 -5 (-5.68%) iters.bpf.o loop_state_deps2 479 725 +246 (+51.36%) 46 63 +17 (+36.96%) kmem_cache_iter.bpf.o open_coded_iter 63 59 -4 (-6.35%) 7 6 -1 (-14.29%) verifier_bits_iter.bpf.o max_words 92 84 -8 (-8.70%) 8 7 -1 (-12.50%) verifier_iterating_callbacks.bpf.o cond_break2 113 107 -6 (-5.31%) 12 12 +0 (+0.00%) And significant negative impact for sched_ext: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ----------------- ---------------------- --------- --------- -------------------- ---------- ---------- ------------------ bpf.bpf.o lavd_init 7039 14723 +7684 (+109.16%) 490 1139 +649 (+132.45%) bpf.bpf.o layered_dispatch 11485 10548 -937 (-8.16%) 848 762 -86 (-10.14%) bpf.bpf.o layered_dump 7422 1000001 +992579 (+13373.47%) 681 31178 +30497 (+4478.27%) bpf.bpf.o layered_enqueue 16854 71127 +54273 (+322.02%) 1611 6450 +4839 (+300.37%) bpf.bpf.o p2dq_dispatch 665 791 +126 (+18.95%) 68 78 +10 (+14.71%) bpf.bpf.o p2dq_init 2343 2980 +637 (+27.19%) 201 237 +36 (+17.91%) bpf.bpf.o refresh_layer_cpumasks 16487 674760 +658273 (+3992.68%) 1770 65370 +63600 (+3593.22%) bpf.bpf.o rusty_select_cpu 1937 40872 +38935 (+2010.07%) 177 3210 +3033 (+1713.56%) scx_central.bpf.o central_dispatch 636 2687 +2051 (+322.48%) 63 227 +164 (+260.32%) scx_nest.bpf.o nest_init 636 815 +179 (+28.14%) 60 73 +13 (+21.67%) scx_qmap.bpf.o qmap_dispatch 2393 3580 +1187 (+49.60%) 196 253 +57 (+29.08%) scx_qmap.bpf.o qmap_dump 233 318 +85 (+36.48%) 22 30 +8 (+36.36%) scx_qmap.bpf.o qmap_init 16367 17436 +1069 (+6.53%) 603 669 +66 (+10.95%) Note 'layered_dump' program, which now hits 1M instructions limit. This impact would be mitigated in the next patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
5644c6b50f |
bpf: skip non exist keys in generic_map_lookup_batch
The generic_map_lookup_batch currently returns EINTR if it fails with
ENOENT and retries several times on bpf_map_copy_value. The next batch
would start from the same location, presuming it's a transient issue.
This is incorrect if a map can actually have "holes", i.e.
"get_next_key" can return a key that does not point to a valid value. At
least the array of maps type may contain such holes legitly. Right now
these holes show up, generic batch lookup cannot proceed any more. It
will always fail with EINTR errors.
Rather, do not retry in generic_map_lookup_batch. If it finds a non
existing element, skip to the next key. This simple solution comes with
a price that transient errors may not be recovered, and the iteration
might cycle back to the first key under parallel deletion. For example,
Hou Tao <houtao@huaweicloud.com> pointed out a following scenario:
For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key
(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.
(3) goto next_key
It swaps the prev_key and key
(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.
With the retry logic, the iteration can continue to the key next to the
deleted one. But if we directly skip to the next key, the iteration loop
would restart from the first key for the lpm_trie type.
However, not all races may be recovered. For example, if current key is
deleted after instead of before bpf_map_copy_value, or if the prev_key
also gets deleted, then the loop will still restart from the first key
for lpm_tire anyway. For generic lookup it might be better to stay
simple, i.e. just skip to the next key. To guarantee that the output
keys are not duplicated, it is better to implement map type specific
batch operations, which can properly lock the trie and synchronize with
concurrent mutators.
Fixes:
|
|
|
|
deacdc871b |
bpf: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/e4be2486f02a8e0ef5aa42624f1708d23e88ad57.1738746821.git.namcao@linutronix.de |
|
|
|
8d9f547f74 |
bpf: Allow struct_ops prog to return referenced kptr
Allow a struct_ops program to return a referenced kptr if the struct_ops operator's return type is a struct pointer. To make sure the returned pointer continues to be valid in the kernel, several constraints are required: 1) The type of the pointer must matches the return type 2) The pointer originally comes from the kernel (not locally allocated) 3) The pointer is in its unmodified form Implementation wise, a referenced kptr first needs to be allowed to _leak_ in check_reference_leak() if it is in the return register. Then, in check_return_code(), constraints 1-3 are checked. During struct_ops registration, a check is also added to warn about operators with non-struct pointer return. In addition, since the first user, Qdisc_ops::dequeue, allows a NULL pointer to be returned when there is no skb to be dequeued, we will allow a scalar value with value equals to NULL to be returned. In the future when there is a struct_ops user that always expects a valid pointer to be returned from an operator, we may extend tagging to the return value. We can tell the verifier to only allow NULL pointer return if the return value is tagged with MAY_BE_NULL. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
a687df2008 |
bpf: Support getting referenced kptr from struct_ops argument
Allows struct_ops programs to acqurie referenced kptrs from arguments by directly reading the argument. The verifier will acquire a reference for struct_ops a argument tagged with "__ref" in the stub function in the beginning of the main program. The user will be able to access the referenced kptr directly by reading the context as long as it has not been released by the program. This new mechanism to acquire referenced kptr (compared to the existing "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons. In the first use case, Qdisc_ops, an skb is passed to .enqueue in the first argument. This mechanism provides a natural way for users to get a referenced kptr in the .enqueue struct_ops programs and makes sure that a qdisc will always enqueue or drop the skb. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
432051806f |
bpf: Make every prog keep a copy of ctx_arg_info
Currently, ctx_arg_info is read-only in the view of the verifier since it is shared among programs of the same attach type. Make each program have their own copy of ctx_arg_info so that we can use it to store program specific information. In the next patch where we support acquiring a referenced kptr through a struct_ops argument tagged with "__ref", ctx_arg_info->ref_obj_id will be used to store the unique reference object id of the argument. This avoids creating a requirement in the verifier that "__ref" tagged arguments must be the first set of references acquired [0]. [0] https://lore.kernel.org/bpf/20241220195619.2022866-2-amery.hung@gmail.com/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
6ebc5030e0 |
bpf: Fix array bounds error with may_goto
may_goto uses an additional 8 bytes on the stack, which causes the
interpreters[] array to go out of bounds when calculating index by
stack_size.
1. If a BPF program is rewritten, re-evaluate the stack size. For non-JIT
cases, reject loading directly.
2. For non-JIT cases, calculating interpreters[idx] may still cause
out-of-bounds array access, and just warn about it.
3. For jit_requested cases, the execution of bpf_func also needs to be
warned. So move the definition of function __bpf_prog_ret0_warn out of
the macro definition CONFIG_BPF_JIT_ALWAYS_ON.
Reported-by: syzbot+d2a2c639d03ac200a4f1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000f823606139faa5d@google.com/
Fixes:
|
|
|
|
5646729279 |
bpf: fs/xattr: Add BPF kfuncs to set and remove xattrs
Add the following kfuncs to set and remove xattrs from BPF programs: bpf_set_dentry_xattr bpf_remove_dentry_xattr bpf_set_dentry_xattr_locked bpf_remove_dentry_xattr_locked The _locked version of these kfuncs are called from hooks where dentry->d_inode is already locked. Instead of requiring the user to know which version of the kfuncs to use, the verifier will pick the proper kfunc based on the calling hook. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250130213549.3353349-5-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
7587d735b1 |
bpf: lsm: Add two more sleepable hooks
Add bpf_lsm_inode_removexattr and bpf_lsm_inode_post_removexattr to list sleepable_lsm_hooks. These two hooks are always called from sleepable context. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250130213549.3353349-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
c83e2d970b |
bpf: Add tracepoints with null-able arguments
Some of the tracepoints slipped when we did the first scan, adding them now.
Fixes:
|
|
|
|
ea145d530a |
bpf: define KF_ARENA_* flags for bpf_arena kfuncs
bpf_arena_alloc_pages() and bpf_arena_free_pages() work with the
bpf_arena pointers [1], which is indicated by the __arena macro in the
kernel source code:
#define __arena __attribute__((address_space(1)))
However currently this information is absent from the debug data in
the vmlinux binary. As a consequence, bpf_arena_* kfuncs declarations
in vmlinux.h (produced by bpftool) do not match prototypes expected by
the BPF programs attempting to use these functions.
Introduce a set of kfunc flags to mark relevant types as bpf_arena
pointers. The flags then can be detected by pahole when generating BTF
from vmlinux's DWARF, allowing it to emit corresponding BTF type tags
for the marked kfuncs.
With recently proposed BTF extension [2], these type tags will be
processed by bpftool when dumping vmlinux.h, and corresponding
compiler attributes will be added to the declarations.
[1] https://lwn.net/Articles/961594/
[2] https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solodrai@linux.dev/
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20250206003148.2308659-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
8784714d7f |
bpf: Handle allocation failure in acquire_lock_state
The acquire_lock_state function needs to handle possible NULL values
returned by acquire_reference_state, and return -ENOMEM.
Fixes:
|
|
|
|
7968c65815 |
bpf: verifier: Disambiguate get_constant_map_key() errors
Refactor get_constant_map_key() to disambiguate the constant key value from potential error values. In the case that the key is negative, it could be confused for an error. It's not currently an issue, as the verifier seems to track s32 spills as u32. So even if the program wrongly uses a negative value for an arraymap key, the verifier just thinks it's an impossibly high value which gets correctly discarded. Refactor anyways to make things cleaner and prevent potential future issues. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/dfe144259ae7cfc98aa63e1b388a14869a10632a.1738689872.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
884c3a18da |
bpf: verifier: Do not extract constant map keys for irrelevant maps
Previously, we were trying to extract constant map keys for all
bpf_map_lookup_elem(), regardless of map type. This is an issue if the
map has a u64 key and the value is very high, as it can be interpreted
as a negative signed value. This in turn is treated as an error value by
check_func_arg() which causes a valid program to be incorrectly
rejected.
Fix by only extracting constant map keys for relevant maps. This fix
works because nullness elision is only allowed for {PERCPU_}ARRAY maps,
and keys for these are within u32 range. See next commit for an example
via selftest.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/aa868b642b026ff87ba6105ea151bc8693b35932.1738689872.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
517e8a7835 |
bpf: Fix softlockup in arena_map_free on 64k page kernel
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y,
arena_htab tests cause a segmentation fault and soft lockup.
The same failure is not observed with 4k pages on aarch64.
It turns out arena_map_free() is calling
apply_to_existing_page_range() with the address returned by
bpf_arena_get_kern_vm_start(). If this address is not page-aligned
the code ends up calling apply_to_pte_range() with that unaligned
address causing soft lockup.
Fix it by round up GUARD_SZ to PAGE_SIZE << 1 so that the
division by 2 in bpf_arena_get_kern_vm_start() returns
a page-aligned value.
Fixes:
|
|
|
|
5da7e15fb5 |
net: Add rx_skb of kfree_skb to raw_tp_null_args[].
Yan Zhai reported a BPF prog could trigger a null-ptr-deref [0] in trace_kfree_skb if the prog does not check if rx_sk is NULL. Commit |
|
|
|
53ee0d66d7 |
bpf: Allow kind_flag for BTF type and decl tags
BTF type tags and decl tags now may have info->kflag set to 1,
changing the semantics of the tag.
Change BTF verification to permit BTF that makes use of this feature:
* remove kflag check in btf_decl_tag_check_meta(), as both values
are valid
* allow kflag to be set for BTF_KIND_TYPE_TAG type in
btf_ref_type_check_meta()
Make sure kind_flag is NOT set when checking for specific BTF tags,
such as "kptr", "user" etc.
Modify a selftest checking for kflag in decl_tag accordingly.
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250130201239.1429648-6-ihor.solodrai@linux.dev
|
|
|
|
12fdd29d5d |
bpf: Use kallsyms to find the function name of a struct_ops's stub function
In commit
|
|
|
|
c78f4afbd9 |
bpf: Fix deadlock when freeing cgroup storage
The following commit |
|
|
|
af13ff1c33 |
Summary:
All ctl_table declared outside of functions and that remain unmodified after initialization are const qualified. This prevents unintended modifications to proc_handler function pointers by placing them in the .rodata section. This is a continuation of the tree-wide effort started a few releases ago with the constification of the ctl_table struct arguments in the sysctl API done in |
|
|
|
bc27c52eea |
bpf: avoid holding freeze_mutex during mmap operation
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].
So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.
[0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
Fixes:
|
|
|
|
98671a0fd1 |
bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
For all BPF maps we ensure that VM_MAYWRITE is cleared when
memory-mapping BPF map contents as initially read-only VMA. This is
because in some cases BPF verifier relies on the underlying data to not
be modified afterwards by user space, so once something is mapped
read-only, it shouldn't be re-mmap'ed as read-write.
As such, it's not necessary to check VM_MAYWRITE in bpf_map_mmap() and
map->ops->map_mmap() callbacks: VM_WRITE should be consistently set for
read-write mappings, and if VM_WRITE is not set, there is no way for
user space to upgrade read-only mapping to read-write one.
This patch cleans up this VM_WRITE vs VM_MAYWRITE handling within
bpf_map_mmap(), which is an entry point for any BPF map mmap()-ing
logic. We also drop unnecessary sanitization of VM_MAYWRITE in BPF
ringbuf's map_mmap() callback implementation, as it is already performed
by common code in bpf_map_mmap().
Note, though, that in bpf_map_mmap_{open,close}() callbacks we can't
drop VM_MAYWRITE use, because it's possible (and is outside of
subsystem's control) to have initially read-write memory mapping, which
is subsequently dropped to read-only by user space through mprotect().
In such case, from BPF verifier POV it's read-write data throughout the
lifetime of BPF map, and is counted as "active writer".
But its VMAs will start out as VM_WRITE|VM_MAYWRITE, then mprotect() can
change it to just VM_MAYWRITE (and no VM_WRITE), so when its finally
munmap()'ed and bpf_map_mmap_close() is called, vm_flags will be just
VM_MAYWRITE, but we still need to decrement active writer count with
bpf_map_write_active_dec() as it's still considered to be a read-write
mapping by the rest of BPF subsystem.
Similar reasoning applies to bpf_map_mmap_open(), which is called
whenever mmap(), munmap(), and/or mprotect() forces mm subsystem to
split original VMA into multiple discontiguous VMAs.
Memory-mapping handling is a bit tricky, yes.
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250129012246.1515826-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
2ab002c755 |
Driver core and debugfs updates
Here is the big set of driver core and debugfs updates for 6.14-rc1.
It's coming late in the merge cycle as there are a number of merge
conflicts with your tree now, and I wanted to make sure they were
working properly. To resolve them, look in linux-next, and I will send
the "fixup" patch as a response to the pull request.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at least
one user that does have a regression with these, but Al is working on
tracking down the fix for it. In my use (and everyone else's linux-next
use), it does not seem like a big issue at the moment.
Here's a short list of the things in here:
- driver core bindings for PCI, platform, OF, and some i/o functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing things
in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon".
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
hcDSPodj4szR40RRnzBd
=u5Ey
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the big set of driver core and debugfs updates for 6.14-rc1.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at
least one user that does have a regression with these, but Al is
working on tracking down the fix for it. In my use (and everyone
else's linux-next use), it does not seem like a big issue at the
moment.
Here's a short list of the things in here:
- driver core rust bindings for PCI, platform, OF, and some i/o
functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing
things in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon""
* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
rust: device: Use as_char_ptr() to avoid explicit cast
rust: device: Replace CString with CStr in property_present()
devcoredump: Constify 'struct bin_attribute'
devcoredump: Define 'struct bin_attribute' through macro
rust: device: Add property_present()
saner replacement for debugfs_rename()
orangefs-debugfs: don't mess with ->d_name
octeontx2: don't mess with ->d_parent or ->d_parent->d_name
arm_scmi: don't mess with ->d_parent->d_name
slub: don't mess with ->d_name
sof-client-ipc-flood-test: don't mess with ->d_name
qat: don't mess with ->d_name
xhci: don't mess with ->d_iname
mtu3: don't mess wiht ->d_iname
greybus/camera - stop messing with ->d_iname
mediatek: stop messing with ->d_iname
netdevsim: don't embed file_operations into your structs
b43legacy: make use of debugfs_get_aux()
b43: stop embedding struct file_operations into their objects
carl9170: stop embedding file_operations into their objects
...
|
|
|
|
1751f872cc |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit
|
|
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
|
|
|
6bf9b5b40a |
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
d0d106a2bd |
bpf-next-6.14
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmeOu1YACgkQ6rmadz2v
bTrrHxAAn6eqEsluWnDlzhI0OGsPjvgS00sf+MOeqiXYeS2eJ8yJuKifp38+nIQZ
lIplsWU2ReUY20eizPqLPnQ7TXZGvLgp08E8yHUoZ0siWanqr9iDRfbZCCNrDMNm
lMqeR1SLapMws2R/UX9JbvPn2ajIJ6Lb4wxenTfdlW6q+0hAGM6Dt0k/jBod+quq
/oo+xwG3L0q4APBovJfiAFN2z6IYN03b+zLiOrpIJtMACGewEXnl3m4mkL8ZM/FV
nZGPIxIUPXCpKTGEkNqxfkrnHN2wZQ4ZSKEJ6lhEEp4jrgCVITaGZ/E7jlx6fZoj
bbd4YMonIPo9Nhim8p1dt8yYBhKKiE5IXIq0GqlMv5+MvAN8ylrlydpsouW1fu66
hZ1W1BxbxmrgyF0Bwo9JPOMhBHwMrmD6iH9LgiMpZf0ASeF+q9cJpoSOU5j5E9XB
LpLIRf5jYTd4wZjhDmrQREReLo+Bng9DlCBu+jjh2+YTz6l6Qed+ETpENcd7lL5i
IHZVbgD2RVPNJoUfdrd763HfYfDTk+50MF5FIMEyfKHz11if0E/LhBMzto22hm6b
2f8ruj/8yvg8s2dxEP3ySQgcnynlwEnGxLenUVv7uEOYKeWri1rq+fvTK5ne1OLK
oHnTlkViwQb74c0r8cFW+nkyfUYTfhhBAql14rl/fMjGDO2KZ10=
=f2CA
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
"A smaller than usual release cycle.
The main changes are:
- Prepare selftest to run with GCC-BPF backend (Ihor Solodrai)
In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile
only mode. Half of the tests are failing, since support for
btf_decl_tag is still WIP, but this is a great milestone.
- Convert various samples/bpf to selftests/bpf/test_progs format
(Alexis Lothoré and Bastien Curutchet)
- Teach verifier to recognize that array lookup with constant
in-range index will always succeed (Daniel Xu)
- Cleanup migrate disable scope in BPF maps (Hou Tao)
- Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao)
- Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin
KaFai Lau)
- Refactor verifier lock support (Kumar Kartikeya Dwivedi)
This is a prerequisite for upcoming resilient spin lock.
- Remove excessive 'may_goto +0' instructions in the verifier that
LLVM leaves when unrolls the loops (Yonghong Song)
- Remove unhelpful bpf_probe_write_user() warning message (Marco
Elver)
- Add fd_array_cnt attribute for prog_load command (Anton Protopopov)
This is a prerequisite for upcoming support for static_branch"
* tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits)
selftests/bpf: Add some tests related to 'may_goto 0' insns
bpf: Remove 'may_goto 0' instruction in opt_remove_nops()
bpf: Allow 'may_goto 0' instruction in verifier
selftests/bpf: Add test case for the freeing of bpf_timer
bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
bpf: Bail out early in __htab_map_lookup_and_delete_elem()
bpf: Free special fields after unlock in htab_lru_map_delete_node()
tools: Sync if_xdp.h uapi tooling header
libbpf: Work around kernel inconsistently stripping '.llvm.' suffix
bpf: selftests: verifier: Add nullness elision tests
bpf: verifier: Support eliding map lookup nullness
bpf: verifier: Refactor helper access type tracking
bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write
bpf: verifier: Add missing newline on verbose() call
selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED
libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED
libbpf: Fix return zero when elf_begin failed
selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test
veristat: Load struct_ops programs only once
...
|
|
|
|
0c35ca252a |
bpf: Remove 'may_goto 0' instruction in opt_remove_nops()
Since 'may_goto 0' insns are actually no-op, let us remove them. Otherwise, verifier will generate code like /* r10 - 8 stores the implicit loop count */ r11 = *(u64 *)(r10 -8) if r11 == 0x0 goto pc+2 r11 -= 1 *(u64 *)(r10 -8) = r11 which is the pure overhead. The following code patterns (from the previous commit) are also handled: may_goto 2 may_goto 1 may_goto 0 With this commit, the above three 'may_goto' insns are all eliminated. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250118192029.2124584-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
aefaa4313b |
bpf: Allow 'may_goto 0' instruction in verifier
Commit
|
|
|
|
58f038e6d2 |
bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
During the update procedure, when overwrite element in a pre-allocated
htab, the freeing of old_element is protected by the bucket lock. The
reason why the bucket lock is necessary is that the old_element has
already been stashed in htab->extra_elems after alloc_htab_elem()
returns. If freeing the old_element after the bucket lock is unlocked,
the stashed element may be reused by concurrent update procedure and the
freeing of old_element will run concurrently with the reuse of the
old_element. However, the invocation of check_and_free_fields() may
acquire a spin-lock which violates the lockdep rule because its caller
has already held a raw-spin-lock (bucket lock). The following warning
will be reported when such race happens:
BUG: scheduling while atomic: test_progs/676/0x00000003
3 locks held by test_progs/676:
#0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830
#1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500
#2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0
Modules linked in: bpf_testmod(O)
Preemption disabled at:
[<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500
CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11
Tainted: [W]=WARN, [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)...
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x70
dump_stack+0x10/0x20
__schedule_bug+0x120/0x170
__schedule+0x300c/0x4800
schedule_rtlock+0x37/0x60
rtlock_slowlock_locked+0x6d9/0x54c0
rt_spin_lock+0x168/0x230
hrtimer_cancel_wait_running+0xe9/0x1b0
hrtimer_cancel+0x24/0x30
bpf_timer_delete_work+0x1d/0x40
bpf_timer_cancel_and_free+0x5e/0x80
bpf_obj_free_fields+0x262/0x4a0
check_and_free_fields+0x1d0/0x280
htab_map_update_elem+0x7fc/0x1500
bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43
bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e
bpf_prog_test_run_syscall+0x322/0x830
__sys_bpf+0x135d/0x3ca0
__x64_sys_bpf+0x75/0xb0
x64_sys_call+0x1b5/0xa10
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
...
</TASK>
It seems feasible to break the reuse and refill of per-cpu extra_elems
into two independent parts: reuse the per-cpu extra_elems with bucket
lock being held and refill the old_element as per-cpu extra_elems after
the bucket lock is unlocked. However, it will make the concurrent
overwrite procedures on the same CPU return unexpected -E2BIG error when
the map is full.
Therefore, the patch fixes the lock problem by breaking the cancelling
of bpf_timer into two steps for PREEMPT_RT:
1) use hrtimer_try_to_cancel() and check its return value
2) if the timer is running, use hrtimer_cancel() through a kworker to
cancel it again
Considering that the current implementation of hrtimer_cancel() will try
to acquire a being held softirq_expiry_lock when the current timer is
running, these steps above are reasonable. However, it also has
downside. When the timer is running, the cancelling of the timer is
delayed when releasing the last map uref. The delay is also fixable
(e.g., break the cancelling of bpf timer into two parts: one part in
locked scope, another one in unlocked scope), it can be revised later if
necessary.
It is a bit hard to decide the right fix tag. One reason is that the
problem depends on PREEMPT_RT which is enabled in v6.12. Considering the
softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced
in v5.15, the bpf_timer commit is used in the fixes tag and an extra
depends-on tag is added to state the dependency on PREEMPT_RT.
Fixes:
|
|
|
|
47363f1553 |
bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()
The freeing of special fields in map value may acquire a spin-lock (e.g., the freeing of bpf_timer), however, the lookup_and_delete_elem procedure has already held a raw-spin-lock, which violates the lockdep rule. The running context of __htab_map_lookup_and_delete_elem() has already disabled the migration. Therefore, it is OK to invoke free_htab_elem() after unlocking the bucket lock. Fix the potential problem by freeing element after unlocking bucket lock in __htab_map_lookup_and_delete_elem(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250117101816.2101857-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
588c6ead32 |
bpf: Bail out early in __htab_map_lookup_and_delete_elem()
Use goto statement to bail out early when the target element is not found, instead of using a large else branch to handle the more likely case. This change doesn't affect functionality and simply make the code cleaner. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
45dc92c32a |
bpf: Free special fields after unlock in htab_lru_map_delete_node()
When bpf_timer is used in LRU hash map, calling check_and_free_fields() in htab_lru_map_delete_node() will invoke bpf_timer_cancel_and_free() to free the bpf_timer. If the timer is running on other CPUs, hrtimer_cancel() will invoke hrtimer_cancel_wait_running() to spin on current CPU to wait for the completion of the hrtimer callback. Considering that the deletion has already acquired a raw-spin-lock (bucket lock). To reduce the time holding the bucket lock, move the invocation of check_and_free_fields() out of bucket lock. However, because htab_lru_map_delete_node() is invoked with LRU raw spin lock being held, the freeing of special fields still happens in a locked scope. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
d2102f2f5d |
bpf: verifier: Support eliding map lookup nullness
This commit allows progs to elide a null check on statically known map lookup keys. In other words, if the verifier can statically prove that the lookup will be in-bounds, allow the prog to drop the null check. This is useful for two reasons: 1. Large numbers of nullness checks (especially when they cannot fail) unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ. 2. It forms a tighter contract between programmer and verifier. For (1), bpftrace is starting to make heavier use of percpu scratch maps. As a result, for user scripts with large number of unrolled loops, we are starting to hit jump complexity verification errors. These percpu lookups cannot fail anyways, as we only use static key values. Eliding nullness probably results in less work for verifier as well. For (2), percpu scratch maps are often used as a larger stack, as the currrent stack is limited to 512 bytes. In these situations, it is desirable for the programmer to express: "this lookup should never fail, and if it does, it means I messed up the code". By omitting the null check, the programmer can "ask" the verifier to double check the logic. Tests also have to be updated in sync with these changes, as the verifier is more efficient with this change. Notable, iters.c tests had to be changed to use a map type that still requires null checks, as it's exercising verifier tracking logic w.r.t iterators. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/68f3ea96ff3809a87e502a11a4bd30177fc5823e.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
37cce22dbd |
bpf: verifier: Refactor helper access type tracking
Previously, the verifier was treating all PTR_TO_STACK registers passed to a helper call as potentially written to by the helper. However, all calls to check_stack_range_initialized() already have precise access type information available. Rather than treat ACCESS_HELPER as a proxy for BPF_WRITE, pass enum bpf_access_type to check_stack_range_initialized() to more precisely track helper arguments. One benefit from this precision is that registers tracked as valid spills and passed as a read-only helper argument remain tracked after the call. Rather than being marked STACK_MISC afterwards. An additional benefit is the verifier logs are also more precise. For this particular error, users will enjoy a slightly clearer message. See included selftest updates for examples. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ff885c0e5859e0cd12077c3148ff0754cad4f7ed.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
b8a81b5dd6 |
bpf: verifier: Add missing newline on verbose() call
The print was missing a newline. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/59cbe18367b159cd470dc6d5c652524c1dc2b984.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
dd19f4116e |
Merge 6.13-rc7 into driver-core-next
We need the debugfs / driver-core fixes in here as well for testing and to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
18032c6bc0 |
btf: Switch module BTF attribute to sysfs_bin_attr_simple_read()
The generic function from the sysfs core can replace the custom one. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241228-sysfs-const-bin_attr-simple-v2-3-7c6f3f1767a3@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
42369b9a1e |
btf: Switch vmlinux BTF attribute to sysfs_bin_attr_simple_read()
The generic function from the sysfs core can replace the custom one. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241228-sysfs-const-bin_attr-simple-v2-2-7c6f3f1767a3@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
d86088e2c3 |
bpf: Remove migrate_{disable|enable} from bpf_selem_free()
bpf_selem_free() has the following three callers:
(1) bpf_local_storage_update
It will be invoked through ->map_update_elem syscall or helpers for
storage map. Migration has already been disabled in these running
contexts.
(2) bpf_sk_storage_clone
It has already disabled migration before invoking bpf_selem_free().
(3) bpf_selem_free_list
bpf_selem_free_list() has three callers: bpf_selem_unlink_storage(),
bpf_local_storage_update() and bpf_local_storage_destroy().
The callers of bpf_selem_unlink_storage() includes: storage map
->map_delete_elem syscall, storage map delete helpers and
bpf_local_storage_map_free(). These contexts have already disabled
migration when invoking bpf_selem_unlink() which invokes
bpf_selem_unlink_storage() and bpf_selem_free_list() correspondingly.
bpf_local_storage_update() has been analyzed as the first caller above.
bpf_local_storage_destroy() is invoked when freeing the local storage
for the kernel object. Now cgroup, task, inode and sock storage have
already disabled migration before invoking bpf_local_storage_destroy().
After the analyses above, it is safe to remove migrate_{disable|enable}
from bpf_selem_free().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-17-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
7b984359e0 |
bpf: Remove migrate_{disable|enable} from bpf_local_storage_free()
bpf_local_storage_free() has three callers:
1) bpf_local_storage_alloc()
Its caller must have disabled migration.
2) bpf_local_storage_destroy()
Its four callers (bpf_{cgrp|inode|task|sk}_storage_free()) have already
invoked migrate_disable() before invoking bpf_local_storage_destroy().
3) bpf_selem_unlink()
Its callers include: cgrp/inode/task/sk storage ->map_delete_elem
callbacks, bpf_{cgrp|inode|task|sk}_storage_delete() helpers and
bpf_local_storage_map_free(). All of these callers have already disabled
migration before invoking bpf_selem_unlink().
Therefore, it is OK to remove migrate_{disable|enable} pair from
bpf_local_storage_free().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-16-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
4855a75ebf |
bpf: Remove migrate_{disable|enable} from bpf_local_storage_alloc()
These two callers of bpf_local_storage_alloc() are the same as
bpf_selem_alloc(): bpf_sk_storage_clone() and
bpf_local_storage_update(). The running contexts of these two callers
have already disabled migration, therefore, there is no need to add
extra migrate_{disable|enable} pair in bpf_local_storage_alloc().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-15-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
2269b32ab0 |
bpf: Remove migrate_{disable|enable} from bpf_selem_alloc()
bpf_selem_alloc() has two callers:
(1) bpf_sk_storage_clone_elem()
bpf_sk_storage_clone() has already disabled migration before invoking
bpf_sk_storage_clone_elem().
(2) bpf_local_storage_update()
Its callers include: cgrp/task/inode/sock storage ->map_update_elem()
callbacks and bpf_{cgrp|task|inode|sk}_storage_get() helpers. These
running contexts have already disabled migration
Therefore, there is no need to add extra migrate_{disable|enable} pair
in bpf_selem_alloc().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-14-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
6a52b965ab |
bpf: Remove migrate_{disable,enable} in bpf_cpumask_release()
When BPF program invokes bpf_cpumask_release(), the migration must have
been disabled. When bpf_cpumask_release_dtor() invokes
bpf_cpumask_release(), the caller bpf_obj_free_fields() also has
disabled migration, therefore, it is OK to remove the unnecessary
migrate_{disable|enable} pair in bpf_cpumask_release().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-13-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
1d2dbe7120 |
bpf: Remove migrate_{disable|enable} in bpf_obj_free_fields()
The callers of bpf_obj_free_fields() have already guaranteed that the
migration is disabled, therefore, there is no need to invoke
migrate_{disable,enable} pair in bpf_obj_free_fields()'s underly
implementation.
This patch removes unnecessary migrate_{disable|enable} pairs from
bpf_obj_free_fields() and its callees: bpf_list_head_free() and
bpf_rb_root_free().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-12-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
4b7e7cd1c1 |
bpf: Disable migration before calling ops->map_free()
The freeing of all map elements may invoke bpf_obj_free_fields() to free
the special fields in the map value. Since these special fields may be
allocated from bpf memory allocator, migrate_{disable|enable} pairs are
necessary for the freeing of these special fields.
To simplify reasoning about when migrate_disable() is needed for the
freeing of these special fields, let the caller to guarantee migration
is disabled before invoking bpf_obj_free_fields(). Therefore, disabling
migration before calling ops->map_free() to simplify the freeing of map
values or special fields allocated from bpf memory allocator.
After disabling migration in bpf_map_free(), there is no need for
additional migration_{disable|enable} pairs in these ->map_free()
callbacks. Remove these redundant invocations.
The migrate_{disable|enable} pairs in the underlying implementation of
bpf_obj_free_fields() will be removed by the following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-11-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
090d7f2e64 |
bpf: Disable migration in bpf_selem_free_rcu
bpf_selem_free_rcu() calls bpf_obj_free_fields() to free the special
fields in map value (e.g., kptr). Since kptrs may be allocated from bpf
memory allocator, migrate_{disable|enable} pairs are necessary for the
freeing of these kptrs.
To simplify reasoning about when migrate_disable() is needed for the
freeing of these dynamically-allocated kptrs, let the caller to
guarantee migration is disabled before invoking bpf_obj_free_fields().
Therefore, the patch adds migrate_{disable|enable} pair in
bpf_selem_free_rcu(). The migrate_{disable|enable} pairs in the
underlying implementation of bpf_obj_free_fields() will be removed by
the following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-10-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
e319cdc895 |
bpf: Disable migration when destroying inode storage
When destroying inode storage, it invokes bpf_local_storage_destroy() to
remove all storage elements saved in the inode storage. The destroy
procedure will call bpf_selem_free() to free the element, and
bpf_selem_free() calls bpf_obj_free_fields() to free the special fields
in map value (e.g., kptr). Since kptrs may be allocated from bpf memory
allocator, migrate_{disable|enable} pairs are necessary for the freeing
of these kptrs.
To simplify reasoning about when migrate_disable() is needed for the
freeing of these dynamically-allocated kptrs, let the caller to
guarantee migration is disabled before invoking bpf_obj_free_fields().
Therefore, the patch adds migrate_{disable|enable} pair in
bpf_inode_storage_free(). The migrate_{disable|enable} pairs in the
underlying implementation of bpf_obj_free_fields() will be removed by
the following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
9e6c958b54 |
bpf: Remove migrate_{disable|enable} from bpf_task_storage_lock helpers
Three callers of bpf_task_storage_lock() are ->map_lookup_elem,
->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for
these three operations of task storage has already disabled migration.
Another two callers are bpf_task_storage_get() and
bpf_task_storage_delete() helpers which will be used by BPF program.
Two callers of bpf_task_storage_trylock() are bpf_task_storage_get() and
bpf_task_storage_delete() helpers. The running contexts of these helpers
have already disabled migration.
Therefore, it is safe to remove migrate_{disable|enable} from task
storage lock helpers for these call sites. However,
bpf_task_storage_free() also invokes bpf_task_storage_lock() and its
running context doesn't disable migration, therefore, add the missed
migrate_{disable|enable} in bpf_task_storage_free().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
25dc65f75b |
bpf: Remove migrate_{disable|enable} from bpf_cgrp_storage_lock helpers
Three callers of bpf_cgrp_storage_lock() are ->map_lookup_elem,
->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for
these three operations of cgrp storage has already disabled migration.
Two call sites of bpf_cgrp_storage_trylock() are bpf_cgrp_storage_get(),
and bpf_cgrp_storage_delete() helpers. The running contexts of these
helpers have already disabled migration.
Therefore, it is safe to remove migrate_disable() for these callers.
However, bpf_cgrp_storage_free() also invokes bpf_cgrp_storage_lock()
and its running context doesn't disable migration. Therefore, also add
the missed migrate_{disabled|enable} in bpf_cgrp_storage_free().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
53f2ba0b1c |
bpf: Remove migrate_{disable|enable} in htab_elem_free
htab_elem_free() has two call-sites: delete_all_elements() has already
disabled migration, free_htab_elem() is invoked by other 4 functions:
__htab_map_lookup_and_delete_elem, __htab_map_lookup_and_delete_batch,
htab_map_update_elem and htab_map_delete_elem.
BPF syscall has already disabled migration before invoking
->map_update_elem, ->map_delete_elem, and ->map_lookup_and_delete_elem
callbacks for hash map. __htab_map_lookup_and_delete_batch() also
disables migration before invoking free_htab_elem(). ->map_update_elem()
and ->map_delete_elem() of hash map may be invoked by BPF program and
the running context of BPF program has already disabled migration.
Therefore, it is safe to remove the migration_{disable|enable} pair in
htab_elem_free()
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
ea5b229630 |
bpf: Remove migrate_{disable|enable} in ->map_for_each_callback
BPF program may call bpf_for_each_map_elem(), and it will call
the ->map_for_each_callback callback of related bpf map. Considering the
running context of bpf program has already disabled migration, remove
the unnecessary migrate_{disable|enable} pair in the implementations of
->map_for_each_callback. To ensure the guarantee will not be voilated
later, also add cant_migrate() check in the implementations.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
1b1a01db17 |
bpf: Remove migrate_{disable|enable} from LPM trie
Both bpf program and bpf syscall may invoke ->update or ->delete
operation for LPM trie. For bpf program, its running context has already
disabled migration explicitly through (migrate_disable()) or implicitly
through (preempt_disable() or disable irq). For bpf syscall, the
migration is disabled through the use of bpf_disable_instrumentation()
before invoking the corresponding map operation callback.
Therefore, it is safe to remove the migrate_{disable|enable){} pair from
LPM trie.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
b8b1e30016 |
bpf: Fix range_tree_set() error handling
range_tree_set() might fail and return -ENOMEM, causing subsequent `bpf_arena_alloc_pages` to fail. Add the error handling. Signed-off-by: Soma Nakata <soma.nakata@somane.sakura.ne.jp> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250106231536.52856-1-soma.nakata@somane.sakura.ne.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
512816403e |
bpf: Allow bpf_for/bpf_repeat calls while holding a spinlock
Add the bpf_iter_num_* kfuncs called by bpf_for in special_kfunc_list, and allow the calls even while holding a spin lock. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250104202528.882482-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
385f186aba |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h |
|
|
|
96ea081ed5 |
bpf: Reject struct_ops registration that uses module ptr and the module btf_id is missing
There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n.
In particular, the report is on tcp_congestion_ops that has
a "struct module *owner" member.
For struct_ops that has a "struct module *owner" member,
it can be extended either by the regular kernel module or
by the bpf_struct_ops. bpf_try_module_get() will be used
to do the refcounting and different refcount is done
based on the owner pointer. When CONFIG_MODULES=n,
the btf_id of the "struct module" is missing:
WARN: resolve_btfids: unresolved symbol module
Thus, the bpf_try_module_get() cannot do the correct refcounting.
Not all subsystem's struct_ops requires the "struct module *owner" member.
e.g. the recent sched_ext_ops.
This patch is to disable bpf_struct_ops registration if
the struct_ops has the "struct module *" member and the
"struct module" btf_id is missing. The btf_type_is_fwd() helper
is moved to the btf.h header file for this test.
This has happened since the beginning of bpf_struct_ops which has gone
through many changes. The Fixes tag is set to a recent commit that this
patch can apply cleanly. Considering CONFIG_MODULES=n is not
common and the age of the issue, targeting for bpf-next also.
Fixes:
|
|
|
|
dfa94ce54f |
bpf: Use refcount_t instead of atomic_t for mmap_count
Use an API that resembles more the actual use of mmap_count.
Found by cocci:
kernel/bpf/arena.c:245:6-25: WARNING: atomic_dec_and_test variation before object free at line 249.
Fixes:
|
|
|
|
654a3381e3 |
bpf: Remove unused MT_ENTRY define
The range tree introduction removed the need for maple tree usage but missed removing the MT_ENTRY defined value that was used to mark maple tree allocated entries. Remove the MT_ENTRY define. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20241223115901.14207-1-lpieralisi@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
4a24035964 |
bpf: Fix holes in special_kfunc_list if !CONFIG_NET
If the function is not available its entry has to be replaced with
BTF_ID_UNUSED instead of skipped.
Otherwise the list doesn't work correctly.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/lkml/CAADnVQJQpVziHzrPCCpGE5=8uzw2OkxP8gqe1FkJ6_XVVyVbNw@mail.gmail.com/
Fixes:
|
|
|
|
9aa0ebde00 |
bpf, verifier: Improve precision of BPF_MUL
This patch improves (or maintains) the precision of register value tracking
in BPF_MUL across all possible inputs. It also simplifies
scalar32_min_max_mul() and scalar_min_max_mul().
As it stands, BPF_MUL is composed of three functions:
case BPF_MUL:
tnum_mul();
scalar32_min_max_mul();
scalar_min_max_mul();
The current implementation of scalar_min_max_mul() restricts the u64 input
ranges of dst_reg and src_reg to be within [0, U32_MAX]:
/* Both values are positive, so we can work with unsigned and
* copy the result to signed (unless it exceeds S64_MAX).
*/
if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) {
/* Potential overflow, we know nothing */
__mark_reg64_unbounded(dst_reg);
return;
}
This restriction is done to avoid unsigned overflow, which could otherwise
wrap the result around 0, and leave an unsound output where umin > umax. We
also observe that limiting these u64 input ranges to [0, U32_MAX] leads to
a loss of precision. Consider the case where the u64 bounds of dst_reg are
[0, 2^34] and the u64 bounds of src_reg are [0, 2^2]. While the
multiplication of these two bounds doesn't overflow and is sound [0, 2^36],
the current scalar_min_max_mul() would set the entire register state to
unbounded.
Importantly, we update BPF_MUL to allow signed bound multiplication
(i.e. multiplying negative bounds) as well as allow u64 inputs to take on
values from [0, U64_MAX]. We perform signed multiplication on two bounds
[a,b] and [c,d] by multiplying every combination of the bounds
(i.e. a*c, a*d, b*c, and b*d) and checking for overflow of each product. If
there is an overflow, we mark the signed bounds unbounded [S64_MIN, S64_MAX].
In the case of no overflow, we take the minimum of these products to
be the resulting smin, and the maximum to be the resulting smax.
The key idea here is that if there’s no possibility of overflow, either
when multiplying signed bounds or unsigned bounds, we can safely multiply the
respective bounds; otherwise, we set the bounds that exhibit overflow
(during multiplication) to unbounded.
if (check_mul_overflow(*dst_umax, src_reg->umax_value, dst_umax) ||
(check_mul_overflow(*dst_umin, src_reg->umin_value, dst_umin))) {
/* Overflow possible, we know nothing */
*dst_umin = 0;
*dst_umax = U64_MAX;
}
...
Below, we provide an example BPF program (below) that exhibits the
imprecision in the current BPF_MUL, where the outputs are all unbounded. In
contrast, the updated BPF_MUL produces a bounded register state:
BPF_LD_IMM64(BPF_REG_1, 11),
BPF_LD_IMM64(BPF_REG_2, 4503599627370624),
BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0),
BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0),
BPF_ALU64_REG(BPF_AND, BPF_REG_1, BPF_REG_2),
BPF_LD_IMM64(BPF_REG_3, 809591906117232263),
BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
Verifier log using the old BPF_MUL:
func#0 @0
0: R1=ctx() R10=fp0
0: (18) r1 = 0xb ; R1_w=11
2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080
4: (87) r2 = -r2 ; R2_w=scalar()
5: (87) r2 = -r2 ; R2_w=scalar()
6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar()
7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687
9: (2f) r3 *= r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar()
...
Verifier using the new updated BPF_MUL (more precise bounds at label 9)
func#0 @0
0: R1=ctx() R10=fp0
0: (18) r1 = 0xb ; R1_w=11
2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080
4: (87) r2 = -r2 ; R2_w=scalar()
5: (87) r2 = -r2 ; R2_w=scalar()
6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar()
7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687
9: (2f) r3 *= r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar(smin=0,smax=umax=0x7b96bb0a94a3a7cd,var_off=(0x0; 0x7fffffffffffffff))
...
Finally, we proved the soundness of the new scalar_min_max_mul() and
scalar32_min_max_mul() functions. Typically, multiplication operations are
expensive to check with bitvector-based solvers. We were able to prove the
soundness of these functions using Non-Linear Integer Arithmetic (NIA)
theory. Additionally, using Agni [2,3], we obtained the encodings for
scalar32_min_max_mul() and scalar_min_max_mul() in bitvector theory, and
were able to prove their soundness using 8-bit bitvectors (instead of
64-bit bitvectors that the functions actually use).
In conclusion, with this patch,
1. We were able to show that we can improve the overall precision of
BPF_MUL. We proved (using an SMT solver) that this new version of
BPF_MUL is at least as precise as the current version for all inputs
and more precise for some inputs.
2. We are able to prove the soundness of the new scalar_min_max_mul() and
scalar32_min_max_mul(). By leveraging the existing proof of tnum_mul
[1], we can say that the composition of these three functions within
BPF_MUL is sound.
[1] https://ieeexplore.ieee.org/abstract/document/9741267
[2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12
[3] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf
Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Matan Shachnai <m.shachnai@gmail.com>
Link: https://lore.kernel.org/r/20241218032337.12214-2-m.shachnai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
07e5c4eb94 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/renesas/rswitch.h |
|
|
|
8eef6ac4d7 |
bpf: bpf_local_storage: Always use bpf_mem_alloc in PREEMPT_RT
In PREEMPT_RT, kmalloc(GFP_ATOMIC) is still not safe in non preemptible
context. bpf_mem_alloc must be used in PREEMPT_RT. This patch is
to enforce bpf_mem_alloc in the bpf_local_storage when CONFIG_PREEMPT_RT
is enabled.
[ 35.118559] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 35.118566] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1832, name: test_progs
[ 35.118569] preempt_count: 1, expected: 0
[ 35.118571] RCU nest depth: 1, expected: 1
[ 35.118577] INFO: lockdep is turned off.
...
[ 35.118647] __might_resched+0x433/0x5b0
[ 35.118677] rt_spin_lock+0xc3/0x290
[ 35.118700] ___slab_alloc+0x72/0xc40
[ 35.118723] __kmalloc_noprof+0x13f/0x4e0
[ 35.118732] bpf_map_kzalloc+0xe5/0x220
[ 35.118740] bpf_selem_alloc+0x1d2/0x7b0
[ 35.118755] bpf_local_storage_update+0x2fa/0x8b0
[ 35.118784] bpf_sk_storage_get_tracing+0x15a/0x1d0
[ 35.118791] bpf_prog_9a118d86fca78ebb_trace_inet_sock_set_state+0x44/0x66
[ 35.118795] bpf_trace_run3+0x222/0x400
[ 35.118820] __bpf_trace_inet_sock_set_state+0x11/0x20
[ 35.118824] trace_inet_sock_set_state+0x112/0x130
[ 35.118830] inet_sk_state_store+0x41/0x90
[ 35.118836] tcp_set_state+0x3b3/0x640
There is no need to adjust the gfp_flags passing to the
bpf_mem_cache_alloc_flags() which only honors the GFP_KERNEL.
The verifier has ensured GFP_KERNEL is passed only in sleepable context.
It has been an old issue since the first introduction of the
bpf_local_storage ~5 years ago, so this patch targets the bpf-next.
bpf_mem_alloc is needed to solve it, so the Fixes tag is set
to the commit when bpf_mem_alloc was first used in the bpf_local_storage.
Fixes:
|
|
|
|
23579010cf |
bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
On x86-64 calling bpf_get_smp_processor_id() in a kernel with CONFIG_SMP
disabled can trigger the following bug, as pcpu_hot is unavailable:
[ 8.471774] BUG: unable to handle page fault for address: 00000000936a290c
[ 8.471849] #PF: supervisor read access in kernel mode
[ 8.471881] #PF: error_code(0x0000) - not-present page
Fix by inlining a return 0 in the !CONFIG_SMP case.
Fixes:
|
|
|
|
06103dccbb |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: Auto-merging include/linux/bpf.h Auto-merging include/linux/bpf_verifier.h Auto-merging kernel/bpf/btf.c Auto-merging kernel/bpf/verifier.c Auto-merging kernel/trace/bpf_trace.c Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
c83508da56 |
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain situations. If a function takes a lock and one of these bpf programs is hooked to some point in the function's critical section, and if the bpf program tries to call the same function and take the same lock it will lead to deadlock. These situations have been reported in the following bug reports. In percpu_freelist - Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/ In bpf_lru_list - Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/ Similar bugs have been reported by syzbot. In queue_stack_maps - Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/ Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/ In lpm_trie - Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/ In ringbuf - Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/ Prevent kprobe and fentry bpf programs from attaching to these critical sections by removing CC_FLAGS_FTRACE for percpu_freelist.o, bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files. The bugs reported by syzbot are due to tracepoint bpf programs being called in the critical sections. This patch does not aim to fix deadlocks caused by tracepoint programs. However, it does prevent deadlocks from occurring in similar situations due to kprobe and fentry programs. Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu> Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
838a10bd2e |
bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL check branch to be dead code eliminated. A previous attempt [1], i.e. the second fixed commit, was made to simulate symbolic execution as if in most accesses, the argument is a non-NULL raw_tp, except for conditional jumps. This tried to suppress branch prediction while preserving compatibility, but surfaced issues with production programs that were difficult to solve without increasing verifier complexity. A more complete discussion of issues and fixes is available at [2]. Fix this by maintaining an explicit list of tracepoints where the arguments are known to be NULL, and mark the positional arguments as PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments are known to be ERR_PTR, and mark these arguments as scalar values to prevent potential dereference. Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2), shifted by the zero-indexed argument number x 4. This can be represented as follows: 1st arg: 0x1 2nd arg: 0x10 3rd arg: 0x100 ... and so on (likewise for ERR_PTR case). In the future, an automated pass will be used to produce such a list, or insert __nullable annotations automatically for tracepoints. Each compilation unit will be analyzed and results will be collated to find whether a tracepoint pointer is definitely not null, maybe null, or an unknown state where verifier conservatively marks it PTR_MAYBE_NULL. A proof of concept of this tool from Eduard is available at [3]. Note that in case we don't find a specification in the raw_tp_null_args array and the tracepoint belongs to a kernel module, we will conservatively mark the arguments as PTR_MAYBE_NULL. This is because unlike for in-tree modules, out-of-tree module tracepoints may pass NULL freely to the tracepoint. We don't protect against such tracepoints passing ERR_PTR (which is uncommon anyway), lest we mark all such arguments as SCALAR_VALUE. While we are it, let's adjust the test raw_tp_null to not perform dereference of the skb->mark, as that won't be allowed anymore, and make it more robust by using inline assembly to test the dead code elimination behavior, which should still stay the same. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: |
|
|
|
c00d738e16 |
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit |
|
|
|
00a5acdbf3 |
bpf: Fix configuration-dependent BTF function references
These BTF functions are not available unconditionally, only reference them when they are available. Avoid the following build warnings: BTF .tmp_vmlinux1.btf.o btf_encoder__tag_kfunc: failed to find kfunc 'bpf_send_signal_task' in BTF btf_encoder__tag_kfuncs: failed to tag kfunc 'bpf_send_signal_task' NM .tmp_vmlinux1.syms KSYMS .tmp_vmlinux1.kallsyms.S AS .tmp_vmlinux1.kallsyms.o LD .tmp_vmlinux2 NM .tmp_vmlinux2.syms KSYMS .tmp_vmlinux2.kallsyms.S AS .tmp_vmlinux2.kallsyms.o LD vmlinux BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol prog_test_ref_kfunc WARN: resolve_btfids: unresolved symbol bpf_crypto_ctx WARN: resolve_btfids: unresolved symbol bpf_send_signal_task WARN: resolve_btfids: unresolved symbol bpf_modify_return_test_tp WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_xdp WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_skb Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213-bpf-cond-ids-v1-1-881849997219@weissschuh.net |
|
|
|
4d3ae294f9 |
bpf: Add fd_array_cnt attribute for prog_load
The fd_array attribute of the BPF_PROG_LOAD syscall may contain a set of file descriptors: maps or btfs. This field was introduced as a sparse array. Introduce a new attribute, fd_array_cnt, which, if present, indicates that the fd_array is a continuous array of the corresponding length. If fd_array_cnt is non-zero, then every map in the fd_array will be bound to the program, as if it was used by the program. This functionality is similar to the BPF_PROG_BIND_MAP syscall, but such maps can be used by the verifier during the program load. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-5-aspsk@isovalent.com |
|
|
|
76145f7255 |
bpf: Refactor check_pseudo_btf_id
Introduce a helper to add btfs to the env->used_maps array. Use it to simplify the check_pseudo_btf_id() function. This new helper will also be re-used in a consequent patch. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-4-aspsk@isovalent.com |
|
|
|
928f3221cb |
bpf: Move map/prog compatibility checks
Move some inlined map/prog compatibility checks from the resolve_pseudo_ldimm64() function to the dedicated check_map_prog_compatibility() function. Call the latter function from the add_used_map_from_fd() function directly. This simplifies code and optimizes logic a bit, as before these changes the check_map_prog_compatibility() function was executed on every map usage, which doesn't make sense, as it doesn't include any per-instruction checks, only map type vs. prog type. (This patch also simplifies a consequent patch which will call the add_used_map_from_fd() function from another code path.) Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-3-aspsk@isovalent.com |
|
|
|
4e885fab71 |
bpf: Add a __btf_get_by_fd helper
Add a new helper to get a pointer to a struct btf from a file descriptor. This helper doesn't increase a refcnt. Add a comment explaining this and pointing to a corresponding function which does take a reference. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-2-aspsk@isovalent.com |
|
|
|
56d95b0adf |
xdp: get rid of xdp_frame::mem.id
Initially, xdp_frame::mem.id was used to search for the corresponding &page_pool to return the page correctly. However, after that struct page was extended to have a direct pointer to its PP (netmem has it as well), further keeping of this field makes no sense. xdp_return_frame_bulk() still used it to do a lookup, and this leftover is now removed. Remove xdp_frame::mem and replace it with ::mem_type, as only memory type still matters and we need to know it to be able to free the frame correctly. As a cute side effect, we can now make every scalar field in &xdp_frame of 4 byte width, speeding up accesses to them. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20241211172649.761483-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
5098462fba |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
659b9ba7cb |
bpf: Check size for BTF-based ctx access of pointer members
Robert Morris reported the following program type which passes the
verifier in [0]:
SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
asm volatile("r2 = *(u16*)(r1 + 0)"); // verifier should demand u64
asm volatile("*(u32 *)(r2 +1504) = 0"); // 1280 in some configs
}
The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.
Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.
Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.
[0]: https://lore.kernel.org/bpf/51338.1732985814@localhost
Fixes:
|
|
|
|
ac6542ad92 |
bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.
The bug could be triggered by replacing the following BPF program:
SEC("tc")
int main_changes(struct __sk_buff *sk)
{
bpf_skb_pull_data(sk, 0);
return 0;
}
With the following BPF program:
SEC("freplace")
long changes_pkt_data(struct __sk_buff *sk)
{
return bpf_skb_pull_data(sk, 0);
}
bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.
Fixes:
|
|
|
|
c4441ca86a |
bpf: fix potential error return
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where error is a result of bpf_adj_branches(), and thus should be always 0 However, if for any reason it is not 0, then it will be converted to boolean by WARN_ON_ONCE and returned to user space as 1, not an actual error value. Fix this by returning the original err after the WARN check. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
81f6d0530b |
bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether
to invalidate all packet pointers in current state depending on the
changes_pkt_data property of the global sub-program.
Because of this, an extension program replacing a global sub-program
must be compatible with changes_pkt_data property of the sub-program
being replaced.
This commit:
- adds changes_pkt_data flag to struct bpf_prog_aux:
- this flag is set in check_cfg() for main sub-program;
- in jit_subprogs() for other sub-programs;
- modifies bpf_check_attach_btf_id() to check changes_pkt_data flag;
- moves call to check_attach_btf_id() after the call to check_cfg(),
because it needs changes_pkt_data flag to be set:
bpf_check:
... ...
- check_attach_btf_id resolve_pseudo_ldimm64
resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded
bpf_prog_is_offloaded check_cfg
check_cfg + check_attach_btf_id
... ...
The following fields are set by check_attach_btf_id():
- env->ops
- prog->aux->attach_btf_trace
- prog->aux->attach_func_name
- prog->aux->attach_func_proto
- prog->aux->dst_trampoline
- prog->aux->mod
- prog->aux->saved_dst_attach_type
- prog->aux->saved_dst_prog_type
- prog->expected_attach_type
Neither of these fields are used by resolve_pseudo_ldimm64() or
bpf_prog_offload_verifier_prep() (for netronome and netdevsim
drivers), so the reordering is safe.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
51081a3f25 |
bpf: track changes_pkt_data property for global functions
When processing calls to certain helpers, verifier invalidates all
packet pointers in a current state. For example, consider the
following program:
__attribute__((__noinline__))
long skb_pull_data(struct __sk_buff *sk, __u32 len)
{
return bpf_skb_pull_data(sk, len);
}
SEC("tc")
int test_invalidate_checks(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
skb_pull_data(sk, 0);
*p = 42;
return TCX_PASS;
}
After a call to bpf_skb_pull_data() the pointer 'p' can't be used
safely. See function filter.c:bpf_helper_changes_pkt_data() for a list
of such helpers.
At the moment verifier invalidates packet pointers when processing
helper function calls, and does not traverse global sub-programs when
processing calls to global sub-programs. This means that calls to
helpers done from global sub-programs do not invalidate pointers in
the caller state. E.g. the program above is unsafe, but is not
rejected by verifier.
This commit fixes the omission by computing field
bpf_subprog_info->changes_pkt_data for each sub-program before main
verification pass.
changes_pkt_data should be set if:
- subprogram calls helper for which bpf_helper_changes_pkt_data
returns true;
- subprogram calls a global function,
for which bpf_subprog_info->changes_pkt_data should be set.
The verifier.c:check_cfg() pass is modified to compute this
information. The commit relies on depth first instruction traversal
done by check_cfg() and absence of recursive function calls:
- check_cfg() would eventually visit every call to subprogram S in a
state when S is fully explored;
- when S is fully explored:
- every direct helper call within S is explored
(and thus changes_pkt_data is set if needed);
- every call to subprogram S1 called by S was visited with S1 fully
explored (and thus S inherits changes_pkt_data from S1).
The downside of such approach is that dead code elimination is not
taken into account: if a helper call inside global function is dead
because of current configuration, verifier would conservatively assume
that the call occurs for the purpose of the changes_pkt_data
computation.
Reported-by: Nick Zavaritsky <mejedi@gmail.com>
Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
b238e187b4 |
bpf: refactor bpf_helper_changes_pkt_data to use helper number
Use BPF helper number instead of function pointer in bpf_helper_changes_pkt_data(). This would simplify usage of this function in verifier.c:check_cfg() (in a follow-up patch), where only helper number is easily available and there is no real need to lookup helper proto. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
27e88bc4df |
bpf: add find_containing_subprog() utility function
Add a utility function, looking for a subprogram containing a given instruction index, rewrite find_subprog() to use this function. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
442bc81bd3 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. Trivial conflict: tools/testing/selftests/bpf/prog_tests/verifier.c Adjacent changes in: Auto-merging kernel/bpf/verifier.c Auto-merging samples/bpf/Makefile Auto-merging tools/testing/selftests/bpf/.gitignore Auto-merging tools/testing/selftests/bpf/Makefile Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
6a5c63d43c |
bpf: Use raw_spinlock_t for LPM trie
After switching from kmalloc() to the bpf memory allocator, there will be no blocking operation during the update of LPM trie. Therefore, change trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in atomic context, even on RT kernels. The max value of prefixlen is 2048. Therefore, update or deletion operations will find the target after at most 2048 comparisons. Constructing a test case which updates an element after 2048 comparisons under a 8 CPU VM, and the average time and the maximal time for such update operation is about 210us and 900us. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
3d8dc43eb2 |
bpf: Switch to bpf mem allocator for LPM trie
Multiple syzbot warnings have been reported. These warnings are mainly about the lock order between trie->lock and kmalloc()'s internal lock. See report [1] as an example: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted ------------------------------------------------------ syz.3.2069/15008 is trying to acquire lock: ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ... but task is already holding lock: ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ... which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&trie->lock){-.-.}-{2:2}: __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 trie_delete_elem+0xb0/0x820 ___bpf_prog_run+0x3e51/0xabd0 __bpf_prog_run32+0xc1/0x100 bpf_dispatcher_nop_func ...... bpf_trace_run2+0x231/0x590 __bpf_trace_contention_end+0xca/0x110 trace_contention_end.constprop.0+0xea/0x170 __pv_queued_spin_lock_slowpath+0x28e/0xcc0 pv_queued_spin_lock_slowpath queued_spin_lock_slowpath queued_spin_lock do_raw_spin_lock+0x210/0x2c0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x42/0x60 __put_partials+0xc3/0x170 qlink_free qlist_free_all+0x4e/0x140 kasan_quarantine_reduce+0x192/0x1e0 __kasan_slab_alloc+0x69/0x90 kasan_slab_alloc slab_post_alloc_hook slab_alloc_node kmem_cache_alloc_node_noprof+0x153/0x310 __alloc_skb+0x2b1/0x380 ...... -> #0 (&n->list_lock){-.-.}-{2:2}: check_prev_add check_prevs_add validate_chain __lock_acquire+0x2478/0x3b30 lock_acquire lock_acquire+0x1b1/0x560 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 get_partial_node.part.0+0x20/0x350 get_partial_node get_partial ___slab_alloc+0x65b/0x1870 __slab_alloc.constprop.0+0x56/0xb0 __slab_alloc_node slab_alloc_node __do_kmalloc_node __kmalloc_node_noprof+0x35c/0x440 kmalloc_node_noprof bpf_map_kmalloc_node+0x98/0x4a0 lpm_trie_node_alloc trie_update_elem+0x1ef/0xe00 bpf_map_update_value+0x2c1/0x6c0 map_update_elem+0x623/0x910 __sys_bpf+0x90c/0x49a0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&trie->lock); lock(&n->list_lock); lock(&trie->lock); lock(&n->list_lock); *** DEADLOCK *** [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7 A bpf program attached to trace_contention_end() triggers after acquiring &n->list_lock. The program invokes trie_delete_elem(), which then acquires trie->lock. However, it is possible that another process is invoking trie_update_elem(). trie_update_elem() will acquire trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke get_partial_node() and try to acquire &n->list_lock (not necessarily the same lock object). Therefore, lockdep warns about the circular locking dependency. Invoking kmalloc() before acquiring trie->lock could fix the warning. However, since BPF programs call be invoked from any context (e.g., through kprobe/tracepoint/fentry), there may still be lock ordering problems for internal locks in kmalloc() or trie->lock itself. To eliminate these potential lock ordering problems with kmalloc()'s internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent BPF memory allocator APIs that can be invoked in any context. The lock ordering problems with trie->lock (e.g., reentrance) will be handled separately. Three aspects of this change require explanation: 1. Intermediate and leaf nodes are allocated from the same allocator. Since the value size of LPM trie is usually small, using a single alocator reduces the memory overhead of the BPF memory allocator. 2. Leaf nodes are allocated before disabling IRQs. This handles cases where leaf_size is large (e.g., > 4KB - 8) and updates require intermediate node allocation. If leaf nodes were allocated in IRQ-disabled region, the free objects in BPF memory allocator would not be refilled timely and the intermediate node allocation may fail. 3. Paired migrate_{disable|enable}() calls for node alloc and free. The BPF memory allocator uses per-CPU struct internally, these paired calls are necessary to guarantee correctness. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
27abc7b3fa |
bpf: Fix exact match conditions in trie_get_next_key()
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.
Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.
Fixes:
|
|
|
|
532d6b36b2 |
bpf: Handle in-place update for full LPM trie correctly
When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.
Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.
Fixes:
|
|
|
|
eae6a075e9 |
bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags. These flags can be specified by users and are relevant since LPM
trie supports exact matches during update.
Fixes:
|
|
|
|
3d5611b4d7 |
bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
There is no need to call kfree(im_node) when updating element fails, because im_node must be NULL. Remove the unnecessary kfree() for im_node. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
156c977c53 |
bpf: Remove unnecessary check when updating LPM trie
When "node->prefixlen == matchlen" is true, it means that the node is fully matched. If "node->prefixlen == key->prefixlen" is false, it means the prefix length of key is greater than the prefix length of node, otherwise, matchlen will not be equal with node->prefixlen. However, it also implies that the prefix length of node must be less than max_prefixlen. Therefore, "node->prefixlen == trie->max_prefixlen" will always be false when the check of "node->prefixlen == key->prefixlen" returns false. Remove this unnecessary comparison. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
7cd1107f48 |
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
b0e66977dc |
bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the
verifier aims to reject partial overwrite on an 8-byte stack slot that
contains a spilled pointer.
However, in such a scenario, it rejects all partial stack overwrites as
long as the targeted stack slot is a spilled register, because it does
not check if the stack slot is a spilled pointer.
Incomplete checks will result in the rejection of valid programs, which
spill narrower scalar values onto scalar slots, as shown below.
0: R1=ctx() R10=fp0
; asm volatile ( @ repro.bpf.c:679
0: (7a) *(u64 *)(r10 -8) = 1 ; R10=fp0 fp-8_w=1
1: (62) *(u32 *)(r10 -8) = 1
attempt to corrupt spilled pointer on stack
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0.
Fix this by expanding the check to not consider spilled scalar registers
when rejecting the write into the stack.
Previous discussion on this patch is at link [0].
[0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch
Fixes:
|
|
|
|
69772f509e |
bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to
STACK_MISC when allow_ptr_leaks is false, since invalid contents
shouldn't be read unless the program has the relevant capabilities.
The relaxation only makes sense when env->allow_ptr_leaks is true.
However, such conversion in privileged mode becomes unnecessary, as
invalid slots can be read without being upgraded to STACK_MISC.
Currently, the condition is inverted (i.e. checking for true instead of
false), simply remove it to restore correct behavior.
Fixes:
|
|
|
|
cbd8730aea |
bpf: Improve verifier log for resource leak on exit
The verifier log when leaking resources on BPF_EXIT may be a bit confusing, as it's a problem only when finally existing from the main prog, not from any of the subprogs. Hence, update the verifier error string and the corresponding selftests matching on it. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
c8e2ee1f3d |
bpf: Introduce support for bpf_local_irq_{save,restore}
Teach the verifier about IRQ-disabled sections through the introduction of two new kfuncs, bpf_local_irq_save, to save IRQ state and disable them, and bpf_local_irq_restore, to restore IRQ state and enable them back again. For the purposes of tracking the saved IRQ state, the verifier is taught about a new special object on the stack of type STACK_IRQ_FLAG. This is a 8 byte value which saves the IRQ flags which are to be passed back to the IRQ restore kfunc. Renumber the enums for REF_TYPE_* to simplify the check in find_lock_state, filtering out non-lock types as they grow will become cumbersome and is unecessary. To track a dynamic number of IRQ-disabled regions and their associated saved states, a new resource type RES_TYPE_IRQ is introduced, which its state management functions: acquire_irq_state and release_irq_state, taking advantage of the refactoring and clean ups made in earlier commits. One notable requirement of the kernel's IRQ save and restore API is that they cannot happen out of order. For this purpose, when releasing reference we keep track of the prev_id we saw with REF_TYPE_IRQ. Since reference states are inserted in increasing order of the index, this is used to remember the ordering of acquisitions of IRQ saved states, so that we maintain a logical stack in acquisition order of resource identities, and can enforce LIFO ordering when restoring IRQ state. The top of the stack is maintained using bpf_verifier_state's active_irq_id. To maintain the stack property when releasing reference states, we need to modify release_reference_state to instead shift the remaining array left using memmove instead of swapping deleted element with last that might break the ordering. A selftest to test this subtle behavior is added in late patches. The logic to detect initialized and unitialized irq flag slots, marking and unmarking is similar to how it's done for iterators. No additional checks are needed in refsafe for REF_TYPE_IRQ, apart from the usual check_id satisfiability check on the ref[i].id. We have to perform the same check_ids check on state->active_irq_id as well. To ensure we don't get assigned REF_TYPE_PTR by default after acquire_reference_state, if someone forgets to assign the type, let's also renumber the enum ref_state_type. This way any unassigned types get caught by refsafe's default switch statement, don't assume REF_TYPE_PTR by default. The kfuncs themselves are plain wrappers over local_irq_save and local_irq_restore macros. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
b79f5f54e1 |
bpf: Refactor mark_{dynptr,iter}_read
There is possibility of sharing code between mark_dynptr_read and mark_iter_read for updating liveness information of their stack slots. Consolidate common logic into mark_stack_slot_obj_read function in preparation for the next patch which needs the same logic for its own stack slots. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
769b0f1c82 |
bpf: Refactor {acquire,release}_reference_state
In preparation for introducing support for more reference types which have to add and remove reference state, refactor the acquire_reference_state and release_reference_state functions to share common logic. The acquire_reference_state function simply handles growing the acquired refs and returning the pointer to the new uninitialized element, which can be filled in by the caller. The release_reference_state function simply erases a reference state entry in the acquired_refs array and shrinks it. The callers are responsible for finding the suitable element by matching on various fields of the reference state and requesting deletion through this function. It is not supposed to be called directly. Existing callers of release_reference_state were using it to find and remove state for a given ref_obj_id without scrubbing the associated registers in the verifier state. Introduce release_reference_nomark to provide this functionality and convert callers. We now use this new release_reference_nomark function within release_reference as well. It needs to operate on a verifier state instead of taking verifier env as mark_ptr_or_null_regs requires operating on verifier state of the two branches of a NULL condition check, therefore env->cur_state cannot be used directly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
1995edc5f9 |
bpf: Consolidate locks and reference state in verifier state
Currently, state for RCU read locks and preemption is in bpf_verifier_state, while locks and pointer reference state remains in bpf_func_state. There is no particular reason to keep the latter in bpf_func_state. Additionally, it is copied into a new frame's state and copied back to the caller frame's state everytime the verifier processes a pseudo call instruction. This is a bit wasteful, given this state is global for a given verification state / path. Move all resource and reference related state in bpf_verifier_state structure in this patch, in preparation for introducing new reference state types in the future. Since we switch print_verifier_state and friends to print using vstate, we now need to explicitly pass in the verifier state from the caller along with the bpf_func_state, so modify the prototype and callers to do so. To ensure func state matches the verifier state when we're printing data, take in frame number instead of bpf_func_state pointer instead and avoid inconsistencies induced by the caller. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
bd74e238ae |
bpf: Zero index arg error string for dynptr and iter
Andrii spotted that process_dynptr_func's rejection of incorrect argument register type will print an error string where argument numbers are not zero-indexed, unlike elsewhere in the verifier. Fix this by subtracting 1 from regno. The same scenario exists for iterator messages. Fix selftest error strings that match on the exact argument number while we're at it to ensure clean bisection. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203002235.3776418-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
12659d2861 |
bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
Currently, KF_ARG_PTR_TO_ITER handling missed checking the reg->type and
ensuring it is PTR_TO_STACK. Instead of enforcing this in the caller of
process_iter_arg, move the check into it instead so that all callers
will gain the check by default. This is similar to process_dynptr_func.
An existing selftest in verifier_bits_iter.c fails due to this change,
but it's because it was passing a NULL pointer into iter_next helper and
getting an error further down the checks, but probably meant to pass an
uninitialized iterator on the stack (as is done in the subsequent test
below it). We will gain coverage for non-PTR_TO_STACK arguments in later
patches hence just change the declaration to zero-ed stack object.
Fixes:
|
|
|
|
ab244dd7cf |
bpf: fix OOB devmap writes when deleting elements
Jordy reported issue against XSKMAP which also applies to DEVMAP - the
index used for accessing map entry, due to being a signed integer,
causes the OOB writes. Fix is simple as changing the type from int to
u32, however, when compared to XSKMAP case, one more thing needs to be
addressed.
When map is released from system via dev_map_free(), we iterate through
all of the entries and an iterator variable is also an int, which
implies OOB accesses. Again, change it to be u32.
Example splat below:
[ 160.724676] BUG: unable to handle page fault for address: ffffc8fc2c001000
[ 160.731662] #PF: supervisor read access in kernel mode
[ 160.736876] #PF: error_code(0x0000) - not-present page
[ 160.742095] PGD 0 P4D 0
[ 160.744678] Oops: Oops: 0000 [#1] PREEMPT SMP
[ 160.749106] CPU: 1 UID: 0 PID: 520 Comm: kworker/u145:12 Not tainted 6.12.0-rc1+ #487
[ 160.757050] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[ 160.767642] Workqueue: events_unbound bpf_map_free_deferred
[ 160.773308] RIP: 0010:dev_map_free+0x77/0x170
[ 160.777735] Code: 00 e8 fd 91 ed ff e8 b8 73 ed ff 41 83 7d 18 19 74 6e 41 8b 45 24 49 8b bd f8 00 00 00 31 db 85 c0 74 48 48 63 c3 48 8d 04 c7 <48> 8b 28 48 85 ed 74 30 48 8b 7d 18 48 85 ff 74 05 e8 b3 52 fa ff
[ 160.796777] RSP: 0018:ffffc9000ee1fe38 EFLAGS: 00010202
[ 160.802086] RAX: ffffc8fc2c001000 RBX: 0000000080000000 RCX: 0000000000000024
[ 160.809331] RDX: 0000000000000000 RSI: 0000000000000024 RDI: ffffc9002c001000
[ 160.816576] RBP: 0000000000000000 R08: 0000000000000023 R09: 0000000000000001
[ 160.823823] R10: 0000000000000001 R11: 00000000000ee6b2 R12: dead000000000122
[ 160.831066] R13: ffff88810c928e00 R14: ffff8881002df405 R15: 0000000000000000
[ 160.838310] FS: 0000000000000000(0000) GS:ffff8897e0c40000(0000) knlGS:0000000000000000
[ 160.846528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 160.852357] CR2: ffffc8fc2c001000 CR3: 0000000005c32006 CR4: 00000000007726f0
[ 160.859604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 160.866847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 160.874092] PKRU: 55555554
[ 160.876847] Call Trace:
[ 160.879338] <TASK>
[ 160.881477] ? __die+0x20/0x60
[ 160.884586] ? page_fault_oops+0x15a/0x450
[ 160.888746] ? search_extable+0x22/0x30
[ 160.892647] ? search_bpf_extables+0x5f/0x80
[ 160.896988] ? exc_page_fault+0xa9/0x140
[ 160.900973] ? asm_exc_page_fault+0x22/0x30
[ 160.905232] ? dev_map_free+0x77/0x170
[ 160.909043] ? dev_map_free+0x58/0x170
[ 160.912857] bpf_map_free_deferred+0x51/0x90
[ 160.917196] process_one_work+0x142/0x370
[ 160.921272] worker_thread+0x29e/0x3b0
[ 160.925082] ? rescuer_thread+0x4b0/0x4b0
[ 160.929157] kthread+0xd4/0x110
[ 160.932355] ? kthread_park+0x80/0x80
[ 160.936079] ret_from_fork+0x2d/0x50
[ 160.943396] ? kthread_park+0x80/0x80
[ 160.950803] ret_from_fork_asm+0x11/0x20
[ 160.958482] </TASK>
Fixes:
|
|
|
|
8618f5ffba |
bpf, lsm: Remove getlsmprop hooks BTF IDs
These hooks are not useful for BPF LSM currently.
Furthermore a recent renaming introduced build warnings:
BTFIDS vmlinux
WARN: resolve_btfids: unresolved symbol bpf_lsm_task_getsecid_obj
WARN: resolve_btfids: unresolved symbol bpf_lsm_current_getsecid_subj
Link: https://lore.kernel.org/lkml/20241123-bpf_lsm_task_getsecid_obj-v1-1-0d0f94649e05@weissschuh.net/
Fixes:
|
|
|
|
980f8f8fd4 |
Summary
* sysctl ctl_table constification
Constifying ctl_table structs prevents the modification of proc_handler
function pointers. All ctl_table struct arguments are const qualified in the
sysctl API in such a way that the ctl_table arrays being defined elsewhere
and passed through sysctl can be constified one-by-one. We kick the
constification off by qualifying user_table in kernel/ucount.c and expect all
the ctl_tables to be constified in the coming releases.
* Misc fixes
Adjust comments in two places to better reflect the code. Remove superfluous
dput calls. Remove Luis from sysctl maintainership. Replace comments about
holding a lock with calls to lockdep_assert_held.
* Testing
All these went through 0-day and they have all been in linux-next for at
least 1 month (since Oct-24). I also rand these through the sysctl selftest
for x86_64.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmdAXMsACgkQupfNUreW
QU/KfQv8Daq9sew98ohmS/lkdoE1dfpI72motzEn1993CbLjN2h3CZauaHjBPFnr
rpr8qPrphdWTyDbDMgx63oxcNxM07g7a9H0y/K3IwdUsx7fGINgHF5kfWeVn09ov
X8I3NuL/+xSHAZRsLQeBykbY6BD5e0uuxL6ayGzkejrgRd+80dmC3MzXqX207v1z
rlrUFXEXwqKYgxP/H+pxmvmVWKAeFsQt/E49GOkg2qSg9mVFhtKpxHwMJVqS2a8u
qAKHgcZhB5T8TQSb1eKnyCzXLDLpzqUBj9ejqJSsQm16fweawv221Ji6a1k53QYG
chreoB9R8qCZ/jGoWI3ZKGRZ/Vl37l+GF/82X/sDrMbKwVlxvaERpb1KXrnh/D1v
qNze1Eea0eYv22weGGEa3J5N2tKfgX6NcRFioDNe9VEXX6zDcAtJKTKZtbMB3gXX
CzQicH5yXApyAk3aNCq0S3s+WRQR0syGAYCmtxhaRgXRnSu9qifKZ1XhZQyhgKIG
Flt9MsU2
=bOJ0
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
"sysctl ctl_table constification:
- Constifying ctl_table structs prevents the modification of
proc_handler function pointers. All ctl_table struct arguments are
const qualified in the sysctl API in such a way that the ctl_table
arrays being defined elsewhere and passed through sysctl can be
constified one-by-one.
We kick the constification off by qualifying user_table in
kernel/ucount.c and expect all the ctl_tables to be constified in
the coming releases.
Misc fixes:
- Adjust comments in two places to better reflect the code
- Remove superfluous dput calls
- Remove Luis from sysctl maintainership
- Replace comments about holding a lock with calls to
lockdep_assert_held"
* tag 'sysctl-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Reduce dput(child) calls in proc_sys_fill_cache()
sysctl: Reorganize kerneldoc parameter names
ucounts: constify sysctl table user_table
sysctl: update comments to new registration APIs
MAINTAINERS: remove me from sysctl
sysctl: Convert locking comments to lockdep assertions
const_structs.checkpatch: add ctl_table
sysctl: make internal ctl_tables const
sysctl: allow registration of const struct ctl_table
sysctl: move internal interfaces to const struct ctl_table
bpf: Constify ctl_table argument of filter function
|