linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	7d7a103d29	vfs-6.16-rc1.pidfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/ jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA= =nzYS -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "Features: - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD socket option SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for the process that called connect() or listen(). This is heavily used to safely authenticate clients in userspace avoiding security bugs due to pid recycling races (dbus, polkit, systemd, etc.) SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag Another summary has been provided by David Rheinsberg: > A pidfd can outlive the task it refers to, and thus user-space > must already be prepared that the task underlying a pidfd is > gone at the time they get their hands on the pidfd. For > instance, resolving the pidfd to a PID via the fdinfo must be > prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several > kernel APIs currently add another layer that checks for this. In > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was > already reaped, but returns a stale pidfd if the task is reaped > immediately after the respective alive-check. > > This has the unfortunate effect that user-space now has two ways > to check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... or the pidfd might be stale, even though > there is no particular reason to distinguish both cases. This > also propagates through user-space APIs, which pass on pidfds. > They must be prepared to pass on `-1` or the pidfd, because > there is no guaranteed way to get a stale pidfd from the kernel. > > Userspace must already deal with a pidfd referring to a reaped > task as the task may exit and get reaped at any time will there > are still many pidfds referring to it In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info = { .mask = PIDFD_INFO_CGROUPID \| PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. / ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes - Hand a pidfd to the coredump usermode helper process Give userspace a way to instruct the kernel to install a pidfd for the crashing process into the process started as a usermode helper. There's still tricky race-windows that cannot be easily or sometimes not closed at all by userspace. There's various ways like looking at the start time of a process to make sure that the usermode helper process is started after the crashing process but it's all very very brittle and fraught with peril The crashed-but-not-reaped process can be killed by userspace before coredump processing programs like systemd-coredump have had time to manually open a PIDFD from the PID the kernel provides them, which means they can be tricked into reading from an arbitrary process, and they run with full privileges as they are usermode helper processes Even if that specific race-window wouldn't exist it's still the safest and cleanest way to let the kernel provide the pidfd directly instead of requiring userspace to do it manually. In parallel with this commit we already have systemd adding support for this in [1] When the usermode helper process is forked we install a pidfd file descriptor three into the usermode helper's file descriptor table so it's available to the exec'd program Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed yet and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited - Allow telling when a task has not been found from finding the wrong task when creating a pidfd We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader This is causing issues for non-threaded workloads as in where they expect ESRCH to be reported, not EINVAL So allow userspace to reliably distinguish between (1) and (2) - Make it possible to detect when a pidfs entry would outlive the struct pid it pinned - Add a range of new selftests Cleanups: - Remove unneeded NULL check from pidfd_prepare() for passed struct pid - Avoid pointless reference count bump during release_task() Fixes: - Various fixes to the pidfd and coredump selftests - Fix error handling for replace_fd() when spawning coredump usermode helper" tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: detect refcount bugs coredump: hand a pidfd to the usermode coredump helper coredump: fix error handling for replace_fd() pidfs: move O_RDWR into pidfs_alloc_file() selftests: coredump: Raise timeout to 2 minutes selftests: coredump: Fix test failure for slow machines selftests: coredump: Properly initialize pointer net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid pidfs: get rid of __pidfd_prepare() net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid pidfs: register pid in pidfs net, pidfd: report EINVAL for ESRCH release_task: kill the no longer needed get/put_pid(thread_pid) pidfs: ensure consistent ENOENT/ESRCH reporting exit: move wake_up_all() pidfd waiters into __unhash_process() selftest/pidfd: add test for thread-group leader pidfd open for thread pidfd: improve uapi when task isn't found pidfd: remove unneeded NULL check from pidfd_prepare() selftests/pidfd: adapt to recent changes	2025-05-26 10:30:02 -07:00
Qiu Yutan	e45b7196df	net: neigh: use kfree_skb_reason() in neigh_resolve_output() and neigh_connected_output() Replace kfree_skb() used in neigh_resolve_output() and neigh_connected_output() with kfree_skb_reason(). Following new skb drop reason is added: /* failed to fill the device hard header */ SKB_DROP_REASON_NEIGH_HH_FILLFAIL Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-26 10:03:13 +01:00
Stanislav Fomichev	384492c48e	net: devmem: support single IOV with sendmsg sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple iovs becomes ITER_IOVEC. iter_iov_len does not return correct value for UBUF, so teach to treat UBUF differently. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Fixes: `bd61848900` ("net: devmem: Implement TX path") Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-26 10:00:48 +01:00
Kuniyuki Iwashima	77cbe1a6d8	af_unix: Introduce SO_PASSRIGHTS. As long as recvmsg() or recvmmsg() is used with cmsg, it is not possible to avoid receiving file descriptors via SCM_RIGHTS. This behaviour has occasionally been flagged as problematic, as it can be (ab)used to trigger DoS during close(), for example, by passing a FUSE-controlled fd or a hung NFS fd. For instance, as noted on the uAPI Group page [0], an untrusted peer could send a file descriptor pointing to a hung NFS mount and then close it. Once the receiver calls recvmsg() with msg_control, the descriptor is automatically installed, and then the responsibility for the final close() now falls on the receiver, which may result in blocking the process for a long time. Regarding this, systemd calls cmsg_close_all() [1] after each recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS. However, this cannot work around the issue at all, because the final fput() may still occur on the receiver's side once sendmsg() with SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does not work for the same reason. Thus, we need a better way to refuse SCM_RIGHTS at sendmsg(). Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS. Note that this option is enabled by default for backward compatibility. Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0] Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima	0e81cfd971	af_unix: Move SOCK_PASS{CRED,PIDFD,SEC} to struct sock. As explained in the next patch, SO_PASSRIGHTS would have a problem if we assigned a corresponding bit to socket->flags, so it must be managed in struct sock. Mixing socket->flags and sk->sk_flags for similar options will look confusing, and sk->sk_flags does not have enough space on 32bit system. Also, as mentioned in commit `16e5726269` ("af_unix: dont send SCM_CREDENTIALS by default"), SOCK_PASSCRED and SOCK_PASSPID handling is known to be slow, and managing the flags in struct socket cannot avoid that for embryo sockets. Let's move SOCK_PASS{CRED,PIDFD,SEC} to struct sock. While at it, other SOCK_XXX flags in net.h are grouped as enum. Note that assign_bit() was atomic, so the writer side is moved down after lock_sock() in setsockopt(), but the bit is only read once in sendmsg() and recvmsg(), so lock_sock() is not needed there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima	7d8d93fdde	net: Restrict SO_PASS{CRED,PIDFD,SEC} to AF_{UNIX,NETLINK,BLUETOOTH}. SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by scm_recv_unix(). scm_recv() is called from AF_NETLINK and AF_BLUETOOTH. scm_recv_unix() is literally called from AF_UNIX. Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and SO_PASSPIDFD to AF_UNIX only. Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock and united with another field. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima	ae4f2f59e1	tcp: Restrict SO_TXREHASH to TCP socket. sk->sk_txrehash is only used for TCP. Let's restrict SO_TXREHASH to TCP to reflect this. Later, we will make sk_txrehash a part of the union for other protocol families. Note that we need to modify BPF selftest not to get/set SO_TEREHASH for non-TCP sockets. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima	38b95d588f	scm: Move scm_recv() from scm.h to scm.c. scm_recv() has been placed in scm.h since the pre-git era for no particular reason (I think), which makes the file really fragile. For example, when you move SOCK_PASSCRED from include/linux/net.h to enum sock_flags in include/net/sock.h, you will see weird build failure due to terrible dependency. To avoid the build failure in the future, let's move scm_recv(_unix())? and its callees to scm.c. Note that only scm_recv() needs to be exported for Bluetooth. scm_send() should be moved to scm.c too, but I'll revisit later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-05-23 10:24:18 +01:00
Jiayuan Chen	8259eb0e06	bpf, sockmap: Avoid using sk_socket after free when sending The sk->sk_socket is not locked or referenced in backlog thread, and during the call to skb_send_sock(), there is a race condition with the release of sk_socket. All types of sockets(tcp/udp/unix/vsock) will be affected. Race conditions: ''' CPU0 CPU1 backlog::skb_send_sock sendmsg_unlocked sock_sendmsg sock_sendmsg_nosec close(fd): ... ops->release() -> sock_map_close() sk_socket->ops = NULL free(socket) sock->ops->sendmsg ^ panic here ''' The ref of psock become 0 after sock_map_close() executed. ''' void sock_map_close() { ... if (likely(psock)) { ... // !! here we remove psock and the ref of psock become 0 sock_map_remove_links(sk, psock) psock = sk_psock_get(sk); if (unlikely(!psock)) goto no_psock; <=== Control jumps here via goto ... cancel_delayed_work_sync(&psock->work); <=== not executed sk_psock_put(sk, psock); ... } ''' Based on the fact that we already wait for the workqueue to finish in sock_map_close() if psock is held, we simply increase the psock reference count to avoid race conditions. With this patch, if the backlog thread is running, sock_map_close() will wait for the backlog thread to complete and cancel all pending work. If no backlog running, any pending work that hasn't started by then will fail when invoked by sk_psock_get(), as the psock reference count have been zeroed, and sk_psock_drop() will cancel all jobs via cancel_delayed_work_sync(). In summary, we require synchronization to coordinate the backlog thread and close() thread. The panic I catched: ''' Workqueue: events sk_psock_backlog RIP: 0010:sock_sendmsg+0x21d/0x440 RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001 ... Call Trace: <TASK> ? die_addr+0x40/0xa0 ? exc_general_protection+0x14c/0x230 ? asm_exc_general_protection+0x26/0x30 ? sock_sendmsg+0x21d/0x440 ? sock_sendmsg+0x3e0/0x440 ? __pfx_sock_sendmsg+0x10/0x10 __skb_send_sock+0x543/0xb70 sk_psock_backlog+0x247/0xb80 ... ''' Fixes: `4b4647add7` ("sock_map: avoid race between sock_map_close and sk_psock_put") Reported-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250516141713.291150-1-jiayuan.chen@linux.dev	2025-05-22 16:16:37 -07:00
Eric Dumazet	945301db34	net: add debug checks in ____napi_schedule() and napi_poll() While tracking an IDPF bug, I found that idpf_vport_splitq_napi_poll() was not following NAPI rules. It can indeed return @budget after napi_complete() has been called. Add two debug conditions in networking core to hopefully catch this kind of bugs sooner. IDPF bug will be fixed in a separate patch. [ 72.441242] repoll requested for device eth1 idpf_vport_splitq_napi_poll [idpf] but napi is not scheduled. [ 72.446291] list_del corruption. next->prev should be ff31783d93b14040, but was ff31783d93b10080. (next=ff31783d93b10080) [ 72.446659] kernel BUG at lib/list_debug.c:67! [ 72.446816] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI [ 72.447031] CPU: 156 UID: 0 PID: 16258 Comm: ip Tainted: G W 6.15.0-dbg-DEV #1944 NONE [ 72.447340] Tainted: [W]=WARN [ 72.447702] RIP: 0010:__list_del_entry_valid_or_report (lib/list_debug.c:65) [ 72.450630] Call Trace: [ 72.450720] <IRQ> [ 72.450797] net_rx_action (include/linux/list.h:215 include/linux/list.h:287 net/core/dev.c:7385 net/core/dev.c:7516) [ 72.450928] ? lock_release (kernel/locking/lockdep.c:?) [ 72.451059] ? clockevents_program_event (kernel/time/clockevents.c:?) [ 72.451222] handle_softirqs (kernel/softirq.c:579) [ 72.451356] ? do_softirq (kernel/softirq.c:480) [ 72.451480] ? idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf [ 72.451635] do_softirq (kernel/softirq.c:480) [ 72.451750] </IRQ> [ 72.451828] <TASK> [ 72.451905] __local_bh_enable_ip (kernel/softirq.c:?) [ 72.452051] idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf [ 72.452210] idpf_send_delete_queues_msg (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:2083) idpf [ 72.452390] idpf_vport_stop (drivers/net/ethernet/intel/idpf/idpf_lib.c:837 drivers/net/ethernet/intel/idpf/idpf_lib.c:868) idpf [ 72.452541] ? idpf_vport_stop (include/linux/bottom_half.h:? include/linux/netdevice.h:4762 drivers/net/ethernet/intel/idpf/idpf_lib.c:855) idpf [ 72.452695] idpf_initiate_soft_reset (drivers/net/ethernet/intel/idpf/idpf_lib.c:?) idpf [ 72.452867] idpf_change_mtu (drivers/net/ethernet/intel/idpf/idpf_lib.c:2189) idpf [ 72.453015] netif_set_mtu_ext (net/core/dev.c:9437) [ 72.453157] ? packet_notifier (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/packet/af_packet.c:4240) [ 72.453292] netif_set_mtu (net/core/dev.c:9515) [ 72.453416] dev_set_mtu (net/core/dev_api.c:?) [ 72.453534] bond_change_mtu (drivers/net/bonding/bond_main.c:4833) [ 72.453666] netif_set_mtu_ext (net/core/dev.c:9437) [ 72.453803] do_setlink (net/core/rtnetlink.c:3116) [ 72.453925] ? rtnl_newlink (net/core/rtnetlink.c:3901) [ 72.454055] ? rtnl_newlink (net/core/rtnetlink.c:3901) [ 72.454185] ? rtnl_newlink (net/core/rtnetlink.c:3901) [ 72.454314] ? trace_contention_end (include/trace/events/lock.h:122) [ 72.454467] ? __mutex_lock (arch/x86/include/asm/preempt.h:85 kernel/locking/mutex.c:611 kernel/locking/mutex.c:746) [ 72.454597] ? cap_capable (include/trace/events/capability.h:26) [ 72.454721] ? security_capable (security/security.c:?) [ 72.454857] rtnl_newlink (net/core/rtnetlink.c:?) [ 72.454982] ? lock_is_held_type (kernel/locking/lockdep.c:5599 kernel/locking/lockdep.c:5938) [ 72.455121] ? __lock_acquire (kernel/locking/lockdep.c:?) [ 72.455256] ? __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:685) [ 72.455438] ? __lock_acquire (kernel/locking/lockdep.c:?) [ 72.455582] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885) [ 72.455721] ? lock_acquire (kernel/locking/lockdep.c:5866) [ 72.455848] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885) [ 72.455987] ? lock_release (kernel/locking/lockdep.c:?) [ 72.456117] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:871) [ 72.456249] ? __pfx_rtnl_newlink (net/core/rtnetlink.c:3956) [ 72.456388] rtnetlink_rcv_msg (net/core/rtnetlink.c:6955) [ 72.456526] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885) [ 72.456671] ? lock_acquire (kernel/locking/lockdep.c:5866) [ 72.456802] ? net_generic (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 include/net/netns/generic.h:45) [ 72.456929] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6858) [ 72.457082] netlink_rcv_skb (net/netlink/af_netlink.c:2534) [ 72.457212] netlink_unicast (net/netlink/af_netlink.c:1313) [ 72.457344] netlink_sendmsg (net/netlink/af_netlink.c:1883) [ 72.457476] __sock_sendmsg (net/socket.c:712) [ 72.457602] ____sys_sendmsg (net/socket.c:?) [ 72.457735] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:126 arch/x86/include/asm/uaccess_64.h:134 arch/x86/include/asm/uaccess_64.h:141 include/linux/uaccess.h:178 lib/usercopy.c:18) [ 72.457875] ___sys_sendmsg (net/socket.c:2620) [ 72.458042] ? __call_rcu_common (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/rcu/tree.c:3107) [ 72.458185] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457) [ 72.458324] ? lock_acquire (kernel/locking/lockdep.c:5866) [ 72.458451] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457) [ 72.458588] ? lock_release (kernel/locking/lockdep.c:?) [ 72.458718] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457) [ 72.458856] __x64_sys_sendmsg (net/socket.c:2652) [ 72.458997] ? do_syscall_64 (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/entry-common.h:198 arch/x86/entry/syscall_64.c:90) [ 72.459136] do_syscall_64 (arch/x86/entry/syscall_64.c:?) [ 72.459259] ? exc_page_fault (arch/x86/mm/fault.c:1542) [ 72.459387] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 72.459555] RIP: 0033:0x7fd15f17cbd0 Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250520121908.1805732-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 20:44:11 -07:00
Eric Biggers	c93f75b2d7	net: remove skb_copy_and_hash_datagram_iter() Now that skb_copy_and_hash_datagram_iter() is no longer used, remove it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-11-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:40:17 -07:00
Eric Biggers	ea6342d989	net: add skb_copy_and_crc32c_datagram_iter() Since skb_copy_and_hash_datagram_iter() is used only with CRC32C, the crypto_ahash abstraction provides no value. Add skb_copy_and_crc32c_datagram_iter() which just calls crc32c() directly. This is faster and simpler. It also doesn't have the weird dependency issue where skb_copy_and_hash_datagram_iter() depends on CONFIG_CRYPTO_HASH=y without that being expressed explicitly in the kconfig (presumably because it was too heavyweight for NET to select). The new function is conditional on the hidden boolean symbol NET_CRC32C, which selects CRC32. So it gets compiled only when something that actually needs CRC32C packet checksums is enabled, it has no implicit dependency, and it doesn't depend on the heavyweight crypto layer. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-9-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:40:17 -07:00
Eric Biggers	70c96c7cb9	net: fold __skb_checksum() into skb_checksum() Now that the only remaining caller of __skb_checksum() is skb_checksum(), fold __skb_checksum() into skb_checksum(). This makes struct skb_checksum_ops unnecessary, so remove that too and simply do the "regular" net checksum. It also makes the wrapper functions csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those too and just use the underlying functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:40:16 -07:00
Eric Biggers	86edc94da1	net: use skb_crc32c() in skb_crc32c_csum_help() Instead of calling __skb_checksum() with a skb_checksum_ops struct that does CRC32C, just call the new function skb_crc32c(). This is faster and simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-4-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:39:59 -07:00
Eric Biggers	a5bd029c73	net: add skb_crc32c() Add skb_crc32c(), which calculates the CRC32C of a sk_buff. It will replace __skb_checksum(), which unnecessarily supports arbitrary checksums. Compared to __skb_checksum(), skb_crc32c(): - Uses the correct type for CRC32C values (u32, not __wsum). - Does not require the caller to provide a skb_checksum_ops struct. - Is faster because it does not use indirect calls and does not use the very slow crc32c_combine(). According to commit `2817a336d4` ("net: skb_checksum: allow custom update/combine for walking skb") which added __skb_checksum(), the original motivation for the abstraction layer was to avoid code duplication for CRC32C and other checksums in the future. However: - No additional checksums showed up after CRC32C. __skb_checksum() is only used with the "regular" net checksum and CRC32C. - Indirect calls are expensive. Commit `2544af0344` ("net: avoid indirect calls in L4 checksum calculation") worked around this using the INDIRECT_CALL_1 macro. But that only avoided the indirect call for the net checksum, and at the cost of an extra branch. - The checksums use different types (__wsum and u32), causing casts to be needed. - It made the checksums of fragments be combined (rather than chained) for both checksums, despite this being highly counterproductive for CRC32C due to how slow crc32c_combine() is. This can clearly be seen in commit `4c2f245496` ("sctp: linearize early if it's not GSO") which tried to work around this performance bug. With a dedicated function for each checksum, we can instead just use the proper strategy for each checksum. As shown by the following tables, the new function skb_crc32c() is faster than __skb_checksum(), with the improvement varying greatly from 5% to 2500% depending on the case. The largest improvements come from fragmented packets, mainly due to eliminating the inefficient crc32c_combine(). But linear packets are improved too, especially shorter ones, mainly due to eliminating indirect calls. These benchmarks were done on AMD Zen 5. On that CPU, Linux uses IBRS instead of retpoline; an even greater improvement might be seen with retpoline: Linear sk_buffs Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 43 18 256 94 77 1420 204 161 16384 1735 1642 Nonlinear sk_buffs (even split between head and one fragment) Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 579 22 256 829 77 1420 1506 194 16384 4365 1682 Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:39:58 -07:00
Eric Biggers	55d22ee035	net: introduce CONFIG_NET_CRC32C Add a hidden kconfig symbol NET_CRC32C that will group together the functions that calculate CRC32C checksums of packets, so that these don't have to be built into NET-enabled kernels that don't need them. Make skb_crc32c_csum_help() (which is called only when IP_SCTP is enabled) conditional on this symbol, and make IP_SCTP select it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-2-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-21 15:39:58 -07:00
Kuniyuki Iwashima	f0a56c17e6	inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?(). Commit `f130a0cc1b` ("inet: fix lwtunnel_valid_encap_type() lock imbalance") added the rtnl_is_held argument as a temporary fix while I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU. Now all callers of lwtunnel_valid_encap_type() do not hold RTNL. Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-20 19:18:24 -07:00
Stanislav Fomichev	f792709e0b	selftests: net: validate team flags propagation Cover three recent cases: 1. missing ops locking for the lowers during netdev_sync_lower_features 2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked with a comment on why/when it's needed) 3. rcu lock during team_change_rx_flags Verified that each one triggers when the respective fix is reverted. Not sure about the placement, but since it all relies on teaming, added to the teaming directory. One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way to trigger netdev_sync_lower_features without it. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250516232205.539266-1-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-20 18:12:58 -07:00
Jakub Kicinski	bebd7b2626	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc7). Conflicts: tools/testing/selftests/drivers/net/hw/ncdevmem.c `97c4e094a4` ("tests/ncdevmem: Fix double-free of queue array") `2f1a805f32` ("selftests: ncdevmem: Implement devmem TCP TX") https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au Adjacent changes: net/core/devmem.c net/core/devmem.h `0afc44d8cd` ("net: devmem: fix kernel panic when netlink socket close after module unload") `bd61848900` ("net: devmem: Implement TX path") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-15 11:28:30 -07:00
Taehee Yoo	0afc44d8cd	net: devmem: fix kernel panic when netlink socket close after module unload Kernel panic occurs when a devmem TCP socket is closed after NIC module is unloaded. This is Devmem TCP unregistration scenarios. number is an order. (a)netlink socket close (b)pp destroy (c)uninstall result 1 2 3 OK 1 3 2 (d)Impossible 2 1 3 OK 3 1 2 (e)Kernel panic 2 3 1 (d)Impossible 3 2 1 (d)Impossible (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is closed. (b) page_pool_destroy() is called when the interface is down. (c) mp_ops->uninstall() is called when an interface is unregistered. (d) There is no scenario in mp_ops->uninstall() is called before page_pool_destroy(). Because unregister_netdevice_many_notify() closes interfaces first and then calls mp_ops->uninstall(). (e) netdev_nl_sock_priv_destroy() accesses struct net_device to acquire netdev_lock(). But if the interface module has already been removed, net_device pointer is invalid, so it causes kernel panic. In summary, there are only 3 possible scenarios. A. sk close -> pp destroy -> uninstall. B. pp destroy -> sk close -> uninstall. C. pp destroy -> uninstall -> sk close. Case C is a kernel panic scenario. In order to fix this problem, It makes mp_dmabuf_devmem_uninstall() set binding->dev to NULL. It indicates an bound net_device was unregistered. It makes netdev_nl_sock_priv_destroy() do not acquire netdev_lock() if binding->dev is NULL. A new binding->lock is added to protect a dev of a binding. So, lock ordering is like below. priv->lock netdev_lock(dev) binding->lock Tests: Scenario A: ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \ -v 7 -t 1 -q 1 & pid=$! sleep 10 kill $pid ip link set $interface down modprobe -rv $module Scenario B: ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \ -v 7 -t 1 -q 1 & pid=$! sleep 10 ip link set $interface down kill $pid modprobe -rv $module Scenario C: ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \ -v 7 -t 1 -q 1 & pid=$! sleep 10 modprobe -rv $module sleep 5 kill $pid Splat looks like: Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf] CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G B W 6.15.0-rc1+ #2 PREEMPT(undef) 0947ec89efa0fd68838b78e36aa1617e97ff5d7f Tainted: [B]=BAD_PAGE, [W]=WARN RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746) Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203 RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811 RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4 R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88 FS: 0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <TASK> ... netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3)) genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705) ... netlink_release (net/netlink/af_netlink.c:737) ... __sock_release (net/socket.c:647) sock_close (net/socket.c:1393) Fixes: `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250514154028.1062909-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-15 08:05:32 -07:00
Sebastian Andrzej Siewior	b9eef3391d	xdp: Use nested-BH locking for system_page_pool system_page_pool is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Make a struct with a page_pool member (original system_page_pool) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-6-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior	c99dac52ff	net: dst_cache: Use nested-BH locking for dst_cache::cache dst_cache::cache is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-3-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-15 15:23:30 +02:00
Sebastian Andrzej Siewior	32471b2f48	net: page_pool: Don't recycle into cache on PREEMPT_RT With preemptible softirq and no per-CPU locking in local_bh_disable() on PREEMPT_RT the consumer can be preempted while a skb is returned. Avoid the race by disabling the recycle into the cache on PREEMPT_RT. Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-2-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-15 15:23:30 +02:00
Mina Almasry	ae28cb1147	net: check for driver support in netmem TX We should not enable netmem TX for drivers that don't declare support. Check for driver netmem TX support during devmem TX binding and fail if the driver does not have the functionality. Check for driver support in validate_xmit_skb as well. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-9-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:49 +02:00
Mina Almasry	bd61848900	net: devmem: Implement TX path Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Stanislav Fomichev	8802087d20	net: devmem: TCP tx netlink api Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Mina Almasry	e9f3d61db5	net: add get_netmem/put_netmem support Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get\|put]_page() dmabuf net_iovs -> net_devmem_[get\|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Mina Almasry	03e96b8c11	netmem: add niov->type attribute to distinguish different net_iov types Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Cosmin Ratiu	af5f54b0ef	net: Lock lower level devices when updating features __netdev_update_features() expects the netdevice to be ops-locked, but it gets called recursively on the lower level netdevices to sync their features, and nothing locks those. This commit fixes that, with the assumption that it shouldn't be possible for both higher-level and lover-level netdevices to require the instance lock, because that would lead to lock dependency warnings. Without this, playing with higher level (e.g. vxlan) netdevices on top of netdevices with instance locking enabled can run into issues: WARNING: CPU: 59 PID: 206496 at ./include/net/netdev_lock.h:17 netif_napi_add_weight_locked+0x753/0xa60 [...] Call Trace: <TASK> mlx5e_open_channel+0xc09/0x3740 [mlx5_core] mlx5e_open_channels+0x1f0/0x770 [mlx5_core] mlx5e_safe_switch_params+0x1b5/0x2e0 [mlx5_core] set_feature_lro+0x1c2/0x330 [mlx5_core] mlx5e_handle_feature+0xc8/0x140 [mlx5_core] mlx5e_set_features+0x233/0x2e0 [mlx5_core] __netdev_update_features+0x5be/0x1670 __netdev_update_features+0x71f/0x1670 dev_ethtool+0x21c5/0x4aa0 dev_ioctl+0x438/0xae0 sock_ioctl+0x2ba/0x690 __x64_sys_ioctl+0xa78/0x1700 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Fixes: `7e4d784f58` ("net: hold netdev instance lock during rtnetlink operations") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250509072850.2002821-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-12 18:44:33 -07:00
Feng Yang	ee971630f2	bpf: Allow some trace helpers for all prog types if it works under NMI and doesn't use any context-dependent things, should be fine for any program type. The detailed discussion is in [1]. [1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com	2025-05-09 10:37:10 -07:00
Jakub Kicinski	6b02fd7799	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: `08e9f2d584` ("net: Lock netdevices during dev_shutdown") `a82dc19db1` ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-08 08:59:02 -07:00
Jakub Kicinski	23fa6a23d9	net: export a helper for adding up queue stats Older drivers and drivers with lower queue counts often have a static array of queues, rather than allocating structs for each queue on demand. Add a helper for adding up qstats from a queue range. Expectation is that driver will pass a queue range [netdev->real_num_x_queues, MAX). It was tempting to always use num_x_queues as the end, but virtio seems to clamp its queue count after allocating the netdev. And this way we can trivaly reuse the helper for [0, real_..). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250507003221.823267-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-08 11:56:12 +02:00
Paul Chaignon	c432722994	bpf: Scrub packet on bpf_redirect_peer When bpf_redirect_peer is used to redirect packets to a device in another network namespace, the skb isn't scrubbed. That can lead skb information from one namespace to be "misused" in another namespace. As one example, this is causing Cilium to drop traffic when using bpf_redirect_peer to redirect packets that just went through IPsec decryption to a container namespace. The following pwru trace shows (1) the packet path from the host's XFRM layer to the container's XFRM layer where it's dropped and (2) the number of active skb extensions at each function. NETNS MARK IFACE TUPLE FUNC 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm_rcv_cb .active_extensions = (__u8)2, 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm4_rcv_cb .active_extensions = (__u8)2, 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 gro_cells_receive .active_extensions = (__u8)2, [...] 4026533547 0 eth0 10.244.3.124:35473->10.244.2.158:53 skb_do_redirect .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv_core .active_extensions = (__u8)2, [...] 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 udp_queue_rcv_one_skb .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_policy_check .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_decode_session .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 security_xfrm_decode_session .active_extensions = (__u8)2, 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY) .active_extensions = (__u8)2, In this case, there are no XFRM policies in the container's network namespace so the drop is unexpected. When we decrypt the IPsec packet, the XFRM state used for decryption is set in the skb extensions. This information is preserved across the netns switch. When we reach the XFRM policy check in the container's netns, __xfrm_policy_check drops the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM policy can't be found that matches the (host-side) XFRM state used for decryption. This patch fixes this by scrubbing the packet when using bpf_redirect_peer, as is done on typical netns switches via veth devices except skb->mark and skb->tstamp are not zeroed. Fixes: `9aa1206e8f` ("bpf: Add redirect_peer helper") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/1728ead5e0fe45e7a6542c36bd4e3ca07a73b7d6.1746460653.git.paul.chaignon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-07 18:16:33 -07:00
Stanislav Fomichev	78cd408356	net: add missing instance lock to dev_set_promiscuity Accidentally spotted while trying to understand what else needs to be renamed to netif_ prefix. Most of the calls to dev_set_promiscuity are adjacent to dev_set_allmulti or dev_disable_lro so it should be safe to add the lock. Note that new netif_set_promiscuity is currently unused, the locked paths call __dev_set_promiscuity directly. Fixes: `ad7c7b2172` ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250506011919.2882313-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-06 18:52:39 -07:00
Cosmin Ratiu	08e9f2d584	net: Lock netdevices during dev_shutdown __qdisc_destroy() calls into various qdiscs .destroy() op, which in turn can call .ndo_setup_tc(), which requires the netdev instance lock. This commit extends the critical section in unregister_netdevice_many_notify() to cover dev_shutdown() (and dev_tcx_uninstall() as a side-effect) and acquires the netdev instance lock in __dev_change_net_namespace() for the other dev_shutdown() call. This should now guarantee that for all qdisc ops, the netdev instance lock is held during .ndo_setup_tc(). Fixes: `a0527ee2df` ("net: hold netdev instance lock during qdisc ndo_setup_tc") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250505194713.1723399-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-06 18:31:32 -07:00
Bui Quang Minh	7ead4405e0	xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc() This commit makes xdp_copy_frags_from_zc() use page allocation API page_pool_dev_alloc() instead of page_pool_dev_alloc_netmem() to avoid possible confusion of the returned value. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250426081220.40689-3-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-29 14:26:26 -07:00
Bui Quang Minh	ebaebc5eaf	xsk: respect the offsets when copying frags In commit `560d958c6c` ("xsk: add generic XSk &xdp_buff -> skb conversion"), we introduce a helper to convert zerocopy xdp_buff to skb. However, in the frag copy, we mistakenly ignore the frag's offset. This commit adds the missing offset when copying frags in xdp_copy_frags_from_zc(). This function is not used anywhere so no backport is needed. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250426081220.40689-2-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-29 14:26:26 -07:00
Alexei Starovoitov	224ee86639	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4 Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-28 08:40:45 -07:00
Christian Brauner	20b70e5896	net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid Now that all preconditions are met, allow handing out pidfs for reaped sk->sk_peer_pids. Thanks to Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> for pointing out that we need to limit this to AF_UNIX sockets for now. Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-4-450a19461e75@kernel.org Reviewed-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-28 11:04:43 +02:00
Jakub Kicinski	5565acd1e6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc4). This pull includes wireless and a fix to vxlan which isn't in Linus's tree just yet. The latter creates with a silent conflict / build breakage, so merging it now to avoid causing problems. drivers/net/vxlan/vxlan_vnifilter.c `094adad913` ("vxlan: Use a single lock to protect the FDB table") `087a9eb9e5` ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry") https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com No "normal" conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-24 11:20:52 -07:00
Kuniyuki Iwashima	f0cc3777b2	net: Fix wild-memory-access in __register_pernet_operations() when CONFIG_NET_NS=n. kernel test robot reported the splat below. [0] Before commit `fed176bf31` ("net: Add ops_undo_single for module load/unload."), if CONFIG_NET_NS=n, ops was linked to pernet_list only when init_net had not been initialised, and ops was unlinked from pernet_list only under the same condition. Let's say an ops is loaded before the init_net setup but unloaded after that. Then, the ops remains in pernet_list, which seems odd. The cited commit added ops_undo_single(), which calls list_add() for ops to link it to a temporary list, so a minor change was added to __register_pernet_operations() and __unregister_pernet_operations() under CONFIG_NET_NS=n to avoid the pernet_list corruption. However, the corruption must have been left as is. When CONFIG_NET_NS=n, pernet_list was used to keep ops registered before the init_net setup, and after that, pernet_list was not used at all. This was because some ops annotated with __net_initdata are cleared out of memory at some point during boot. Then, such ops is initialised by POISON_FREE_INITMEM (0xcc), resulting in that ops->list.{next,prev} suddenly switches from a valid pointer to a weird value, 0xcccccccccccccccc. To avoid such wild memory access, let's allow the pernet_list corruption for CONFIG_NET_NS=n. [0]: Oops: general protection fault, probably for non-canonical address 0xf999959999999999: 0000 [#1] SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xccccccccccccccc8-0xcccccccccccccccf] CPU: 2 UID: 0 PID: 346 Comm: modprobe Not tainted 6.15.0-rc1-00294-ga4cba7e98e35 #85 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:32) Code: 48 c1 ea 03 80 3c 02 00 0f 85 5a 01 00 00 49 39 74 24 08 0f 85 83 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 1f 01 00 00 4c 39 26 0f 85 ab 00 00 00 4c 39 ee RSP: 0018:ff11000135b87830 EFLAGS: 00010a07 RAX: dffffc0000000000 RBX: ffffffffc02223c0 RCX: ffffffff8406fcc2 RDX: 1999999999999999 RSI: cccccccccccccccc RDI: ffffffffc02223c0 RBP: ffffffff86064e40 R08: 0000000000000001 R09: fffffbfff0a9f5b5 R10: ffffffff854fadaf R11: 676552203a54454e R12: ffffffff86064e40 R13: ffffffffc02223c0 R14: ffffffff86064e48 R15: 0000000000000021 FS: 00007f6fb0d9e1c0(0000) GS:ff11000858ea0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6fb0eda580 CR3: 0000000122fec005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> register_pernet_operations (./include/linux/list.h:150 (discriminator 5) ./include/linux/list.h:183 (discriminator 5) net/core/net_namespace.c:1315 (discriminator 5) net/core/net_namespace.c:1359 (discriminator 5)) register_pernet_subsys (net/core/net_namespace.c:1401) inet6_init (net/ipv6/af_inet6.c:535) ipv6 do_one_initcall (init/main.c:1257) do_init_module (kernel/module/main.c:2942) load_module (kernel/module/main.c:3409) init_module_from_file (kernel/module/main.c:3599) idempotent_init_module (kernel/module/main.c:3611) __x64_sys_finit_module (./include/linux/file.h:62 ./include/linux/file.h:83 kernel/module/main.c:3634 kernel/module/main.c:3621 kernel/module/main.c:3621) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f6fb0df7e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fffdc6a8968 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000055b535721b70 RCX: 00007f6fb0df7e5d RDX: 0000000000000000 RSI: 000055b51e44aa2a RDI: 0000000000000004 RBP: 0000000000040000 R08: 0000000000000000 R09: 000055b535721b30 R10: 0000000000000004 R11: 0000000000000246 R12: 000055b51e44aa2a R13: 000055b535721bf0 R14: 000055b5357220b0 R15: 0000000000000000 </TASK> Modules linked in: ipv6(+) crc_ccitt Fixes: `fed176bf31` ("net: Add ops_undo_single for module load/unload.") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504181511.1c3f23e4-lkp@intel.com Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418215025.87871-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-23 16:24:56 -07:00
Joshua Washington	0e0a7e3719	xdp: create locked/unlocked instances of xdp redirect target setters Commit `03df156dd3` ("xdp: double protect netdev->xdp_flags with netdev->lock") introduces the netdev lock to xdp_set_features_flag(). The change includes a _locked version of the method, as it is possible for a driver to have already acquired the netdev lock before calling this helper. However, the same applies to xdp_features_(set\|clear)_redirect_flags(), which ends up calling the unlocked version of xdp_set_features_flags() leading to deadlocks in GVE, which grabs the netdev lock as part of its suspend, reset, and shutdown processes: [ 833.265543] WARNING: possible recursive locking detected [ 833.270949] 6.15.0-rc1 #6 Tainted: G E [ 833.276271] -------------------------------------------- [ 833.281681] systemd-shutdow/1 is trying to acquire lock: [ 833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90 [ 833.295470] [ 833.295470] but task is already holding lock: [ 833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] [ 833.309508] [ 833.309508] other info that might help us debug this: [ 833.316130] Possible unsafe locking scenario: [ 833.316130] [ 833.322142] CPU0 [ 833.324681] ---- [ 833.327220] lock(&dev->lock); [ 833.330455] lock(&dev->lock); [ 833.333689] [ 833.333689] * DEADLOCK * [ 833.333689] [ 833.339701] May be due to missing lock nesting notation [ 833.339701] [ 833.346582] 5 locks held by systemd-shutdow/1: [ 833.351205] #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210 [ 833.360695] #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0 [ 833.369144] #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0 [ 833.377603] #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve] [ 833.386138] #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] Introduce xdp_features_(set\|clear)_redirect_target_locked() versions which assume that the netdev lock has already been acquired before setting the XDP feature flag and update GVE to use the locked version. Fixes: `03df156dd3` ("xdp: double protect netdev->xdp_flags with netdev->lock") Tested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-22 19:57:56 -07:00
Kuniyuki Iwashima	434efd3d0c	net: Drop hold_rtnl arg from ops_undo_list(). ops_undo_list() first iterates over ops_list for ->pre_exit(). Let's check if any of the ops has ->exit_rtnl() there and drop the hold_rtnl argument. Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl is always true for setup_net() and cleanup_net() for now. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-22 19:07:41 -07:00
Justin Iurman	c03a49f309	net: lwtunnel: disable BHs when required In lwtunnel_{output\|xmit}(), dev_xmit_recursion() may be called in preemptible scope for PREEMPT kernels. This patch disables BHs before calling dev_xmit_recursion(). BHs are re-enabled only at the end, since we must ensure the same CPU is used for both dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other recursion levels in some cases) in order to maintain valid per-cpu counters. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/netdev/m2h62qwf34.fsf@gmail.com/ Fixes: `986ffb3a57` ("net: lwtunnel: fix recursion loops") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160716.8823-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-22 15:37:01 +02:00
Oleksij Rempel	9e8d1013b0	net: selftests: initialize TCP header and skb payload with zero Zero-initialize TCP header via memset() to avoid garbage values that may affect checksum or behavior during test transmission. Also zero-fill allocated payload and padding regions using memset() after skb_put(), ensuring deterministic content for all outgoing test packets. Fixes: `3e1e58d64c` ("net: add generic selftest support") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160125.2914724-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-22 15:30:35 +02:00
Breno Leitao	a451930180	net: Use nlmsg_payload in rtnetlink file Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-2-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-21 18:38:00 -07:00
Breno Leitao	9929ba1942	net: Use nlmsg_payload in neighbour file Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-21 18:38:00 -07:00
Jakub Kicinski	d3153c3b42	net: fix the missing unlock for detached devices The combined condition was left as is when we converted from __dev_get_by_index() to netdev_get_by_index_lock(). There was no need to undo anything with the former, for the latter we need an unlock. Fixes: `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250418015317.1954107-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-21 17:10:49 -07:00
Alexei Starovoitov	5709be4c35	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3 Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-21 08:04:38 -07:00
Jakub Kicinski	22cbc1ee26	netdev: fix the locking for netdev notifications Kuniyuki reports that the assert for netdev lock fires when there are netdev event listeners (otherwise we skip the netlink event generation). Correct the locking when coming from the notifier. The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked, it's the documentation that's incorrect. Fixes: `99e44f39a8` ("netdev: depend on netdev->lock for xdp features") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-17 18:55:14 -07:00
Jakub Kicinski	240ce924d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py `4d07bbf2d4` ("tools: ynl-gen: don't declare loop iterator in place") `7e8ba0c7de` ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-17 12:26:50 -07:00
Linus Torvalds	b5c6891b2c	Including fixes from Bluetooth, CAN and Netfilter. Current release - regressions: - 2 fixes for the netdev per-instance locking - batman-adv: fix double-hold of meshif when getting enabled Current release - new code bugs: - Bluetooth: increment TX timestamping tskey always for stream sockets - wifi: static analysis and build fixes for the new Intel sub-driver Previous releases - regressions: - net: fib_rules: fix iif / oif matching on L3 master (VRF) device - ipv6: add exception routes to GC list in rt6_insert_exception() - netfilter: conntrack: fix erroneous removal of offload bit - Bluetooth: - fix sending MGMT_EV_DEVICE_FOUND for invalid address - l2cap: process valid commands in too long frame - btnxpuart: Revert baudrate change in nxp_shutdown Previous releases - always broken: - ethtool: fix memory corruption during SFP FW flashing - eth: hibmcge: fixes for link and MTU handling, pause frames etc. - eth: igc: fixes for PTM (PCIe timestamping) - dsa: b53: enable BPDU reception for management port Misc: - fixes for Netlink protocol schemas Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmgBMWAACgkQMUZtbf5S Irv3kw//enBPRBMhTdXop2g9Rlw8zQEwuMinq9ytgiNbWxshd7wXHNDYKICaX8qK KkN4InxF4ieESHGNUy9/UvZLAEHYV8BqugdUGurwCnEg9QJlThsDmRCYTxI37Sdx 8uUYoSYWvPcHUsw3s/qBzunAZ8R/5hbP99L1M8W+3DqnJUm6qjUYjWCmm+8sn6FW abZ8JiazJhGKfTtIGxxN3oOJBzKcdeTK+z29ujQBWegnau8uj92QKkuoTZTtHaYL MU5pRyDQS5vo/wLBizTGQQ7PrcIpKgM1mi5Oayl5WbvTUF1pJ8pvT1iT7jnGOt4c lWPImtdQwsNwAwFNU7/S+YkBvrYBsCUPDErQ82HtxRsNXcCzKvuM4SqkvzYS5SZd Qz0ZuKqXbxxSgRgasvikWP6qAI8vz4zT8pjXu/a3eE5xnsSqhwXVbEwez8KqRF79 Cxm/iYz/m/NhYW59YSobssNHjGM4QgOu/dFr37EPCgf8BK//RLFckwQVA1qBogO/ HEwkWhQFYNJY0BDXuzO8rCPFQbs/NBiuNU0xRbKqPGkGcqKRwTd/KLOeTe/NJvtE mQx5R+VNVEyNdlQtXh5muEJHhN2nLYzQUTM5gkt6jwIzcLt7HVE+ZF3nkAKf1EQB 1nYN8gxCO1convTNT84Eldpp2At8zazz+ltnC9gWIEZhdTct46Q= =EObk -----END PGP SIGNATURE----- Merge tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth, CAN and Netfilter. Current release - regressions: - two fixes for the netdev per-instance locking - batman-adv: fix double-hold of meshif when getting enabled Current release - new code bugs: - Bluetooth: increment TX timestamping tskey always for stream sockets - wifi: static analysis and build fixes for the new Intel sub-driver Previous releases - regressions: - net: fib_rules: fix iif / oif matching on L3 master (VRF) device - ipv6: add exception routes to GC list in rt6_insert_exception() - netfilter: conntrack: fix erroneous removal of offload bit - Bluetooth: - fix sending MGMT_EV_DEVICE_FOUND for invalid address - l2cap: process valid commands in too long frame - btnxpuart: Revert baudrate change in nxp_shutdown Previous releases - always broken: - ethtool: fix memory corruption during SFP FW flashing - eth: - hibmcge: fixes for link and MTU handling, pause frames etc - igc: fixes for PTM (PCIe timestamping) - dsa: b53: enable BPDU reception for management port Misc: - fixes for Netlink protocol schemas" * tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps net: ethernet: mtk_eth_soc: reapply mdc divider on reset net: ti: icss-iep: Fix possible NULL pointer dereference for perout request net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame() net: ti: icssg-prueth: Fix kernel warning while bringing down network interface netfilter: conntrack: fix erronous removal of offload bit net: don't try to ops lock uninitialized devs ptp: ocp: fix start time alignment in ptp_ocp_signal_set net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails net: dsa: free routing table on probe failure net: dsa: clean up FDB, MDB, VLAN entries on unbind net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered net: txgbe: fix memory leak in txgbe_probe() error path net: bridge: switchdev: do not notify new brentries as changed net: b53: enable BPDU reception for management port netlink: specs: rt-neigh: prefix struct nfmsg members with ndm netlink: specs: rt-link: adjust mctp attribute naming netlink: specs: rtnetlink: attribute naming corrections ...	2025-04-17 11:45:30 -07:00
Peter Seiderer	422cf22aa3	net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy) Fix checkpatch code style warnings: WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1423: FILE: net/core/pktgen.c:1423: + strcpy(pkt_dev->dst_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1444: FILE: net/core/pktgen.c:1444: + strcpy(pkt_dev->dst_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1554: FILE: net/core/pktgen.c:1554: + strcpy(pkt_dev->src_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1575: FILE: net/core/pktgen.c:1575: + strcpy(pkt_dev->src_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3231: FILE: net/core/pktgen.c:3231: + strcpy(pkt_dev->result, "Starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3235: FILE: net/core/pktgen.c:3235: + strcpy(pkt_dev->result, "Error starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3849: FILE: net/core/pktgen.c:3849: + strcpy(pkt_dev->odevname, ifname); While at it squash memset/strcpy pattern into single strscpy_pad call. Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-4-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-17 13:02:41 +02:00
Peter Seiderer	65f5b9cb54	net: pktgen: fix code style (WARNING: please, no space before tabs) Fix checkpatch code style warnings: WARNING: please, no space before tabs #230: FILE: net/core/pktgen.c:230: +#define M_NETIF_RECEIVE ^I1^I/* Inject packets into stack */$ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-3-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-17 13:02:41 +02:00
Peter Seiderer	3bc1ca7e17	net: pktgen: fix code style (ERROR: else should follow close brace '}') Fix checkpatch code style errors: ERROR: else should follow close brace '}' #1317: FILE: net/core/pktgen.c:1317: + } + else And checkpatch follow up code style check: CHECK: Unbalanced braces around else statement #1316: FILE: net/core/pktgen.c:1316: + } else Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-2-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-17 13:02:41 +02:00
Antonio Quartulli	17240749f2	skb: implement skb_send_sock_locked_with_flags() When sending an skb over a socket using skb_send_sock_locked(), it is currently not possible to specify any flag to be set in msghdr->msg_flags. However, we may want to pass flags the user may have specified, like MSG_NOSIGNAL. Extend __skb_send_sock() with a new argument 'flags' and add a new interface named skb_send_sock_locked_with_flags(). Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-17 12:30:03 +02:00
Christian Brauner	b590c928cc	net, pidfd: report EINVAL for ESRCH dbus-broker relies -EINVAL being returned to indicate ESRCH in [1]. This causes issues for some workloads as reported in [2]. Paper over it until this is fixed in userspace. Link: https://lore.kernel.org/20250416-gegriffen-tiefbau-70cfecb80ac8@brauner Link: `5d34d91b13/src/util/sockopt.c (L241)` [1] Link: https://lore.kernel.org/20250415223454.GA1852104@ax162 [2] Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-17 09:43:26 +02:00
Jakub Kicinski	4798cfa209	net: don't try to ops lock uninitialized devs We need to be careful when operating on dev while in rtnl_create_link(). Some devices (vxlan) initialize netdev_ops in ->newlink, so later on. Avoid using netdev_lock_ops(), the device isn't registered so we cannot legally call its ops or generate any notifications for it. netdev_ops_assert_locked_or_invisible() is safe to use, it checks registration status first. Reported-by: syzbot+de1c7d68a10e3f123bdd@syzkaller.appspotmail.com Fixes: `04efcee6ef` ("net: hold instance lock during NETDEV_CHANGE") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415151552.768373-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-16 18:28:11 -07:00
Ido Schimmel	2d300ce0b7	net: fib_rules: Fix iif / oif matching on L3 master device Before commit `40867d74c3` ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") it was possible to use FIB rules to match on a L3 domain. This was done by having a FIB rule match on iif / oif being a L3 master device. It worked because prior to the FIB rule lookup the iif / oif fields in the flow structure were reset to the index of the L3 master device to which the input / output device was enslaved to. The above scheme made it impossible to match on the original input / output device. Therefore, cited commit stopped overwriting the iif / oif fields in the flow structure and instead stored the index of the enslaving L3 master device in a new field ('flowi_l3mdev') in the flow structure. While the change enabled new use cases, it broke the original use case of matching on a L3 domain. Fix this by interpreting the iif / oif matching on a L3 master device as a match against the L3 domain. In other words, if the iif / oif in the FIB rule points to a L3 master device, compare the provided index against 'flowi_l3mdev' rather than 'flowi_{i,o}if'. Before cited commit, a FIB rule that matched on 'iif vrf1' would only match incoming traffic from devices enslaved to 'vrf1'. With the proposed change (i.e., comparing against 'flowi_l3mdev'), the rule would also match traffic originating from a socket bound to 'vrf1'. Avoid that by adding a new flow flag ('FLOWI_FLAG_L3MDEV_OIF') that indicates if the L3 domain was derived from the output interface or the input interface (when not set) and take this flag into account when evaluating the FIB rule against the flow structure. Avoid unnecessary checks in the data path by detecting that a rule matches on a L3 master device when the rule is installed and marking it as such. Tested using the following script [1]. Output before `40867d74c3` (v5.4.291): default dev dummy1 table 100 scope link default dev dummy1 table 200 scope link Output after 40867d74c374: default dev dummy1 table 300 scope link default dev dummy1 table 300 scope link Output with this patch: default dev dummy1 table 100 scope link default dev dummy1 table 200 scope link [1] #!/bin/bash ip link add name vrf1 up type vrf table 10 ip link add name dummy1 up master vrf1 type dummy sysctl -wq net.ipv4.conf.all.forwarding=1 sysctl -wq net.ipv4.conf.all.rp_filter=0 ip route add table 100 default dev dummy1 ip route add table 200 default dev dummy1 ip route add table 300 default dev dummy1 ip rule add prio 0 oif vrf1 table 100 ip rule add prio 1 iif vrf1 table 200 ip rule add prio 2 table 300 ip route get 192.0.2.1 oif dummy1 fibmatch ip route get 192.0.2.1 iif dummy1 from 198.51.100.1 fibmatch Fixes: `40867d74c3` ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Reported-by: hanhuihui <hanhuihui5@huawei.com> Closes: https://lore.kernel.org/netdev/ec671c4f821a4d63904d0da15d604b75@huawei.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250414172022.242991-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 17:54:56 -07:00
Breno Leitao	8ff9530361	net: fib_rules: Use nlmsg_payload in fib_{new,del}rule() Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-10-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:28:55 -07:00
Breno Leitao	4c113c803f	net: fib_rules: Use nlmsg_payload in fib_valid_dumprule_req Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-9-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:28:55 -07:00
Breno Leitao	77d0229036	rtnetlink: Use nlmsg_payload in valid_fdb_dump_strict Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-4-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:28:54 -07:00
Breno Leitao	2d1f827f06	neighbour: Use nlmsg_payload in neigh_valid_get_req Update neigh_valid_get_req function to utilize the new nlmsg_payload() helper function. This change improves code clarity and safety by ensuring that the Netlink message payload is properly validated before accessing its data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-3-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:28:54 -07:00
Breno Leitao	7527efe8a4	neighbour: Use nlmsg_payload in neightbl_valid_dump_info Update neightbl_valid_dump_info function to utilize the new nlmsg_payload() helper function. This change improves code clarity and safety by ensuring that the Netlink message payload is properly validated before accessing its data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-2-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-15 08:28:54 -07:00
Kuniyuki Iwashima	c57a9c5035	net: Remove ->exit_batch_rtnl(). There are no ->exit_batch_rtnl() users remaining. Let's remove the hook. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 17:08:45 -07:00
Kuniyuki Iwashima	7a60d91c69	net: Add ->exit_rtnl() hook to struct pernet_operations. struct pernet_operations provides two batching hooks; ->exit_batch() and ->exit_batch_rtnl(). The batching variant is beneficial if ->exit() meets any of the following conditions: 1) ->exit() repeatedly acquires a global lock for each netns 2) ->exit() has a time-consuming operation that can be factored out (e.g. synchronize_rcu(), smp_mb(), etc) 3) ->exit() does not need to repeat the same iterations for each netns (e.g. inet_twsk_purge()) Currently, none of the ->exit_batch_rtnl() functions satisfy any of the above conditions because RTNL is factored out and held by the caller and all of these functions iterate over the dying netns list. Also, we want to hold per-netns RTNL there but avoid spreading __rtnl_net_lock() across multiple locations. Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock(). The following patches will convert all ->exit_batch_rtnl() users to ->exit_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 17:08:41 -07:00
Kuniyuki Iwashima	fed176bf31	net: Add ops_undo_single for module load/unload. If ops_init() fails while loading a module or we unload the module, free_exit_list() rolls back the changes. The rollback sequence is the same as ops_undo_list(). The ops is already removed from pernet_list before calling free_exit_list(). If we link the ops to a temporary list, we can reuse ops_undo_list(). Let's add a wrapper of ops_undo_list() and use it instead of free_exit_list(). Now, we have the central place to roll back ops_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 17:08:40 -07:00
Kuniyuki Iwashima	e333b1c3cf	net: Factorise setup_net() and cleanup_net(). When we roll back the changes made by struct pernet_operations.init(), we execute mostly identical sequences in three places. * setup_net() * cleanup_net() * free_exit_list() The only difference between the first two is which list and RCU helpers to use. In setup_net(), an ops could fail on the way, so we need to perform a reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(), we iterate the full list from tail to head. The former passes the failed ops to list_for_each_entry_continue_reverse(). It's tricky, but we can reuse it for the latter if we pass list_entry() of the head node. Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily switched by an argument. Let's factorise the rollback part in setup_net() and cleanup_net(). In the next patch, ops_undo_list() will be reused for free_exit_list(), and then two arguments (ops_list and hold_rtnl) will differ. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 17:08:40 -07:00
Toke Høiland-Jørgensen	ee62ce7a1d	page_pool: Track DMA-mapped pages and unmap them when destroying the pool When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com Reported-by: Yonglong Liu <liuyonglong@huawei.com> Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com Fixes: `ff7d6b27f8` ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Qiuling Ren <qren@redhat.com> Tested-by: Yuying Ma <yuma@redhat.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 16:30:29 -07:00
Toke Høiland-Jørgensen	cd3c93167d	page_pool: Move pp_magic check into helper functions Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 16:30:29 -07:00
Peter Seiderer	08fcb1f242	net: pktgen: fix code style (WARNING: quoted string split across lines) Fix checkpatch code style warnings: WARNING: quoted string split across lines #480: FILE: net/core/pktgen.c:480: + "Packet Generator for packet performance testing. " + "Version: " VERSION "\n"; WARNING: quoted string split across lines #632: FILE: net/core/pktgen.c:632: + " udp_src_min: %d udp_src_max: %d" + " udp_dst_min: %d udp_dst_max: %d\n", Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:51:32 -07:00
Peter Seiderer	dceae3e82f	net: pktgen: fix code style (WARNING: macros should not use a trailing semicolon) Fix checkpatch code style warnings: WARNING: macros should not use a trailing semicolon #180: FILE: net/core/pktgen.c:180: +#define func_enter() pr_debug("entering %s\n", __func__); WARNING: macros should not use a trailing semicolon #234: FILE: net/core/pktgen.c:234: +#define if_lock(t) mutex_lock(&(t->if_lock)); CHECK: Unnecessary parentheses around t->if_lock #235: FILE: net/core/pktgen.c:235: +#define if_unlock(t) mutex_unlock(&(t->if_lock)); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:51:32 -07:00
Peter Seiderer	ca8ee66521	net: pktgen: fix code style (WARNING: Missing a blank line after declarations) Fix checkpatch code style warnings: WARNING: Missing a blank line after declarations #761: FILE: net/core/pktgen.c:761: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #780: FILE: net/core/pktgen.c:780: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #806: FILE: net/core/pktgen.c:806: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #823: FILE: net/core/pktgen.c:823: + char c; + if (get_user(c, &user_buffer[i])) WARNING: Missing a blank line after declarations #1968: FILE: net/core/pktgen.c:1968: + char f[32]; + memset(f, 0, 32); WARNING: Missing a blank line after declarations #2410: FILE: net/core/pktgen.c:2410: + struct pktgen_net pn = net_generic(dev_net(pkt_dev->odev), pg_net_id); + if (!x) { WARNING: Missing a blank line after declarations #2442: FILE: net/core/pktgen.c:2442: + __u16 t; + if (pkt_dev->flags & F_QUEUE_MAP_RND) { WARNING: Missing a blank line after declarations #2523: FILE: net/core/pktgen.c:2523: + unsigned int i; + for (i = 0; i < pkt_dev->nr_labels; i++) WARNING: Missing a blank line after declarations #2567: FILE: net/core/pktgen.c:2567: + __u32 t; + if (pkt_dev->flags & F_IPSRC_RND) WARNING: Missing a blank line after declarations #2587: FILE: net/core/pktgen.c:2587: + __be32 s; + if (pkt_dev->flags & F_IPDST_RND) { WARNING: Missing a blank line after declarations #2634: FILE: net/core/pktgen.c:2634: + __u32 t; + if (pkt_dev->flags & F_TXSIZE_RND) { WARNING: Missing a blank line after declarations #2736: FILE: net/core/pktgen.c:2736: + int i; + for (i = 0; i < pkt_dev->cflows; i++) { WARNING: Missing a blank line after declarations #2738: FILE: net/core/pktgen.c:2738: + struct xfrm_state x = pkt_dev->flows[i].x; + if (x) { WARNING: Missing a blank line after declarations #2752: FILE: net/core/pktgen.c:2752: + int nhead = 0; + if (x) { WARNING: Missing a blank line after declarations #2795: FILE: net/core/pktgen.c:2795: + unsigned int i; + for (i = 0; i < pkt_dev->nr_labels; i++) WARNING: Missing a blank line after declarations #3480: FILE: net/core/pktgen.c:3480: + ktime_t idle_start = ktime_get(); + schedule(); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:51:32 -07:00
Peter Seiderer	870b856cb4	net: pktgen: fix code style (WARNING: Block comments) Fix checkpatch code style warnings: WARNING: Block comments use a trailing / on a separate line + removal by worker thread / WARNING: Block comments use on subsequent lines + __u8 tos; /* six MSB of (former) IPv4 TOS + are for dscp codepoint / WARNING: Block comments use a trailing / on a separate line + are for dscp codepoint / WARNING: Block comments use on subsequent lines + __u8 traffic_class; /* ditto for the (former) Traffic Class in IPv6 + (see RFC 3260, sec. 4) / WARNING: Block comments use a trailing / on a separate line + (see RFC 3260, sec. 4) / WARNING: Block comments use on subsequent lines + /* = { + 0x00, 0x80, 0xC8, 0x79, 0xB3, 0xCB, WARNING: Block comments use * on subsequent lines + /* Field for thread to receive "posted" events terminate, + stop ifs etc. / WARNING: Block comments use a trailing / on a separate line + stop ifs etc. / WARNING: Block comments should align the on each line + * we go look for it ... +/ WARNING: Block comments use a trailing / on a separate line + * we resolve the dst issue / WARNING: Block comments use a trailing / on a separate line + * with proc_create_data() */ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:51:31 -07:00
Peter Seiderer	1d8f07bf4a	net: pktgen: fix code style (WARNING: suspect code indent for conditional statements) Fix checkpatch code style warnings: WARNING: suspect code indent for conditional statements (8, 17) #2901: FILE: net/core/pktgen.c:2901: + } else { + skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT); Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:51:31 -07:00
Peter Seiderer	eb1fd49ef6	net: pktgen: fix code style (ERROR: space prohibited after that '&') Fix checkpatch code style errors/checks: CHECK: No space is necessary after a cast #2984: FILE: net/core/pktgen.c:2984: + (__be16 ) & eth[12] = protocol; ERROR: space prohibited after that '&' (ctx:WxW) #2984: FILE: net/core/pktgen.c:2984: + (__be16 ) & eth[12] = protocol; Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:47:58 -07:00
Peter Seiderer	81e92f4fb8	net: pktgen: fix code style (ERROR: "foo * bar" should be "foo bar") Fix checkpatch code style errors: ERROR: "foo bar" should be "foo bar" #977: FILE: net/core/pktgen.c:977: + const char __user user_buffer, size_t count, ERROR: "foo * bar" should be "foo bar" #978: FILE: net/core/pktgen.c:978: + loff_t offset) ERROR: "foo * bar" should be "foo bar" #1912: FILE: net/core/pktgen.c:1912: + const char __user user_buffer, ERROR: "foo * bar" should be "foo bar" #1913: FILE: net/core/pktgen.c:1913: + size_t count, loff_t offset) Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:47:58 -07:00
Jakub Kicinski	097f171f98	net: convert dev->rtnl_link_state to a bool netdevice reg_state was split into two 16 bit enums back in 2010 in commit `a2835763e1` ("rtnetlink: handle rtnl_link netlink notifications manually"). Since the split the fields have been moved apart, and last year we converted reg_state to a normal u8 in commit `4d42b37def` ("net: convert dev->reg_state to u8"). rtnl_link_state being a 16 bitfield makes no sense. Convert it to a single bool, it seems very unlikely after 15 years that we'll need more values in it. We could drop dev->rtnl_link_ops from the conditions but feels like having it there more clearly points at the reason for this hack. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250410014246.780885-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 14:29:54 -07:00
Jakub Kicinski	f0433eea46	net: don't mix device locking in dev_close_many() calls Lockdep found the following dependency: &dev_instance_lock_key#3 --> &rdev->wiphy.mtx --> &net->xdp.lock --> &xs->mutex --> &dev_instance_lock_key#3 The first dependency is the problem. wiphy mutex should be outside the instance locks. The problem happens in notifiers (as always) for CLOSE. We only hold the instance lock for ops locked devices during CLOSE, and WiFi netdevs are not ops locked. Unfortunately, when we dev_close_many() during netns dismantle we may be holding the instance lock of _another_ netdev when issuing a CLOSE for a WiFi device. Lockdep's "Possible unsafe locking scenario" only prints 3 locks and we have 4, plus I think we'd need 3 CPUs, like this: CPU0 CPU1 CPU2 ---- ---- ---- lock(&xs->mutex); lock(&dev_instance_lock_key#3); lock(&rdev->wiphy.mtx); lock(&net->xdp.lock); lock(&xs->mutex); lock(&rdev->wiphy.mtx); lock(&dev_instance_lock_key#3); Tho, I don't think that's possible as CPU1 and CPU2 would be under rtnl_lock. Even if we have per-netns rtnl_lock and wiphy can span network namespaces - CPU0 and CPU1 must be in the same netns to see dev_instance_lock, so CPU0 can't be installing a socket as CPU1 is tearing the netns down. Regardless, our expected lock ordering is that wiphy lock is taken before instance locks, so let's fix this. Go over the ops locked and non-locked devices separately. Note that calling dev_close_many() on an empty list is perfectly fine. All processing (including RCU syncs) are conditional on the list not being empty, already. Fixes: `7e4d784f58` ("net: hold netdev instance lock during rtnetlink operations") Reported-by: syzbot+6f588c78bf765b62b450@syzkaller.appspotmail.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250412233011.309762-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-14 12:48:44 -07:00
Linus Torvalds	b676ac484f	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmf6sD8ACgkQ6rmadz2v bTq86w//bbg2S1ZhSXXQvgRSbxfecvJ0r6XGDOaMsKxPXcqpbaMoSCYx2D8puO+b xm0vc+5qXlzuTHq9I8flDKrWdA+/sHxLQhXjcBA796vaY6IgJEnapf3kENyzZ3Vp agpNPlZe9FLaANDRivTFPVgzVjr07/3eL7VKItASksb/3yjBSa+vrIJVfGF1krQT slxTMzVMzB+p0MdKVjmeGn5EodWXp8TdVzQBPb8vnCn7U1h1HULSh4j1+nZ/Z1yr zC4/pVPmdDJe1H8ghBGm4f0nY+EwXPtZiVbXnYS2FhgjvthRKFYIyxN9F6kg7AD7 NG0T6xw/QYNfPTR40PSiV/WHhH5qa2zRVtlepVU7tqqmsyRXi+0Eq/MfJyiuNzgN WWmJec0O/Ax4r2Xs/QgX3mFlRnLNi5gmc7fuOARmayAlqElZ9QdB2x6ebW5Fk4Qx 9oyQACpcu6/oUKgeMSo52MDa82wUPPxpC6qdsefmQYaAcOKM5MD4SNd+eEnfX03E RAaItTW9az57a2BL9C/ejJO/SwY4Er+O8B3PO7GaKiURMSZa5nVlY+2QB2fJy6TA 7IvSYjFD5E4risMbZgPFCqWkQ0yHbY7zEn/tbcNC5AFZoKv70jELPQTLPXq7UPLe BuKoL9VJyeXF7E1MQqQH33q3tfcwlIL++piCNHvTQoPadEba2dM= =Mezb -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Followup fixes for resilient spinlock (Kumar Kartikeya Dwivedi): - Make res_spin_lock test less verbose, since it was spamming BPF CI on failure, and make the check for AA deadlock stronger - Fix rebasing mistake and use architecture provided res_smp_cond_load_acquire - Convert BPF maps (queue_stack and ringbuf) to resilient spinlock to address long standing syzbot reports - Make sure that classic BPF load instruction from SKF_[NET\|LL]_OFF offsets works when skb is fragmeneted (Willem de Bruijn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Convert ringbuf map to rqspinlock bpf: Convert queue_stack map to rqspinlock bpf: Use architecture provided res_smp_cond_load_acquire selftests/bpf: Make res_spin_lock AA test condition stronger selftests/net: test sk_filter support for SKF_NET_OFF on frags bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags selftests/bpf: Make res_spin_lock test less verbose	2025-04-12 12:48:10 -07:00
Kuniyuki Iwashima	22d6c9eebf	net: Unexport shared functions for DCCP. DCCP was removed, so many inet functions no longer need to be exported. Let's unexport or use EXPORT_IPV6_MOD() for such functions. sk_free_unlock_clone() is inlined in sk_clone_lock() as it's the only caller. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250410023921.11307-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-11 18:58:11 -07:00
Kuniyuki Iwashima	2a63dd0edf	net: Retire DCCP socket. DCCP was orphaned in 2021 by commit `054c4610bd` ("MAINTAINERS: dccp: move Gerrit Renker to CREDITS"), which noted that the last maintainer had been inactive for five years. In recent years, it has become a playground for syzbot, and most changes to DCCP have been odd bug fixes triggered by syzbot. Apart from that, the only changes have been driven by treewide or networking API updates or adjustments related to TCP. Thus, in 2023, we announced we would remove DCCP in 2025 via commit `b144fcaf46` ("dccp: Print deprecation notice."). Since then, only one individual has contacted the netdev mailing list. [0] There is ongoing research for Multipath DCCP. The repository is hosted on GitHub [1], and development is not taking place through the upstream community. While the repository is published under the GPLv2 license, the scheduling part remains proprietary, with a LICENSE file [2] stating: "This is not Open Source software." The researcher mentioned a plan to address the licensing issue, upstream the patches, and step up as a maintainer, but there has been no further communication since then. Maintaining DCCP for a decade without any real users has become a burden. Therefore, it's time to remove it. Removing DCCP will also provide significant benefits to TCP. It allows us to freely reorganize the layout of struct inet_connection_sock, which is currently shared with DCCP, and optimize it to reduce the number of cachelines accessed in the TCP fast path. Note that we keep DCCP netfilter modules as requested. [3] Link: https://lore.kernel.org/netdev/20230710182253.81446-1-kuniyu@amazon.com/T/#u #[0] Link: https://github.com/telekom/mp-dccp #[1] Link: https://github.com/telekom/mp-dccp/blob/mpdccp_v03_k5.10/net/dccp/non_gpl_scheduler/LICENSE #[2] Link: https://lore.kernel.org/netdev/Z_VQ0KlCRkqYWXa-@calendula/ #[3] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> (LSM and SELinux) Acked-by: Casey Schaufler <casey@schaufler-ca.com> Link: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-11 18:58:10 -07:00
Zijun Hu	faeefc173b	sock: Correct error checking condition for (assign\|release)_proto_idx() (assign\|release)_proto_idx() wrongly check find_first_zero_bit() failure by condition '(prot->inuse_idx == PROTO_INUSE_NR - 1)' obviously. Fix by correcting the condition to '(prot->inuse_idx == PROTO_INUSE_NR)' Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250410-fix_net-v2-1-d69e7c5739a4@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-11 16:32:40 -07:00
Jakub Kicinski	cb7103298d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc2). Conflict: Documentation/networking/netdevices.rst net/core/lock_debug.c `04efcee6ef` ("net: hold instance lock during NETDEV_CHANGE") `03df156dd3` ("xdp: double protect netdev->xdp_flags with netdev->lock") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-10 16:51:07 -07:00
Linus Torvalds	ab59a86056	Including fixes from netfilter. Current release - regressions: - core: hold instance lock during NETDEV_CHANGE - rtnetlink: fix bad unlock balance in do_setlink(). - ipv6: - fix null-ptr-deref in addrconf_add_ifaddr(). - align behavior across nexthops during path selection Previous releases - regressions: - sctp: prevent transport UaF in sendmsg - mptcp: only inc MPJoinAckHMacFailure for HMAC failures Previous releases - always broken: - sched: - make ->qlen_notify() idempotent - ensure sufficient space when sending filter netlink notifications - sch_sfq: really don't allow 1 packet limit - netfilter: fix incorrect avx2 match of 5th field octet - tls: explicitly disallow disconnect - eth: octeontx2-pf: fix VF root node parent queue priority Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmf3xusSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkud0P/iWOQB0oj0nvxl2ionPzgJEPduxuF0V6 YPyDBUzLC7Gq6NmTcdDlNJt8fE6UmKUIneghUm9Ss7LRpKv0/TPvorKMSK44Zt53 a5q49JeoI0TvnnhJesdHjiF31hrInqZmcX8OjSH8Q/SCKuy7rsgzao0vjvhd7lxm wA6LlWnJO1Pf991nNpbjUSoAZ7CMNlEIewGkdq0+6UADC7D9VagKTgIkFKw1BvRw 2Eb2pzvdO9Pj02+l/mjdRhUzMZlr+FG+WBqXk5oKR0YZ2t3CS4O9/UUBoAn775tM gCfzepNuAUXGX0I6h+DANCNuswWuG/IvYTdhy+hRWblYeCkILU60E8eVMlh7tpII fUd5GSRhX1NpGNHUlDG/4b6IcjMO3ebtce2cm2Y9t2CUe7EqB0HZyvTczNroTxip KXrXcCBuEkzxXCZhaN/CrBu8Piu8vJk/rMH5ha1khce9CkmYY+m9ruvsYjZmPI+/ P/SFkRdb/yV/SIOmay8FCJsy60t4FOtLnlDDrnygq4Q/9a7VwafebVpKS1fbTELG ZTiELN3/PN2GUnfREf0DVLPfn9sdqrMZaclLOJpp/Zi1/RZpo52WHceXJShiu9pe A8B+3SuPgOaLfhwyqiHlWm5moc9kNF26vlrWfFjK1GrJdxMisYwQoWD5eHfQFhDX UaxlAmndwZa9 =wkGz -----END PGP SIGNATURE----- Merge tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - core: hold instance lock during NETDEV_CHANGE - rtnetlink: fix bad unlock balance in do_setlink() - ipv6: - fix null-ptr-deref in addrconf_add_ifaddr() - align behavior across nexthops during path selection Previous releases - regressions: - sctp: prevent transport UaF in sendmsg - mptcp: only inc MPJoinAckHMacFailure for HMAC failures Previous releases - always broken: - sched: - make ->qlen_notify() idempotent - ensure sufficient space when sending filter netlink notifications - sch_sfq: really don't allow 1 packet limit - netfilter: fix incorrect avx2 match of 5th field octet - tls: explicitly disallow disconnect - eth: octeontx2-pf: fix VF root node parent queue priority" * tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits) ethtool: cmis_cdb: Fix incorrect read / write length extension selftests: netfilter: add test case for recent mismatch bug nft_set_pipapo: fix incorrect avx2 match of 5th field octet net: ppp: Add bound checking for skb data on ppp_sync_txmung net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod. ipv6: Align behavior across nexthops during path selection net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend() selftests/tc-testing: sfq: check that a derived limit of 1 is rejected net_sched: sch_sfq: move the limit validation net_sched: sch_sfq: use a temporary work area for validating configuration net: libwx: handle page_pool_dev_alloc_pages error selftests: mptcp: validate MPJoin HMacFailure counters mptcp: only inc MPJoinAckHMacFailure for HMAC failures rtnetlink: Fix bad unlock balance in do_setlink(). net: ethtool: Don't call .cleanup_data when prepare_data fails tc: Ensure we have enough buffer space when sending filter netlink notifications net: libwx: Fix the wrong Rx descriptor field octeontx2-pf: qos: fix VF root node parent queue index selftests: tls: check that disconnect does nothing ...	2025-04-10 08:52:18 -07:00
Willem de Bruijn	d4bac0288a	bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to read when these offsets extend into frags. This has been observed with iwlwifi and reproduced with tun with IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port, applied to a RAW socket, will silently miss matching packets. const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt); const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest); struct sock_filter filter_code[] = { BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4), BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_NET_OFF + offset_proto), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2), BPF_STMT(BPF_LD + BPF_H + BPF_ABS, SKF_NET_OFF + offset_dport), This is unexpected behavior. Socket filter programs should be consistent regardless of environment. Silent misses are particularly concerning as hard to detect. Use skb_copy_bits for offsets outside linear, same as done for non-SKF_(LL\|NET) offsets. Offset is always positive after subtracting the reference threshold SKB_(LL\|NET)_OFF, so is always >= skb_(mac\|network)_offset. The sum of the two is an offset against skb->data, and may be negative, but it cannot point before skb->head, as skb_(mac\|network)_offset would too. This appears to go back to when frag support was introduced to sk_run_filter in linux-2.4.4, before the introduction of git. The amount of code change and 8/16/32 bit duplication are unfortunate. But any attempt I made to be smarter saved very few LoC while complicating the code. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/20250122200402.3461154-1-maze@google.com/ Link: https://elixir.bootlin.com/linux/2.4.4/source/net/core/filter.c#L244 Reported-by: Matt Moeller <moeller.matt@gmail.com> Co-developed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250408132833.195491-2-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 20:02:51 -07:00
Jiayuan Chen	5ca2e29f68	bpf, sockmap: Fix panic when calling skb_linearize The panic can be reproduced by executing the command: ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --rx-strp 100000 Then a kernel panic was captured: ''' [ 657.460555] kernel BUG at net/core/skbuff.c:2178! [ 657.462680] Tainted: [W]=WARN [ 657.463287] Workqueue: events sk_psock_backlog ... [ 657.469610] <TASK> [ 657.469738] ? die+0x36/0x90 [ 657.469916] ? do_trap+0x1d0/0x270 [ 657.470118] ? pskb_expand_head+0x612/0xf40 [ 657.470376] ? pskb_expand_head+0x612/0xf40 [ 657.470620] ? do_error_trap+0xa3/0x170 [ 657.470846] ? pskb_expand_head+0x612/0xf40 [ 657.471092] ? handle_invalid_op+0x2c/0x40 [ 657.471335] ? pskb_expand_head+0x612/0xf40 [ 657.471579] ? exc_invalid_op+0x2d/0x40 [ 657.471805] ? asm_exc_invalid_op+0x1a/0x20 [ 657.472052] ? pskb_expand_head+0xd1/0xf40 [ 657.472292] ? pskb_expand_head+0x612/0xf40 [ 657.472540] ? lock_acquire+0x18f/0x4e0 [ 657.472766] ? find_held_lock+0x2d/0x110 [ 657.472999] ? __pfx_pskb_expand_head+0x10/0x10 [ 657.473263] ? __kmalloc_cache_noprof+0x5b/0x470 [ 657.473537] ? __pfx___lock_release.isra.0+0x10/0x10 [ 657.473826] __pskb_pull_tail+0xfd/0x1d20 [ 657.474062] ? __kasan_slab_alloc+0x4e/0x90 [ 657.474707] sk_psock_skb_ingress_enqueue+0x3bf/0x510 [ 657.475392] ? __kasan_kmalloc+0xaa/0xb0 [ 657.476010] sk_psock_backlog+0x5cf/0xd70 [ 657.476637] process_one_work+0x858/0x1a20 ''' The panic originates from the assertion BUG_ON(skb_shared(skb)) in skb_linearize(). A previous commit(see Fixes tag) introduced skb_get() to avoid race conditions between skb operations in the backlog and skb release in the recvmsg path. However, this caused the panic to always occur when skb_linearize is executed. The "--rx-strp 100000" parameter forces the RX path to use the strparser module which aggregates data until it reaches 100KB before calling sockmap logic. The 100KB payload exceeds MAX_MSG_FRAGS, triggering skb_linearize. To fix this issue, just move skb_get into sk_psock_skb_ingress_enqueue. ''' sk_psock_backlog: sk_psock_handle_skb skb_get(skb) <== we move it into 'sk_psock_skb_ingress_enqueue' sk_psock_skb_ingress____________ ↓ \| \| → sk_psock_skb_ingress_self \| sk_psock_skb_ingress_enqueue sk_psock_verdict_apply_________________↑ skb_linearize ''' Note that for verdict_apply path, the skb_get operation is unnecessary so we add 'take_ref' param to control it's behavior. Fixes: `a454d84ee2` ("bpf, sockmap: Fix skb refcnt race after locking changes") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-4-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:59:00 -07:00
Jiayuan Chen	3b4f14b794	bpf, sockmap: fix duplicated data transmission In the !ingress path under sk_psock_handle_skb(), when sending data to the remote under snd_buf limitations, partial skb data might be transmitted. Although we preserved the partial transmission state (offset/length), the state wasn't properly consumed during retries. This caused the retry path to resend the entire skb data instead of continuing from the previous offset, resulting in data overlap at the receiver side. Fixes: `405df89dd5` ("bpf, sockmap: Improved check for empty queue") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:58:59 -07:00
Jiayuan Chen	7683167196	bpf, sockmap: Fix data lost during EAGAIN retries We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf limit, the redirect info in _sk_redir is not recovered. Fix skb redir loss during EAGAIN retries by restoring _sk_redir information using skb_bpf_set_redir(). Before this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed 65.271 MB/s Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed 0 MB/s Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed 0 MB/s Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed 0 MB/s ''' Due to the high send rate, the RX processing path may frequently hit the sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow to mistakenly enter the "!ingress" path, leading to send failures. (The Rcv speed depends on tcp_rmem). After this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed 65.402 MB/s Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed 65.536 MB/s Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed 65.536 MB/s ''' Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-04-09 19:58:59 -07:00
Kuniyuki Iwashima	0bb2f7a1ad	net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod. When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1] Reproduction Steps: 1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped. At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds. # ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445 # lsmod \| grep cifs cifs 1159168 0 This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully. While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name(). Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired. Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref. If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue. Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in sk_prot_free(). Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO. [0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt" MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1 iptables -A INPUT -s ${CIFS_SERVER} -j DROP for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done rm -r ${MNT} iptables -D INPUT -s ${CIFS_SERVER} -j DROP [1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ... BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4 Fixes: `ed07536ed6` ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 19:11:55 -07:00
Jakub Kicinski	ce7b149474	netdev: depend on netdev->lock for qstats in ops locked drivers We mostly needed rtnl_lock in qstat to make sure the queue count is stable while we work. For "ops locked" drivers the instance lock protects the queue count, so we don't have to take rtnl_lock. For currently ops-locked drivers: netdevsim and bnxt need the protection from netdev going down while we dump, which instance lock provides. gve doesn't care. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:52 -07:00
Jakub Kicinski	99e44f39a8	netdev: depend on netdev->lock for xdp features Writes to XDP features are now protected by netdev->lock. Other things we report are based on ops which don't change once device has been registered. It is safe to stop taking rtnl_lock, and depend on netdev->lock instead. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:52 -07:00
Jakub Kicinski	03df156dd3	xdp: double protect netdev->xdp_flags with netdev->lock Protect xdp_features with netdev->lock. This way pure readers no longer have to take rtnl_lock to access the field. This includes calling NETDEV_XDP_FEAT_CHANGE under the lock. Looks like that's fine for bonding, the only "real" listener, it's the same as ethtool feature change. In terms of normal drivers - only GVE need special consideration (other drivers don't use instance lock or don't support XDP). It calls xdp_set_features_flag() helper from gve_init_priv() which in turn is called from gve_reset_recovery() (locked), or prior to netdev registration. So switch to _locked. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Harshitha Ramamurthy <hramamurthy@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250408195956.412733-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:52 -07:00
Jakub Kicinski	d02e3b3882	netdev: don't hold rtnl_lock over nl queue info get when possible Netdev queue dump accesses: NAPI, memory providers, XSk pointers. All three are "ops protected" now, switch to the op compat locking. rtnl lock does not have to be taken for "ops locked" devices. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:52 -07:00
Jakub Kicinski	4ec9031cbe	netdev: add "ops compat locking" helpers Add helpers to "lock a netdev in a backward-compatible way", which for ops-locked netdevs will mean take the instance lock. For drivers which haven't opted into the ops locking we'll take rtnl_lock. The scoped foreach is dropping and re-taking the lock for each device, even if prev and next are both under rtnl_lock. I hope that's fine since we expect that netdev nl to be mostly supported by modern drivers, and modern drivers should also opt into the instance locking. Note that these helpers are mostly needed for queue related state, because drivers modify queue config in their ops in a non-atomic way. Or differently put, queue changes don't have a clear-cut API like NAPI configuration. Any state that can should just use the instance lock directly, not the "compat" hacks. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:51 -07:00
Jakub Kicinski	a82dc19db1	net: avoid potential race between netdev_get_by_index_lock() and netns switch netdev_get_by_index_lock() performs following steps: rcu_lock(); dev = lookup(netns, ifindex); dev_get(dev); rcu_unlock(); [... lock & validate the dev ...] return dev Validation right now only checks if the device is registered but since the lookup is netns-aware we must also protect against the device switching netns right after we dropped the RCU lock. Otherwise the caller in netns1 may get a pointer to a device which has just switched to netns2. We can't hold the lock for the entire netns change process (because of the NETDEV_UNREGISTER notifier), and there's no existing marking to indicate that the netns is unlisted because of netns move, so add one. AFAIU none of the existing netdev_get_by_index_lock() callers can suffer from this problem (NAPI code double checks the netns membership and other callers are either under rtnl_lock or not ns-sensitive), so this patch does not have to be treated as a fix. Reviewed-by: Joe Damato <jdamato@fastly.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-09 17:01:51 -07:00
Michal Luczaj	420aabef3a	net: Drop unused @sk of __skb_try_recv_from_queue() __skb_try_recv_from_queue() deals with a queue, @sk is not used since commit `e427cad6ee` ("net: datagram: drop 'destructor' argument from several helpers"). Remove sk from function parameters, adapt callers. No functional change intended. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250407-cleanup-drop-param-sk-v1-1-cd076979afac@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 18:23:51 -07:00
Kuniyuki Iwashima	445e99bdf6	rtnetlink: Fix bad unlock balance in do_setlink(). When validate_linkmsg() fails in do_setlink(), we jump to the errout label and calls netdev_unlock_ops() even though we have not called netdev_lock_ops() as reported by syzbot. [0] Let's return an error directly in such a case. [0] WARNING: bad unlock balance detected! 6.14.0-syzkaller-12504-g8bc251e5d874 #0 Not tainted syz-executor814/5834 is trying to release lock (&dev_instance_lock_key) at: [<ffffffff89f41f56>] netdev_unlock include/linux/netdevice.h:2756 [inline] [<ffffffff89f41f56>] netdev_unlock_ops include/net/netdev_lock.h:48 [inline] [<ffffffff89f41f56>] do_setlink+0xc26/0x43a0 net/core/rtnetlink.c:3406 but there are no more locks to release! other info that might help us debug this: 1 lock held by syz-executor814/5834: #0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline] #0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock net/core/rtnetlink.c:341 [inline] #0: ffffffff900fc408 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0xd68/0x1fe0 net/core/rtnetlink.c:4064 stack backtrace: CPU: 0 UID: 0 PID: 5834 Comm: syz-executor814 Not tainted 6.14.0-syzkaller-12504-g8bc251e5d874 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_unlock_imbalance_bug+0x185/0x1a0 kernel/locking/lockdep.c:5296 __lock_release kernel/locking/lockdep.c:5535 [inline] lock_release+0x1ed/0x3e0 kernel/locking/lockdep.c:5887 __mutex_unlock_slowpath+0xee/0x800 kernel/locking/mutex.c:907 netdev_unlock include/linux/netdevice.h:2756 [inline] netdev_unlock_ops include/net/netdev_lock.h:48 [inline] do_setlink+0xc26/0x43a0 net/core/rtnetlink.c:3406 rtnl_group_changelink net/core/rtnetlink.c:3783 [inline] __rtnl_newlink net/core/rtnetlink.c:3937 [inline] rtnl_newlink+0x1619/0x1fe0 net/core/rtnetlink.c:4065 rtnetlink_rcv_msg+0x80f/0xd70 net/core/rtnetlink.c:6955 netlink_rcv_skb+0x208/0x480 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f8/0x9a0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8c3/0xcd0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:727 ____sys_sendmsg+0x523/0x860 net/socket.c:2566 ___sys_sendmsg net/socket.c:2620 [inline] __sys_sendmsg+0x271/0x360 net/socket.c:2652 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f8427b614a9 Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff9b59f3a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fff9b59f578 RCX: 00007f8427b614a9 RDX: 0000000000000000 RSI: 0000200000000300 RDI: 0000000000000004 RBP: 00007f8427bd4610 R08: 000000000000000c R09: 00007fff9b59f578 R10: 000000000000001b R11: 0000000000000246 R12: 0000000000000001 R13: Fixes: `4c975fd700` ("net: hold instance lock during NETDEV_REGISTER/UP") Reported-by: syzbot+45016fe295243a7882d3@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=45016fe295243a7882d3 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250407164229.24414-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 12:33:22 -07:00
Eric Dumazet	0a7de4a8f8	net: rps: remove kfree_rcu_mightsleep() use Add an rcu_head to sd_flow_limit and rps_sock_flow_table structs to use the more conventional and predictable k[v]free_rcu(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 12:30:55 -07:00
Eric Dumazet	22d046a778	net: add data-race annotations in softnet_seq_show() softnet_seq_show() reads several fields that might be updated concurrently. Add READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 12:30:55 -07:00
Eric Dumazet	7b6f0a852d	net: rps: annotate data-races around (struct sd_flow_limit)->count softnet_seq_show() can read fl->count while another cpu updates this field from skb_flow_limit(). Make this field an 'unsigned int', as its only consumer only deals with 32 bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 12:30:55 -07:00
Eric Dumazet	c3025e94da	net: rps: change skb_flow_limit() hash function As explained in commit `f3483c8e1d` ("net: rfs: hash function change"), masking low order bits of skb_get_hash(skb) has low entropy. A NIC with 32 RX queues uses the 5 low order bits of rss key to select a queue. This means all packets landing to a given queue share the same 5 low order bits. Switch to hash_32() to reduce hash collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-08 12:30:55 -07:00
Stanislav Fomichev	04efcee6ef	net: hold instance lock during NETDEV_CHANGE Cosmin reports an issue with ipv6_add_dev being called from NETDEV_CHANGE notifier: [ 3455.008776] ? ipv6_add_dev+0x370/0x620 [ 3455.010097] ipv6_find_idev+0x96/0xe0 [ 3455.010725] addrconf_add_dev+0x1e/0xa0 [ 3455.011382] addrconf_init_auto_addrs+0xb0/0x720 [ 3455.013537] addrconf_notify+0x35f/0x8d0 [ 3455.014214] notifier_call_chain+0x38/0xf0 [ 3455.014903] netdev_state_change+0x65/0x90 [ 3455.015586] linkwatch_do_dev+0x5a/0x70 [ 3455.016238] rtnl_getlink+0x241/0x3e0 [ 3455.019046] rtnetlink_rcv_msg+0x177/0x5e0 Similarly, linkwatch might get to ipv6_add_dev without ops lock: [ 3456.656261] ? ipv6_add_dev+0x370/0x620 [ 3456.660039] ipv6_find_idev+0x96/0xe0 [ 3456.660445] addrconf_add_dev+0x1e/0xa0 [ 3456.660861] addrconf_init_auto_addrs+0xb0/0x720 [ 3456.661803] addrconf_notify+0x35f/0x8d0 [ 3456.662236] notifier_call_chain+0x38/0xf0 [ 3456.662676] netdev_state_change+0x65/0x90 [ 3456.663112] linkwatch_do_dev+0x5a/0x70 [ 3456.663529] __linkwatch_run_queue+0xeb/0x200 [ 3456.663990] linkwatch_event+0x21/0x30 [ 3456.664399] process_one_work+0x211/0x610 [ 3456.664828] worker_thread+0x1cc/0x380 [ 3456.665691] kthread+0xf4/0x210 Reclassify NETDEV_CHANGE as a notifier that consistently runs under the instance lock. Link: https://lore.kernel.org/netdev/aac073de8beec3e531c86c101b274d434741c28e.camel@nvidia.com/ Reported-by: Cosmin Ratiu <cratiu@nvidia.com> Tested-by: Cosmin Ratiu <cratiu@nvidia.com> Fixes: `ad7c7b2172` ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250404161122.3907628-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-07 11:13:39 -07:00
Thomas Gleixner	8fa7292fee	treewide: Switch/rename to timer_delete[_sync]() timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-05 10:30:12 +02:00
Jakub Kicinski	34f71de3f5	net: avoid false positive warnings in __net_mp_close_rxq() Commit under Fixes solved the problem of spurious warnings when we uninstall an MP from a device while its down. The __net_mp_close_rxq() which is used by io_uring was not fixed. Move the fix over and reuse __net_mp_close_rxq() in the devmem path. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: `a70f891e0f` ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-04 07:35:38 -07:00
Jakub Kicinski	ec304b70d4	net: move mp dev config validation to __net_mp_open_rxq() devmem code performs a number of safety checks to avoid having to reimplement all of them in the drivers. Move those to __net_mp_open_rxq() and reuse that function for binding to make sure that io_uring ZC also benefits from them. While at it rename the queue ID variable to rxq_idx in __net_mp_open_rxq(), we touch most of the relevant lines. The XArray insertion is reordered after the netdev_rx_queue_restart() call, otherwise we'd need to duplicate the queue index check or risk inserting an invalid pointer. The XArray allocation failures should be extremely rare. Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: `6e18ed929d` ("net: add helpers for setting a memory provider on an rx queue") Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-04 07:35:38 -07:00
Stanislav Fomichev	1901066aab	netdevsim: add dummy device notifiers In order to exercise and verify notifiers' locking assumptions, register dummy notifiers (via register_netdevice_notifier_dev_net). Share notifier event handler that enforces the assumptions with lock_debug.c (rename and export rtnl_net_debug_event as netdev_debug_event). Add ops lock asserts to netdev_debug_event. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-03 15:32:08 -07:00
Stanislav Fomichev	b912d599d3	net: rename rtnl_net_debug to lock_debug And make it selected by CONFIG_DEBUG_NET. Don't rename any of the structs/functions. Next patch will use rtnl_net_debug_event in netdevsim. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-03 15:32:08 -07:00
Stanislav Fomichev	8965c160b8	net: use netif_disable_lro in ipv6_add_dev ipv6_add_dev might call dev_disable_lro which unconditionally grabs instance lock, so it will deadlock during NETDEV_REGISTER. Switch to netif_disable_lro. Make sure all callers hold the instance lock as well. Cc: Cosmin Ratiu <cratiu@nvidia.com> Fixes: `ad7c7b2172` ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-03 15:32:08 -07:00
Stanislav Fomichev	4c975fd700	net: hold instance lock during NETDEV_REGISTER/UP Callers of inetdev_init can come from several places with inconsistent expectation about netdev instance lock. Grab instance lock during REGISTER (plus UP). Also solve the inconsistency with UNREGISTER where it was locked only during move netns path. WARNING: CPU: 10 PID: 1479 at ./include/net/netdev_lock.h:54 __netdev_update_features+0x65f/0xca0 __warn+0x81/0x180 __netdev_update_features+0x65f/0xca0 report_bug+0x156/0x180 handle_bug+0x4f/0x90 exc_invalid_op+0x13/0x60 asm_exc_invalid_op+0x16/0x20 __netdev_update_features+0x65f/0xca0 netif_disable_lro+0x30/0x1d0 inetdev_init+0x12f/0x1f0 inetdev_event+0x48b/0x870 notifier_call_chain+0x38/0xf0 register_netdevice+0x741/0x8b0 register_netdev+0x1f/0x40 mlx5e_probe+0x4e3/0x8e0 [mlx5_core] auxiliary_bus_probe+0x3f/0x90 really_probe+0xc3/0x3a0 __driver_probe_device+0x80/0x150 driver_probe_device+0x1f/0x90 __device_attach_driver+0x7d/0x100 bus_for_each_drv+0x80/0xd0 __device_attach+0xb4/0x1c0 bus_probe_device+0x91/0xa0 device_add+0x657/0x870 Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Cosmin Ratiu <cratiu@nvidia.com> Fixes: `ad7c7b2172` ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250401163452.622454-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-03 15:32:08 -07:00
Antoine Tenart	3a0a3ff659	net: decrease cached dst counters in dst_release Upstream fix `ac888d5886` ("net: do not delay dst_entries_add() in dst_release()") moved decrementing the dst count from dst_destroy to dst_release to avoid accessing already freed data in case of netns dismantle. However in case CONFIG_DST_CACHE is enabled and OvS+tunnels are used, this fix is incomplete as the same issue will be seen for cached dsts: Unable to handle kernel paging request at virtual address ffff5aabf6b5c000 Call trace: percpu_counter_add_batch+0x3c/0x160 (P) dst_release+0xec/0x108 dst_cache_destroy+0x68/0xd8 dst_destroy+0x13c/0x168 dst_destroy_rcu+0x1c/0xb0 rcu_do_batch+0x18c/0x7d0 rcu_core+0x174/0x378 rcu_core_si+0x18/0x30 Fix this by invalidating the cache, and thus decrementing cached dst counters, in dst_release too. Fixes: `d71785ffc7` ("net: add dst_cache to ovs vxlan lwtunnel") Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250326173634.31096-1-atenart@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-04-03 13:05:07 +02:00
Kuniyuki Iwashima	1b7fdc702c	rtnetlink: Use register_pernet_subsys() in rtnl_net_debug_init(). rtnl_net_debug_init() registers rtnl_net_debug_net_ops by register_pernet_device() but calls unregister_pernet_subsys() in case register_netdevice_notifier() fails. It corrupts pernet_list because first_device is updated in register_pernet_device() but not unregister_pernet_subsys(). Let's fix it by calling register_pernet_subsys() instead. The _subsys() one fits better for the use case because it keeps the notifier alive until default_device_exit_net(), giving it more chance to test NETDEV_UNREGISTER. Fixes: `03fa534856` ("rtnetlink: Add ASSERT_RTNL_NET() placeholder for netdev notifier.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250401190716.70437-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-02 17:17:22 -07:00
Stanislav Fomichev	d996e412b2	bpf: add missing ops lock around dev_xdp_attach_link Syzkaller points out that create_link path doesn't grab ops lock, add it. Reported-by: syzbot+08936936fe8132f91f1a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/67e6b3e8.050a0220.2f068f.0079.GAE@google.com/ Fixes: `97246d6d21` ("net: hold netdev instance lock during ndo_bpf") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250331142814.1887506-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-04-02 16:06:15 -07:00
Linus Torvalds	acc4d5ff0b	Rather tiny PR, mostly so that we can get into our trees your fix to the x86 Makefile. Current release - regressions: - Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error queue accounting was missed Current release - new code bugs: - 5 fixes for the netdevice instance locking work Previous releases - regressions: - usbnet: restore usb%d name exception for local mac addresses Previous releases - always broken: - rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid spurious GET_LINK failures - eth: mana: Switch to page pool for jumbo frames - phy: broadcom: Correct BCM5221 PHY model detection Misc: - selftests: drv-net: replace helpers for referring to other files Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfrLiQACgkQMUZtbf5S IrvDsg//VH5VmkMau/TUF1WpwA03Wb/X0UqtM72lufVCMGYCogqjHZzs9popfhuu CCi4ZiBofEJHfN9WYJoumUL8aOdux0Q875o7h5gvfqKCokkUbJDk3W32QiLj1RqZ L14TorNOyS9Tg9m60pLa/6H3WS0kduhUGLuLzgTmOtT/YVJrOGmBtk/tRXFAKclH f3CPucvZGS5Fr4HfUc3yWXiocjibDJnv+jE+xW6S6YHpghvsK7anUvqIGTqQV8Vp s6AuvFULGO00968hFHcO0N8f8MnaCCJr2bcHCOXyjssdEbdvDOqzhFN4KhxWEPbK kCl3rLkPdkXo+ekOC7gIxXXaKVz3IVm28pegtEws8fda/iLuqZTp+BKo7kdrf3Iy br0rP/iK3eFN0M1XpVUIbmEuJ6VamztCzK88uvDKdI+Ol3GLAfy9v5NmkbzJI+aE cw+SyE6NgbeDeHBxvOu2F7G2sWMBkTEGaHMNXCv7I/VAvQsbk48onTVnpA+GlFD5 vFqMxiZHLBUfFOfUcHxmw8KAkZ44pc1xEkpw/4s8GZOfq+1oWz2LrSQ8M8Hjs2VN NTuE44OOsBDPOJ54iDAIOr5jyF+ZDeWEuPbvWbHGri80gII+2iF2788N2ToyAkyR R0t4VtJwXF+8D6IxTG41OIvAuq9Zc8AqM6O7VnOyFhZtKqezBTI= =BDoM -----END PGP SIGNATURE----- Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Rather tiny pull request, mostly so that we can get into our trees your fix to the x86 Makefile. Current release - regressions: - Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error queue accounting was missed Current release - new code bugs: - 5 fixes for the netdevice instance locking work Previous releases - regressions: - usbnet: restore usb%d name exception for local mac addresses Previous releases - always broken: - rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid spurious GET_LINK failures - eth: mana: Switch to page pool for jumbo frames - phy: broadcom: Correct BCM5221 PHY model detection Misc: - selftests: drv-net: replace helpers for referring to other files" * tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits) Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc" bnxt_en: bring back rtnl lock in bnxt_shutdown eth: gve: add missing netdev locks on reset and shutdown paths selftests: mptcp: ignore mptcp_diag binary selftests: mptcp: close fd_in before returning in main_loop selftests: mptcp: fix incorrect fd checks in main_loop mptcp: fix NULL pointer in can_accept_new_subflow octeontx2-af: Free NIX_AF_INT_VEC_GEN irq octeontx2-af: Fix mbox INTR handler when num VFs > 64 net: fix use-after-free in the netdev_nl_sock_priv_destroy() selftests: net: use Path helpers in ping selftests: net: use the dummy bpf from net/lib selftests: drv-net: replace the rpath helper with Path objects net: lapbether: use netdev_lockdep_set_classes() helper net: phy: broadcom: Correct BCM5221 PHY model detection net: usb: usbnet: restore usb%d name exception for local mac addresses net/mlx5e: SHAMPO, Make reserved size independent of page size net: mana: Switch to page pool for jumbo frames MAINTAINERS: Add dedicated entries for phy_link_topology net: move replay logic to tc_modify_qdisc ...	2025-04-01 20:00:51 -07:00
Taehee Yoo	42f3423878	net: fix use-after-free in the netdev_nl_sock_priv_destroy() In the netdev_nl_sock_priv_destroy(), an instance lock is acquired before calling net_devmem_unbind_dmabuf(), then releasing an instance lock(netdev_unlock(binding->dev)). However, a binding is freed in the net_devmem_unbind_dmabuf(). So using a binding after net_devmem_unbind_dmabuf() occurs UAF. To fix this UAF, it needs to use temporary variable. Fixes: `ba6f418fbf` ("net: bubble up taking netdev instance lock to callers of net_devmem_unbind_dmabuf()") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250328062237.3746875-1-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-31 16:44:49 -07:00
Linus Torvalds	fa593d0f96	bpf-next-6.15 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS 9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg= =aN/V -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "For this merge window we're splitting BPF pull request into three for higher visibility: main changes, res_spin_lock, try_alloc_pages. These are the main BPF changes: - Add DFA-based live registers analysis to improve verification of programs with loops (Eduard Zingerman) - Introduce load_acquire and store_release BPF instructions and add x86, arm64 JIT support (Peilin Ye) - Fix loop detection logic in the verifier (Eduard Zingerman) - Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet) - Add kfunc for populating cpumask bits (Emil Tsalapatis) - Convert various shell based tests to selftests/bpf/test_progs format (Bastien Curutchet) - Allow passing referenced kptrs into struct_ops callbacks (Amery Hung) - Add a flag to LSM bpf hook to facilitate bpf program signing (Blaise Boscaccy) - Track arena arguments in kfuncs (Ihor Solodrai) - Add copy_remote_vm_str() helper for reading strings from remote VM and bpf_copy_from_user_task_str() kfunc (Jordan Rome) - Add support for timed may_goto instruction (Kumar Kartikeya Dwivedi) - Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy) - Reduce bpf_cgrp_storage_busy false positives when accessing cgroup local storage (Martin KaFai Lau) - Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko) - Allow retrieving BTF data with BTF token (Mykyta Yatsenko) - Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix (Song Liu) - Reject attaching programs to noreturn functions (Yafang Shao) - Introduce pre-order traversal of cgroup bpf programs (Yonghong Song)" * tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits) selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid bpf: Fix out-of-bounds read in check_atomic_load/store() libbpf: Add namespace for errstr making it libbpf_errstr bpf: Add struct_ops context information to struct bpf_prog_aux selftests/bpf: Sanitize pointer prior fclose() selftests/bpf: Migrate test_xdp_vlan.sh into test_progs selftests/bpf: test_xdp_vlan: Rename BPF sections bpf: clarify a misleading verifier error message selftests/bpf: Add selftest for attaching fexit to __noreturn functions bpf: Reject attaching fexit/fmod_ret to __noreturn functions bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage bpf: Make perf_event_read_output accessible in all program types. bpftool: Using the right format specifiers bpftool: Add -Wformat-signedness flag to detect format errors selftests/bpf: Test freplace from user namespace libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID bpf: Return prog btf_id without capable check bpf: BPF token support for BPF_BTF_GET_FD_BY_ID bpf, x86: Fix objtool warning for timed may_goto bpf: Check map->record at the beginning of check_and_free_fields() ...	2025-03-30 12:43:03 -07:00
Mark Zhang	23f0080761	rtnetlink: Allocate vfinfo size for VF GUIDs when supported Commit `30aad41721` ("net/core: Add support for getting VF GUIDs") added support for getting VF port and node GUIDs in netlink ifinfo messages, but their size was not taken into consideration in the function that allocates the netlink message, causing the following warning when a netlink message is filled with many VF port and node GUIDs: # echo 64 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_numvfs # ip link show dev ib0 RTNETLINK answers: Message too long Cannot send link get request: Message too long Kernel warning: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1930 at net/core/rtnetlink.c:4151 rtnl_getlink+0x586/0x5a0 Modules linked in: xt_conntrack xt_MASQUERADE nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay mlx5_ib macsec mlx5_core tls rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib fuse ib_cm ib_core CPU: 2 UID: 0 PID: 1930 Comm: ip Not tainted 6.14.0-rc2+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:rtnl_getlink+0x586/0x5a0 Code: cb 82 e8 3d af 0a 00 4d 85 ff 0f 84 08 ff ff ff 4c 89 ff 41 be ea ff ff ff e8 66 63 5b ff 49 c7 07 80 4f cb 82 e9 36 fc ff ff <0f> 0b e9 16 fe ff ff e8 de a0 56 00 66 66 2e 0f 1f 84 00 00 00 00 RSP: 0018:ffff888113557348 EFLAGS: 00010246 RAX: 00000000ffffffa6 RBX: ffff88817e87aa34 RCX: dffffc0000000000 RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff88817e87afb8 RBP: 0000000000000009 R08: ffffffff821f44aa R09: 0000000000000000 R10: ffff8881260f79a8 R11: ffff88817e87af00 R12: ffff88817e87aa00 R13: ffffffff8563d300 R14: 00000000ffffffa6 R15: 00000000ffffffff FS: 00007f63a5dbf280(0000) GS:ffff88881ee00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f63a5ba4493 CR3: 00000001700fe002 CR4: 0000000000772eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0xa5/0x230 ? rtnl_getlink+0x586/0x5a0 ? report_bug+0x22d/0x240 ? handle_bug+0x53/0xa0 ? exc_invalid_op+0x14/0x50 ? asm_exc_invalid_op+0x16/0x20 ? skb_trim+0x6a/0x80 ? rtnl_getlink+0x586/0x5a0 ? __pfx_rtnl_getlink+0x10/0x10 ? rtnetlink_rcv_msg+0x1e5/0x860 ? __pfx___mutex_lock+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? __pfx_lock_acquire+0x10/0x10 ? stack_trace_save+0x90/0xd0 ? filter_irq_stacks+0x1d/0x70 ? kasan_save_stack+0x30/0x40 ? kasan_save_stack+0x20/0x40 ? kasan_save_track+0x10/0x30 rtnetlink_rcv_msg+0x21c/0x860 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? __pfx_rtnetlink_rcv_msg+0x10/0x10 ? arch_stack_walk+0x9e/0xf0 ? rcu_is_watching+0x34/0x60 ? lock_acquire+0xd5/0x410 ? rcu_is_watching+0x34/0x60 netlink_rcv_skb+0xe0/0x210 ? __pfx_rtnetlink_rcv_msg+0x10/0x10 ? __pfx_netlink_rcv_skb+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? __pfx___netlink_lookup+0x10/0x10 ? lock_release+0x62/0x200 ? netlink_deliver_tap+0xfd/0x290 ? rcu_is_watching+0x34/0x60 ? lock_release+0x62/0x200 ? netlink_deliver_tap+0x95/0x290 netlink_unicast+0x31f/0x480 ? __pfx_netlink_unicast+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? lock_acquire+0xd5/0x410 netlink_sendmsg+0x369/0x660 ? lock_release+0x62/0x200 ? __pfx_netlink_sendmsg+0x10/0x10 ? import_ubuf+0xb9/0xf0 ? __import_iovec+0x254/0x2b0 ? lock_release+0x62/0x200 ? __pfx_netlink_sendmsg+0x10/0x10 ____sys_sendmsg+0x559/0x5a0 ? __pfx_____sys_sendmsg+0x10/0x10 ? __pfx_copy_msghdr_from_user+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? do_read_fault+0x213/0x4a0 ? rcu_is_watching+0x34/0x60 ___sys_sendmsg+0xe4/0x150 ? __pfx____sys_sendmsg+0x10/0x10 ? do_fault+0x2cc/0x6f0 ? handle_pte_fault+0x2e3/0x3d0 ? __pfx_handle_pte_fault+0x10/0x10 ? preempt_count_sub+0x14/0xc0 ? __down_read_trylock+0x150/0x270 ? __handle_mm_fault+0x404/0x8e0 ? __pfx___handle_mm_fault+0x10/0x10 ? lock_release+0x62/0x200 ? __rcu_read_unlock+0x65/0x90 ? rcu_is_watching+0x34/0x60 __sys_sendmsg+0xd5/0x150 ? __pfx___sys_sendmsg+0x10/0x10 ? __up_read+0x192/0x480 ? lock_release+0x62/0x200 ? __rcu_read_unlock+0x65/0x90 ? rcu_is_watching+0x34/0x60 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f63a5b13367 Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RSP: 002b:00007fff8c726bc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000067b687c2 RCX: 00007f63a5b13367 RDX: 0000000000000000 RSI: 00007fff8c726c30 RDI: 0000000000000004 RBP: 00007fff8c726cb8 R08: 0000000000000000 R09: 0000000000000034 R10: 00007fff8c726c7c R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000000 R14: 00007fff8c726cd0 R15: 00007fff8c726cd0 </TASK> irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff813f9e58>] copy_process+0xd08/0x2830 softirqs last enabled at (0): [<ffffffff813f9e58>] copy_process+0xd08/0x2830 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace 0000000000000000 ]--- Thus, when calculating ifinfo message size, take VF GUIDs sizes into account when supported. Fixes: `30aad41721` ("net/core: Add support for getting VF GUIDs") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250325090226.749730-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-27 10:11:30 -07:00
Linus Torvalds	1a9239bb42	Networking changes for 6.15. Core & protocols ---------------- - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls). - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API ---------- - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers -------------- - Remove the IBM LCS driver for s390. - Remove the sb1000 cable modem driver. - Add support for SFP module access over SMBus. - Add MCTP transport driver for MCTP-over-USB. - Enable XDP metadata support in multiple drivers. - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM= =XY0S -----END PGP SIGNATURE----- Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls) - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API: - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers: - Remove the IBM LCS driver for s390 - Remove the sb1000 cable modem driver - Add support for SFP module access over SMBus - Add MCTP transport driver for MCTP-over-USB - Enable XDP metadata support in multiple drivers - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7" * tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits) unix: fix up for "apparmor: add fine grained af_unix mediation" mctp: Fix incorrect tx flow invalidation condition in mctp-i2c net: usb: asix: ax88772: Increase phy_name size net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string net: libwx: fix Tx L4 checksum net: libwx: fix Tx descriptor content for some tunnel packets atm: Fix NULL pointer dereference net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus net: phy: aquantia: add essential functions to aqr105 driver net: phy: aquantia: search for firmware-name in fwnode net: phy: aquantia: add probe function to aqr105 for firmware loading net: phy: Add swnode support to mdiobus_scan gve: add XDP DROP and PASS support for DQ gve: update XDP allocation path support RX buffer posting gve: merge packet buffer size fields gve: update GQ RX to use buf_size gve: introduce config-based allocation for XDP gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics ...	2025-03-26 21:48:21 -07:00
Jakub Kicinski	023b1e9d26	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c `919f9f497d` ("eth: bnxt: fix out-of-range access of vnic_info array") `fe96d717d3` ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-26 09:32:10 -07:00
Jakub Kicinski	4f74a45c6b	bluetooth-next pull request for net-next: core: - Add support for skb TX SND/COMPLETION timestamping - hci_core: Enable buffer flow control for SCO/eSCO - coredump: Log devcd dumps into the monitor drivers: - btusb: Add 2 HWIDs for MT7922 - btusb: Fix regression in the initialization of fake Bluetooth controllers - btusb: Add 14 USB device IDs for Qualcomm WCN785x - btintel: Add support for Intel Scorpius Peak - btintel: Add support to configure TX power - btintel: Add DSBR support for ScP - btintel_pcie: Add device id of Whale Peak - btintel_pcie: Setup buffers for firmware traces - btintel_pcie: Read hardware exception data - btintel_pcie: Add support for device coredump - btintel_pcie: Trigger device coredump on hardware exception - btnxpuart: Support for controller wakeup gpio config - btnxpuart: Add support to set BD address - btnxpuart: Add correct bootloader error codes - btnxpuart: Handle bootloader error during cmd5 and cmd7 - btnxpuart: Fix kernel panic during FW release - qca: add WCN3950 support - hci_qca: use the power sequencer for wcn6750 - btmtksdio: Prevent enabling interrupts after IRQ handler removal -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmfjA3AZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeWWD/0QZYBrNP9QSTyYTeNlYhCC Lw/n7n3+LhxOqIu+tOGS7UplTqR3p4WQGu2g+XV9wSu4dvLplZGn/40XtiXJA0+r VsXivQ4IR/Vjd8sNLLixfmuH4g4CbMSblQvECD3/5wFTeSH8T6/gyts/WV/LDYOJ jp5kG6HCFHMv7RaiaHdZ0Pe4c1xJ/Ek8TnrW4G/kPBaLzm+lhjRkfzxCx8cO0k0H mpoheUohLtUSgfLf49826t7rp3HDuX9db2hiGXQfKSrL2milwufKNaMFTmbuU2Uq IyAwR1CEdSsKlcpbnVNF05r94sjf8NjuBD3YWxB9OfVXq9aymJYZdGh54XT5nF3f ccD1WNnOoTsXDbEAnuVx+EYrJAI70e7vE4m8XbpxJcxhFoSIQCmtDoXUY199LiEG El7DlwouNFESmOIj6gSK7ogUEhmcQA0AJxBVpdxnGBAcQMn1hrGSb75VGg7rul8K HcBtf5j4gxZum2fzrufeY1+eBxfSRjNFIsnutAr53gEtidPZYlmiRyAuqQJMo2sO 0hlpC6gQGQJ5HiTR5Jbo9u+OC1oeHHXNN5IpNrd6J65n0pqCRldDsoXCerEJsOou cqWzb9lfQCrjRnWRxUWOLukdYRgBH15wuORU+Z7/qVhqOr49RvARnzq7AjGSenTS YKb2N+NGlqpxG+vQscFJUA== =XCdn -----END PGP SIGNATURE----- Merge tag 'for-net-next-2025-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: core: - Add support for skb TX SND/COMPLETION timestamping - hci_core: Enable buffer flow control for SCO/eSCO - coredump: Log devcd dumps into the monitor drivers: - btusb: Add 2 HWIDs for MT7922 - btusb: Fix regression in the initialization of fake Bluetooth controllers - btusb: Add 14 USB device IDs for Qualcomm WCN785x - btintel: Add support for Intel Scorpius Peak - btintel: Add support to configure TX power - btintel: Add DSBR support for ScP - btintel_pcie: Add device id of Whale Peak - btintel_pcie: Setup buffers for firmware traces - btintel_pcie: Read hardware exception data - btintel_pcie: Add support for device coredump - btintel_pcie: Trigger device coredump on hardware exception - btnxpuart: Support for controller wakeup gpio config - btnxpuart: Add support to set BD address - btnxpuart: Add correct bootloader error codes - btnxpuart: Handle bootloader error during cmd5 and cmd7 - btnxpuart: Fix kernel panic during FW release - qca: add WCN3950 support - hci_qca: use the power sequencer for wcn6750 - btmtksdio: Prevent enabling interrupts after IRQ handler removal * tag 'for-net-next-2025-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (53 commits) Bluetooth: MGMT: Add LL Privacy Setting Bluetooth: hci_event: Fix handling of HCI_EV_LE_DIRECT_ADV_REPORT Bluetooth: btnxpuart: Fix kernel panic during FW release Bluetooth: btnxpuart: Handle bootloader error during cmd5 and cmd7 Bluetooth: btnxpuart: Add correct bootloader error codes t blameBluetooth: btintel: Fix leading white space Bluetooth: btintel: Add support to configure TX power Bluetooth: btmtksdio: Prevent enabling interrupts after IRQ handler removal Bluetooth: btmtk: Remove the resetting step before downloading the fw Bluetooth: SCO: add TX timestamping Bluetooth: L2CAP: add TX timestamping Bluetooth: ISO: add TX timestamping Bluetooth: add support for skb TX SND/COMPLETION timestamping net-timestamp: COMPLETION timestamp on packet tx completion HCI: coredump: Log devcd dumps into the monitor Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancel Bluetooth: hci_vhci: Mark Sync Flow Control as supported Bluetooth: hci_core: Enable buffer flow control for SCO/eSCO Bluetooth: btintel_pci: Fix build warning Bluetooth: btintel_pcie: Trigger device coredump on hardware exception ... ==================== Link: https://patch.msgid.link/20250325192925.2497890-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 14:00:48 -07:00
Linus Torvalds	a50b4fe095	A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0 iJbvLJeisXr3GA== =bYAX -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...	2025-03-25 10:54:15 -07:00
Jakub Kicinski	b52458652e	net: protect rxq->mp_params with the instance lock Ensure that all accesses to mp_params are under the netdev instance lock. The only change we need is to move dev_memory_provider_uninstall() under the lock. Appropriately swap the asserts. Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 10:06:49 -07:00
Jakub Kicinski	310ae9eb26	net: designate queue -> napi linking as "ops protected" netdev netlink is the only reader of netdev_{,rx_}queue->napi, and it already holds netdev->lock. Switch protection of the writes to netdev->lock to "ops protected". The expectation will be now that accessing queue->napi will require netdev->lock for "ops locked" drivers, and rtnl_lock for all other drivers. Current "ops locked" drivers don't require any changes. gve and netdevsim use _locked() helpers right next to netif_queue_set_napi() so they must be holding the instance lock. iavf doesn't call it. bnxt is a bit messy but all paths seem locked. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 10:06:49 -07:00
Jakub Kicinski	0a65dcf624	net: designate queue counts as "double ops protected" by instance lock Drivers which opt into instance lock protection of ops should only call set_real_num_*_queues() under the instance lock. This means that queue counts are double protected (writes are under both rtnl_lock and instance lock, readers under either). Some readers may still be under the rtnl_lock, however, so for now we need double protection of writers. OTOH queue API paths are only under the protection of the instance lock, so we need to validate that the instance is actually locking ops, otherwise the input checks we do against queue count are racy. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 10:06:49 -07:00
Jakub Kicinski	bae2da8261	net: remove netif_set_real_num_rx_queues() helper for when SYSFS=n Since commit `a953be53ce` ("net-sysfs: add support for device-specific rx queue sysfs attributes"), so for at least a decade now it is safe to call net_rx_queue_update_kobjects() when SYSFS=n. That function does its own ifdef-inery and will return 0. Remove the unnecessary stub for netif_set_real_num_rx_queues(). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 10:04:49 -07:00
Jakub Kicinski	ba6f418fbf	net: bubble up taking netdev instance lock to callers of net_devmem_unbind_dmabuf() A recent commit added taking the netdev instance lock in netdev_nl_bind_rx_doit(), but didn't remove it in net_devmem_unbind_dmabuf() which it calls from an error path. Always expect the callers of net_devmem_unbind_dmabuf() to hold the lock. This is consistent with net_devmem_bind_dmabuf(). (Not so) coincidentally this also protects mp_param with the instance lock, which the rest of this series needs. Fixes: `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250324224537.248800-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 10:04:49 -07:00
Pauli Virtanen	983e0e4e87	net-timestamp: COMPLETION timestamp on packet tx completion Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp when hardware reports a packet completed. Completion tstamp is useful for Bluetooth, as hardware timestamps do not exist in the HCI specification except for ISO packets, and the hardware has a queue where packets may wait. In this case the software SND timestamp only reflects the kernel-side part of the total latency (usually small) and queue length (usually 0 unless HW buffers congested), whereas the completion report time is more informative of the true latency. It may also be useful in other cases where HW TX timestamps cannot be obtained and user wants to estimate an upper bound to when the TX probably happened. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>	2025-03-25 12:48:05 -04:00
Eric Dumazet	f3483c8e1d	net: rfs: hash function change RFS is using two kinds of hash tables. First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N and using the N low order bits of the l4 hash is good enough. Then each RX queue has its own hash table, controlled by /sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X Current hash function, using the X low order bits is suboptimal, because RSS is usually using Func(hash) = (hash % power_of_two); For example, with 32 RX queues, 6 low order bits have no entropy for a given queue. Switch this hash function to hash_32(hash, log) to increase chances to use all possible slots and reduce collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-25 08:24:13 -07:00
Paolo Abeni	c353e8983e	net: introduce per netns packet chains Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces. Add per netns ptypes chains, so that in the mentioned case only the netns owning the packet socket(s) is affected. While at that drop the global ptype_all list: no in kernel user registers a tap on "any" type without specifying either the target device or the target namespace (and IMHO doing that would not make any sense). Note that this adds a conditional in the fast path (to check for per netns ptype_specific list) and increases the dataset size by a cacheline (owing the per netns lists). Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumaze@google.com> Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-24 13:58:22 -07:00
Breno Leitao	f1fce08e63	netpoll: Eliminate redundant assignment The assignment of zero to udph->check is unnecessary as it is immediately overwritten in the subsequent line. Remove the redundant assignment. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250319-netpoll_nit-v1-1-a7faac5cbd92@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-24 13:42:53 -07:00
Peter Seiderer	7151062c29	net: pktgen: add strict buffer parsing index check Add strict buffer parsing index check to avoid the following Smatch warning: net/core/pktgen.c:877 get_imix_entries() warn: check that incremented offset 'i' is capped Checking the buffer index i after every get_user/i++ step and returning with error code immediately avoids the current indirect (but correct) error handling. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/netdev/36cf3ee2-38b1-47e5-a42a-363efeb0ace3@stanley.mountain/ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250317090401.1240704-1-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-24 12:01:37 -07:00
Kuniyuki Iwashima	ed3ba9b6e2	net: Remove RTNL dance for SIOCBRADDIF and SIOCBRDELIF. SIOCBRDELIF is passed to dev_ioctl() first and later forwarded to br_ioctl_call(), which causes unnecessary RTNL dance and the splat below [0] under RTNL pressure. Let's say Thread A is trying to detach a device from a bridge and Thread B is trying to remove the bridge. In dev_ioctl(), Thread A bumps the bridge device's refcnt by netdev_hold() and releases RTNL because the following br_ioctl_call() also re-acquires RTNL. In the race window, Thread B could acquire RTNL and try to remove the bridge device. Then, rtnl_unlock() by Thread B will release RTNL and wait for netdev_put() by Thread A. Thread A, however, must hold RTNL after the unlock in dev_ifsioc(), which may take long under RTNL pressure, resulting in the splat by Thread B. Thread A (SIOCBRDELIF) Thread B (SIOCBRDELBR) ---------------------- ---------------------- sock_ioctl sock_ioctl `- sock_do_ioctl `- br_ioctl_call `- dev_ioctl `- br_ioctl_stub \|- rtnl_lock \| \|- dev_ifsioc ' ' \|- dev = __dev_get_by_name(...) \|- netdev_hold(dev, ...) . / \|- rtnl_unlock ------. \| \| \|- br_ioctl_call `---> \|- rtnl_lock Race \| \| `- br_ioctl_stub \|- br_del_bridge Window \| \| \| \|- dev = __dev_get_by_name(...) \| \| \| May take long \| `- br_dev_delete(dev, ...) \| \| \| under RTNL pressure \| `- unregister_netdevice_queue(dev, ...) \| \| \| \| `- rtnl_unlock \ \| \|- rtnl_lock <-' `- netdev_run_todo \| \|- ... `- netdev_run_todo \| `- rtnl_unlock \|- __rtnl_unlock \| \|- netdev_wait_allrefs_any \|- netdev_put(dev, ...) <----------------' Wait refcnt decrement and log splat below To avoid blocking SIOCBRDELBR unnecessarily, let's not call dev_ioctl() for SIOCBRADDIF and SIOCBRDELIF. In the dev_ioctl() path, we do the following: 1. Copy struct ifreq by get_user_ifreq in sock_do_ioctl() 2. Check CAP_NET_ADMIN in dev_ioctl() 3. Call dev_load() in dev_ioctl() 4. Fetch the master dev from ifr.ifr_name in dev_ifsioc() 3. can be done by request_module() in br_ioctl_call(), so we move 1., 2., and 4. to br_ioctl_stub(). Note that 2. is also checked later in add_del_if(), but it's better performed before RTNL. SIOCBRADDIF and SIOCBRDELIF have been processed in dev_ioctl() since the pre-git era, and there seems to be no specific reason to process them there. [0]: unregister_netdevice: waiting for wpan3 to become free. Usage count = 2 ref_tracker: wpan3@ffff8880662d8608 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4282 [inline] netdev_hold include/linux/netdevice.h:4311 [inline] dev_ifsioc+0xc6a/0x1160 net/core/dev_ioctl.c:624 dev_ioctl+0x255/0x10c0 net/core/dev_ioctl.c:826 sock_do_ioctl+0x1ca/0x260 net/socket.c:1213 sock_ioctl+0x23a/0x6c0 net/socket.c:1318 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x1a4/0x210 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: `893b195875` ("net: bridge: fix ioctl locking") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: yan kang <kangyan91@outlook.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/netdev/SY8P300MB0421225D54EB92762AE8F0F2A1D32@SY8P300MB0421.AUSP300.PROD.OUTLOOK.COM/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250316192851.19781-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-21 22:10:06 +01:00
Paolo Abeni	6f13bec53a	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZ9NWrwAKCRAraS+Z+3y5 E2VIAP4iIdMobFDEm04JuXyrOK6HasnkOFkuEgLNNhGRbCCTTwD/X/2A8baSXnjg eqYN2DJfrs7OZ1ZUku/ic1VGtJg7Sgk= =jTyW -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-03-13 The following pull-request contains BPF updates for your net-next tree. We've added 4 non-merge commits during the last 3 day(s) which contain a total of 2 files changed, 35 insertions(+), 12 deletions(-). The main changes are: 1) bpf_getsockopt support for TCP_BPF_RTO_MIN and TCP_BPF_DELACK_MAX, from Jason Xing bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add bpf_getsockopt() for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN tcp: bpf: Support bpf_getsockopt for TCP_BPF_DELACK_MAX tcp: bpf: Support bpf_getsockopt for TCP_BPF_RTO_MIN tcp: bpf: Introduce bpf_sol_tcp_getsockopt to support TCP_BPF flags ==================== Link: https://patch.msgid.link/20250313221620.2512684-1-martin.lau@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 21:48:14 +01:00
Paolo Abeni	f491593394	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc8). Conflict: tools/testing/selftests/net/Makefile `03544faad7` ("selftest: net: add proc_net_pktgen") `3ed61b8938` ("selftests: net: test for lwtunnel dst ref loops") tools/testing/selftests/net/config: `85cb3711ac` ("selftests: net: Add test cases for link and peer netns") `3ed61b8938` ("selftests: net: test for lwtunnel dst ref loops") Adjacent commits: tools/testing/selftests/net/Makefile `c935af429e` ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY") `355d940f4d` ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 21:38:01 +01:00
Lin Ma	90a7138619	net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES Previous commit `8b5c171bb3` ("neigh: new unresolved queue limits") introduces new netlink attribute NDTPA_QUEUE_LENBYTES to represent approximative value for deprecated QUEUE_LEN. However, it forgot to add the associated nla_policy in nl_ntbl_parm_policy array. Fix it with one simple NLA_U32 type policy. Fixes: `8b5c171bb3` ("neigh: new unresolved queue limits") Signed-off-by: Lin Ma <linma@zju.edu.cn> Link: https://patch.msgid.link/20250315165113.37600-1-linma@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 15:23:29 +01:00
Justin Iurman	986ffb3a57	net: lwtunnel: fix recursion loops This patch acts as a parachute, catch all solution, by detecting recursion loops in lwtunnel users and taking care of them (e.g., a loop between routes, a loop within the same route, etc). In general, such loops are the consequence of pathological configurations. Each lwtunnel user is still free to catch such loops early and do whatever they want with them. It will be the case in a separate patch for, e.g., seg6 and seg6_local, in order to provide drop reasons and update statistics. Another example of a lwtunnel user taking care of loops is ioam6, which has valid use cases that include loops (e.g., inline mode), and which is addressed by the next patch in this series. Overall, this patch acts as a last resort to catch loops and drop packets, since we don't want to leak something unintentionally because of a pathological configuration in lwtunnels. The solution in this patch reuses dev_xmit_recursion(), dev_xmit_recursion_inc(), and dev_xmit_recursion_dec(), which seems fine considering the context. Closes: https://lore.kernel.org/netdev/2bc9e2079e864a9290561894d2a602d6@akamai.com/ Closes: https://lore.kernel.org/netdev/Z7NKYMY7fJT5cYWu@shredder/ Fixes: `ffce41962e` ("lwtunnel: support dst output redirect function") Fixes: `2536862311` ("lwt: Add support to redirect dst.input") Fixes: `14972cbd34` ("net: lwtunnel: Handle fragmentation") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250314120048.12569-2-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 11:25:52 +01:00
Gerhard Engleder	0d60fd5032	net: phy: Support speed selection for PHY loopback phy_loopback() leaves it to the PHY driver to select the speed of the loopback mode. Thus, the speed of the loopback mode depends on the PHY driver in use. Add support for speed selection to phy_loopback() to enable loopback with defined speeds. Ensure that link up is signaled if speed changes as speed is not allowed to change during link up. Link down and up is necessary for a new speed. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20250312203010.47429-3-gerhard@engleder-embedded.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-20 08:45:08 +01:00
Uday Shankar	f8a10bed32	netconsole: allow selection of egress interface via MAC address Currently, netconsole has two methods of configuration - module parameter and configfs. The former interface allows for netconsole activation earlier during boot (by specifying the module parameter on the kernel command line), so it is preferred for debugging issues which arise before userspace is up/the configfs interface can be used. The module parameter syntax requires specifying the egress interface name. This requirement makes it hard to use for a couple reasons: - The egress interface name can be hard or impossible to predict. For example, installing a new network card in a system can change the interface names assigned by the kernel. - When constructing the module parameter, one may have trouble determining the original (kernel-assigned) name of the interface (which is the name that should be given to netconsole) if some stable interface naming scheme is in effect. A human can usually look at kernel logs to determine the original name, but this is very painful if automation is constructing the parameter. For these reasons, allow selection of the egress interface via MAC address when configuring netconsole using the module parameter. Update the netconsole documentation with an example of the new syntax. Selection of egress interface by MAC address via configfs is far less interesting (since when this interface can be used, one should be able to easily convert between MAC address and interface name), so it is left unimplemented. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Breno Leitao <leitao@debian.org> Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250312-netconsole-v6-2-3437933e79b8@purestorage.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-19 19:17:58 +01:00
Stanislav Fomichev	6dd132516f	net: reorder dev_addr_sem lock Lockdep complains about circular lock in 1 -> 2 -> 3 (see below). Change the lock ordering to be: - rtnl_lock - dev_addr_sem - netdev_ops (only for lower devices!) - team_lock (or other per-upper device lock) 1. rtnl_lock -> netdev_ops -> dev_addr_sem rtnl_setlink rtnl_lock do_setlink IFLA_ADDRESS on lower netdev_ops dev_addr_sem 2. rtnl_lock -> team_lock -> netdev_ops rtnl_newlink rtnl_lock do_setlink IFLA_MASTER on lower do_set_master team_add_slave team_lock team_port_add dev_set_mtu netdev_ops 3. rtnl_lock -> dev_addr_sem -> team_lock rtnl_newlink rtnl_lock do_setlink IFLA_ADDRESS on upper dev_addr_sem netif_set_mac_address team_set_mac_address team_lock 4. rtnl_lock -> netdev_ops -> dev_addr_sem rtnl_lock dev_ifsioc dev_set_mac_address_user __tun_chr_ioctl rtnl_lock dev_set_mac_address_user tap_ioctl rtnl_lock dev_set_mac_address_user dev_set_mac_address_user netdev_lock_ops netif_set_mac_address_user dev_addr_sem v2: - move lock reorder to happen after kmalloc (Kuniyuki) Cc: Kohei Enju <enjuk@amazon.com> Fixes: `df43d8bf10` ("net: replace dev_addr_sem with netdev instance lock") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250312190513.1252045-3-sdf@fomichev.me Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-19 18:52:00 +01:00
Stanislav Fomichev	8033d2aef5	Revert "net: replace dev_addr_sem with netdev instance lock" This reverts commit `df43d8bf10`. Cc: Kohei Enju <enjuk@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Fixes: `df43d8bf10` ("net: replace dev_addr_sem with netdev instance lock") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250312190513.1252045-2-sdf@fomichev.me Tested-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-19 18:52:00 +01:00
Eric Dumazet	15492700ac	tcp: cache RTAX_QUICKACK metric in a hot cache line tcp_in_quickack_mode() is called from input path for small packets. It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in sock_read_tx group (for good reasons). Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses. Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull these cache lines for the cases a delayed ACK is scheduled. After this patch TCP receive path does not longer access sock_read_tx group. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250312083907.1931644-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-18 13:44:59 +01:00
Breno Leitao	19856a5247	net: filter: Avoid shadowing variable in bpf_convert_ctx_access() Rename the local variable 'off' to 'offset' to avoid shadowing the existing 'off' variable that is declared as an `int` in the outer scope of bpf_convert_ctx_access(). This fixes a compiler warning: net/core/filter.c:9679:8: warning: declaration shadows a local variable [-Wshadow] Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://patch.msgid.link/20250228-fix_filter-v1-1-ce13eae66fe9@debian.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:27 -07:00
Paolo Abeni	941defcea7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py `75cc19c8ff` ("selftests: drv-net: add xdp cases for ping.py") `de94e86974` ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c `a70f891e0f` ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile `6f50175cca` ("selftests: Add IPv6 link-local address generation tests for GRE devices.") `2e5584e0f9` ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c `661958552e` ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") `fe96d717d3` ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-13 23:08:11 +01:00
Jason Xing	d22b8b04b8	tcp: bpf: Support bpf_getsockopt for TCP_BPF_DELACK_MAX Support bpf_getsockopt if application tries to know what the delayed ack max time is. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250312153523.9860-4-kerneljasonxing@gmail.com	2025-03-13 14:30:39 -07:00
Jason Xing	5584cd7e0d	tcp: bpf: Support bpf_getsockopt for TCP_BPF_RTO_MIN Support bpf_getsockopt if application tries to know what the RTO MIN of this socket is. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250312153523.9860-3-kerneljasonxing@gmail.com	2025-03-13 14:30:39 -07:00
Jason Xing	49f6713cb6	tcp: bpf: Introduce bpf_sol_tcp_getsockopt to support TCP_BPF flags The patch refactors a bit on supporting getsockopt for TCP BPF flags. For now, only TCP_BPF_SOCK_OPS_CB_FLAGS. Later, more flags will be added into this function. No functional changes here. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250312153523.9860-2-kerneljasonxing@gmail.com	2025-03-13 14:30:38 -07:00
Stanislav Fomichev	1d22d3060b	net: drop rtnl_lock for queue_mgmt operations All drivers that use queue API are already converted to use netdev instance lock. Move netdev instance lock management to the netlink layer and drop rtnl_lock. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Mina Almasry. <almasrymina@google.com> Link: https://patch.msgid.link/20250311144026.4154277-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-12 13:32:35 -07:00
Stanislav Fomichev	10eef096be	net: add granular lock for the netdev netlink socket As we move away from rtnl_lock for queue ops, introduce per-netdev_nl_sock lock. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250311144026.4154277-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-12 13:32:35 -07:00
Stanislav Fomichev	b6b67141d6	net: create netdev_nl_sock to wrap bindings list No functional changes. Next patches will add more granular locking to netdev_nl_sock. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250311144026.4154277-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-12 13:32:35 -07:00
Stanislav Fomichev	110eff172d	eth: bnxt: switch to netif_close All (error) paths that call dev_close are already holding instance lock, so switch to netif_close to avoid the deadlock. v2: - add missing EXPORT_MODULE for netif_close Fixes: `004b500801` ("eth: bnxt: remove most dependencies on RTNL") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250309215851.2003708-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-12 13:19:15 -07:00

1 2 3 4 5 ...

10127 Commits