Commit Graph

9767 Commits

Author SHA1 Message Date
Eric Dumazet 5ced52fa8f net: add skb_set_owner_edemux() helper
This can be used to attach a socket to an skb,
taking a reference on sk->sk_refcnt.

This helper might be a NOP if sk->sk_refcnt is zero.

Use it from tcp_make_synack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241010174817.1543642-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:39:36 -07:00
Eric Dumazet 78e2baf3d9 net: add TIME_WAIT logic to sk_to_full_sk()
TCP will soon attach TIME_WAIT sockets to some ACK and RST.

Make sure sk_to_full_sk() detects this and does not return
a non full socket.

v3: also changed sk_const_to_full_sk()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241010174817.1543642-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:39:36 -07:00
Eric Dumazet 2698acd6ea net: do not acquire rtnl in fib_seq_sum()
After we made sure no fib_seq_read() handlers needs RTNL anymore,
we can remove RTNL from fib_seq_sum().

Note that after RTNL was dropped, fib_seq_sum() result was possibly
outdated anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-11 15:35:05 -07:00
Eric Dumazet a716ff52be fib: rules: use READ_ONCE()/WRITE_ONCE() on ops->fib_rules_seq
Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.

Writes are protected by RTNL.
We can use READ_ONCE() on readers.

Constify 'struct net' argument of fib_rules_seq_read()
and lookup_rules_ops().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-11 15:35:05 -07:00
Eric Dumazet d677aebd66 tcp: move sysctl_tcp_l3mdev_accept to netns_ipv4_read_rx
sysctl_tcp_l3mdev_accept is read from TCP receive fast path from
tcp_v6_early_demux(),
 __inet6_lookup_established,
  inet_request_bound_dev_if().

Move it to netns_ipv4_read_rx.

Remove the '#ifdef CONFIG_NET_L3_MASTER_DEV' that was guarding
its definition.

Note this adds a hole of three bytes that could be filled later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Link: https://patch.msgid.link/20241010034100.320832-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-11 08:45:24 -07:00
Jakub Kicinski 9c0fc36ec4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.12-rc3).

No conflicts and no adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10 13:13:33 -07:00
Paolo Abeni ff7d4deb1f net-shapers: implement shaper cleanup on queue deletion
hook into netif_set_real_num_tx_queues() to cleanup any shaper
configured on top of the to-be-destroyed TX queues.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/6da4ee03cae2b2a757d7b59e88baf09cc94c5ef1.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10 08:30:22 -07:00
Paolo Abeni 4b623f9f0f net-shapers: implement NL get operation
Introduce the basic infrastructure to implement the net-shaper
core functionality. Each network devices carries a net-shaper cache,
the NL get() operation fetches the data from such cache.

The cache is initially empty, will be fill by the set()/group()
operation implemented later and is destroyed at device cleanup time.

The net_shaper_fill_handle(), net_shaper_ctx_init(), and
net_shaper_generic_pre() implementations handle generic index type
attributes, despite the current caller always pass a constant value
to avoid more noise in later patches using them with different
attributes.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/ddd10fd645a9367803ad02fca4a5664ea5ace170.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10 08:30:22 -07:00
Paolo Abeni 13d68a1643 genetlink: extend info user-storage to match NL cb ctx
This allows a more uniform implementation of non-dump and dump
operations, and will be used later in the series to avoid some
per-operation allocation.

Additionally rename the NL_ASSERT_DUMP_CTX_FITS macro, to
fit a more extended usage.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/1130cc2896626b84587a2a5f96a5c6829638f4da.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10 08:30:21 -07:00
Kuniyuki Iwashima 07cc7b0b94 rtnetlink: Add bulk registration helpers for rtnetlink message handlers.
Before commit addf9b90de ("net: rtnetlink: use rcu to free rtnl message
handlers"), once rtnl_msg_handlers[protocol] was allocated, the following
rtnl_register_module() for the same protocol never failed.

However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to
be allocated in each rtnl_register_module(), so each call could fail.

Many callers of rtnl_register_module() do not handle the returned error,
and we need to add many error handlings.

To handle that easily, let's add wrapper functions for bulk registration
of rtnetlink message handlers.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-10 15:39:35 +02:00
Eric Dumazet ac888d5886 net: do not delay dst_entries_add() in dst_release()
dst_entries_add() uses per-cpu data that might be freed at netns
dismantle from ip6_route_net_exit() calling dst_entries_destroy()

Before ip6_route_net_exit() can be called, we release all
the dsts associated with this netns, via calls to dst_release(),
which waits an rcu grace period before calling dst_destroy()

dst_entries_add() use in dst_destroy() is racy, because
dst_entries_destroy() could have been called already.

Decrementing the number of dsts must happen sooner.

Notes:

1) in CONFIG_XFRM case, dst_destroy() can call
   dst_release_immediate(child), this might also cause UAF
   if the child does not have DST_NOCOUNT set.
   IPSEC maintainers might take a look and see how to address this.

2) There is also discussion about removing this count of dst,
   which might happen in future kernels.

Fixes: f886497212 ("ipv4: fix dst race in sk_dst_get()")
Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-10 11:28:17 +02:00
Jason Xing da5e06dee5 net-timestamp: namespacify the sysctl_tstamp_allow_data
Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241005222609.94980-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08 15:33:11 -07:00
Mahe Tardy eb62f49de7 bpf: add get_netns_cookie helper to tc programs
This is needed in the context of Cilium and Tetragon to retrieve netns
cookie from hostns when traffic leaves Pod, so that we can correlate
skb->sk's netns cookie.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
Link: https://lore.kernel.org/r/20241007095958.97442-1-mahe.tardy@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08 12:06:43 -07:00
Kuniyuki Iwashima 03fa534856 rtnetlink: Add ASSERT_RTNL_NET() placeholder for netdev notifier.
The global and per-netns netdev notifier depend on RTNL, and its
dependency is not so clear due to nested calls.

Let's add a placeholder to place ASSERT_RTNL_NET() for each event.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08 15:16:59 +02:00
Kuniyuki Iwashima 844e5e7e65 rtnetlink: Add assertion helpers for per-netns RTNL.
Once an RTNL scope is converted with rtnl_net_lock(), we will replace
RTNL helper functions inside the scope with the following per-netns
alternatives:

  ASSERT_RTNL()           -> ASSERT_RTNL_NET(net)
  rcu_dereference_rtnl(p) -> rcu_dereference_rtnl_net(net, p)

Note that the per-netns helpers are equivalent to the conventional
helpers unless CONFIG_DEBUG_NET_SMALL_RTNL is enabled.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08 15:16:59 +02:00
Kuniyuki Iwashima 76aed95319 rtnetlink: Add per-netns RTNL.
The goal is to break RTNL down into per-netns mutex.

This patch adds per-netns mutex and its helper functions, rtnl_net_lock()
and rtnl_net_unlock().

rtnl_net_lock() acquires the global RTNL and per-netns RTNL mutex, and
rtnl_net_unlock() releases them.

We will replace 800+ rtnl_lock() with rtnl_net_lock() and finally removes
rtnl_lock() in rtnl_net_lock().

When we need to nest per-netns RTNL mutex, we will use __rtnl_net_lock(),
and its locking order is defined by rtnl_net_lock_cmp_fn() as follows:

  1. init_net is first
  2. netns address ascending order

Note that the conversion will be done under CONFIG_DEBUG_NET_SMALL_RTNL
with LOCKDEP so that we can carefully add the extra mutex without slowing
down RTNL operations during conversion.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08 15:16:59 +02:00
Eric Dumazet f858cc9eed net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute
Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ
packet scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

This patch adds dev->max_pacing_offload_horizon expressing
this timing wheel horizon in nsec units.

This is a read-only attribute.

Unless a driver sets it, dev->max_pacing_offload_horizon
is zero.

v2: addressed Jakub feedback ( https://lore.kernel.org/netdev/20240930152304.472767-2-edumazet@google.com/T/#mf6294d714c41cc459962154cc2580ce3c9693663 )
v3: added yaml doc (also per Jakub feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241003121219.2396589-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-04 15:37:53 -07:00
Vadim Fedorenko 4aecca4c76 net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-04 11:52:19 -07:00
Guillaume Nault 66fb6386d3 ipv4: Convert ip_route_input_noref() to dscp_t.
Pass a dscp_t variable to ip_route_input_noref(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.

Callers of ip_route_input_noref() to consider are:

  * arp_process() in net/ipv4/arp.c. This function sets the tos
    parameter to 0, which is already a valid dscp_t value, so it
    doesn't need to be adjusted for the new prototype.

  * ip_route_input(), which already has a dscp_t variable to pass as
    parameter. We just need to remove the inet_dscp_to_dsfield()
    conversion.

  * ipvlan_l3_rcv(), bpf_lwt_input_reroute(), ip_expire(),
    ip_rcv_finish_core(), xfrm4_rcv_encap_finish() and
    xfrm4_rcv_encap(), which get the DSCP directly from IPv4 headers
    and can simply use the ip4h_dscp() helper.

While there, declare the IPv4 header pointers as const in
ipvlan_l3_rcv() and bpf_lwt_input_reroute().
Also, modify the declaration of ip_route_input_noref() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/a8a747bed452519c4d0cc06af32c7e7795d7b627.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03 16:21:21 -07:00
Linus Torvalds 8c245fe7dd Including fixes from ieee802154, bluetooth and netfilter.
Current release - regressions:
 
   - eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
 
   - eth: am65-cpsw: fix forever loop in cleanup code
 
 Current release - new code bugs:
 
   - eth: mlx5: HWS, fixed double-free in error flow of creating SQ
 
 Previous releases - regressions:
 
   - core: avoid potential underflow in qdisc_pkt_len_init() with UFO
 
   - core: test for not too small csum_start in virtio_net_hdr_to_skb()
 
   - vrf: revert "vrf: remove unnecessary RCU-bh critical section"
 
   - bluetooth:
     - fix uaf in l2cap_connect
     - fix possible crash on mgmt_index_removed
 
   - dsa: improve shutdown sequence
 
   - eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
 
   - eth: ip_gre: fix drops of small packets in ipgre_xmit
 
 Previous releases - always broken:
 
   - core: fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
 
   - core: fix tcp fraglist segmentation after pull from frag_list
 
   - netfilter: nf_tables: prevent nf_skb_duplicated corruption
 
   - sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
 
   - mac802154: fix potential RCU dereference issue in mac802154_scan_worker
 
   - eth: fec: restart PPS after link state change
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmb+giESHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkDowP/25YsDA8uaH5yelI85vUgp1T50MWgFxJ
 ARm58Pzxr8byX6eIup95xSsjLvMbLaWj5LIA2Y49AV0fWVgGn0U8yx4mPy0Czhdg
 J1oxtyoV1pR2V/okWzD4yhZV2on7OGsS73I6J1s6BAowezr19A+aa5Un57dW/103
 ccwBuBOYlSIOIHmarOxuFhWMYcwXreNBHa9K7J6JtDFn9F56fUn+ZoIUJ7x27cSO
 eWhh9bIkeEb+xYeUXAjNP3pBvJ1xpwIyZv+JMTp40jNsAXPjSpI3Jwd1YlAAMuT9
 J2dW0Zs8uwm5LzBPFvI9iM0WHEmVy6+b32NjnKVwPn2+XGGWQss52bmRElNcJkrw
 4NeG6/6CPIE0xuczBECuMa0X68NDKIZsjy3Q3OahV82ef2cwhRk6FexyIg5oiMPx
 KmMi5B+UQw6ZY3ZF/ME/0jJx/H5ayOC01yNBaTUPrLJr8gjquWEMjZXEqJsdyixJ
 5OoZeKG5oN6HkN7g/IxoFjg/W/g93OULO3qH+IzLQG4NlVs6Zp4ykL7dT+Py2zzc
 Ru3n5+HA4PqDn2u7gmP1mu2g/lmKUIZEEvR+msP81Cywlz5qtWIH1a6oIeVC7bjt
 JNhgBgzKGGMGdgmhYNzXw213WCEbz0+as2SNlvlbiqMP5FKQPLzzBVuJoz4AtJVn
 cyVy7D66HuMW
 =cq2I
 -----END PGP SIGNATURE-----

Merge tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from ieee802154, bluetooth and netfilter.

  Current release - regressions:

   - eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc

   - eth: am65-cpsw: fix forever loop in cleanup code

  Current release - new code bugs:

   - eth: mlx5: HWS, fixed double-free in error flow of creating SQ

  Previous releases - regressions:

   - core: avoid potential underflow in qdisc_pkt_len_init() with UFO

   - core: test for not too small csum_start in virtio_net_hdr_to_skb()

   - vrf: revert "vrf: remove unnecessary RCU-bh critical section"

   - bluetooth:
       - fix uaf in l2cap_connect
       - fix possible crash on mgmt_index_removed

   - dsa: improve shutdown sequence

   - eth: mlx5e: SHAMPO, fix overflow of hd_per_wq

   - eth: ip_gre: fix drops of small packets in ipgre_xmit

  Previous releases - always broken:

   - core: fix gso_features_check to check for both
     dev->gso_{ipv4_,}max_size

   - core: fix tcp fraglist segmentation after pull from frag_list

   - netfilter: nf_tables: prevent nf_skb_duplicated corruption

   - sctp: set sk_state back to CLOSED if autobind fails in
     sctp_listen_start

   - mac802154: fix potential RCU dereference issue in
     mac802154_scan_worker

   - eth: fec: restart PPS after link state change"

* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
  sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
  dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
  doc: net: napi: Update documentation for napi_schedule_irqoff
  net/ncsi: Disable the ncsi work before freeing the associated structure
  net: phy: qt2025: Fix warning: unused import DeviceId
  gso: fix udp gso fraglist segmentation after pull from frag_list
  bridge: mcast: Fail MDB get request on empty entry
  vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
  net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
  net: phy: realtek: Check the index value in led_hw_control_get
  ppp: do not assume bh is held in ppp_channel_bridge_input()
  selftests: rds: move include.sh to TEST_FILES
  net: test for not too small csum_start in virtio_net_hdr_to_skb()
  net: gso: fix tcp fraglist segmentation after pull from frag_list
  ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
  net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
  net: add more sanity checks to qdisc_pkt_len_init()
  net: avoid potential underflow in qdisc_pkt_len_init() with UFO
  net: ethernet: ti: cpsw_ale: Fix warning on some platforms
  net: microchip: Make FDMA config symbol invisible
  ...
2024-10-03 09:44:00 -07:00
Al Viro 5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Maciej Fijalkowski 8f5b408d76 bpf: Remove unused macro
Commit 7aebfa1b38 ("bpf: Support narrow loads from bpf_sock_addr.user_port")
removed one and only SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD callsite but kept
the macro. Remove it to clean up the code base. Found while getting lost in
the BPF code.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241001200605.249526-1-maciej.fijalkowski@intel.com
2024-10-02 11:51:54 +02:00
Toke Høiland-Jørgensen 09d88791c7 bpf: Make sure internal and UAPI bpf_redirect flags don't overlap
The bpf_redirect_info is shared between the SKB and XDP redirect paths,
and the two paths use the same numeric flag values in the ri->flags
field (specifically, BPF_F_BROADCAST == BPF_F_NEXTHOP). This means that
if skb bpf_redirect_neigh() is used with a non-NULL params argument and,
subsequently, an XDP redirect is performed using the same
bpf_redirect_info struct, the XDP path will get confused and end up
crashing, which syzbot managed to trigger.

With the stack-allocated bpf_redirect_info, the structure is no longer
shared between the SKB and XDP paths, so the crash doesn't happen
anymore. However, different code paths using identically-numbered flag
values in the same struct field still seems like a bit of a mess, so
this patch cleans that up by moving the flag definitions together and
redefining the three flags in BPF_F_REDIRECT_INTERNAL to not overlap
with the flags used for XDP. It also adds a BUILD_BUG_ON() check to make
sure the overlap is not re-introduced by mistake.

Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Reported-by: syzbot+cca39e6e84a367a7e6f6@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://syzkaller.appspot.com/bug?extid=cca39e6e84a367a7e6f6
Link: https://lore.kernel.org/bpf/20240920125625.59465-1-toke@redhat.com
2024-10-01 21:40:12 +02:00
Eric Dumazet ab9a9a9e96 net: add more sanity checks to qdisc_pkt_len_init()
One path takes care of SKB_GSO_DODGY, assuming
skb->len is bigger than hdr_len.

virtio_net_hdr_to_skb() does not fully dissect TCP headers,
it only make sure it is at least 20 bytes.

It is possible for an user to provide a malicious 'GSO' packet,
total length of 80 bytes.

- 20 bytes of IPv4 header
- 60 bytes TCP header
- a small gso_size like 8

virtio_net_hdr_to_skb() would declare this packet as a normal
GSO packet, because it would see 40 bytes of payload,
bigger than gso_size.

We need to make detect this case to not underflow
qdisc_skb_cb(skb)->pkt_len.

Fixes: 1def9238d4 ("net_sched: more precise pkt_len computation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01 11:47:05 +02:00
Eric Dumazet c20029db28 net: avoid potential underflow in qdisc_pkt_len_init() with UFO
After commit 7c6d2ecbda ("net: be more gentle about silly gso
requests coming from user") virtio_net_hdr_to_skb() had sanity check
to detect malicious attempts from user space to cook a bad GSO packet.

Then commit cf9acc90c8 ("net: virtio_net_hdr_to_skb: count
transport header in UFO") while fixing one issue, allowed user space
to cook a GSO packet with the following characteristic :

IPv4 SKB_GSO_UDP, gso_size=3, skb->len = 28.

When this packet arrives in qdisc_pkt_len_init(), we end up
with hdr_len = 28 (IPv4 header + UDP header), matching skb->len

Then the following sets gso_segs to 0 :

gso_segs = DIV_ROUND_UP(skb->len - hdr_len,
                        shinfo->gso_size);

Then later we set qdisc_skb_cb(skb)->pkt_len to back to zero :/

qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len;

This leads to the following crash in fq_codel [1]

qdisc_pkt_len_init() is best effort, we only want an estimation
of the bytes sent on the wire, not crashing the kernel.

This patch is fixing this particular issue, a following one
adds more sanity checks for another potential bug.

[1]
[   70.724101] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   70.724561] #PF: supervisor read access in kernel mode
[   70.724561] #PF: error_code(0x0000) - not-present page
[   70.724561] PGD 10ac61067 P4D 10ac61067 PUD 107ee2067 PMD 0
[   70.724561] Oops: Oops: 0000 [#1] SMP NOPTI
[   70.724561] CPU: 11 UID: 0 PID: 2163 Comm: b358537762 Not tainted 6.11.0-virtme #991
[   70.724561] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   70.724561] RIP: 0010:fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel
[ 70.724561] Code: 24 08 49 c1 e1 06 44 89 7c 24 18 45 31 ed 45 31 c0 31 ff 89 44 24 14 4c 03 8b 90 01 00 00 eb 04 39 ca 73 37 4d 8b 39 83 c7 01 <49> 8b 17 49 89 11 41 8b 57 28 45 8b 5f 34 49 c7 07 00 00 00 00 49
All code
========
   0:	24 08                	and    $0x8,%al
   2:	49 c1 e1 06          	shl    $0x6,%r9
   6:	44 89 7c 24 18       	mov    %r15d,0x18(%rsp)
   b:	45 31 ed             	xor    %r13d,%r13d
   e:	45 31 c0             	xor    %r8d,%r8d
  11:	31 ff                	xor    %edi,%edi
  13:	89 44 24 14          	mov    %eax,0x14(%rsp)
  17:	4c 03 8b 90 01 00 00 	add    0x190(%rbx),%r9
  1e:	eb 04                	jmp    0x24
  20:	39 ca                	cmp    %ecx,%edx
  22:	73 37                	jae    0x5b
  24:	4d 8b 39             	mov    (%r9),%r15
  27:	83 c7 01             	add    $0x1,%edi
  2a:*	49 8b 17             	mov    (%r15),%rdx		<-- trapping instruction
  2d:	49 89 11             	mov    %rdx,(%r9)
  30:	41 8b 57 28          	mov    0x28(%r15),%edx
  34:	45 8b 5f 34          	mov    0x34(%r15),%r11d
  38:	49 c7 07 00 00 00 00 	movq   $0x0,(%r15)
  3f:	49                   	rex.WB

Code starting with the faulting instruction
===========================================
   0:	49 8b 17             	mov    (%r15),%rdx
   3:	49 89 11             	mov    %rdx,(%r9)
   6:	41 8b 57 28          	mov    0x28(%r15),%edx
   a:	45 8b 5f 34          	mov    0x34(%r15),%r11d
   e:	49 c7 07 00 00 00 00 	movq   $0x0,(%r15)
  15:	49                   	rex.WB
[   70.724561] RSP: 0018:ffff95ae85e6fb90 EFLAGS: 00000202
[   70.724561] RAX: 0000000002000000 RBX: ffff95ae841de000 RCX: 0000000000000000
[   70.724561] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
[   70.724561] RBP: ffff95ae85e6fbf8 R08: 0000000000000000 R09: ffff95b710a30000
[   70.724561] R10: 0000000000000000 R11: bdf289445ce31881 R12: ffff95ae85e6fc58
[   70.724561] R13: 0000000000000000 R14: 0000000000000040 R15: 0000000000000000
[   70.724561] FS:  000000002c5c1380(0000) GS:ffff95bd7fcc0000(0000) knlGS:0000000000000000
[   70.724561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.724561] CR2: 0000000000000000 CR3: 000000010c568000 CR4: 00000000000006f0
[   70.724561] Call Trace:
[   70.724561]  <TASK>
[   70.724561] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[   70.724561] ? page_fault_oops (arch/x86/mm/fault.c:715)
[   70.724561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
[   70.724561] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
[   70.724561] ? fq_codel_enqueue (net/sched/sch_fq_codel.c:120 net/sched/sch_fq_codel.c:168 net/sched/sch_fq_codel.c:230) sch_fq_codel
[   70.724561] dev_qdisc_enqueue (net/core/dev.c:3784)
[   70.724561] __dev_queue_xmit (net/core/dev.c:3880 (discriminator 2) net/core/dev.c:4390 (discriminator 2))
[   70.724561] ? irqentry_enter (kernel/entry/common.c:237)
[   70.724561] ? sysvec_apic_timer_interrupt (./arch/x86/include/asm/hardirq.h:74 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2) arch/x86/kernel/apic/apic.c:1043 (discriminator 2))
[   70.724561] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4))
[   70.724561] ? asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:702)
[   70.724561] ? virtio_net_hdr_to_skb.constprop.0 (./include/linux/virtio_net.h:129 (discriminator 1))
[   70.724561] packet_sendmsg (net/packet/af_packet.c:3145 (discriminator 1) net/packet/af_packet.c:3177 (discriminator 1))
[   70.724561] ? _raw_spin_lock_bh (./arch/x86/include/asm/atomic.h:107 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:127 (discriminator 4) kernel/locking/spinlock.c:178 (discriminator 4))
[   70.724561] ? netdev_name_node_lookup_rcu (net/core/dev.c:325 (discriminator 1))
[   70.724561] __sys_sendto (net/socket.c:730 (discriminator 1) net/socket.c:745 (discriminator 1) net/socket.c:2210 (discriminator 1))
[   70.724561] ? __sys_setsockopt (./include/linux/file.h:34 net/socket.c:2355)
[   70.724561] __x64_sys_sendto (net/socket.c:2222 (discriminator 1) net/socket.c:2218 (discriminator 1) net/socket.c:2218 (discriminator 1))
[   70.724561] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
[   70.724561] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[   70.724561] RIP: 0033:0x41ae09

Fixes: cf9acc90c8 ("net: virtio_net_hdr_to_skb: count transport header in UFO")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01 11:47:05 +02:00
Daniel Borkmann e609c959a9 net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
Commit 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.

This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.

Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.

Fixes: 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01 10:48:52 +02:00
Daniel Borkmann e8d4d34df7 net: Add netif_get_gro_max_size helper for GRO
Add a small netif_get_gro_max_size() helper which returns the maximum IPv4
or IPv6 GRO size of the netdevice.

We later add a netif_get_gso_max_size() equivalent as well for GSO, so that
these helpers can be used consistently instead of open-coded checks.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-01 10:48:51 +02:00
Linus Torvalds fa8380a06b bpf-next-6.12-struct-fd
-----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbyniwACgkQ6rmadz2v
 bTqE0w/2J8TJWfR+1Z0Bf2Nzt3kFd/wLNn6FpWsq+z0/pzoP5AzborvmLzNiZmeh
 0vJFieOL7pV4+NcaIHBPqfW1eMsXu+BlrtkHGLLYiCPJUr8o5jU9SrVKfF3arMZS
 a6+zcX6EivX0MYWobZ2F7/8XF0nRQADxzInLazFmtJmLmOAyIch417KOg9ylwr3m
 WVqhtCImUFyVz83XMFgbf2jXrvL9xD08iHN62GzcAioRF5LeJSPX0U/N15gWDqF7
 V68F0PnvUf6/hkFvYVynhpMivE8u+8VXCHX+heZ8yUyf4ExV/+KSZrImupJ0WLeO
 iX/qJ/9XP+g6ad9Olqpu6hmPi/6c6epQgbSOchpG04FGBGmJv1j9w4wnlHCgQDdB
 i2oKHRtMKdqNZc0sOSfvw/KyxZXJuD1VQ9YgGVpZbHUbSZDoj7T40zWziUp8VgyR
 nNtOmfJLDbtYlPV7/cQY5Ui4ccMJm6GzxxLBcqcMWxBu/90Ng0wTSubLbg3RHmWu
 d9cCL6IprjJnliEUqC4k4gqZy6RJlHvQ8+NDllaW+4iPnz7B2WaUbwRX/oZ5yiYK
 bLjWCWo+SzntVPAzTsmAYs2G47vWoALxo2NpNXLfmhJiWwfakJaQu7fwrDxsY11M
 OgByiOzcbAcvkJzeVIDhfLVq5z49KF6k4D8Qu0uvXHDeC8Mraw==
 =zzmh
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf 'struct fd' updates from Alexei Starovoitov:
 "This includes struct_fd BPF changes from Al and Andrii"

* tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  bpf: convert bpf_token_create() to CLASS(fd, ...)
  security,bpf: constify struct path in bpf_token_create() LSM hook
  bpf: more trivial fdget() conversions
  bpf: trivial conversions for fdget()
  bpf: switch maps to CLASS(fd, ...)
  bpf: factor out fetching bpf_map from FD and adding it to used_maps list
  bpf: switch fdget_raw() uses to CLASS(fd_raw, ...)
  bpf: convert __bpf_prog_get() to CLASS(fd, ...)
2024-09-24 14:54:26 -07:00
Linus Torvalds f8ffbc365f struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ
 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d
 592+iDgLsema/H/0/CqfqlaNtDNY8Q0=
 =HUl5
 -----END PGP SIGNATURE-----

Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
2024-09-23 09:35:36 -07:00
Linus Torvalds 440b652328 bpf-next-6.12
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v
 bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta
 CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn
 DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD
 Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC
 dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR
 ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb
 hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC
 kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ
 SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8
 CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf
 TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0=
 =MDwh
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
   corresponding support in LLVM.

   It is similar to existing 'no_caller_saved_registers' attribute in
   GCC/LLVM with a provision for backward compatibility. It allows
   compilers generate more efficient BPF code assuming the verifier or
   JITs will inline or partially inline a helper/kfunc with such
   attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
   bpf_get_smp_processor_id are the first set of such helpers.

 - Harden and extend ELF build ID parsing logic.

   When called from sleepable context the relevants parts of ELF file
   will be read to find and fetch .note.gnu.build-id information. Also
   harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.

 - Improvements and fixes for sched-ext:
    - Allow passing BPF iterators as kfunc arguments
    - Make the pointer returned from iter_next method trusted
    - Fix x86 JIT convergence issue due to growing/shrinking conditional
      jumps in variable length encoding

 - BPF_LSM related:
    - Introduce few VFS kfuncs and consolidate them in
      fs/bpf_fs_kfuncs.c
    - Enforce correct range of return values from certain LSM hooks
    - Disallow attaching to other LSM hooks

 - Prerequisite work for upcoming Qdisc in BPF:
    - Allow kptrs in program provided structs
    - Support for gen_epilogue in verifier_ops

 - Important fixes:
    - Fix uprobe multi pid filter check
    - Fix bpf_strtol and bpf_strtoul helpers
    - Track equal scalars history on per-instruction level
    - Fix tailcall hierarchy on x86 and arm64
    - Fix signed division overflow to prevent INT_MIN/-1 trap on x86
    - Fix get kernel stack in BPF progs attached to tracepoint:syscall

 - Selftests:
    - Add uprobe bench/stress tool
    - Generate file dependencies to drastically improve re-build time
    - Match JIT-ed and BPF asm with __xlated/__jited keywords
    - Convert older tests to test_progs framework
    - Add support for RISC-V
    - Few fixes when BPF programs are compiled with GCC-BPF backend
      (support for GCC-BPF in BPF CI is ongoing in parallel)
    - Add traffic monitor
    - Enable cross compile and musl libc

* tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
  btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
  btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
  btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
  bpf: Call the missed kfree() when there is no special field in btf
  bpf: Call the missed btf_record_free() when map creation fails
  selftests/bpf: Add a test case to write mtu result into .rodata
  selftests/bpf: Add a test case to write strtol result into .rodata
  selftests/bpf: Rename ARG_PTR_TO_LONG test description
  selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
  bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
  bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
  bpf: Fix helper writes to read-only maps
  bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
  bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
  selftests/bpf: Add tests for sdiv/smod overflow cases
  bpf: Fix a sdiv overflow issue
  libbpf: Add bpf_object__token_fd accessor
  docs/bpf: Add missing BPF program types to docs
  docs/bpf: Add constant values for linkages
  bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
  ...
2024-09-21 09:27:50 -07:00
Linus Torvalds cb69d86550 Updates for the interrupt subsystem:
- Core:
 	- Remove a global lock in the affinity setting code
 
 	  The lock protects a cpumask for intermediate results and the lock
 	  causes a bottleneck on simultaneous start of multiple virtual
 	  machines. Replace the lock and the static cpumask with a per CPU
 	  cpumask which is nicely serialized by raw spinlock held when
 	  executing this code.
 
 	- Provide support for giving a suffix to interrupt domain names.
 
 	  That's required to support devices with subfunctions so that the
 	  domain names are distinct even if they originate from the same
 	  device node.
 
 	- The usual set of cleanups and enhancements all over the place
 
   - Drivers:
 
 	- Support for longarch AVEC interrupt chip
 
 	- Refurbishment of the Armada driver so it can be extended for new
           variants.
 
 	- The usual set of cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
 1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
 YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
 uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
 RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
 isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
 rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
 Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
 kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
 dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
 4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
 w6vvU0i+NIyVMA==
 =JlJ2
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Core:

   - Remove a global lock in the affinity setting code

     The lock protects a cpumask for intermediate results and the lock
     causes a bottleneck on simultaneous start of multiple virtual
     machines. Replace the lock and the static cpumask with a per CPU
     cpumask which is nicely serialized by raw spinlock held when
     executing this code.

   - Provide support for giving a suffix to interrupt domain names.

     That's required to support devices with subfunctions so that the
     domain names are distinct even if they originate from the same
     device node.

   - The usual set of cleanups and enhancements all over the place

  Drivers:

   - Support for longarch AVEC interrupt chip

   - Refurbishment of the Armada driver so it can be extended for new
     variants.

   - The usual set of cleanups and enhancements all over the place"

* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
  genirq: Use cpumask_intersects()
  genirq/cpuhotplug: Use cpumask_intersects()
  irqchip/apple-aic: Only access system registers on SoCs which provide them
  irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
  irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
  dt-bindings: apple,aic: Document A7-A11 compatibles
  irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
  genirq/msi: Use kmemdup_array() instead of kmemdup()
  genirq/proc: Change the return value for set affinity permission error
  genirq/proc: Use irq_move_pending() in show_irq_affinity()
  genirq/proc: Correctly set file permissions for affinity control files
  genirq: Get rid of global lock in irq_do_set_affinity()
  genirq: Fix typo in struct comment
  irqchip/loongarch-avec: Add AVEC irqchip support
  irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
  irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
  LoongArch: Architectural preparation for AVEC irqchip
  LoongArch: Move irqchip function prototypes to irq-loongson.h
  irqchip/loongson-pch-msi: Switch to MSI parent domains
  softirq: Remove unused 'action' parameter from action callback
  ...
2024-09-17 07:09:17 +02:00
Linus Torvalds 3352633ce6 vfs-6.12.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
 7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
 =e9s9
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file updates from Christian Brauner:
 "This is the work to cleanup and shrink struct file significantly.

  Right now, (focusing on x86) struct file is 232 bytes. After this
  series struct file will be 184 bytes aka 3 cacheline and a spare 8
  bytes for future extensions at the end of the struct.

  With struct file being as ubiquitous as it is this should make a
  difference for file heavy workloads and allow further optimizations in
  the future.

   - struct fown_struct was embedded into struct file letting it take up
     32 bytes in total when really it shouldn't even be embedded in
     struct file in the first place. Instead, actual users of struct
     fown_struct now allocate the struct on demand. This frees up 24
     bytes.

   - Move struct file_ra_state into the union containg the cleanup hooks
     and move f_iocb_flags out of the union. This closes a 4 byte hole
     we created earlier and brings struct file to 192 bytes. Which means
     struct file is 3 cachelines and we managed to shrink it by 40
     bytes.

   - Reorder struct file so that nothing crosses a cacheline.

     I suspect that in the future we will end up reordering some members
     to mitigate false sharing issues or just because someone does
     actually provide really good perf data.

   - Shrinking struct file to 192 bytes is only part of the work.

     Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
     is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
     located outside of the object because the cache doesn't know what
     part of the memory can safely be overwritten as it may be needed to
     prevent object recycling.

     That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
     adding a new cacheline.

     So this also contains work to add a new kmem_cache_create_rcu()
     function that allows the caller to specify an offset where the
     freelist pointer is supposed to be placed. Thus avoiding the
     implicit addition of a fourth cacheline.

   - And finally this removes the f_version member in struct file.

     The f_version member isn't particularly well-defined. It is mainly
     used as a cookie to detect concurrent seeks when iterating
     directories. But it is also abused by some subsystems for
     completely unrelated things.

     It is mostly a directory and filesystem specific thing that doesn't
     really need to live in struct file and with its wonky semantics it
     really lacks a specific function.

     For pipes, f_version is (ab)used to defer poll notifications until
     a write has happened. And struct pipe_inode_info is used by
     multiple struct files in their ->private_data so there's no chance
     of pushing that down into file->private_data without introducing
     another pointer indirection.

     But pipes don't rely on f_pos_lock so this adds a union into struct
     file encompassing f_pos_lock and a pipe specific f_pipe member that
     pipes can use. This union of course can be extended to other file
     types and is similar to what we do in struct inode already"

* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  fs: remove f_version
  pipe: use f_pipe
  fs: add f_pipe
  ubifs: store cookie in private data
  ufs: store cookie in private data
  udf: store cookie in private data
  proc: store cookie in private data
  ocfs2: store cookie in private data
  input: remove f_version abuse
  ext4: store cookie in private data
  ext2: store cookie in private data
  affs: store cookie in private data
  fs: add generic_llseek_cookie()
  fs: use must_set_pos()
  fs: add must_set_pos()
  fs: add vfs_setpos_cookie()
  s390: remove unused f_version
  ceph: remove unused f_version
  adi: remove unused f_version
  mm: Removed @freeptr_offset to prevent doc warning
  ...
2024-09-16 09:14:02 +02:00
Ido Schimmel 4b041d286e net: fib_rules: Enable DSCP selector usage
Now that both IPv4 and IPv6 support the new DSCP selector, enable user
space to configure FIB rules that make use of it by changing the policy
of the new DSCP attribute so that it accepts values in the range of [0,
63].

Use NLA_U8 rather than NLA_UINT as the field is of fixed size.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240911093748.3662015-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13 21:15:45 -07:00
Ido Schimmel c951a29f6b net: fib_rules: Add DSCP selector attribute
The FIB rule TOS selector is implemented differently between IPv4 and
IPv6. In IPv4 it is used to match on the three "Type of Services" bits
specified in RFC 791, while in IPv6 is it is used to match on the six
DSCP bits specified in RFC 2474.

Add a new FIB rule attribute to allow matching on DSCP. The attribute
will be used to implement a 'dscp' selector in ip-rule with a consistent
behavior between IPv4 and IPv6.

For now, set the type of the attribute to 'NLA_REJECT' so that user
space will not be able to configure it. This restriction will be lifted
once both IPv4 and IPv6 support the new attribute.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240911093748.3662015-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13 21:15:44 -07:00
Daniel Borkmann 4b3786a6c5 bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input
arguments, zero the value for the case of an error as otherwise it could leak
memory. For tracing, it is not needed given CAP_PERFMON can already read all
kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped
in here.

Also, the MTU helpers mtu_len pointer value is being written but also read.
Technically, the MEM_UNINIT should not be there in order to always force init.
Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now
implies two things actually: i) write into memory, ii) memory does not have
to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory,
ii) memory must be initialized. This means that for bpf_*_check_mtu() we're
readding the issue we're trying to fix, that is, it would then be able to
write back into things like .rodata BPF maps. Follow-up work will rework the
MEM_UNINIT semantics such that the intent can be better expressed. For now
just clear the *mtu_len on error path which can be lifted later again.

Fixes: 8a67f2de9b ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
Fixes: d7a4cb9b67 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:56 -07:00
Daniel Borkmann 32556ce93b bpf: Fix helper writes to read-only maps
Lonial found an issue that despite user- and BPF-side frozen BPF map
(like in case of .rodata), it was still possible to write into it from
a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT}
as arguments.

In check_func_arg() when the argument is as mentioned, the meta->raw_mode
is never set. Later, check_helper_mem_access(), under the case of
PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the
subsequent call to check_map_access_type() and given the BPF map is
read-only it succeeds.

The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT
when results are written into them as opposed to read out of them. The
latter indicates that it's okay to pass a pointer to uninitialized memory
as the memory is written to anyway.

However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM
just with additional alignment requirement. So it is better to just get
rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the
fixed size memory types. For this, add MEM_ALIGNED to additionally ensure
alignment given these helpers write directly into the args via *<ptr> = val.
The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>).

MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated
argument types, since in !MEM_FIXED_SIZE cases the verifier does not know
the buffer size a priori and therefore cannot blindly write *<ptr> = val.

Fixes: 57c3bb725a ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13 13:17:55 -07:00
Jakub Kicinski 3b7dc7000e bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZuH9UQAKCRDbK58LschI
 g0/zAP99WOcCBp1M/jSTUOba230+eiol7l5RirDEA6wu7TqY2QEAuvMG0KfCCpTI
 I0WqStrK1QMbhwKPodJC1k+17jArKgw=
 =jfMU
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-09-11

We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).

There's a minor merge conflict in drivers/net/netkit.c:
  00d066a4d4 ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
  d966087948 ("netkit: Disable netpoll support")

The main changes are:

1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
   to easily parse skbs in BPF programs attached to tracepoints,
   from Philo Lu.

2) Add a cond_resched() point in BPF's sock_hash_free() as there have
   been several syzbot soft lockup reports recently, from Eric Dumazet.

3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
   got noticed when zero copy ice driver started to use it,
   from Maciej Fijalkowski.

4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
   up via netif_receive_skb_list() to better measure latencies,
   from Daniel Xu.

5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.

6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
   instead gather the actual value via /proc/sys/net/core/max_skb_frags,
   also from Maciej Fijalkowski.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  sock_map: Add a cond_resched() in sock_hash_free()
  selftests/bpf: Expand skb dynptr selftests for tp_btf
  bpf: Allow bpf_dynptr_from_skb() for tp_btf
  tcp: Use skb__nullable in trace_tcp_send_reset
  selftests/bpf: Add test for __nullable suffix in tp_btf
  bpf: Support __nullable argument suffix for tp_btf
  bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
  selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
  xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
  tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
  bpf, sockmap: Correct spelling skmsg.c
  netkit: Disable netpoll support

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12 20:22:44 -07:00
Mina Almasry d0caf9876a netdev: add dmabuf introspection
Add dmabuf information to page_pool stats:

$ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get
...
 {'dmabuf': 10,
  'id': 456,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 455,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 454,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 453,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 452,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 451,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 450,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},
 {'dmabuf': 10,
  'id': 449,
  'ifindex': 3,
  'inflight': 1023,
  'inflight-mem': 4190208},

And queue stats:

$ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get
...
{'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'},
{'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'},

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:32 -07:00
Mina Almasry 678f6e28b5 net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
Add an interface for the user to notify the kernel that it is done
reading the devmem dmabuf frags returned as cmsg. The kernel will
drop the reference on the frags to make them available for reuse.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:32 -07:00
Mina Almasry 8f0b3cc9a4 tcp: RX path for devmem TCP
In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.

tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.

tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:

1. the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.

The pages awaiting freeing are stored in the newly added
sk->sk_user_frags, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page.  All pages are released when the socket is
destroyed.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:32 -07:00
Mina Almasry 65249feb6b net: add support for skbs with unreadable frags
For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.

Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.

__skb_fill_netmem_desc() now checks frags added to skbs for net_iov,
and marks the skb as skb->devmem accordingly.

Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 9f6b619edf net: support non paged skb frags
Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 0f92140468 memory-provider: dmabuf devmem memory provider
Implement a memory provider that allocates dmabuf devmem in the form of
net_iov.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the net_iov.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
the page_pool can provide the driver with the dma-addrs of the devmem.

Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
0.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-7-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 8ab79ed50c page_pool: devmem support
Convert netmem to be a union of struct page and struct netmem. Overload
the LSB of struct netmem* to indicate that it's a net_iov, otherwise
it's a page.

Currently these entries in struct page are rented by the page_pool and
used exclusively by the net stack:

struct {
	unsigned long pp_magic;
	struct page_pool *pp;
	unsigned long _pp_mapping_pad;
	unsigned long dma_addr;
	atomic_long_t pp_ref_count;
};

Mirror these (and only these) entries into struct net_iov and implement
netmem helpers that can access these common fields regardless of
whether the underlying type is page or net_iov.

Implement checks for net_iov in netmem helpers which delegate to mm
APIs, to ensure net_iov are never passed to the mm stack.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-6-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 28c5c74eea netdev: netdevice devmem allocator
Implement netdev devmem allocator. The allocator takes a given struct
netdev_dmabuf_binding as input and allocates net_iov from that
binding.

The allocation simply delegates to the binding's genpool for the
allocation logic and wraps the returned memory region in a net_iov
struct.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-5-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 170aafe35c netdev: support binding dma-buf to netdevice
Add a netdev_dmabuf_binding struct which represents the
dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
rx queues on the netdevice. On the binding, the dma_buf_attach
& dma_buf_map_attachment will occur. The entries in the sg_table from
mapping will be inserted into a genpool to make it ready
for allocation.

The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
holds the dma-buf offset of the base of the chunk and the dma_addr of
the chunk. Both are needed to use allocations that come from this chunk.

We create a new type that represents an allocation from the genpool:
net_iov. We setup the net_iov allocation size in the
genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
allocated by the page pool and given to the drivers.

The user can unbind the dmabuf from the netdevice by closing the netlink
socket that established the binding. We do this so that the binding is
automatically unbound even if the userspace process crashes.

The binding and unbinding leaves an indicator in struct netdev_rx_queue
that the given queue is bound, and the binding is actuated by resetting
the rx queue using the queue API.

The netdev_dmabuf_binding struct is refcounted, and releases its
resources only when all the refs are released.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 3efd7ab46d net: netdev netlink api to bind dma-buf to a net device
API takes the dma-buf fd as input, and binds it to the netdevice. The
user can specify the rx queues to bind the dma-buf to.

Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:31 -07:00
Mina Almasry 7c88f86576 netdev: add netdev_rx_queue_restart()
Add netdev_rx_queue_restart(), which resets an rx queue using the
queue API recently merged[1].

The queue API was merged to enable the core net stack to reset individual
rx queues to actuate changes in the rx queue's configuration. In later
patches in this series, we will use netdev_rx_queue_restart() to reset
rx queues after binding or unbinding dmabuf configuration, which will
cause reallocation of the page_pool to repopulate its memory using the
new configuration.

[1] https://lore.kernel.org/netdev/20240430231420.699177-1-shailend@google.com/T/

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11 20:44:30 -07:00
Eric Dumazet b1339be951 sock_map: Add a cond_resched() in sock_hash_free()
Several syzbot soft lockup reports all have in common sock_hash_free()

If a map with a large number of buckets is destroyed, we need to yield
the cpu when needed.

Fixes: 75e68e5bf2 ("bpf, sockhash: Synchronize delete from bucket list on map free")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240906154449.3742932-1-edumazet@google.com
2024-09-11 22:16:04 +02:00
Philo Lu ffc83860d8 bpf: Allow bpf_dynptr_from_skb() for tp_btf
Making tp_btf able to use bpf_dynptr_from_skb(), which is useful for skb
parsing, especially for non-linear paged skb data. This is achieved by
adding KF_TRUSTED_ARGS flag to bpf_dynptr_from_skb and registering it
for TRACING progs. With KF_TRUSTED_ARGS, args from fentry/fexit are
excluded, so that unsafe progs like fexit/__kfree_skb are not allowed.

We also need the skb dynptr to be read-only in tp_btf. Because
may_access_direct_pkt_data() returns false by default when checking
bpf_dynptr_from_skb, there is no need to add BPF_PROG_TYPE_TRACING to it
explicitly.

Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240911033719.91468-5-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11 08:56:42 -07:00
Jakub Kicinski ea403549da ipsec-next-2024-09-10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmbf6xAACgkQrB3Eaf9P
 W7eZQA/9HuHTWBg0V43QDT1rjNnKult+uBKYpKrh045outqMs+cU8bsww5ZuIAKx
 ktN66OCE67d7XeFttb9UAJUPqQ98RjwjVUOpjRJ5iRDtj2bmn/5VGSYuH7zx5so0
 msFs5gkomo2ZZNjcMOSrDVGUoCdlHh1og5L2KN/FgztSA1smDdUBQOWNm1peezbI
 eJFt2Q6KCNfzwPthmQte0dmDnK5gWPducereSx03tMuSyUmPML1zrzOFXBXSg09e
 dAlDTxbAXZDrXS4Ii0y/FEM2Ugkjg9FXbE1kvM0i05GIc/SGnEBGEcdW5YbmRhOL
 4JlLnpiLTmKTaIZ0GdpADv7XZMga6R01AalSPsJz+H7aNAHTKkK+SzQY4YXRucZy
 SsASM39oRLzo9Bm4ZZ773Nw83cxBgO/ZixK4KVvCZI/1ftD+9zn72eqk+CeveSeE
 ChaXGuWpRdfAOsgozFJNFx/ffK5qzxFKkIeN9KN0QYV/XJuZJ7nD6eQkH9ydgvTI
 4cexY+cs4wgfdi9dDkVHPVhCR7mRlfi5r/VL8rtWWnWzR07okKF4rW6dgvx33m60
 9MmF1/EdD2uh3CLcBMjNg6qXdC07VeDpFLqWs+utJvSHMuI43uE4FkRQui/J6T9N
 RX7zzkFBsPvPpm5GHLx2u/wvnzX1co1Rk9xzbC+J6FEPlm2/0vI=
 =ErGl
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2024-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2024-09-10

1) Remove an unneeded WARN_ON on packet offload.
   From Patrisious Haddad.

2) Add a copy from skb_seq_state to buffer function.
   This is needed for the upcomming IPTFS patchset.
   From Christian Hopps.

3) Spelling fix in xfrm.h.
   From Simon Horman.

4) Speed up xfrm policy insertions.
   From Florian Westphal.

5) Add and revert a patch to support xfrm interfaces
   for packet offload. This patch was just half cooked.

6) Extend usage of the new xfrm_policy_is_dead_or_sk helper.
   From Florian Westphal.

7) Update comments on sdb and xfrm_policy.
   From Florian Westphal.

8) Fix a null pointer dereference in the new policy insertion
   code From Florian Westphal.

9) Fix an uninitialized variable in the new policy insertion
   code. From Nathan Chancellor.

* tag 'ipsec-next-2024-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: policy: Restore dir assignments in xfrm_hash_rebuild()
  xfrm: policy: fix null dereference
  Revert "xfrm: add SA information to the offloaded packet"
  xfrm: minor update to sdb and xfrm_policy comments
  xfrm: policy: use recently added helper in more places
  xfrm: add SA information to the offloaded packet
  xfrm: policy: remove remaining use of inexact list
  xfrm: switch migrate to xfrm_policy_lookup_bytype
  xfrm: policy: don't iterate inexact policies twice at insert time
  selftests: add xfrm policy insertion speed test script
  xfrm: Correct spelling in xfrm.h
  net: add copy from skb_seq_state to buffer function
  xfrm: Remove documentation WARN_ON to limit return values for offloaded SA
====================

Link: https://patch.msgid.link/20240910065507.2436394-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-10 19:00:47 -07:00
Jakub Kicinski 17245a195d net: remove dev_pick_tx_cpu_id()
dev_pick_tx_cpu_id() has been introduced with two users by
commit a4ea8a3dac ("net: Add generic ndo_select_queue functions").
The use in AF_PACKET has been removed in 2019 by
commit b71b5837f8 ("packet: rework packet_pick_tx_queue() to use common code selection")
The other user was a Netlogic XLP driver, removed in 2021 by
commit 47ac6f567c ("staging: Remove Netlogic XLP network driver").

It's relatively unlikely that any modern driver will need an
.ndo_select_queue implementation which picks purely based on CPU ID
and skips XPS, delete dev_pick_tx_cpu_id()

Found by code inspection.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09 16:53:38 -07:00
Ido Schimmel b3899830aa bpf: lwtunnel: Unmask upper DSCP bits in bpf_lwt_xmit_reroute()
Unmask the upper DSCP bits when calling ip_route_output_key() so that in
the future it could perform the FIB lookup according to the full DSCP
value.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-09 14:14:52 +01:00
Zijun Hu 8f08854199 net: sysfs: Fix weird usage of class's namespace relevant fields
Device class has two namespace relevant fields which are associated by
the following usage:

struct class {
	...
	const struct kobj_ns_type_operations *ns_type;
	const void *(*namespace)(const struct device *dev);
	...
}
if (dev->class && dev->class->ns_type)
	dev->class->namespace(dev);

The usage looks weird since it checks @ns_type but calls namespace()
it is found for all existing class definitions that the other filed is
also assigned once one is assigned in current kernel tree, so fix this
weird usage by checking @namespace to call namespace().

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-09 10:30:52 +01:00
Eric Dumazet 9a95eedc81 netpoll: remove netpoll_srcu
netpoll_srcu is currently used from netpoll_poll_disable() and
__netpoll_cleanup()

Both functions run under RTNL, using netpoll_srcu adds confusion
and no additional protection.

Moreover the synchronize_srcu() call in __netpoll_cleanup() is
performed before clearing np->dev->npinfo, which violates RCU rules.

After this patch, netpoll_poll_disable() and netpoll_poll_enable()
simply use rtnl_dereference().

This saves a big chunk of memory (more than 192KB on platforms
with 512 cpus)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240905084909.2082486-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-06 18:24:59 -07:00
Hongbo Li 17f0139190 net/core: make use of the helper macro LIST_HEAD()
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). Here we can simplify
the code.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240904093243.3345012-6-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-06 18:10:22 -07:00
Jakub Kicinski 502cc061de Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/phy/phy_device.c
  2560db6ede ("net: phy: Fix missing of_node_put() for leds")
  1dce520abd ("net: phy: Use for_each_available_child_of_node_scoped()")
https://lore.kernel.org/20240904115823.74333648@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/xilinx/xilinx_axienet.h
drivers/net/ethernet/xilinx/xilinx_axienet_main.c
  858430db28 ("net: xilinx: axienet: Fix race in axienet_stop")
  76abb5d675 ("net: xilinx: axienet: Add statistics support")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05 20:37:20 -07:00
Joe Damato 08062af0a5 net: napi: Prevent overflow of napi_defer_hard_irqs
In commit 6f8b12d661 ("net: napi: add hard irqs deferral feature")
napi_defer_irqs was added to net_device and napi_defer_irqs_count was
added to napi_struct, both as type int.

This value never goes below zero, so there is not reason for it to be a
signed int. Change the type for both from int to u32, and add an
overflow check to sysfs to limit the value to S32_MAX.

The limit of S32_MAX was chosen because the practical limit before this
patch was S32_MAX (anything larger was an overflow) and thus there are
no behavioral changes introduced. If the extra bit is needed in the
future, the limit can be raised.

Before this patch:

$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
$ cat /sys/class/net/eth4/napi_defer_hard_irqs
-2147483647

After this patch:

$ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs'
bash: line 0: echo: write error: Numerical result out of range

Similarly, /sys/class/net/XXXXX/tx_queue_len is defined as unsigned:

include/linux/netdevice.h:      unsigned int            tx_queue_len;

And has an overflow check:

dev_change_tx_queue_len(..., unsigned long new_len):

  if (new_len != (unsigned int)new_len)
          return -ERANGE;

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240904153431.307932-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05 18:42:56 -07:00
Breno Leitao 77461c1081 net: dqs: Do not use extern for unused dql_group
When CONFIG_DQL is not enabled, dql_group should be treated as a dead
declaration. However, its current extern declaration assumes the linker
will ignore it, which is generally true across most compiler and
architecture combinations.

But in certain cases, the linker still attempts to resolve the extern
struct, even when the associated code is dead, resulting in a linking
error. For instance the following error in loongarch64:

>> loongarch64-linux-ld: net-sysfs.c:(.text+0x589c): undefined reference to `dql_group'

Modify the declaration of the dead object to be an empty declaration
instead of an extern. This change will prevent the linker from
attempting to resolve an undefined reference.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409012047.eCaOdfQJ-lkp@intel.com/
Fixes: 74293ea1c4 ("net: sysfs: Do not create sysfs for non BQL device")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://patch.msgid.link/20240902101734.3260455-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03 12:01:38 -07:00
Alexander Lobakin 05c1280a2b netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_local
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03 11:36:43 +02:00
Alexander Lobakin 00d066a4d4 netdev_features: convert NETIF_F_LLTX to dev->lltx
NETIF_F_LLTX can't be changed via Ethtool and is not a feature,
rather an attribute, very similar to IFF_NO_QUEUE (and hot).
Free one netdev_features_t bit and make it a "hot" private flag.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03 11:36:43 +02:00
Alexander Lobakin beb5a9bea8 netdevice: convert private flags > BIT(31) to bitfields
Make dev->priv_flags `u32` back and define bits higher than 31 as
bitfield booleans as per Jakub's suggestion. This simplifies code
which accesses these bits with no optimization loss (testb both
before/after), allows to not extend &netdev_priv_flags each time,
but also scales better as bits > 63 in the future would only add
a new u64 to the structure with no complications, comparing to
that extending ::priv_flags would require converting it to a bitmap.
Note that I picked `unsigned long :1` to not lose any potential
optimizations comparing to `bool :1` etc.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03 11:36:43 +02:00
Joe Damato 4e3a024b43 netdev-genl: Set extack and fix error on napi-get
In commit 27f91aaf49 ("netdev-genl: Add netlink framework functions
for napi"), when an invalid NAPI ID is specified the return value
-EINVAL is used and no extack is set.

Change the return value to -ENOENT and set the extack.

Before this commit:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                          --do napi-get --json='{"id": 451}'
Netlink error: Invalid argument
nl_len = 36 (20) nl_flags = 0x100 nl_type = 2
	error: -22

After this commit:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --do napi-get --json='{"id": 451}'
Netlink error: No such file or directory
nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
	error: -2
	extack: {'bad-attr': '.id'}

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20240831121707.17562-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-02 18:27:15 -07:00
Simon Horman 731733c623 bpf, sockmap: Correct spelling skmsg.c
Correct spelling in skmsg.c. As reported by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/bpf/20240829-sockmap-spell-v1-1-a614d76564cc@kernel.org
2024-09-02 18:43:58 +02:00
Ido Schimmel 50033400fc bpf: Unmask upper DSCP bits in __bpf_redirect_neigh_v4()
Unmask the upper DSCP bits when calling ip_route_output_flow() so that
in the future it could perform the FIB lookup according to the full DSCP
value.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-31 17:44:51 +01:00
Hongbo Li 68016b9972 net: prefer strscpy over strcpy
The deprecated helper strcpy() performs no bounds checking on the
destination buffer. This could result in linear overflows beyond
the end of the buffer, leading to all kinds of misbehaviors.
The safe replacement is strscpy() [1].

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1]

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29 12:33:07 -07:00
Jakub Kicinski 3cbd2090d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/faraday/ftgmac100.c
  4186c8d9e6 ("net: ftgmac100: Ensure tx descriptor updates are visible")
  e24a6c8746 ("net: ftgmac100: Get link speed and duplex for NC-SI")
https://lore.kernel.org/0b851ec5-f91d-4dd3-99da-e81b98c9ed28@kernel.org

net/ipv4/tcp.c
  bac76cf898 ("tcp: fix forever orphan socket caused by tcp_abort")
  edefba66d9 ("tcp: rstreason: introduce SK_RST_REASON_TCP_STATE for active reset")
https://lore.kernel.org/20240828112207.5c199d41@canb.auug.org.au

No adjacent changes.

Link: https://patch.msgid.link/20240829130829.39148-1-pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29 11:49:10 -07:00
Christian Brauner 1934b21261 file: reclaim 24 bytes from f_owner
We do embedd struct fown_struct into struct file letting it take up 32
bytes in total. We could tweak struct fown_struct to be more compact but
really it shouldn't even be embedded in struct file in the first place.

Instead, actual users of struct fown_struct should allocate the struct
on demand. This frees up 24 bytes in struct file.

That will have some potentially user-visible changes for the ownership
fcntl()s. Some of them can now fail due to allocation failures.
Practically, that probably will almost never happen as the allocations
are small and they only happen once per file.

The fown_struct is used during kill_fasync() which is used by e.g.,
pipes to generate a SIGIO signal. Sending of such signals is conditional
on userspace having set an owner for the file using one of the F_OWNER
fcntl()s. Such users will be unaffected if struct fown_struct is
allocated during the fcntl() call.

There are a few subsystems that call __f_setown() expecting
file->f_owner to be allocated:

(1) tun devices
    file->f_op->fasync::tun_chr_fasync()
    -> __f_setown()

    There are no callers of tun_chr_fasync().

(2) tty devices

    file->f_op->fasync::tty_fasync()
    -> __tty_fasync()
       -> __f_setown()

    tty_fasync() has no additional callers but __tty_fasync() has. Note
    that __tty_fasync() only calls __f_setown() if the @on argument is
    true. It's called from:

    file->f_op->release::tty_release()
    -> tty_release()
       -> __tty_fasync()
          -> __f_setown()

    tty_release() calls __tty_fasync() with @on false
    => __f_setown() is never called from tty_release().
       => All callers of tty_release() are safe as well.

    file->f_op->release::tty_open()
    -> tty_release()
       -> __tty_fasync()
          -> __f_setown()

    __tty_hangup() calls __tty_fasync() with @on false
    => __f_setown() is never called from tty_release().
       => All callers of __tty_hangup() are safe as well.

From the callchains it's obvious that (1) and (2) end up getting called
via file->f_op->fasync(). That can happen either through the F_SETFL
fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If
FASYNC is requested and the file isn't already FASYNC then
file->f_op->fasync() is called with @on true which ends up causing both
(1) and (2) to call __f_setown().

(1) and (2) are the only subsystems that call __f_setown() from the
file->f_op->fasync() handler. So both (1) and (2) have been updated to
allocate a struct fown_struct prior to calling fasync_helper() to
register with the fasync infrastructure. That's safe as they both call
fasync_helper() which also does allocations if @on is true.

The other interesting case are file leases:

(3) file leases
    lease_manager_ops->lm_setup::lease_setup()
    -> __f_setown()

    Which in turn is called from:

    generic_add_lease()
    -> lease_manager_ops->lm_setup::lease_setup()
       -> __f_setown()

So here again we can simply make generic_add_lease() allocate struct
fown_struct prior to the lease_manager_ops->lm_setup::lease_setup()
which happens under a spinlock.

With that the two remaining subsystems that call __f_setown() are:

(4) dnotify
(5) sockets

Both have their own custom ioctls to set struct fown_struct and both
have been converted to allocate a struct fown_struct on demand from
their respective ioctls.

Interactions with O_PATH are fine as well e.g., when opening a /dev/tty
as O_PATH then no file->f_op->open() happens thus no file->f_owner is
allocated. That's fine as no file operation will be set for those and
the device has never been opened. fcntl()s called on such things will
just allocate a ->f_owner on demand. Although I have zero idea why'd you
care about f_owner on an O_PATH fd.

Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 13:05:39 +02:00
Boris Sukholitko 9388637270 tc: adjust network header after 2nd vlan push
<tldr>
skb network header of the single-tagged vlan packet continues to point the
vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This
causes problem at the dissector which expects double-tagged packet network
header to point to the inner vlan.

The fix is to adjust network header in tcf_act_vlan.c but requires
refactoring of skb_vlan_push function.
</tldr>

Consider the following shell script snippet configuring TC rules on the
veth interface:

ip link add veth0 type veth peer veth1
ip link set veth0 up
ip link set veth1 up

tc qdisc add dev veth0 clsact

tc filter add dev veth0 ingress pref 10 chain 0 flower \
	num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5
tc filter add dev veth0 ingress pref 20 chain 0 flower \
	num_of_vlans 1 action vlan push id 100 \
	protocol 0x8100 action goto chain 5
tc filter add dev veth0 ingress pref 30 chain 5 flower \
	num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"

Sending double-tagged vlan packet with the IP payload inside:

cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64   ..........."...d
0010  81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11   ......E..&......
0020  18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12   ................
0030  e1 c7 00 00 00 00 00 00 00 00 00 00               ............
ENDS

will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to
the dmesg.

OTOH, sending single-tagged vlan packet:

cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000  00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14   ..........."....
0010  08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00   ..E..*..........
0020  00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00   ................
0030  00 00 00 00 00 00 00 00 00 00 00 00               ............
ENDS

will match rule 20, will push the second vlan tag but will *not* match
rule 30. IOW, the match at rule 30 fails if the second vlan was freshly
pushed by the kernel.

Lets look at  __skb_flow_dissect working on the double-tagged vlan packet.
Here is the relevant code from around net/core/flow_dissector.c:1277
copy-pasted here for convenience:

	if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &&
	    skb && skb_vlan_tag_present(skb)) {
		proto = skb->protocol;
	} else {
		vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
					    data, hlen, &_vlan);
		if (!vlan) {
			fdret = FLOW_DISSECT_RET_OUT_BAD;
			break;
		}

		proto = vlan->h_vlan_encapsulated_proto;
		nhoff += sizeof(*vlan);
	}

The "else" clause above gets the protocol of the encapsulated packet from
the skb data at the network header location. printk debugging has showed
that in the good double-tagged packet case proto is
htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet
case proto is garbage leading to the failure to match tc filter 30.

proto is being set from the skb header pointed by nhoff parameter which is
defined at the beginning of __skb_flow_dissect
(net/core/flow_dissector.c:1055 in the current version):

		nhoff = skb_network_offset(skb);

Therefore the culprit seems to be that the skb network offset is different
between double-tagged packet received from the interface and single-tagged
packet having its vlan tag pushed by TC.

Lets look at the interesting points of the lifetime of the single/double
tagged packets as they traverse our packet flow.

Both of them will start at __netif_receive_skb_core where the first vlan
tag will be stripped:

	if (eth_type_vlan(skb->protocol)) {
		skb = skb_vlan_untag(skb);
		if (unlikely(!skb))
			goto out;
	}

At this stage in double-tagged case skb->data points to the second vlan tag
while in single-tagged case skb->data points to the network (eg. IP)
header.

Looking at TC vlan push action (net/sched/act_vlan.c) we have the following
code at tcf_vlan_act (interesting points are in square brackets):

	if (skb_at_tc_ingress(skb))
[1]		skb_push_rcsum(skb, skb->mac_len);

	....

	case TCA_VLAN_ACT_PUSH:
		err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
				    (p->tcfv_push_prio << VLAN_PRIO_SHIFT),
				    0);
		if (err)
			goto drop;
		break;

	....

out:
	if (skb_at_tc_ingress(skb))
[3]		skb_pull_rcsum(skb, skb->mac_len);

And skb_vlan_push (net/core/skbuff.c:6204) function does:

		err = __vlan_insert_tag(skb, skb->vlan_proto,
					skb_vlan_tag_get(skb));
		if (err)
			return err;

		skb->protocol = skb->vlan_proto;
[2]		skb->mac_len += VLAN_HLEN;

in the case of pushing the second tag. Lets look at what happens with
skb->data of the single-tagged packet at each of the above points:

1. As a result of the skb_push_rcsum, skb->data is moved back to the start
   of the packet.

2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is
   incremented, skb->data still points to the start of the packet.

3. As a result of the skb_pull_rcsum, skb->data is moved forward by the
   modified skb->mac_len, thus pointing to the network header again.

Then __skb_flow_dissect will get confused by having double-tagged vlan
packet with the skb->data at the network header.

The solution for the bug is to preserve "skb->data at second vlan header"
semantics in the skb_vlan_push function. We do this by manipulating
skb->network_header rather than skb->mac_len. skb_vlan_push callers are
updated to do skb_reset_mac_len.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-27 11:37:42 +02:00
Jamie Bainbridge a699781c79 ethtool: check device is present when getting link settings
A sysfs reader can race with a device reset or removal, attempting to
read device state when the device is not actually present. eg:

     [exception RIP: qed_get_current_link+17]
  #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede]
  #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3
 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4
 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300
 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c
 #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b
 #14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3
 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1
 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f
 #17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb

 crash> struct net_device.state ffff9a9d21336000
    state = 5,

state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100).
The device is not present, note lack of __LINK_STATE_PRESENT (0b10).

This is the same sort of panic as observed in commit 4224cfd7fb
("net-sysfs: add check for netdevice being present to speed_show").

There are many other callers of __ethtool_get_link_ksettings() which
don't have a device presence check.

Move this check into ethtool to protect all callers.

Fixes: d519e17e2d ("net: export device speed and duplex via sysfs")
Fixes: 4224cfd7fb ("net-sysfs: add check for netdevice being present to speed_show")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 14:03:02 -07:00
Simon Horman a8c924e987 net: Correct spelling in net/core
Correct spelling in net/core.
As reported by codespell.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822-net-spell-v1-13-3a98971ce2d2@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 09:37:23 -07:00
Breno Leitao ae5a0456e0 netpoll: Ensure clean state on setup failures
Modify netpoll_setup() and __netpoll_setup() to ensure that the netpoll
structure (np) is left in a clean state if setup fails for any reason.
This prevents carrying over misconfigured fields in case of partial
setup success.

Key changes:
- np->dev is now set only after successful setup, ensuring it's always
  NULL if netpoll is not configured or if netpoll_setup() fails.
- np->local_ip is zeroed if netpoll setup doesn't complete successfully.
- Added DEBUG_NET_WARN_ON_ONCE() checks to catch unexpected states.
- Reordered some operations in __netpoll_setup() for better logical flow.

These changes improve the reliability of netpoll configuration, since it
assures that the structure is fully initialized or totally unset.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240822111051.179850-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 09:25:44 -07:00
Jakub Kicinski e540e3bcf2 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZsiMrQAKCRDbK58LschI
 g1mtAP9wBoNO9sNRrJ2OUg69R5uSTT2//v7icN01xwVtx9ir/AD+PJ+v/WG1QVlM
 6GNsPoGtQ53ptuiJFfXEkuVELGqKywY=
 =I/T4
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-08-23

We've added 10 non-merge commits during the last 15 day(s) which contain
a total of 10 files changed, 222 insertions(+), 190 deletions(-).

The main changes are:

1) Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
   when long-lived sockets miss a chance to set additional callbacks
   if a sockops program was not attached early in their lifetime,
   from Alan Maguire.

2) Add a batch of BPF selftest improvements which fix a few bugs and add
   missing features to improve the test coverage of sockmap/sockhash,
   from Michal Luczaj.

3) Fix a false-positive Smatch-reported off-by-one in tcp_validate_cookie()
   which is part of the test_tcp_custom_syncookie BPF selftest,
   from Kuniyuki Iwashima.

4) Fix the flow_dissector BPF selftest which had a bug in IP header's
   tot_len calculation doing subtraction after htons() instead of inside
   htons(), from Asbjørn Sloth Tønnesen.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftest: bpf: Remove mssind boundary check in test_tcp_custom_syncookie.c.
  selftests/bpf: Introduce __attribute__((cleanup)) in create_pair()
  selftests/bpf: Exercise SOCK_STREAM unix_inet_redir_to_connected()
  selftests/bpf: Honour the sotype of af_unix redir tests
  selftests/bpf: Simplify inet_socketpair() and vsock_socketpair_connectible()
  selftests/bpf: Socket pair creation, cleanups
  selftests/bpf: Support more socket types in create_pair()
  selftests/bpf: Avoid subtraction after htons() in ipip tests
  selftests/bpf: add sockopt tests for TCP_BPF_SOCK_OPS_CB_FLAGS
  bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags
====================

Link: https://patch.msgid.link/20240823134959.1091-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 08:50:29 -07:00
Jakub Kicinski b2ede25b7e netfilter pull request 24-08-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbHtrsACgkQ1V2XiooU
 IOTZ8g/+OW8W468NmBHA2zrTWei19irA1iCBLvXMPakice0+ADU51eqVp6uUrCeP
 iBUZGMtCq4WzFBrAEBePK3UNxxMHquWvAsA0kO/XW95KVM++s9ykF62q89jugMb3
 CADEv/TxgJrkzpLWclxHNTCWMKpURijlkjT+kCMR4fKbeQnB6e/jI+2sdl7l5iRG
 tHHm8ieewNNKE+jlSUJUrPEIM3tXRaZh9+JmbClfsF6wUw7qLmT/7P92aHBX4Owp
 tpqi/Xc5/2k+Ud96a8u1NYrLLG+L70uz3SaeE7PvhaRavFuYftk2XLB4L2umtEfb
 ZCZO/lCadH23XrVAUs5EtCDk4Tu3rZdTDsKYm2qS66uBsh/e6hg+j/cIPSO8jsNq
 5Zbs/XzPFJ1PUpXVy8Sfs9vxH+cDuiqhy9nfKrbQotsqtoW+z52UoFH4WAjfmpqb
 XMI+yeSTXYl1KIo2LV408VFRFRGcstBvXE7bOn7ufSrltRZcFdx7wqQBgVbh1zvA
 1NTzIguZ+Lf2hPcNLPQd/f2vghKRTI7gUwzlDRw6so6NOWUM6/yV5KyoiZKekHjC
 S6+M8cdiyMH8DmSsvAb46YKtDxYuIHqLVxuVqjfBHrMo1hLIo5smMCeCRA1Vabd5
 /E4DTwpN5tVWX+HZl1wcAtQpXhcTktWM0qGSPnlRS11gwbAGWvk=
 =O8tu
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
	 segment GSO packet since skb_zerocopy() does not support
	 GSO_BY_FRAGS, from Antonio Ojea.

Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
	 from Antonio Ojea.

Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
         from Donald Hunter.

Patch #4 adds a dedicate commit list for sets to speed up
	 intra-transaction lookups, from Florian Westphal.

Patch #5 skips removal of element from abort path for the pipapo
         backend, ditching the shadow copy of this datastructure
	 is sufficient.

Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
	 let users of conncoiunt decide when to enable conntrack,
	 this is needed by openvswitch, from Xin Long.

Patch #7 pass context to all nft_parse_register_load() in
	 preparation for the next patch.

Patches #8 and #9 reject loads from uninitialized registers from
	 control plane to remove register initialization from
	 datapath. From Florian Westphal.

* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: don't initialize registers in nft_do_chain()
  netfilter: nf_tables: allow loads only when register is initialized
  netfilter: nf_tables: pass context structure to nft_parse_register_load
  netfilter: move nf_ct_netns_get out of nf_conncount_init
  netfilter: nf_tables: do not remove elements if set backend implements .abort
  netfilter: nf_tables: store new sets in dedicated list
  netfilter: nfnetlink: convert kfree_skb to consume_skb
  selftests: netfilter: nft_queue.sh: sctp coverage
  netfilter: nfnetlink_queue: unbreak SCTP traffic
====================

Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26 08:42:55 -07:00
Mina Almasry 7d3aed652d net: refactor ->ndo_bpf calls into dev_xdp_propagate
When net devices propagate xdp configurations to slave devices,
we will need to perform a memory provider check to ensure we're
not binding xdp to a device using unreadable netmem.

Currently the ->ndo_bpf calls in a few places. Adding checks to all
these places would not be ideal.

Refactor all the ->ndo_bpf calls into one place where we can add this
check in the future.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-24 15:27:22 +01:00
Li Zetao 2d522384fb rtnetlink: delete redundant judgment statements
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23 14:27:45 +01:00
Li Zetao c25bdd2ac8 neighbour: delete redundant judgment statements
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23 14:27:45 +01:00
Li Zetao 41aa426392 fib: rules: delete redundant judgment statements
The initial value of err is -ENOMEM, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23 14:27:44 +01:00
Maxime Chevallier 3849687869 net: phy: Introduce ethernet link topology representation
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.

With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.

The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.

Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.

The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.

This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.

The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to. The PHY index can be
re-used for PHYs that are persistent.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23 13:04:34 +01:00
Eric Dumazet 979b581e4c pktgen: use cpus_read_lock() in pg_net_init()
I have seen the WARN_ON(smp_processor_id() != cpu) firing
in pktgen_thread_worker() during tests.

We must use cpus_read_lock()/cpus_read_unlock()
around the for_each_online_cpu(cpu) loop.

While we are at it use WARN_ON_ONCE() to avoid a possible syslog flood.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240821175339.1191779-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22 17:14:03 -07:00
Jakub Kicinski 761d527d5d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.h
  c948c0973d ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops")
  f2878cdeb7 ("bnxt_en: Add support to call FW to update a VNIC")

Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22 17:06:18 -07:00
Ido Schimmel ef434fae72 bpf: Unmask upper DSCP bits in bpf_fib_lookup() helper
The helper performs a FIB lookup according to the parameters in the
'params' argument, one of which is 'tos'. According to the test in
test_tc_neigh_fib.c, it seems that BPF programs are expected to
initialize the 'tos' field to the full 8 bit DS field from the IPv4
header.

Unmask the upper DSCP bits before invoking the IPv4 FIB lookup APIs so
that in the future the lookup could be performed according to the full
DSCP value.

No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22 16:59:56 -07:00
Alexei Starovoitov 50c374c6d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbc ("bpf: Fix updating attached freplace prog in prog_array map")

No conflicts.

Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile

Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-22 09:48:44 -07:00
Yu Jiaoliang b6ab509027 bpf: Use kmemdup_array instead of kmemdup for multiple allocation
Let the kmemdup_array() take care about multiplication and possible
overflows.

Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240821073709.4067177-1-yujiaoliang@vivo.com
2024-08-22 14:28:24 +02:00
Eric Dumazet 007d4271a5 netpoll: do not export netpoll_poll_[disable|enable]()
netpoll_poll_disable() and netpoll_poll_enable() are only used
from core networking code, there is no need to export them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240820162053.3870927-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-21 17:31:28 -07:00
Caleb Sander Mateos e68ac2b488 softirq: Remove unused 'action' parameter from action callback
When soft interrupt actions are called, they are passed a pointer to the
struct softirq action which contains the action's function pointer.

This pointer isn't useful, as the action callback already knows what
function it is. And since each callback handles a specific soft interrupt,
the callback also knows which soft interrupt number is running.

No soft interrupt action callback actually uses this parameter, so remove
it from the function pointer signature. This clarifies that soft interrupt
actions are global routines and makes it slightly cheaper to call them.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-20 17:13:40 +02:00
Christian Hopps 6ad8bc92a4 net: add copy from skb_seq_state to buffer function
Add an skb helper function to copy a range of bytes from within
an existing skb_seq_state.

Signed-off-by: Christian Hopps <chopps@labn.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-20 08:11:48 +02:00
Zhang Changzhong dca9d62a0d net: remove redundant check in skb_shift()
The check for '!to' is redundant here, since skb_can_coalesce() already
contains this check.

Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1723730983-22912-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19 18:20:18 -07:00
Antonio Ojea 26a77d0289 netfilter: nfnetlink_queue: unbreak SCTP traffic
when packet is enqueued with nfqueue and GSO is enabled, checksum
calculation has to take into account the protocol, as SCTP uses a
32 bits CRC checksum.

Enter skb_gso_segment() path in case of SCTP GSO packets because
skb_zerocopy() does not support for GSO_BY_FRAGS.

Joint work with Pablo.

Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-19 18:44:50 +02:00
Uros Bizjak d440af37ba netdev: Add missing __percpu qualifier to a cast
Add missing __percpu qualifier to a (void *) cast to fix

dev.c:10863:45: warning: cast removes address space '__percpu' of expression

sparse warning. Also remove now unneeded __force sparse directives.

Found by GCC's named address space checks.

There were no changes in the resulting object file.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://patch.msgid.link/20240814070748.943671-1-ubizjak@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15 19:10:01 -07:00
Jakub Kicinski 4d3d3559fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
  c25504a0ba ("dt-bindings: net: fsl,qoriq-mc-dpmac: add missed property phys")
  be034ee6c3 ("dt-bindings: net: fsl,qoriq-mc-dpmac: using unevaluatedProperties")
https://lore.kernel.org/20240815110934.56ae623a@canb.auug.org.au

drivers/net/dsa/vitesse-vsc73xx-core.c
  5b9eebc2c7 ("net: dsa: vsc73xx: pass value in phy_write operation")
  fa63c6434b ("net: dsa: vsc73xx: check busy flag in MDIO operations")
  2524d6c28b ("net: dsa: vsc73xx: use defined values in phy operations")
https://lore.kernel.org/20240813104039.429b9fe6@canb.auug.org.au
Resolve by using FIELD_PREP(), Stephen's resolution is simpler.

Adjacent changes:

net/vmw_vsock/af_vsock.c
  69139d2919 ("vsock: fix recursive ->recvmsg calls")
  744500d81f ("vsock: add support for SIOCOUTQ ioctl")

Link: https://patch.msgid.link/20240815141149.33862-1-pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15 17:18:52 -07:00
Al Viro 55f325958c bpf: switch maps to CLASS(fd, ...)
Calling conventions for __bpf_map_get() would be more convenient
if it left fpdut() on failure to callers.  Makes for simpler logics
in the callers.

	Among other things, the proof of memory safety no longer has to
rely upon file->private_data never being ERR_PTR(...) for bpffs files.
Original calling conventions made it impossible for the caller to tell
whether __bpf_map_get() has returned ERR_PTR(-EINVAL) because it has found
the file not be a bpf map one (in which case it would've done fdput())
or because it found that ERR_PTR(-EINVAL) in file->private_data of a
bpf map file (in which case fdput() would _not_ have been done).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-08-13 15:58:17 -07:00
Andrii Nakryiko 50470d3899 Merge remote-tracking branch 'vfs/stable-struct_fd'
Merge Al Viro's struct fd refactorings.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-08-13 13:52:30 -07:00
Breno Leitao 1ef33652d2 net: netpoll: extract core of netpoll_cleanup
Extract the core part of netpoll_cleanup(), so, it could be called from
a caller that has the rtnl lock already.

Netconsole uses this in a weird way right now:

	__netpoll_cleanup(&nt->np);
	spin_lock_irqsave(&target_list_lock, flags);
	netdev_put(nt->np.dev, &nt->np.dev_tracker);
	nt->np.dev = NULL;
	nt->enabled = false;

This will be replaced by do_netpoll_cleanup() as the locking situation
is overhauled.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-13 10:58:58 +02:00
Al Viro 1da91ea87a introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
	Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
	This commit converts (almost) all of f.file to
fd_file(f).  It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

	NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12 22:00:43 -04:00
Jakub Sitnicki 2b2bc3bab1 net: Make USO depend on CSUM offload
UDP segmentation offload inherently depends on checksum offload. It should
not be possible to disable checksum offload while leaving USO enabled.
Enforce this dependency in code.

There is a single tx-udp-segmentation feature flag to indicate support for
both IPv4/6, hence the devices wishing to support USO must offer checksum
offload for both IP versions.

Fixes: 10154dbded ("udp: Allow GSO transmit from devices with no checksum offload")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-1-f5c5b4149ab9@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09 21:58:08 -07:00
Alan Maguire 3882dccf48 bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags
Currently the only opportunity to set sock ops flags dictating
which callbacks fire for a socket is from within a TCP-BPF sockops
program.  This is problematic if the connection is already set up
as there is no further chance to specify callbacks for that socket.
Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_setsockopt() and bpf_getsockopt()
to allow users to specify callbacks later, either via an iterator
over sockets or via a socket-specific program triggered by a
setsockopt() on the socket.

Previous discussion on this here [1].

[1] https://lore.kernel.org/bpf/f42f157b-6e52-dd4d-3d97-9b86c84c0b00@oracle.com/

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20240808150558.1035626-2-alan.maguire@oracle.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-08-08 16:52:43 -07:00
Jakub Kicinski e47fd9beb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Link: https://patch.msgid.link/20240808170148.3629934-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08 14:04:17 -07:00
Eric Dumazet 3e7917c0cd net: linkwatch: use system_unbound_wq
linkwatch_event() grabs possibly very contended RTNL mutex.

system_wq is not suitable for such work.

Inspired by many noisy syzbot reports.

3 locks held by kworker/0:7/5266:
 #0: ffff888015480948 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3206 [inline]
 #0: ffff888015480948 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x90a/0x1830 kernel/workqueue.c:3312
 #1: ffffc90003f6fd00 ((linkwatch_work).work){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3207 [inline]
 , at: process_scheduled_works+0x945/0x1830 kernel/workqueue.c:3312
 #2: ffffffff8fa6f208 (rtnl_mutex){+.+.}-{3:3}, at: linkwatch_event+0xe/0x60 net/core/link_watch.c:276

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240805085821.1616528-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-06 12:12:53 -07:00
Jakub Kicinski c89cca307b net: skbuff: sprinkle more __GFP_NOWARN on ingress allocs
build_skb() and frag allocations done with GFP_ATOMIC will
fail in real life, when system is under memory pressure,
and there's nothing we can do about that. So no point
printing warnings.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-05 12:28:11 +01:00
Kuniyuki Iwashima 8eaf71f77c net: Initialise net.core sysctl defaults in preinit_net().
Commit 7c3f1875c6 ("net: move somaxconn init from sysctl code")
introduced net_defaults_ops to make sure that net.core sysctl knobs
are always initialised even if CONFIG_SYSCTL is disabled.

Such operations better fit preinit_net() added for a similar purpose
by commit 6e77a5a4af ("net: initialize net->notrefcnt_tracker earlier").

Let's initialise the sysctl defaults in preinit_net().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03 22:38:45 +01:00
Kuniyuki Iwashima 05be801259 net: Slim down setup_net().
Most initialisations in setup_net() do not require pernet_ops_rwsem
and can be moved to preinit_net().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03 22:38:45 +01:00
Kuniyuki Iwashima 9302994918 net: Call preinit_net() without pernet_ops_rwsem.
When initialising the root netns, we call preinit_net() under
pernet_ops_rwsem.

However, the operations in preinit_net() do not require pernet_ops_rwsem.

Also, we don't hold it for preinit_net() when initialising non-root netns.

To be consistent, let's call preinit_net() without pernet_ops_rwsem in
net_ns_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03 22:38:44 +01:00
Kuniyuki Iwashima 2b5afc1d5d net: Initialise net->passive once in preinit_net().
When initialising the root netns, we set net->passive in setup_net().

However, we do it twice for non-root netns in copy_net_ns() and
setup_net().

This is because we could bypass setup_net() in copy_net_ns() if
down_read_killable() fails.

preinit_net() is a better place to put such an operation.

Let's initialise net->passive in preinit_net().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03 22:38:44 +01:00
Kuniyuki Iwashima 768e4bb6a7 net: Don't register pernet_operations if only one of id or size is specified.
We can allocate per-netns memory for struct pernet_operations by specifying
id and size.

register_pernet_operations() assigns an id to pernet_operations and later
ops_init() allocates the specified size of memory as net->gen->ptr[id].

If id is missing, no memory is allocated.  If size is not specified,
pernet_operations just wastes an entry of net->gen->ptr[] for every netns.

net_generic is available only when both id and size are specified, so let's
ensure that.

While we are at it, we add const to both fields.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-03 22:38:44 +01:00
Dmitry Antipov f94074687d net: core: annotate socks of struct sock_reuseport with __counted_by
According to '__reuseport_alloc()', annotate flexible array member
'sock' of 'struct sock_reuseport' with '__counted_by()' and use
convenient 'struct_size()' to simplify the math used in 'kzalloc()'.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240801142311.42837-1-dmantipov@yandex.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-02 17:16:59 -07:00
Yonghong Song 92de36080c bpf: Fail verification for sign-extension of packet data/data_end/data_meta
syzbot reported a kernel crash due to
  commit 1f1e864b65 ("bpf: Handle sign-extenstin ctx member accesses").
The reason is due to sign-extension of 32-bit load for
packet data/data_end/data_meta uapi field.

The original code looks like:
        r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
        r3 = *(u32 *)(r1 + 80) /* load __sk_buff->data_end */
        r0 = r2
        r0 += 8
        if r3 > r0 goto +1
        ...
Note that __sk_buff->data load has 32-bit sign extension.

After verification and convert_ctx_accesses(), the final asm code looks like:
        r2 = *(u64 *)(r1 +208)
        r2 = (s32)r2
        r3 = *(u64 *)(r1 +80)
        r0 = r2
        r0 += 8
        if r3 > r0 goto pc+1
        ...
Note that 'r2 = (s32)r2' may make the kernel __sk_buff->data address invalid
which may cause runtime failure.

Currently, in C code, typically we have
        void *data = (void *)(long)skb->data;
        void *data_end = (void *)(long)skb->data_end;
        ...
and it will generate
        r2 = *(u64 *)(r1 +208)
        r3 = *(u64 *)(r1 +80)
        r0 = r2
        r0 += 8
        if r3 > r0 goto pc+1

If we allow sign-extension,
        void *data = (void *)(long)(int)skb->data;
        void *data_end = (void *)(long)skb->data_end;
        ...
the generated code looks like
        r2 = *(u64 *)(r1 +208)
        r2 <<= 32
        r2 s>>= 32
        r3 = *(u64 *)(r1 +80)
        r0 = r2
        r0 += 8
        if r3 > r0 goto pc+1
and this will cause verification failure since "r2 <<= 32" is not allowed
as "r2" is a packet pointer.

To fix this issue for case
  r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
this patch added additional checking in is_valid_access() callback
function for packet data/data_end/data_meta access. If those accesses
are with sign-extenstion, the verification will fail.

  [1] https://lore.kernel.org/bpf/000000000000c90eee061d236d37@google.com/

Reported-by: syzbot+ad9ec60c8eaf69e6f99c@syzkaller.appspotmail.com
Fixes: 1f1e864b65 ("bpf: Handle sign-extenstin ctx member accesses")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240723153439.2429035-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-29 15:05:05 -07:00
Kuniyuki Iwashima 9415d375d8 rtnetlink: Don't ignore IFLA_TARGET_NETNSID when ifname is specified in rtnl_dellink().
The cited commit accidentally replaced tgt_net with net in rtnl_dellink().

As a result, IFLA_TARGET_NETNSID is ignored if the interface is specified
with IFLA_IFNAME or IFLA_ALT_IFNAME.

Let's pass tgt_net to rtnl_dev_get().

Fixes: cc6090e985 ("net: rtnetlink: introduce helper to get net_device instance by ifname")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29 11:36:48 +01:00
Jeongjun Park 9da49aa80d tun: Add missing bpf_net_ctx_clear() in do_xdp_generic()
There are cases where do_xdp_generic returns bpf_net_context without
clearing it. This causes various memory corruptions, so the missing
bpf_net_ctx_clear must be added.

Reported-by: syzbot+44623300f057a28baf1e@syzkaller.appspotmail.com
Fixes: fecef4cd42 ("tun: Assign missing bpf_net_context.")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reported-by: syzbot+3c2b6d5d4bec3b904933@syzkaller.appspotmail.com
Reported-by: syzbot+707d98c8649695eaf329@syzkaller.appspotmail.com
Reported-by: syzbot+c226757eb784a9da3e8b@syzkaller.appspotmail.com
Reported-by: syzbot+61a1cfc2b6632363d319@syzkaller.appspotmail.com
Reported-by: syzbot+709e4c85c904bcd62735@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29 10:55:15 +01:00
Linus Torvalds 1722389b0d A lot of networking people were at a conference last week, busy
catching COVID, so relatively short PR. Including fixes from bpf
 and netfilter.
 
 Current release - regressions:
 
  - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
 
 Current release - new code bugs:
 
  - l2tp: protect session IDR and tunnel session list with one lock,
    make sure the state is coherent to avoid a warning
 
  - eth: bnxt_en: update xdp_rxq_info in queue restart logic
 
  - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
 
 Previous releases - regressions:
 
  - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
    the field reuses previously un-validated pad
 
 Previous releases - always broken:
 
  - tap/tun: drop short frames to prevent crashes later in the stack
 
  - eth: ice: add a per-VF limit on number of FDIR filters
 
  - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaibxAACgkQMUZtbf5S
 IruuIRAAu96TiN/urPwmKznyb/Sk8x7p8iUzn6OvPS/TUlFUkURQtOh6M9uvbpN4
 x/L//EWkMR0hY4SkBegoiXfb1GS0PjBdWTWUiROm5X9nVHqp5KRZAxWXhjFiS1BO
 BIYOT+JfCl7mQiPs90Mys/cEtYOggMBsCZQVIGw/iYoJLFREqxFSONwa0dG+tGMX
 jn9WNu4yCVDhJ/jtl2MaTsCNtYUaBUgYrKHJBfNGfJ2Lz/7rH9yFui2WSMlmOd/U
 QGeCb1DWURlShlCqY37wNinbFsxWkI5JN00ukTtwFAXLIaqc+zgHcIjrDjTJwK43
 F4tKbJT3+bmehMU/h3Uo3c7DhXl7n9zDGiDtbCxnkykp0sFGJpjhDrWydo51c+YB
 qW5HaNrII2LiDicOVN8L29ylvKp7AEkClxgivEhZVGGk2f/szJRXfp9u3WBn5kAx
 3paH55YN0DEsKbYbb1ZENEI1Vnc/4ff4PxZJCUNKwzcS8wCn1awqwcriK9TjS/cp
 fjilNFT4J3/uFrodHWTkx0jJT6UJFT0aF03qPLUH/J5kG+EVukOf1jBPInNdf1si
 1j47SpblHUe86HiHphFMt32KZ210lJzWxh8uGma57Y2sB9makdLiK4etrFjkiMJJ
 Z8A3kGp3KpFjbuK4tHY25rp+5oxLNNOBNpay29lQrWtCL/NDcaQ=
 =9OsH
 -----END PGP SIGNATURE-----

Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  A lot of networking people were at a conference last week, busy
  catching COVID, so relatively short PR.

  Current release - regressions:

   - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP

  Current release - new code bugs:

   - l2tp: protect session IDR and tunnel session list with one lock,
     make sure the state is coherent to avoid a warning

   - eth: bnxt_en: update xdp_rxq_info in queue restart logic

   - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field

  Previous releases - regressions:

   - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
     the field reuses previously un-validated pad

  Previous releases - always broken:

   - tap/tun: drop short frames to prevent crashes later in the stack

   - eth: ice: add a per-VF limit on number of FDIR filters

   - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"

* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
  tun: add missing verification for short frame
  tap: add missing verification for short frame
  mISDN: Fix a use after free in hfcmulti_tx()
  gve: Fix an edge case for TSO skb validity check
  bnxt_en: update xdp_rxq_info in queue restart logic
  tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
  MAINTAINERS: make Breno the netconsole maintainer
  MAINTAINERS: Update bonding entry
  net: nexthop: Initialize all fields in dumped nexthops
  net: stmmac: Correct byte order of perfect_match
  selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
  tipc: Return non-zero value from tipc_udp_addr2str() on error
  netfilter: nft_set_pipapo_avx2: disable softinterrupts
  ice: Fix recipe read procedure
  ice: Add a per-VF limit on number of FDIR filters
  net: bonding: correctly annotate RCU in bond_should_notify_peers()
  ...
2024-07-25 13:32:25 -07:00
Jakub Kicinski f7578df913 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZqIl1AAKCRDbK58LschI
 g/MdAP9oyZV9/IZ6Y6Z1fWfio0SB+yJGugcwbFjWcEtNrzsqJQEAwipQnemAI4NC
 HBMfK2a/w7vhAFMXrP/SbkB/gUJJ7QE=
 =vovf
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-07-25

We've added 14 non-merge commits during the last 8 day(s) which contain
a total of 19 files changed, 177 insertions(+), 70 deletions(-).

The main changes are:

1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
   BPF sockhash. Also add test coverage for this case, from Michal Luczaj.

2) Fix a segmentation issue when downgrading gso_size in the BPF helper
   bpf_skb_adjust_room(), from Fred Li.

3) Fix a compiler warning in resolve_btfids due to a missing type cast,
   from Liwei Song.

4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
   boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.

5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
   from Stanislav Fomichev.

6) Fix function prototype BTF dumping in libbpf for prototypes that have
   no input arguments, from Andrii Nakryiko.

7) Fix stacktrace symbol resolution in perf script for BPF programs
   containing subprograms, from Hou Tao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
  bpf, events: Use prog to emit ksymbol event for main program
  selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
  selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
  selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
  af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
  bpftool: Fix typo in usage help
  libbpf: Fix no-args func prototype BTF dumping syntax
  MAINTAINERS: Update powerpc BPF JIT maintainers
  MAINTAINERS: Update email address of Naveen
  selftests/bpf: fexit_sleep: Fix stack allocation for arm64
====================

Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-25 07:40:25 -07:00
Fred Li fa5ef65561 bpf: Fix a segment issue when downgrading gso_size
Linearize the skb when downgrading gso_size because it may trigger a
BUG_ON() later when the skb is segmented as described in [1,2].

Fixes: 2be7e212d5 ("bpf: add bpf_skb_adjust_room helper")
Signed-off-by: Fred Li <dracodingfly@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/all/20240626065555.35460-2-dracodingfly@gmail.com [1]
Link: https://lore.kernel.org/all/668d5cf1ec330_1c18c32947@willemb.c.googlers.com.notmuch [2]
Link: https://lore.kernel.org/bpf/20240719024653.77006-1-dracodingfly@gmail.com
2024-07-25 11:50:14 +02:00
Joel Granados 78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Pablo Neira Ayuso 120f1c857a net: flow_dissector: use DEBUG_NET_WARN_ON_ONCE
The following splat is easy to reproduce upstream as well as in -stable
kernels. Florian Westphal provided the following commit:

  d1dab4f71d ("net: add and use __skb_get_hash_symmetric_net")

but this complementary fix has been also suggested by Willem de Bruijn
and it can be easily backported to -stable kernel which consists in
using DEBUG_NET_WARN_ON_ONCE instead to silence the following splat
given __skb_get_hash() is used by the nftables tracing infrastructure to
to identify packets in traces.

[69133.561393] ------------[ cut here ]------------
[69133.561404] WARNING: CPU: 0 PID: 43576 at net/core/flow_dissector.c:1104 __skb_flow_dissect+0x134f/
[...]
[69133.561944] CPU: 0 PID: 43576 Comm: socat Not tainted 6.10.0-rc7+ #379
[69133.561959] RIP: 0010:__skb_flow_dissect+0x134f/0x2ad0
[69133.561970] Code: 83 f9 04 0f 84 b3 00 00 00 45 85 c9 0f 84 aa 00 00 00 41 83 f9 02 0f 84 81 fc ff
ff 44 0f b7 b4 24 80 00 00 00 e9 8b f9 ff ff <0f> 0b e9 20 f3 ff ff 41 f6 c6 20 0f 84 e4 ef ff ff 48 8d 7b 12 e8
[69133.561979] RSP: 0018:ffffc90000006fc0 EFLAGS: 00010246
[69133.561988] RAX: 0000000000000000 RBX: ffffffff82f33e20 RCX: ffffffff81ab7e19
[69133.561994] RDX: dffffc0000000000 RSI: ffffc90000007388 RDI: ffff888103a1b418
[69133.562001] RBP: ffffc90000007310 R08: 0000000000000000 R09: 0000000000000000
[69133.562007] R10: ffffc90000007388 R11: ffffffff810cface R12: ffff888103a1b400
[69133.562013] R13: 0000000000000000 R14: ffffffff82f33e2a R15: ffffffff82f33e28
[69133.562020] FS:  00007f40f7131740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
[69133.562027] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69133.562033] CR2: 00007f40f7346ee0 CR3: 000000015d200001 CR4: 00000000001706f0
[69133.562040] Call Trace:
[69133.562044]  <IRQ>
[69133.562049]  ? __warn+0x9f/0x1a0
[ 1211.841384]  ? __skb_flow_dissect+0x107e/0x2860
[...]
[ 1211.841496]  ? bpf_flow_dissect+0x160/0x160
[ 1211.841753]  __skb_get_hash+0x97/0x280
[ 1211.841765]  ? __skb_get_hash_symmetric+0x230/0x230
[ 1211.841776]  ? mod_find+0xbf/0xe0
[ 1211.841786]  ? get_stack_info_noinstr+0x12/0xe0
[ 1211.841798]  ? bpf_ksym_find+0x56/0xe0
[ 1211.841807]  ? __rcu_read_unlock+0x2a/0x70
[ 1211.841819]  nft_trace_init+0x1b9/0x1c0 [nf_tables]
[ 1211.841895]  ? nft_trace_notify+0x830/0x830 [nf_tables]
[ 1211.841964]  ? get_stack_info+0x2b/0x80
[ 1211.841975]  ? nft_do_chain_arp+0x80/0x80 [nf_tables]
[ 1211.842044]  nft_do_chain+0x79c/0x850 [nf_tables]

Fixes: 9b52e3f267 ("flow_dissector: handle no-skb use case")
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240715141442.43775-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18 10:52:17 +02:00
Jakub Kicinski 51b35d4f9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.11 net-next PR.

Conflicts:
  93c3a96c30 ("net: pse-pd: Do not return EOPNOSUPP if config is null")
  4cddb0f15e ("net: ethtool: pse-pd: Fix possible null-deref")
  30d7b67277 ("net: ethtool: Add new power limit get and set features")
https://lore.kernel.org/20240715123204.623520bb@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 13:19:17 -07:00
Asbjørn Sloth Tønnesen 706bf4f44c flow_dissector: set encapsulation control flags for non-IP
Make sure to set encapsulated control flags also for non-IP
packets, such that it's possible to allow matching on e.g.
TUNNEL_OAM on a geneve packet carrying a non-IP packet.

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20240713021911.1631517-13-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 09:14:39 -07:00
Asbjørn Sloth Tønnesen db5271d50e flow_dissector: cleanup FLOW_DISSECTOR_KEY_ENC_FLAGS
Now that TCA_FLOWER_KEY_ENC_FLAGS is unused, as it's
former data is stored behind TCA_FLOWER_KEY_ENC_CONTROL,
then remove the last bits of FLOW_DISSECTOR_KEY_ENC_FLAGS.

FLOW_DISSECTOR_KEY_ENC_FLAGS is unreleased, and have been
in net-next since 2024-06-04.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20240713021911.1631517-12-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 09:14:39 -07:00
Asbjørn Sloth Tønnesen 03afeb613b flow_dissector: set encapsulated control flags from tun_flags
Set the new FLOW_DIS_F_TUNNEL_* encapsulated control flags, based
on if their counter-part is set in tun_flags.

These flags are not userspace visible yet, as the code to dump
encapsulated control flags will first be added, and later activated
in the following patches.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20240713021911.1631517-8-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 09:14:38 -07:00
Asbjørn Sloth Tønnesen 4d0aed380f flow_dissector: prepare for encapsulated control flags
Rename skb_flow_dissect_set_enc_addr_type() to
skb_flow_dissect_set_enc_control(), and make it set both
addr_type and flags in FLOW_DISSECTOR_KEY_ENC_CONTROL.

Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20240713021911.1631517-7-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 09:14:38 -07:00
Jakub Kicinski 30b3560050 Merge branch 'net-make-timestamping-selectable'
First part of "net: Make timestamping selectable" from Kory Maincent.
Change the driver-facing type already to lower rebasing pain.

Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 08:02:30 -07:00
Kory Maincent 2dd3560059 net: Change the API of PHY default timestamp to MAC
Change the API to select MAC default time stamping instead of the PHY.
Indeed the PHY is closer to the wire therefore theoretically it has less
delay than the MAC timestamping but the reality is different. Due to lower
time stamping clock frequency, latency in the MDIO bus and no PHC hardware
synchronization between different PHY, the PHY PTP is often less precise
than the MAC. The exception is for PHY designed specially for PTP case but
these devices are not very widespread. For not breaking the compatibility
default_timestamp flag has been introduced in phy_device that is set by
the phy driver to know we are using the old API behavior.

Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-4-b5317f50df2a@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 08:02:26 -07:00
Taehee Yoo 59a931c5b7 xdp: fix invalid wait context of page_pool_destroy()
If the driver uses a page pool, it creates a page pool with
page_pool_create().
The reference count of page pool is 1 as default.
A page pool will be destroyed only when a reference count reaches 0.
page_pool_destroy() is used to destroy page pool, it decreases a
reference count.
When a page pool is destroyed, ->disconnect() is called, which is
mem_allocator_disconnect().
This function internally acquires mutex_lock().

If the driver uses XDP, it registers a memory model with
xdp_rxq_info_reg_mem_model().
The xdp_rxq_info_reg_mem_model() internally increases a page pool
reference count if a memory model is a page pool.
Now the reference count is 2.

To destroy a page pool, the driver should call both page_pool_destroy()
and xdp_unreg_mem_model().
The xdp_unreg_mem_model() internally calls page_pool_destroy().
Only page_pool_destroy() decreases a reference count.

If a driver calls page_pool_destroy() then xdp_unreg_mem_model(), we
will face an invalid wait context warning.
Because xdp_unreg_mem_model() calls page_pool_destroy() with
rcu_read_lock().
The page_pool_destroy() internally acquires mutex_lock().

Splat looks like:
=============================
[ BUG: Invalid wait context ]
6.10.0-rc6+ #4 Tainted: G W
-----------------------------
ethtool/1806 is trying to lock:
ffffffff90387b90 (mem_id_lock){+.+.}-{4:4}, at: mem_allocator_disconnect+0x73/0x150
other info that might help us debug this:
context-{5:5}
3 locks held by ethtool/1806:
stack backtrace:
CPU: 0 PID: 1806 Comm: ethtool Tainted: G W 6.10.0-rc6+ #4 f916f41f172891c800f2fed
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
Call Trace:
<TASK>
dump_stack_lvl+0x7e/0xc0
__lock_acquire+0x1681/0x4de0
? _printk+0x64/0xe0
? __pfx_mark_lock.part.0+0x10/0x10
? __pfx___lock_acquire+0x10/0x10
lock_acquire+0x1b3/0x580
? mem_allocator_disconnect+0x73/0x150
? __wake_up_klogd.part.0+0x16/0xc0
? __pfx_lock_acquire+0x10/0x10
? dump_stack_lvl+0x91/0xc0
__mutex_lock+0x15c/0x1690
? mem_allocator_disconnect+0x73/0x150
? __pfx_prb_read_valid+0x10/0x10
? mem_allocator_disconnect+0x73/0x150
? __pfx_llist_add_batch+0x10/0x10
? console_unlock+0x193/0x1b0
? lockdep_hardirqs_on+0xbe/0x140
? __pfx___mutex_lock+0x10/0x10
? tick_nohz_tick_stopped+0x16/0x90
? __irq_work_queue_local+0x1e5/0x330
? irq_work_queue+0x39/0x50
? __wake_up_klogd.part.0+0x79/0xc0
? mem_allocator_disconnect+0x73/0x150
mem_allocator_disconnect+0x73/0x150
? __pfx_mem_allocator_disconnect+0x10/0x10
? mark_held_locks+0xa5/0xf0
? rcu_is_watching+0x11/0xb0
page_pool_release+0x36e/0x6d0
page_pool_destroy+0xd7/0x440
xdp_unreg_mem_model+0x1a7/0x2a0
? __pfx_xdp_unreg_mem_model+0x10/0x10
? kfree+0x125/0x370
? bnxt_free_ring.isra.0+0x2eb/0x500
? bnxt_free_mem+0x5ac/0x2500
xdp_rxq_info_unreg+0x4a/0xd0
bnxt_free_mem+0x1356/0x2500
bnxt_close_nic+0xf0/0x3b0
? __pfx_bnxt_close_nic+0x10/0x10
? ethnl_parse_bit+0x2c6/0x6d0
? __pfx___nla_validate_parse+0x10/0x10
? __pfx_ethnl_parse_bit+0x10/0x10
bnxt_set_features+0x2a8/0x3e0
__netdev_update_features+0x4dc/0x1370
? ethnl_parse_bitset+0x4ff/0x750
? __pfx_ethnl_parse_bitset+0x10/0x10
? __pfx___netdev_update_features+0x10/0x10
? mark_held_locks+0xa5/0xf0
? _raw_spin_unlock_irqrestore+0x42/0x70
? __pm_runtime_resume+0x7d/0x110
ethnl_set_features+0x32d/0xa20

To fix this problem, it uses rhashtable_lookup_fast() instead of
rhashtable_lookup() with rcu_read_lock().
Using xa without rcu_read_lock() here is safe.
xa is freed by __xdp_mem_allocator_rcu_free() and this is called by
call_rcu() of mem_xa_remove().
The mem_xa_remove() is called by page_pool_destroy() if a reference
count reaches 0.
The xa is already protected by the reference count mechanism well in the
control plane.
So removing rcu_read_lock() for page_pool_destroy() is safe.

Fixes: c3f812cea0 ("page_pool: do not release pool until inflight == 0.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240712095116.3801586-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-14 20:40:21 -07:00
Jakub Kicinski 69cf87304d Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
idpf: XDP chapter I: convert Rx to libeth

Alexander Lobakin says:

XDP for idpf is currently 5 chapters:
* convert Rx to libeth (this);
* convert Tx and stats to libeth;
* generic XDP and XSk code changes, libeth_xdp;
* actual XDP for idpf via libeth_xdp;
* XSk for idpf (^).

Part I does the following:
* splits &idpf_queue into 4 (RQ, SQ, FQ, CQ) and puts them on a diet;
* ensures optimal cacheline placement, strictly asserts CL sizes;
* moves currently unused/dead singleq mode out of line;
* reuses libeth's Rx ptype definitions and helpers;
* uses libeth's Rx buffer management for both header and payload;
* eliminates memcpy()s and coherent DMA uses on hotpath, uses
  napi_build_skb() instead of in-place short skb allocation.

Most idpf patches, except for the queue split, removes more lines
than adds.

Expect far better memory utilization and +5-8% on Rx depending on
the case (+17% on skb XDP_DROP :>).

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: use libeth Rx buffer management for payload buffer
  idpf: convert header split mode to libeth + napi_build_skb()
  libeth: support different types of buffers for Rx
  idpf: remove legacy Page Pool Ethtool stats
  idpf: reuse libeth's definitions of parsed ptype structures
  idpf: compile singleq code only under default-n CONFIG_IDPF_SINGLEQ
  idpf: merge singleq and splitq &net_device_ops
  idpf: strictly assert cachelines of queue and queue vector structures
  idpf: avoid bloating &idpf_q_vector with big %NR_CPUS
  idpf: split &idpf_queue into 4 strictly-typed queue structures
  idpf: stop using macros for accessing queue descriptors
  libeth: add cacheline / struct layout assertion helpers
  page_pool: use __cacheline_group_{begin, end}_aligned()
  cache: add __cacheline_group_{begin, end}_aligned() (+ couple more)
====================

Link: https://patch.msgid.link/20240710203031.188081-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-12 22:27:26 -07:00
Jakub Kicinski 26f453176a bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZpGVmAAKCRDbK58LschI
 gxB4AQCgquQis63yqTI36j4iXBT+TuxHEBNoQBSLyzYdrLS1dgD/S5DRJDA+3LD+
 394hn/VtB1qvX5vaqjsov4UIwSMyxA0=
 =OhSn
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-07-12

We've added 23 non-merge commits during the last 3 day(s) which contain
a total of 18 files changed, 234 insertions(+), 243 deletions(-).

The main changes are:

1) Improve BPF verifier by utilizing overflow.h helpers to check
   for overflows, from Shung-Hsi Yu.

2) Fix NULL pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
   when attr->attach_prog_fd was not specified, from Tengda Wu.

3) Fix arm64 BPF JIT when generating code for BPF trampolines with
   BPF_TRAMP_F_CALL_ORIG which corrupted upper address bits,
   from Puranjay Mohan.

4) Remove test_run callback from lwt_seg6local_prog_ops which never worked
   in the first place and caused syzbot reports,
   from Sebastian Andrzej Siewior.

5) Relax BPF verifier to accept non-zero offset on KF_TRUSTED_ARGS/
   /KF_RCU-typed BPF kfuncs, from Matt Bobrowski.

6) Fix a long standing bug in libbpf with regards to handling of BPF
   skeleton's forward and backward compatibility, from Andrii Nakryiko.

7) Annotate btf_{seq,snprintf}_show functions with __printf,
   from Alan Maguire.

8) BPF selftest improvements to reuse common network helpers in sk_lookup
   test and dropping the open-coded inetaddr_len() and make_socket() ones,
   from Geliang Tang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type()
  bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT
  selftests/bpf: DENYLIST.aarch64: Skip fexit_sleep again
  bpf: use check_sub_overflow() to check for subtraction overflows
  bpf: use check_add_overflow() to check for addition overflows
  bpf: fix overflow check in adjust_jmp_off()
  bpf: Eliminate remaining "make W=1" warnings in kernel/bpf/btf.o
  bpf: annotate BTF show functions with __printf
  bpf, arm64: Fix trampoline for BPF_TRAMP_F_CALL_ORIG
  selftests/bpf: Close obj in error path in xdp_adjust_tail
  selftests/bpf: Null checks for links in bpf_tcp_ca
  selftests/bpf: Use connect_fd_to_fd in sk_lookup
  selftests/bpf: Use start_server_addr in sk_lookup
  selftests/bpf: Use start_server_str in sk_lookup
  selftests/bpf: Close fd in error path in drop_on_reuseport
  selftests/bpf: Add ASSERT_OK_FD macro
  selftests/bpf: Add backlog for network_helper_opts
  selftests/bpf: fix compilation failure when CONFIG_NF_FLOW_TABLE=m
  bpf: Remove tst_run from lwt_seg6local_prog_ops.
  bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU
  ...
====================

Link: https://patch.msgid.link/20240712212448.5378-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-12 22:25:54 -07:00
Alexander Lobakin 13cabc47f8 netdevice: define and allocate &net_device _properly_
In fact, this structure contains a flexible array at the end, but
historically its size, alignment etc., is calculated manually.
There are several instances of the structure embedded into other
structures, but also there's ongoing effort to remove them and we
could in the meantime declare &net_device properly.
Declare the array explicitly, use struct_size() and store the array
size inside the structure, so that __counted_by() can be applied.
Don't use PTR_ALIGN(), as SLUB itself tries its best to ensure the
allocated buffer is aligned to what the user expects.
Also, change its alignment from %NETDEV_ALIGN to the cacheline size
as per several suggestions on the netdev ML.

bloat-o-meter for vmlinux:

free_netdev                                  445     440      -5
netdev_freemem                                24       -     -24
alloc_netdev_mqs                            1481    1450     -31

On x86_64 with several NICs of different vendors, I was never able to
get a &net_device pointer not aligned to the cacheline size after the
change.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20240710113036.2125584-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 18:11:31 -07:00
Eric Dumazet cef4902b0f net: reduce rtnetlink_rcv_msg() stack usage
IFLA_MAX is increasing slowly but surely.

Some compilers use more than 512 bytes of stack in rtnetlink_rcv_msg()
because it calls rtnl_calcit() for RTM_GETLINK message.

Use noinline_for_stack attribute to not inline rtnl_calcit(),
and directly use nla_for_each_attr_type() (Jakub suggestion)
because we only care about IFLA_EXT_MASK at this stage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240710151653.3786604-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 17:13:58 -07:00
Jakub Kicinski 7c8267275d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/act_ct.c
  26488172b0 ("net/sched: Fix UAF when resolving a clash")
  3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11 12:58:13 -07:00
Alexander Lobakin 39daa09d34 page_pool: use __cacheline_group_{begin, end}_aligned()
Instead of doing __cacheline_group_begin() __aligned(), use the new
__cacheline_group_{begin,end}_aligned(), so that it will take care
of the group alignment itself.
Also replace open-coded `4 * sizeof(long)` in two places with
a definition.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-10 10:28:23 -07:00
Sebastian Andrzej Siewior c13fda93ac bpf: Remove tst_run from lwt_seg6local_prog_ops.
The syzbot reported that the lwt_seg6 related BPF ops can be invoked
via bpf_test_run() without without entering input_action_end_bpf()
first.

Martin KaFai Lau said that self test for BPF_PROG_TYPE_LWT_SEG6LOCAL
probably didn't work since it was introduced in commit 04d4b274e2a
("ipv6: sr: Add seg6local action End.BPF"). The reason is that the
per-CPU variable seg6_bpf_srh_states::srh is never assigned in the self
test case but each BPF function expects it.

Remove test_run for BPF_PROG_TYPE_LWT_SEG6LOCAL.

Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Reported-by: syzbot+608a2acde8c5a101d07d@syzkaller.appspotmail.com
Fixes: d1542d4ae4 ("seg6: Use nested-BH locking for seg6_bpf_srh_states.")
Fixes: 004d4b274e ("ipv6: sr: Add seg6local action End.BPF")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240710141631.FbmHcQaX@linutronix.de
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-07-10 09:58:52 -07:00
Hugh Dickins f153831097 net: fix rc7's __skb_datagram_iter()
X would not start in my old 32-bit partition (and the "n"-handling looks
just as wrong on 64-bit, but for whatever reason did not show up there):
"n" must be accumulated over all pages before it's added to "offset" and
compared with "copy", immediately after the skb_frag_foreach_page() loop.

Fixes: d2d30a376d ("net: allow skb_datagram_iter to be called from any context")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://patch.msgid.link/fef352e8-b89a-da51-f8ce-04bc39ee6481@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09 11:24:54 -07:00
Paolo Abeni 7b769adc26 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZoxN0AAKCRDbK58LschI
 g0c5AQDa3ZV9gfbN42y1zSDoM1uOgO60fb+ydxyOYh8l3+OiQQD/fLfpTY3gBFSY
 9yi/pZhw/QdNzQskHNIBrHFGtJbMxgs=
 =p1Zz
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-07-08

The following pull-request contains BPF updates for your *net-next* tree.

We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).

The main changes are:

1) Support resilient split BTF which cuts down on duplication and makes BTF
   as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.

2) Add support for dumping kfunc prototypes from BTF which enables both detecting
   as well as dumping compilable prototypes for kfuncs, from Daniel Xu.

3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
   support for BPF exceptions, from Ilya Leoshkevich.

4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
   for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.

5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
   for skbs and add coverage along with it, from Vadim Fedorenko.

6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
   a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.

7) Extend the BPF verifier to track the delta between linked registers in order
   to better deal with recent LLVM code optimizations, from Alexei Starovoitov.

8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
   have been a pointer to the map value, from Benjamin Tissoires.

9) Extend BPF selftests to add regular expression support for test output matching
   and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.

10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
    iterates exactly once anyway, from Dan Carpenter.

11) Add the capability to offload the netfilter flowtable in XDP layer through
    kfuncs, from Florian Westphal & Lorenzo Bianconi.

12) Various cleanups in networking helpers in BPF selftests to shave off a few
    lines of open-coded functions on client/server handling, from Geliang Tang.

13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
    that x86 JIT does not need to implement detection, from Leon Hwang.

14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
    out-of-bounds memory access for dynpointers, from Matt Bobrowski.

15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
    it might lead to problems on 32-bit archs, from Jiri Olsa.

16) Enhance traffic validation and dynamic batch size support in xsk selftests,
    from Tushar Vyavahare.

bpf-next-for-netdev

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
  selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
  selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
  bpf: helpers: fix bpf_wq_set_callback_impl signature
  libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
  selftests/bpf: Remove exceptions tests from DENYLIST.s390x
  s390/bpf: Implement exceptions
  s390/bpf: Change seen_reg to a mask
  bpf: Remove unnecessary loop in task_file_seq_get_next()
  riscv, bpf: Optimize stack usage of trampoline
  bpf, devmap: Add .map_alloc_check
  selftests/bpf: Remove arena tests from DENYLIST.s390x
  selftests/bpf: Add UAF tests for arena atomics
  selftests/bpf: Introduce __arena_global
  s390/bpf: Support arena atomics
  s390/bpf: Enable arena
  s390/bpf: Support address space cast instruction
  s390/bpf: Support BPF_PROBE_MEM32
  s390/bpf: Land on the next JITed instruction after exception
  s390/bpf: Introduce pre- and post- probe functions
  s390/bpf: Get rid of get_probe_mem_regno()
  ...
====================

Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09 17:01:46 +02:00
Geliang Tang f0c1802569 skmsg: Skip zero length skb in sk_msg_recvmsg
When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
platform, the following kernel panic occurs:

  [...]
  Oops[#1]:
  CPU: 22 PID: 2824 Comm: test_progs Tainted: G           OE  6.10.0-rc2+ #18
  Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
     ... ...
     ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
    ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
   CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
   PRMD: 0000000c (PPLV0 +PIE +PWE)
   EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
   ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
  ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
   BADV: 0000000000000040
   PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
  Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
  Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
  Stack : ...
  Call Trace:
  [<9000000004162774>] copy_page_to_iter+0x74/0x1c0
  [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
  [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
  [<90000000049aae34>] inet_recvmsg+0x54/0x100
  [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
  [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
  [<900000000481e27c>] sys_recvfrom+0x1c/0x40
  [<9000000004c076ec>] do_syscall+0x8c/0xc0
  [<9000000003731da4>] handle_syscall+0xc4/0x160
  Code: ...
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel relocated by 0x3510000
   .text @ 0x9000000003710000
   .data @ 0x9000000004d70000
   .bss  @ 0x9000000006469400
  ---[ end Kernel panic - not syncing: Fatal exception ]---
  [...]

This crash happens every time when running sockmap_skb_verdict_shutdown
subtest in sockmap_basic.

This crash is because a NULL pointer is passed to page_address() in the
sk_msg_recvmsg(). Due to the different implementations depending on the
architecture, page_address(NULL) will trigger a panic on Loongarch
platform but not on x86 platform. So this bug was hidden on x86 platform
for a while, but now it is exposed on Loongarch platform. The root cause
is that a zero length skb (skb->len == 0) was put on the queue.

This zero length skb is a TCP FIN packet, which was sent by shutdown(),
invoked in test_sockmap_skb_verdict_shutdown():

	shutdown(p1, SHUT_WR);

In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
page is put to this sge (see sg_set_page in sg_set_page), but this empty
sge is queued into ingress_msg list.

And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
to kmap_local_page() and to page_address(), then kernel panics.

To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
if copy is zero, that means it's a zero length skb, skip invoking
copy_page_to_iter(). We are using the EFAULT return triggered by
copy_page_to_iter to check for is_fin in tcp_bpf.c.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn
2024-07-09 10:24:50 +02:00
Johannes Berg 946b6c48cc net: page_pool: fix warning code
WARN_ON_ONCE("string") doesn't really do what appears to
be intended, so fix that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fixes: 90de47f020 ("page_pool: fragment API support for 32-bit arch with 64-bit DMA")
Link: https://patch.msgid.link/20240705134221.2f4de205caa1.I28496dc0f2ced580282d1fb892048017c4491e21@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-08 20:19:18 -07:00
Sebastian Andrzej Siewior fecef4cd42 tun: Assign missing bpf_net_context.
During the introduction of struct bpf_net_context handling for
XDP-redirect, the tun driver has been missed.
Jakub also pointed out that there is another call chain to
do_xdp_generic() originating from netif_receive_skb() and drivers may
use it outside from the NAPI context.

Set the bpf_net_context before invoking BPF XDP program within the TUN
driver. Set the bpf_net_context also in do_xdp_generic() if a xdp
program is available.

Reported-by: syzbot+0b5c75599f1d872bea6f@syzkaller.appspotmail.com
Reported-by: syzbot+5ae46b237278e2369cac@syzkaller.appspotmail.com
Reported-by: syzbot+c1e04a422bbc0f0f2921@syzkaller.appspotmail.com
Fixes: 401cb7dae8 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240704144815.j8xQda5r@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-05 16:59:37 -07:00
Jakub Kicinski 76ed626479 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/phy/aquantia/aquantia.h
  219343755e ("net: phy: aquantia: add missing include guards")
  61578f6793 ("net: phy: aquantia: add support for PHY LEDs")

drivers/net/ethernet/wangxun/libwx/wx_hw.c
  bd07a98178 ("net: txgbe: remove separate irq request for MSI and INTx")
  b501d261a5 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/

include/linux/mlx5/mlx5_ifc.h
  048a403648 ("net/mlx5: IFC updates for changing max EQs")
  99be56171f ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/

Adjacent changes:

drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
  4130c67cd1 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
  3f3126515f ("wifi: iwlwifi: mvm: add mvm-specific guard")

include/net/mac80211.h
  816c6bec09 ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
  5a009b42e0 ("wifi: mac80211: track changes in AP's TPE")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04 14:16:11 -07:00
Mina Almasry 4dec64c52e page_pool: convert to use netmem
Abstract the memory type from the page_pool so we can later add support
for new memory types. Convert the page_pool to use the new netmem type
abstraction, rather than use struct page directly.

As of this patch the netmem type is a no-op abstraction: it's always a
struct page underneath. All the page pool internals are converted to
use struct netmem instead of struct page, and the page pool now exports
2 APIs:

1. The existing struct page API.
2. The new struct netmem API.

Keeping the existing API is transitional; we do not want to refactor all
the current drivers using the page pool at once.

The netmem abstraction is currently a no-op. The page_pool uses
page_to_netmem() to convert allocated pages to netmem, and uses
netmem_to_page() to convert the netmem back to pages to pass to mm APIs,

Follow up patches to this series add non-paged netmem support to the
page_pool. This change is factored out on its own to limit the code
churn to this 1 patch, for ease of code review.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20240628003253.1694510-6-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-02 18:59:33 -07:00
Sebastian Andrzej Siewior d839a73179 net: Optimize xdp_do_flush() with bpf_net_context infos.
Every NIC driver utilizing XDP should invoke xdp_do_flush() after
processing all packages. With the introduction of the bpf_net_context
logic the flush lists (for dev, CPU-map and xsk) are lazy initialized
only if used. However xdp_do_flush() tries to flush all three of them so
all three lists are always initialized and the likely empty lists are
"iterated".
Without the usage of XDP but with CONFIG_DEBUG_NET the lists are also
initialized due to xdp_do_check_flushed().

Jakub suggest to utilize the hints in bpf_net_context and avoid invoking
the flush function. This will also avoiding initializing the lists which
are otherwise unused.

Introduce bpf_net_ctx_get_all_used_flush_lists() to return the
individual list if not-empty. Use the logic in xdp_do_flush() and
xdp_do_check_flushed(). Remove the not needed .*_check_flush().

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 15:26:57 +02:00
David Wei d7f39aee79 page_pool: export page_pool_disable_direct_recycling()
56ef27e3 unexported page_pool_unlink_napi() and renamed it to
page_pool_disable_direct_recycling(). This is because there was no
in-tree user of page_pool_unlink_napi().

Since then Rx queue API and an implementation in bnxt got merged. In the
bnxt implementation, it broadly follows the following steps: allocate
new queue memory + page pool, stop old rx queue, swap, then destroy old
queue memory + page pool.

The existing NAPI instance is re-used so when the old page pool that is
no longer used but still linked to this shared NAPI instance is
destroyed, it will trigger warnings.

In my initial patches I unlinked a page pool from a NAPI instance
directly. Instead, export page_pool_disable_direct_recycling() and call
that instead to avoid having a driver touch a core struct.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 15:00:11 +02:00
Sagi Grimberg d2d30a376d net: allow skb_datagram_iter to be called from any context
We only use the mapping in a single context, so kmap_local is sufficient
and cheaper. Make sure to use skb_frag_foreach_page as skb frags may
contain compound pages and we need to map page by page.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202406161539.b5ff7b20-oliver.sang@intel.com
Fixes: 950fcaecd5 ("datagram: consolidate datagram copy to iter helpers")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://patch.msgid.link/20240626100008.831849-1-sagi@grimberg.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 14:55:15 +02:00
Pavel Begunkov 2ca58ed21c net: limit scope of a skb_zerocopy_iter_stream var
skb_zerocopy_iter_stream() only uses @orig_uarg in the !link_skb path,
and we can move the local variable in the appropriate block.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 12:06:50 +02:00
Pavel Begunkov 060f4ba6e4 io_uring/net: move charging socket out of zc io_uring
Currently, io_uring's io_sg_from_iter() duplicates the part of
__zerocopy_sg_from_iter() charging pages to the socket. It'd be too easy
to miss while changing it in net/, the chunk is not the most
straightforward for outside users and full of internal implementation
details. io_uring is not a good place to keep it, deduplicate it by
moving out of the callback into __zerocopy_sg_from_iter().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 12:06:50 +02:00
Pavel Begunkov aeb320fc05 net: batch zerocopy_fill_skb_from_iter accounting
Instead of accounting every page range against the socket separately, do
it in batch based on the change in skb->truesize. It's also moved into
__zerocopy_sg_from_iter(), so that zerocopy_fill_skb_from_iter() is
simpler and responsible for setting frags but not the accounting.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 12:06:50 +02:00
Pavel Begunkov 7fb05423fe net: split __zerocopy_sg_from_iter()
Split a function out of __zerocopy_sg_from_iter() that only cares about
the traditional path with refcounted pages and doesn't need to know
about ->sg_from_iter. A preparation patch, we'll improve on the function
later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 12:06:50 +02:00
Pavel Begunkov 9e2db9d399 net: always try to set ubuf in skb_zerocopy_iter_stream
skb_zcopy_set() does nothing if there is already a ubuf_info associated
with an skb, and since ->link_skb should have set it several lines above
the check here essentially does nothing and can be removed. It's also
safer this way, because even if the callback is faulty we'll
have it set.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-02 12:06:50 +02:00
Edward Cree 8792515119 net: ethtool: add a mutex protecting RSS contexts
While this is not needed to serialise the ethtool entry points (which
 are all under RTNL), drivers may have cause to asynchronously access
 dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
 do this safely without needing to take the RTNL.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/7f9c15eb7525bf87af62c275dde3a8570ee8bf0a.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28 18:53:21 -07:00
Edward Cree 30a32cdf6b net: ethtool: add an extack parameter to new rxfh_context APIs
Currently passed as NULL, but will allow drivers to report back errors
 when ethnl support for these ops is added.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/6e0012347d175fdd1280363d7bfa76a2f2777e17.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28 18:53:21 -07:00
Edward Cree 847a8ab186 net: ethtool: let the core choose RSS context IDs
Add a new API to create/modify/remove RSS contexts, that passes in the
 newly-chosen context ID (not as a pointer) rather than leaving the
 driver to choose it on create.  Also pass in the ctx, allowing drivers
 to easily use its private data area to store their hardware-specific
 state.
Keep the existing .set_rxfh API for now as a fallback, but deprecate it
 for custom contexts (rss_context != 0).

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/45f1fe61df2163c091ec394c9f52000c8b16cc3b.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28 18:53:21 -07:00
Edward Cree 6ad2962f8a net: ethtool: attach an XArray of custom RSS contexts to a netdevice
Each context stores the RXFH settings (indir, key, and hfunc) as well
 as optionally some driver private data.
Delete any still-existing contexts at netdev unregister time.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/cbd1c402cec38f2e03124f2ab65b4ae4e08bd90d.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28 18:53:21 -07:00
Edward Cree 3ebbd9f6de net: move ethtool-related netdev state into its own struct
net_dev->ethtool is a pointer to new struct ethtool_netdev_state, which
 currently contains only the wol_enabled field.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/293a562278371de7534ed1eb17531838ca090633.1719502239.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28 18:53:17 -07:00
Sagi Grimberg 2d5f6801db Revert "net: micro-optimize skb_datagram_iter"
This reverts commit 934c29999b.
This triggered a usercopy BUG() in systems with HIGHMEM, reported
by the test robot in:
 https://lore.kernel.org/oe-lkp/202406161539.b5ff7b20-oliver.sang@intel.com

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://patch.msgid.link/20240626070153.759257-1-sagi@grimberg.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-27 16:44:32 -07:00
Jakub Kicinski 193b9b2002 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:
  e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-27 12:14:11 -07:00
Jakub Kicinski 482000cf7f bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZnlmXgAKCRDbK58LschI
 g2ovAP9iynwwFEjMSxHjQVXSq1J1PMqF4966vmy30RCKJMMN/QD/SRsRRKcfsPis
 BzKOdsOVbWlDl2CUqvBrPZGT6laKoQc=
 =6/0V
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-06-24

We've added 12 non-merge commits during the last 10 day(s) which contain
a total of 10 files changed, 412 insertions(+), 16 deletions(-).

The main changes are:

1) Fix a BPF verifier issue validating may_goto with a negative offset,
   from Alexei Starovoitov.

2) Fix a BPF verifier validation bug with may_goto combined with jump to
   the first instruction, also from Alexei Starovoitov.

3) Fix a bug with overrunning reservations in BPF ring buffer,
   from Daniel Borkmann.

4) Fix a bug in BPF verifier due to missing proper var_off setting related
   to movsx instruction, from Yonghong Song.

5) Silence unnecessary syzkaller-triggered warning in __xdp_reg_mem_model(),
   from Daniil Dulov.

* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xdp: Remove WARN() from __xdp_reg_mem_model()
  selftests/bpf: Add tests for may_goto with negative offset.
  bpf: Fix may_goto with negative offset.
  selftests/bpf: Add more ring buffer test coverage
  bpf: Fix overrunning reservations in ringbuf
  selftests/bpf: Tests with may_goto and jumps to the 1st insn
  bpf: Fix the corner case with may_goto and jump to the 1st insn.
  bpf: Update BPF LSM maintainer list
  bpf: Fix remap of arena.
  selftests/bpf: Add a few tests to cover
  bpf: Add missed var_off setting in coerce_subreg_to_size_sx()
  bpf: Add missed var_off setting in set_sext32_default_val()
====================

Link: https://patch.msgid.link/20240624124330.8401-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 18:15:22 -07:00
Sebastian Andrzej Siewior 401cb7dae8 net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.

The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:24 -07:00
Sebastian Andrzej Siewior 78f520b7bb net: Use nested-BH locking for bpf_scratchpad.
bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-14-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior d1542d4ae4 seg6: Use nested-BH locking for seg6_bpf_srh_states.
The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() for
locking. Add lockdep_assert_held() to ensure the lock is held while the
per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-13-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior 3414adbd6a lwt: Don't disable migration prio invoking BPF.
There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-12-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior b22800f9d3 dev: Use nested-BH locking for softnet_data.process_queue.
softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.

Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior a8760d0d14 dev: Remove PREEMPT_RT ifdefs from backlog_lock.*().
The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().

Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.

Remove the ifdefs in backlog_lock.*().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior ecefbc09e8 net: softnet_data: Make xmit per task.
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
`recursion' counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
The `more' member is subject to similar problems if set by one thread
for one driver and wrongly used by another driver within another thread.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make xmit per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
needed wrapper.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior bdacf3e349 net: Use nested-BH locking for napi_alloc_cache.
napi_alloc_cache is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-5-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:22 -07:00
Sebastian Andrzej Siewior 43d7ca2907 net: Use __napi_alloc_frag_align() instead of open coding it.
The else condition within __netdev_alloc_frag_align() is an open coded
__napi_alloc_frag_align().

Use __napi_alloc_frag_align() instead of open coding it.
Move fragsz assignment before page_frag_alloc_align() invocation because
__napi_alloc_frag_align() also contains this statement.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-4-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-24 16:41:22 -07:00
Daniil Dulov 7e9f794283 xdp: Remove WARN() from __xdp_reg_mem_model()
syzkaller reports a warning in __xdp_reg_mem_model().

The warning occurs only if __mem_id_init_hash_table() returns an error. It
returns the error in two cases:

  1. memory allocation fails;
  2. rhashtable_init() fails when some fields of rhashtable_params
     struct are not initialized properly.

The second case cannot happen since there is a static const rhashtable_params
struct with valid fields. So, warning is only triggered when there is a
problem with memory allocation.

Thus, there is no sense in using WARN() to handle this error and it can be
safely removed.

WARNING: CPU: 0 PID: 5065 at net/core/xdp.c:299 __xdp_reg_mem_model+0x2d9/0x650 net/core/xdp.c:299

CPU: 0 PID: 5065 Comm: syz-executor883 Not tainted 6.8.0-syzkaller-05271-gf99c5f563c17 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:__xdp_reg_mem_model+0x2d9/0x650 net/core/xdp.c:299

Call Trace:
 xdp_reg_mem_model+0x22/0x40 net/core/xdp.c:344
 xdp_test_run_setup net/bpf/test_run.c:188 [inline]
 bpf_test_run_xdp_live+0x365/0x1e90 net/bpf/test_run.c:377
 bpf_prog_test_run_xdp+0x813/0x11b0 net/bpf/test_run.c:1267
 bpf_prog_test_run+0x33a/0x3b0 kernel/bpf/syscall.c:4240
 __sys_bpf+0x48d/0x810 kernel/bpf/syscall.c:5649
 __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
 __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
 do_syscall_64+0xfb/0x240
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Found by Linux Verification Center (linuxtesting.org) with syzkaller.

Fixes: 8d5d885275 ("xdp: rhashtable with allocator ID to pointer mapping")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/all/20240617162708.492159-1-d.dulov@aladdin.ru
Link: https://lore.kernel.org/bpf/20240624080747.36858-1-d.dulov@aladdin.ru
2024-06-24 13:44:02 +02:00
Eric Dumazet 62e58ddb14 net: add softirq safety to netdev_rename_lock
syzbot reported a lockdep violation involving bridge driver [1]

Make sure netdev_rename_lock is softirq safe to fix this issue.

[1]
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0 Not tainted
   -----------------------------------------------------
syz-executor.2/9449 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
 ffffffff8f5de668 (netdev_rename_lock.seqcount){+.+.}-{0:0}, at: rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839

and this task is already holding:
 ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
 ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212
which would create a new lock dependency:
 (&br->lock){+.-.}-{2:2} -> (netdev_rename_lock.seqcount){+.+.}-{0:0}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&br->lock){+.-.}-{2:2}

... which became SOFTIRQ-irq-safe at:
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:351 [inline]
   br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86
   call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792
   expire_timers kernel/time/timer.c:1843 [inline]
   __run_timers kernel/time/timer.c:2417 [inline]
   __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428
   run_timer_base kernel/time/timer.c:2437 [inline]
   run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447
   handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
   __do_softirq kernel/softirq.c:588 [inline]
   invoke_softirq kernel/softirq.c:428 [inline]
   __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
   irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
   instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
   sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
   asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
   lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
   fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800
   might_alloc include/linux/sched/mm.h:334 [inline]
   slab_pre_alloc_hook mm/slub.c:3890 [inline]
   slab_alloc_node mm/slub.c:3980 [inline]
   kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147
   kmalloc_noprof include/linux/slab.h:660 [inline]
   kzalloc_noprof include/linux/slab.h:778 [inline]
   class_dir_create_and_add drivers/base/core.c:3255 [inline]
   get_device_parent+0x2a7/0x410 drivers/base/core.c:3315
   device_add+0x325/0xbf0 drivers/base/core.c:3645
   netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136
   register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375
   nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline]
   nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750
   __nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390
   nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline]
   nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline]
   nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985
   devlink_reload+0x478/0x870 net/devlink/dev.c:474
   devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586
   genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
   genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
   genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
   netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
   genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
   netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
   netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
   netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
   sock_sendmsg_nosec net/socket.c:730 [inline]
   __sock_sendmsg+0x221/0x270 net/socket.c:745
   ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
   ___sys_sendmsg net/socket.c:2639 [inline]
   __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

to a SOFTIRQ-irq-unsafe lock:
 (netdev_rename_lock.seqcount){+.+.}-{0:0}

... which became SOFTIRQ-irq-unsafe at:
...
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
   do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
   do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
   write_seqlock include/linux/seqlock.h:823 [inline]
   dev_change_name+0x184/0x920 net/core/dev.c:1229
   do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
   __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
   rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
   rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
   netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
   netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
   netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
   netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
   sock_sendmsg_nosec net/socket.c:730 [inline]
   __sock_sendmsg+0x221/0x270 net/socket.c:745
   __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
   __do_sys_sendto net/socket.c:2204 [inline]
   __se_sys_sendto net/socket.c:2200 [inline]
   __x64_sys_sendto+0xde/0x100 net/socket.c:2200
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(netdev_rename_lock.seqcount);
                               local_irq_disable();
                               lock(&br->lock);
                               lock(netdev_rename_lock.seqcount);
  <Interrupt>
    lock(&br->lock);

 *** DEADLOCK ***

3 locks held by syz-executor.2/9449:
  #0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:79 [inline]
  #0: ffffffff8f5e7448 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x842/0x1180 net/core/rtnetlink.c:6632
  #1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
  #1: ffff888060c64cb8 (&br->lock){+.-.}-{2:2}, at: br_port_slave_changelink+0x3d/0x150 net/bridge/br_netlink.c:1212
  #2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
  #2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
  #2: ffffffff8e333fa0 (rcu_read_lock){....}-{1:2}, at: team_change_rx_flags+0x29/0x330 drivers/net/team/team_core.c:1767

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&br->lock){+.-.}-{2:2} {
   HARDIRQ-ON-W at:
                     lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                     __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
                     _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
                     spin_lock_bh include/linux/spinlock.h:356 [inline]
                     br_add_if+0xb34/0xef0 net/bridge/br_if.c:682
                     do_set_master net/core/rtnetlink.c:2701 [inline]
                     do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907
                     __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
                     rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
                     rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
                     netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                     netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                     netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                     netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                     sock_sendmsg_nosec net/socket.c:730 [inline]
                     __sock_sendmsg+0x221/0x270 net/socket.c:745
                     __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
                     __do_sys_sendto net/socket.c:2204 [inline]
                     __se_sys_sendto net/socket.c:2200 [inline]
                     __x64_sys_sendto+0xde/0x100 net/socket.c:2200
                     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                     do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   IN-SOFTIRQ-W at:
                     lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                     __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
                     _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
                     spin_lock include/linux/spinlock.h:351 [inline]
                     br_forward_delay_timer_expired+0x50/0x440 net/bridge/br_stp_timer.c:86
                     call_timer_fn+0x18e/0x650 kernel/time/timer.c:1792
                     expire_timers kernel/time/timer.c:1843 [inline]
                     __run_timers kernel/time/timer.c:2417 [inline]
                     __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2428
                     run_timer_base kernel/time/timer.c:2437 [inline]
                     run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2447
                     handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
                     __do_softirq kernel/softirq.c:588 [inline]
                     invoke_softirq kernel/softirq.c:428 [inline]
                     __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
                     irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
                     instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
                     sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
                     asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
                     lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
                     fs_reclaim_acquire+0xaf/0x140 mm/page_alloc.c:3800
                     might_alloc include/linux/sched/mm.h:334 [inline]
                     slab_pre_alloc_hook mm/slub.c:3890 [inline]
                     slab_alloc_node mm/slub.c:3980 [inline]
                     kmalloc_trace_noprof+0x3d/0x2c0 mm/slub.c:4147
                     kmalloc_noprof include/linux/slab.h:660 [inline]
                     kzalloc_noprof include/linux/slab.h:778 [inline]
                     class_dir_create_and_add drivers/base/core.c:3255 [inline]
                     get_device_parent+0x2a7/0x410 drivers/base/core.c:3315
                     device_add+0x325/0xbf0 drivers/base/core.c:3645
                     netdev_register_kobject+0x17e/0x320 net/core/net-sysfs.c:2136
                     register_netdevice+0x11d5/0x19e0 net/core/dev.c:10375
                     nsim_init_netdevsim drivers/net/netdevsim/netdev.c:690 [inline]
                     nsim_create+0x647/0x890 drivers/net/netdevsim/netdev.c:750
                     __nsim_dev_port_add+0x6c0/0xae0 drivers/net/netdevsim/dev.c:1390
                     nsim_dev_port_add_all drivers/net/netdevsim/dev.c:1446 [inline]
                     nsim_dev_reload_create drivers/net/netdevsim/dev.c:1498 [inline]
                     nsim_dev_reload_up+0x69b/0x8e0 drivers/net/netdevsim/dev.c:985
                     devlink_reload+0x478/0x870 net/devlink/dev.c:474
                     devlink_nl_reload_doit+0xbd6/0xe50 net/devlink/dev.c:586
                     genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
                     genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
                     genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210
                     netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                     genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
                     netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                     netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                     netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                     sock_sendmsg_nosec net/socket.c:730 [inline]
                     __sock_sendmsg+0x221/0x270 net/socket.c:745
                     ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
                     ___sys_sendmsg net/socket.c:2639 [inline]
                     __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
                     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                     do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   INITIAL USE at:
                    lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                    __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
                    _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
                    spin_lock_bh include/linux/spinlock.h:356 [inline]
                    br_add_if+0xb34/0xef0 net/bridge/br_if.c:682
                    do_set_master net/core/rtnetlink.c:2701 [inline]
                    do_setlink+0xe70/0x41f0 net/core/rtnetlink.c:2907
                    __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
                    rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
                    rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
                    netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                    netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                    netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                    netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                    sock_sendmsg_nosec net/socket.c:730 [inline]
                    __sock_sendmsg+0x221/0x270 net/socket.c:745
                    __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
                    __do_sys_sendto net/socket.c:2204 [inline]
                    __se_sys_sendto net/socket.c:2200 [inline]
                    __x64_sys_sendto+0xde/0x100 net/socket.c:2200
                    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                   entry_SYSCALL_64_after_hwframe+0x77/0x7f
 }
 ... key      at: [<ffffffff94b9a1a0>] br_dev_setup.__key+0x0/0x20

the dependencies between the lock to be acquired
 and SOFTIRQ-irq-unsafe lock:
-> (netdev_rename_lock.seqcount){+.+.}-{0:0} {
   HARDIRQ-ON-W at:
                     lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                     do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
                     do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
                     write_seqlock include/linux/seqlock.h:823 [inline]
                     dev_change_name+0x184/0x920 net/core/dev.c:1229
                     do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
                     __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
                     rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
                     rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
                     netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                     netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                     netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                     netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                     sock_sendmsg_nosec net/socket.c:730 [inline]
                     __sock_sendmsg+0x221/0x270 net/socket.c:745
                     __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
                     __do_sys_sendto net/socket.c:2204 [inline]
                     __se_sys_sendto net/socket.c:2200 [inline]
                     __x64_sys_sendto+0xde/0x100 net/socket.c:2200
                     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                     do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   SOFTIRQ-ON-W at:
                     lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                     do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
                     do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
                     write_seqlock include/linux/seqlock.h:823 [inline]
                     dev_change_name+0x184/0x920 net/core/dev.c:1229
                     do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
                     __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
                     rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
                     rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
                     netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                     netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                     netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                     netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                     sock_sendmsg_nosec net/socket.c:730 [inline]
                     __sock_sendmsg+0x221/0x270 net/socket.c:745
                     __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
                     __do_sys_sendto net/socket.c:2204 [inline]
                     __se_sys_sendto net/socket.c:2200 [inline]
                     __x64_sys_sendto+0xde/0x100 net/socket.c:2200
                     do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                     do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   INITIAL USE at:
                    lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                    do_write_seqcount_begin_nested include/linux/seqlock.h:469 [inline]
                    do_write_seqcount_begin include/linux/seqlock.h:495 [inline]
                    write_seqlock include/linux/seqlock.h:823 [inline]
                    dev_change_name+0x184/0x920 net/core/dev.c:1229
                    do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2880
                    __rtnl_newlink net/core/rtnetlink.c:3696 [inline]
                    rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3743
                    rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
                    netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
                    netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
                    netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
                    netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
                    sock_sendmsg_nosec net/socket.c:730 [inline]
                    __sock_sendmsg+0x221/0x270 net/socket.c:745
                    __sys_sendto+0x3a4/0x4f0 net/socket.c:2192
                    __do_sys_sendto net/socket.c:2204 [inline]
                    __se_sys_sendto net/socket.c:2200 [inline]
                    __x64_sys_sendto+0xde/0x100 net/socket.c:2200
                    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
                    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
                   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   INITIAL READ USE at:
                         lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
                         seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
                         read_seqbegin include/linux/seqlock.h:772 [inline]
                         netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
                         rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
                         rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
                         rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
                         rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
                         register_netdevice+0x1665/0x19e0 net/core/dev.c:10422
                         register_netdev+0x3b/0x50 net/core/dev.c:10512
                         loopback_net_init+0x73/0x150 drivers/net/loopback.c:217
                         ops_init+0x359/0x610 net/core/net_namespace.c:139
                         __register_pernet_operations net/core/net_namespace.c:1247 [inline]
                         register_pernet_operations+0x2cb/0x660 net/core/net_namespace.c:1320
                         register_pernet_device+0x33/0x80 net/core/net_namespace.c:1407
                         net_dev_init+0xfcd/0x10d0 net/core/dev.c:11956
                         do_one_initcall+0x248/0x880 init/main.c:1267
                         do_initcall_level+0x157/0x210 init/main.c:1329
                         do_initcalls+0x3f/0x80 init/main.c:1345
                         kernel_init_freeable+0x435/0x5d0 init/main.c:1578
                         kernel_init+0x1d/0x2b0 init/main.c:1467
                         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
                         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 }
 ... key      at: [<ffffffff8f5de668>] netdev_rename_lock+0x8/0xa0
 ... acquired at:
    lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
    seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
    read_seqbegin include/linux/seqlock.h:772 [inline]
    netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
    rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
    rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
    rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
    rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
    __dev_notify_flags+0xf7/0x400 net/core/dev.c:8816
    __dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588
    dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
    team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771
    dev_change_rx_flags net/core/dev.c:8541 [inline]
    __dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585
    dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
    br_port_clear_promisc net/bridge/br_if.c:135 [inline]
    br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172
    nbp_update_port_count net/bridge/br_if.c:242 [inline]
    br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761
    br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000
    br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213
    __rtnl_newlink net/core/rtnetlink.c:3689 [inline]
    rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743
    rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
    netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
    netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
    netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
    netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
    sock_sendmsg_nosec net/socket.c:730 [inline]
    __sock_sendmsg+0x221/0x270 net/socket.c:745
    ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
    ___sys_sendmsg net/socket.c:2639 [inline]
    __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

stack backtrace:
CPU: 0 PID: 9449 Comm: syz-executor.2 Not tainted 6.10.0-rc2-syzkaller-00249-gbe27b8965297 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
  print_bad_irq_dependency kernel/locking/lockdep.c:2626 [inline]
  check_irq_usage kernel/locking/lockdep.c:2865 [inline]
  check_prev_add kernel/locking/lockdep.c:3138 [inline]
  check_prevs_add kernel/locking/lockdep.c:3253 [inline]
  validate_chain+0x4de0/0x5900 kernel/locking/lockdep.c:3869
  __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
  seqcount_lockdep_reader_access include/linux/seqlock.h:72 [inline]
  read_seqbegin include/linux/seqlock.h:772 [inline]
  netdev_copy_name+0x168/0x2c0 net/core/dev.c:949
  rtnl_fill_ifinfo+0x38e/0x2270 net/core/rtnetlink.c:1839
  rtmsg_ifinfo_build_skb+0x18a/0x260 net/core/rtnetlink.c:4073
  rtmsg_ifinfo_event net/core/rtnetlink.c:4107 [inline]
  rtmsg_ifinfo+0x91/0x1b0 net/core/rtnetlink.c:4116
  __dev_notify_flags+0xf7/0x400 net/core/dev.c:8816
  __dev_set_promiscuity+0x152/0x5a0 net/core/dev.c:8588
  dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
  team_change_rx_flags+0x203/0x330 drivers/net/team/team_core.c:1771
  dev_change_rx_flags net/core/dev.c:8541 [inline]
  __dev_set_promiscuity+0x406/0x5a0 net/core/dev.c:8585
  dev_set_promiscuity+0x51/0xe0 net/core/dev.c:8608
  br_port_clear_promisc net/bridge/br_if.c:135 [inline]
  br_manage_promisc+0x505/0x590 net/bridge/br_if.c:172
  nbp_update_port_count net/bridge/br_if.c:242 [inline]
  br_port_flags_change+0x161/0x1f0 net/bridge/br_if.c:761
  br_setport+0xcb5/0x16d0 net/bridge/br_netlink.c:1000
  br_port_slave_changelink+0x135/0x150 net/bridge/br_netlink.c:1213
  __rtnl_newlink net/core/rtnetlink.c:3689 [inline]
  rtnl_newlink+0x169f/0x20a0 net/core/rtnetlink.c:3743
  rtnetlink_rcv_msg+0x89b/0x1180 net/core/rtnetlink.c:6635
  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2564
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x221/0x270 net/socket.c:745
  ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
  ___sys_sendmsg net/socket.c:2639 [inline]
  __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3b3047cf29
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3b311740c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f3b305b4050 RCX: 00007f3b3047cf29
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000008
RBP: 00007f3b304ec074 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f3b305b4050 R15: 00007ffca2f3dc68
 </TASK>

Fixes: 0840556e5a ("net: Protect dev->name by seqlock.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-21 12:18:34 +01:00
Jakub Kicinski a6ec08beec Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  1e7962114c ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error")
  165f87691a ("bnxt_en: add timestamping statistics support")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-20 13:49:59 -07:00
Ignat Korchagin 6cd4a78d96 net: do not leave a dangling sk pointer, when socket creation fails
It is possible to trigger a use-after-free by:
  * attaching an fentry probe to __sock_release() and the probe calling the
    bpf_get_socket_cookie() helper
  * running traceroute -I 1.1.1.1 on a freshly booted VM

A KASAN enabled kernel will log something like below (decoded and stripped):
==================================================================
BUG: KASAN: slab-use-after-free in __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
Read of size 8 at addr ffff888007110dd8 by task traceroute/299

CPU: 2 PID: 299 Comm: traceroute Tainted: G            E      6.10.0-rc2+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1))
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
kasan_report (mm/kasan/report.c:603)
? __sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
kasan_check_range (mm/kasan/generic.c:183 mm/kasan/generic.c:189)
__sock_gen_cookie (./arch/x86/include/asm/atomic64_64.h:15 ./include/linux/atomic/atomic-arch-fallback.h:2583 ./include/linux/atomic/atomic-instrumented.h:1611 net/core/sock_diag.c:29)
bpf_get_socket_ptr_cookie (./arch/x86/include/asm/preempt.h:94 ./include/linux/sock_diag.h:42 net/core/filter.c:5094 net/core/filter.c:5092)
bpf_prog_875642cf11f1d139___sock_release+0x6e/0x8e
bpf_trampoline_6442506592+0x47/0xaf
__sock_release (net/socket.c:652)
__sock_create (net/socket.c:1601)
...
Allocated by task 299 on cpu 2 at 78.328492s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:312 mm/kasan/common.c:338)
kmem_cache_alloc_noprof (mm/slub.c:3941 mm/slub.c:4000 mm/slub.c:4007)
sk_prot_alloc (net/core/sock.c:2075)
sk_alloc (net/core/sock.c:2134)
inet_create (net/ipv4/af_inet.c:327 net/ipv4/af_inet.c:252)
__sock_create (net/socket.c:1572)
__sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706)
__x64_sys_socket (net/socket.c:1718)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 299 on cpu 2 at 78.328502s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:582)
poison_slab_object (mm/kasan/common.c:242)
__kasan_slab_free (mm/kasan/common.c:256)
kmem_cache_free (mm/slub.c:4437 mm/slub.c:4511)
__sk_destruct (net/core/sock.c:2117 net/core/sock.c:2208)
inet_create (net/ipv4/af_inet.c:397 net/ipv4/af_inet.c:252)
__sock_create (net/socket.c:1572)
__sys_socket (net/socket.c:1660 net/socket.c:1644 net/socket.c:1706)
__x64_sys_socket (net/socket.c:1718)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Fix this by clearing the struct socket reference in sk_common_release() to cover
all protocol families create functions, which may already attached the
reference to the sk object with sock_init_data().

Fixes: c5dbb89fc2 ("bpf: Expose bpf_get_socket_cookie to tracing programs")
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20240613194047.36478-1-kuniyu@amazon.com/T/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240617210205.67311-1-ignat@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-20 10:43:14 +02:00
Yan Zhai ba8de796ba net: introduce sk_skb_reason_drop function
Long used destructors kfree_skb and kfree_skb_reason do not pass
receiving socket to packet drop tracepoints trace_kfree_skb.
This makes it hard to track packet drops of a certain netns (container)
or a socket (user application).

The naming of these destructors are also not consistent with most sk/skb
operating functions, i.e. functions named "sk_xxx" or "skb_xxx".
Introduce a new functions sk_skb_reason_drop as drop-in replacement for
kfree_skb_reason on local receiving path. Callers can now pass receiving
sockets to the tracepoints.

kfree_skb and kfree_skb_reason are still usable but they are now just
inline helpers that call sk_skb_reason_drop.

Note it is not feasible to do the same to consume_skb. Packets not
dropped can flow through multiple receive handlers, and have multiple
receiving sockets. Leave it untouched for now.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai c53795d48e net: add rx_sk to trace_kfree_skb
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.

Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yue Haibing ff960f9d3e netns: Make get_net_ns() handle zero refcount net
Syzkaller hit a warning:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 7890 at lib/refcount.c:25 refcount_warn_saturate+0xdf/0x1d0
Modules linked in:
CPU: 3 PID: 7890 Comm: tun Not tainted 6.10.0-rc3-00100-gcaa4f9578aba-dirty #310
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xdf/0x1d0
Code: 41 49 04 31 ff 89 de e8 9f 1e cd fe 84 db 75 9c e8 76 26 cd fe c6 05 b6 41 49 04 01 90 48 c7 c7 b8 8e 25 86 e8 d2 05 b5 fe 90 <0f> 0b 90 90 e9 79 ff ff ff e8 53 26 cd fe 0f b6 1
RSP: 0018:ffff8881067b7da0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c72ac
RDX: ffff8881026a2140 RSI: ffffffff811c72b5 RDI: 0000000000000001
RBP: ffff8881067b7db0 R08: 0000000000000000 R09: 205b5d3730353139
R10: 0000000000000000 R11: 205d303938375420 R12: ffff8881086500c4
R13: ffff8881086500c4 R14: ffff8881086500b0 R15: ffff888108650040
FS:  00007f5b2961a4c0(0000) GS:ffff88823bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055d7ed36fd18 CR3: 00000001482f6000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? show_regs+0xa3/0xc0
 ? __warn+0xa5/0x1c0
 ? refcount_warn_saturate+0xdf/0x1d0
 ? report_bug+0x1fc/0x2d0
 ? refcount_warn_saturate+0xdf/0x1d0
 ? handle_bug+0xa1/0x110
 ? exc_invalid_op+0x3c/0xb0
 ? asm_exc_invalid_op+0x1f/0x30
 ? __warn_printk+0xcc/0x140
 ? __warn_printk+0xd5/0x140
 ? refcount_warn_saturate+0xdf/0x1d0
 get_net_ns+0xa4/0xc0
 ? __pfx_get_net_ns+0x10/0x10
 open_related_ns+0x5a/0x130
 __tun_chr_ioctl+0x1616/0x2370
 ? __sanitizer_cov_trace_switch+0x58/0xa0
 ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30
 ? __pfx_tun_chr_ioctl+0x10/0x10
 tun_chr_ioctl+0x2f/0x40
 __x64_sys_ioctl+0x11b/0x160
 x64_sys_call+0x1211/0x20d0
 do_syscall_64+0x9e/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5b28f165d7
Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 8
RSP: 002b:00007ffc2b59c5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5b28f165d7
RDX: 0000000000000000 RSI: 00000000000054e3 RDI: 0000000000000003
RBP: 00007ffc2b59c650 R08: 00007f5b291ed8c0 R09: 00007f5b2961a4c0
R10: 0000000029690010 R11: 0000000000000246 R12: 0000000000400730
R13: 00007ffc2b59cf40 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
Kernel panic - not syncing: kernel: panic_on_warn set ...

This is trigger as below:
          ns0                                    ns1
tun_set_iff() //dev is tun0
   tun->dev = dev
//ip link set tun0 netns ns1
                                       put_net() //ref is 0
__tun_chr_ioctl() //TUNGETDEVNETNS
   net = dev_net(tun->dev);
   open_related_ns(&net->ns, get_net_ns); //ns1
     get_net_ns()
        get_net() //addition on 0

Use maybe_get_net() in get_net_ns in case net's ref is zero to fix this

Fixes: 0c3e0e3bb6 ("tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240614131302.2698509-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 10:59:52 +02:00
Kory Maincent efb459303d net: Move dev_set_hwtstamp_phylib to net/core/dev.h
This declaration was added to the header to be called from ethtool.
ethtool is separated from core for code organization but it is not really
a separate entity, it controls very core things.
As ethtool is an internal stuff it is not wise to have it in netdevice.h.
Move the declaration to net/core/dev.h instead.

Remove the EXPORT_SYMBOL_GPL call as ethtool can not be built as a module.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-2-b2a086257b63@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 18:25:53 -07:00
Sagi Grimberg 934c29999b net: micro-optimize skb_datagram_iter
We only use the mapping in a single context in a short and contained scope,
so kmap_local_page is sufficient and cheaper. This will also allow
skb_datagram_iter to be called from softirq context.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20240613113504.1079860-1-sagi@grimberg.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14 19:32:48 -07:00
Jakub Kicinski 7ed352d34f netdev-genl: fix error codes when outputting XDP features
-EINVAL will interrupt the dump. The correct error to return
if we have more data to dump is -EMSGSIZE.

Discovered by doing:

  for i in `seq 80`; do ip link add type veth; done
  ./cli.py --dbg-small-recv 5300 --spec netdev.yaml --dump dev-get >> /dev/null
  [...]
     nl_len = 64 (48) nl_flags = 0x0 nl_type = 19
     nl_len = 20 (4) nl_flags = 0x2 nl_type = 3
  	error: -22

Fixes: d3d854fd6a ("netdev-genl: create a simple family for netdev stuff")
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240613213044.3675745-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14 18:04:29 -07:00
Alexei Starovoitov 124e8c2b1b bpf: Relax tuple len requirement for sk helpers.
__bpf_skc_lookup() safely handles incorrect values of tuple len,
hence we can allow zero to be passed as tuple len.
This patch alone doesn't make an observable verifier difference.
It's a trivial improvement that might simplify bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-2-alexei.starovoitov@gmail.com
2024-06-14 21:52:39 +02:00
Florian Westphal 2bbe3e5a2f bpf: Avoid splat in pskb_pull_reason
syzkaller builds (CONFIG_DEBUG_NET=y) frequently trigger a debug
hint in pskb_may_pull.

We'd like to retain this debug check because it might hint at integer
overflows and other issues (kernel code should pull headers, not huge
value).

In bpf case, this splat isn't interesting at all: such (nonsensical)
bpf programs are typically generated by a fuzzer anyway.

Do what Eric suggested and suppress such warning.

For CONFIG_DEBUG_NET=n we don't need the extra check because
pskb_may_pull will do the right thing: return an error without the
WARN() backtrace.

Fixes: 219eee9c0d ("net: skbuff: add overflow debug check to pull/push helpers")
Reported-by: syzbot+0c4150bff9fff3bf023c@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://syzkaller.appspot.com/bug?extid=0c4150bff9fff3bf023c
Link: https://lore.kernel.org/netdev/9f254c96-54f2-4457-b7ab-1d9f6187939c@gmail.com/
Link: https://lore.kernel.org/bpf/20240614101801.9496-1-fw@strlen.de
2024-06-14 17:20:21 +02:00
Petr Machata 4ee2a8cace net: ipv4: Add a sysctl to set multipath hash seed
When calculating hashes for the purpose of multipath forwarding, both IPv4
and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
randomly-generated seed. That's a fine choice by default, but unfortunately
some deployments may need a tighter control over the seed used.

In this patch, make the seed configurable by adding a new sysctl key,
net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used
specifically for multipath forwarding and not for the other concerns that
flow_hash_from_keys() is used for, such as queue selection. Expose the knob
as sysctl because other such settings, such as headers to hash, are also
handled that way. Like those, the multipath hash seed is a per-netns
variable.

Despite being placed in the net.ipv4 namespace, the multipath seed sysctl
is used for both IPv4 and IPv6, similarly to e.g. a number of TCP
variables.

The seed used by flow_hash_from_keys() is a 128-bit quantity. However it
seems that usually the seed is a much more modest value. 32 bits seem
typical (Cisco, Cumulus), some systems go even lower. For that reason, and
to decouple the user interface from implementation details, go with a
32-bit quantity, which is then quadruplicated to form the siphash key.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12 16:42:11 -07:00
Florian Westphal d1dab4f71d net: add and use __skb_get_hash_symmetric_net
Similar to previous patch: apply same logic for
__skb_get_hash_symmetric and let callers pass the netns to the dissector
core.

Existing function is turned into a wrapper to avoid adjusting all
callers, nft_hash.c uses new function.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240608221057.16070-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12 14:33:38 -07:00
Florian Westphal b975d3ee59 net: add and use skb_get_hash_net
Years ago flow dissector gained ability to delegate flow dissection
to a bpf program, scoped per netns.

Unfortunately, skb_get_hash() only gets an sk_buff argument instead
of both net+skb.  This means the flow dissector needs to obtain the
netns pointer from somewhere else.

The netns is derived from skb->dev, and if that is not available, from
skb->sk.  If neither is set, we hit a (benign) WARN_ON_ONCE().

Trying both dev and sk covers most cases, but not all, as recently
reported by Christoph Paasch.

In case of nf-generated tcp reset, both sk and dev are NULL:

WARNING: .. net/core/flow_dissector.c:1104
 skb_flow_dissect_flow_keys include/linux/skbuff.h:1536 [inline]
 skb_get_hash include/linux/skbuff.h:1578 [inline]
 nft_trace_init+0x7d/0x120 net/netfilter/nf_tables_trace.c:320
 nft_do_chain+0xb26/0xb90 net/netfilter/nf_tables_core.c:268
 nft_do_chain_ipv4+0x7a/0xa0 net/netfilter/nft_chain_filter.c:23
 nf_hook_slow+0x57/0x160 net/netfilter/core.c:626
 __ip_local_out+0x21d/0x260 net/ipv4/ip_output.c:118
 ip_local_out+0x26/0x1e0 net/ipv4/ip_output.c:127
 nf_send_reset+0x58c/0x700 net/ipv4/netfilter/nf_reject_ipv4.c:308
 nft_reject_ipv4_eval+0x53/0x90 net/ipv4/netfilter/nft_reject_ipv4.c:30
 [..]

syzkaller did something like this:
table inet filter {
  chain input {
    type filter hook input priority filter; policy accept;
    meta nftrace set 1
    tcp dport 42 reject with tcp reset
   }
   chain output {
    type filter hook output priority filter; policy accept;
    # empty chain is enough
   }
}

... then sends a tcp packet to port 42.

Initial attempt to simply set skb->dev from nf_reject_ipv4 doesn't cover
all cases: skbs generated via ipv4 igmp_send_report trigger similar splat.

Moreover, Pablo Neira found that nft_hash.c uses __skb_get_hash_symmetric()
which would trigger same warn splat for such skbs.

Lets allow callers to pass the current netns explicitly.
The nf_trace infrastructure is adjusted to use the new helper.

__skb_get_hash_symmetric is handled in the next patch.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/494
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240608221057.16070-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12 14:33:38 -07:00
Daniel Xu cce4c40b96 bpf: treewide: Align kfunc signatures to prog point-of-view
Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1ebc ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Jeremy Kerr 94b601bc4f net: core: Implement dstats-type stats collections
We currently have dev_get_tstats64() for collecting per-cpu stats of
type pcpu_sw_netstats ("tstats"). However, tstats doesn't allow for
accounting tx/rx drops. We do have a stats variant that does have stats
for dropped packets: struct pcpu_dstats, but there are no core helpers
for using those stats.

The VRF driver uses dstats, by providing its own collation/fetch
functions to do so.

This change adds a common implementation for dstats-type collection,
used when pcpu_stat_type == NETDEV_PCPU_STAT_DSTAT. This is based on the
VRF driver's existing stats collator (plus the unused tx_drops stat from
there). We will switch the VRF driver to use this in the next change.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607-dstats-v3-2-cc781fe116f7@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-11 19:24:56 -07:00
Jakub Kicinski 5380d64f8d rtnetlink: move rtnl_lock handling out of af_netlink
Now that we have an intermediate layer of code for handling
rtnl-level netlink dump quirks, we can move the rtnl_lock
taking there.

For dump handlers with RTNL_FLAG_DUMP_SPLIT_NLM_DONE we can
avoid taking rtnl_lock just to generate NLM_DONE, once again.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10 13:15:40 +01:00
David Wei 3e61103b2f page_pool: remove WARN_ON() with OR
Having an OR in WARN_ON() makes me sad because it's impossible to tell
which condition is true when triggered.

Split a WARN_ON() with an OR in page_pool_disable_direct_recycling().

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-09 15:50:43 +01:00
Jakub Kicinski 62b5bf58b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
  491aee894a ("ionic: fix kernel panic in XDP_TX action")

net/ipv6/ip6_fib.c
  b4cb4a1391 ("net: use unrcu_pointer() helper")
  b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 12:06:56 -07:00
Eric Dumazet b4cb4a1391 net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 11:52:52 +02:00
Jakub Kicinski 5b4b62a169 rtnetlink: make the "split" NLM_DONE handling generic
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf ("inet: bring NLM_DONE out to a separate recv() again")

Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).

Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.

Tested:

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
           --dump getaddr --json '{"ifa-family": 2}'

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
           --dump getroute --json '{"rtm-family": 2}'

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
           --dump getlink

Fixes: 3e41af9076 ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05 12:34:54 +01:00
Jakub Kicinski 99b8add01f net: skb: add compatibility warnings to skb_shift()
According to current semantics we should never try to shift data
between skbs which differ on decrypted or pp_recycle status.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 13:23:30 +02:00
Davide Caratti 668b6a2ef8 flow_dissector: add support for tunnel control flags
Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata.
This is a prerequisite for matching these control flags using TC flower.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04 11:16:38 +02:00
Eric Dumazet 2fe6fb36c7 net: dst_cache: add two DEBUG_NET warnings
After fixing four different bugs involving dst_cache
users, it might be worth adding a check about BH being
blocked by dst_cache callers.

DEBUG_NET_WARN_ON_ONCE(!in_softirq());

It is not fatal, if we missed valid case where no
BH deadlock is to be feared, we might change this.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03 18:50:09 -07:00
Matteo Croce 19249c0724 net: make net.core.{r,w}mem_{default,max} namespaced
The following sysctl are global and can't be read from a netns:

net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max

Make the following sysctl parameters available readonly from within a
network namespace, allowing a container to read them.

Signed-off-by: Matteo Croce <teknoraver@meta.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20240530232722.45255-2-technoboy85@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 16:03:21 -07:00
Jason Xing 8105378c0c net: rps: fix error when CONFIG_RFS_ACCEL is off
John Sperbeck reported that if we turn off CONFIG_RFS_ACCEL, the 'head'
is not defined, which will trigger compile error. So I move the 'head'
out of the CONFIG_RFS_ACCEL scope.

Fixes: 84b6823cd9 ("net: rps: protect last_qtail with rps_input_queue_tail_save() helper")
Reported-by: John Sperbeck <jsperbeck@google.com>
Closes: https://lore.kernel.org/all/20240529203421.2432481-1-jsperbeck@google.com/
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240530032717.57787-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 16:02:08 -07:00
Abhishek Chauhan 73451e9aaa net: validate SO_TXTIME clockid coming from userspace
Currently there are no strict checks while setting SO_TXTIME
from userspace. With the recent development in skb->tstamp_type
clockid with unsupported clocks results in warn_on_once, which causes
unnecessary aborts in some systems which enables panic on warns.

Add validation in setsockopt to support only CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace.

Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Link: https://lore.kernel.org/lkml/6bdba7b6-fd22-4ea5-a356-12268674def1@quicinc.com/
Fixes: 1693c5db6a ("net: Add additional bit to support clockid_t timestamp type")
Reported-by: syzbot+d7b227731ec589e7f4f0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d7b227731ec589e7f4f0
Reported-by: syzbot+30a35a2e9c5067cc43fa@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=30a35a2e9c5067cc43fa
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240529183130.1717083-1-quic_abchauha@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01 15:47:23 -07:00
Jakub Kicinski e19de2064f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_classifier.c
  abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
  56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/

No other adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31 14:10:28 -07:00
Thomas Weißschuh 874aa96d78 net/neighbour: constify ctl_table arguments of utility function
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.

As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.

To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".

No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-1-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 19:49:47 -07:00
Jakub Kicinski 4b3529edbb bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlWtmQAKCRDbK58LschI
 g0TUAQDT76jx7Rq1DShCtZ3eqiBMNkYczK8b+GqNsSG8YGduaAEA1jn/GN+H65Rh
 atQZ/pYAfLZflMV04+XE0GyBr5q1uQg=
 =NczG
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-28

We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).

The main changes are:

1) Rename skb's mono_delivery_time to tstamp_type for extensibility
   and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
   from Abhishek Chauhan.

2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
   CT zones can be used from XDP/tc BPF netfilter CT helper functions,
   from Brad Cowie.

3) Several tweaks to the instruction-set.rst IETF doc to address
   the Last Call review comments, from Dave Thaler.

4) Small batch of riscv64 BPF JIT optimizations in order to emit more
   compressed instructions to the JITed image for better icache efficiency,
   from Xiao Wang.

5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
   diffing and forcing more natural type definitions ordering,
   from Mykyta Yatsenko.

6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
   a syzbot/KCSAN race report for the tx_errors counter,
   from Jiang Yunshui.

7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
   which started treating const structs as constants and thus breaking
   full BTF program name resolution, from Ivan Babrou.

8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
   some of the test-internal array sizes, from Geliang Tang.

9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
   pahole, from Alan Maguire.

10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
    rebuilds in some corner cases, from Artem Savkov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  bpf, net: Use DEV_STAT_INC()
  bpf, docs: Fix instruction.rst indentation
  bpf, docs: Clarify call local offset
  bpf, docs: Add table captions
  bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
  bpf, docs: Use RFC 2119 language for ISA requirements
  bpf, docs: Move sentence about returning R0 to abi.rst
  bpf: constify member bpf_sysctl_kern:: Table
  riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
  riscv, bpf: Use STACK_ALIGN macro for size rounding up
  riscv, bpf: Optimize zextw insn with Zba extension
  selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
  net: Add additional bit to support clockid_t timestamp type
  net: Rename mono_delivery_time to tstamp_type for scalabilty
  selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
  net: netfilter: Make ct zone opts configurable for bpf ct helpers
  selftests/bpf: Fix prog numbers in test_sockmap
  bpf: Remove unused variable "prev_state"
  bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
  bpf: Fix order of args in call to bpf_map_kvcalloc
  ...
====================

Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-28 07:27:29 -07:00
Gou Hao de31e96cf4 net/core: move the lockdep-init of sk_callback_lock to sk_init_common()
In commit cdfbabfb2f ("net: Work around lockdep limitation in
sockets that use sockets"), it introduces 'af_kern_callback_keys'
to lockdep-init of sk_callback_lock according to 'sk_kern_sock',
it modifies sock_init_data() only, and sk_clone_lock() calls
sk_init_common() to initialize sk_callback_lock too, so the
lockdep-init of sk_callback_lock should be moved to sk_init_common().

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Link: https://lore.kernel.org/r/20240526145718.9542-2-gouhao@uniontech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-28 13:29:36 +02:00
Gou Hao c65b652111 net/core: remove redundant sk_callback_lock initialization
sk_callback_lock has already been initialized in sk_init_common().

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240526145718.9542-1-gouhao@uniontech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-28 13:29:36 +02:00
Thadeu Lima de Souza Cascardo 4b4647add7 sock_map: avoid race between sock_map_close and sk_psock_put
sk_psock_get will return NULL if the refcount of psock has gone to 0, which
will happen when the last call of sk_psock_put is done. However,
sk_psock_drop may not have finished yet, so the close callback will still
point to sock_map_close despite psock being NULL.

This can be reproduced with a thread deleting an element from the sock map,
while the second one creates a socket, adds it to the map and closes it.

That will trigger the WARN_ON_ONCE:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Modules linked in:
CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701
Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02
RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293
RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000
RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0
RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3
R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840
R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870
FS:  000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 unix_release+0x87/0xc0 net/unix/af_unix.c:1048
 __sock_release net/socket.c:659 [inline]
 sock_close+0xbe/0x240 net/socket.c:1421
 __fput+0x42b/0x8a0 fs/file_table.c:422
 __do_sys_close fs/open.c:1556 [inline]
 __se_sys_close fs/open.c:1541 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1541
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb37d618070
Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070
RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Use sk_psock, which will only check that the pointer is not been set to
NULL yet, which should only happen after the callbacks are restored. If,
then, a reference can still be gotten, we may call sk_psock_stop and cancel
psock->work.

As suggested by Paolo Abeni, reorder the condition so the control flow is
less convoluted.

After that change, the reproducer does not trigger the WARN_ON_ONCE
anymore.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reported-by: syzbot+07a2e4a1a57118ef7355@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355
Fixes: aadb2bb83f ("sock_map: Fix a potential use-after-free in sock_map_close()")
Fixes: 5b4a79ba65 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself")
Cc: stable@vger.kernel.org
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20240524144702.1178377-1-cascardo@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-28 12:05:19 +02:00
yunshui d9cbd8343b bpf, net: Use DEV_STAT_INC()
syzbot/KCSAN reported that races happen when multiple CPUs updating
dev->stats.tx_error concurrently. Adopt SMP safe DEV_STATS_INC() to
update the dev->stats fields.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: yunshui <jiangyunshui@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240523033520.4029314-1-jiangyunshui@kylinos.cn
2024-05-28 12:04:11 +02:00
Jakub Kicinski 2786ae339e bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlTGFAAKCRDbK58LschI
 g5NXAP0QRn8nBSxJHIswFSOwRiCyhOhR7YL2P0c+RGcRMA+ZSAD9E1cwsYXsPu3L
 ummQ52AMaMfouHg6aW+rFIoupkGSnwc=
 =QctA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-05-27

We've added 15 non-merge commits during the last 7 day(s) which contain
a total of 18 files changed, 583 insertions(+), 55 deletions(-).

The main changes are:

1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread
   while the promise was to filter by process, from Andrii Nakryiko.

2) Fix the recent influx of syzkaller reports to sockmap which triggered
   a locking rule violation by performing a map_delete, from Jakub Sitnicki.

3) Fixes to netkit driver in particular on skb->pkt_type override upon pass
   verdict, from Daniel Borkmann.

4) Fix an integer overflow in resolve_btfids which can wrongly trigger build
   failures, from Friedrich Vock.

5) Follow-up fixes for ARC JIT reported by static analyzers,
   from Shahab Vahedi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Cover verifier checks for mutating sockmap/sockhash
  Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
  bpf: Allow delete from sockmap/sockhash only if update is allowed
  selftests/bpf: Add netkit test for pkt_type
  selftests/bpf: Add netkit tests for mac address
  netkit: Fix pkt_type override upon netkit pass verdict
  netkit: Fix setting mac address in l2 mode
  ARC, bpf: Fix issues reported by the static analyzers
  selftests/bpf: extend multi-uprobe tests with USDTs
  selftests/bpf: extend multi-uprobe tests with child thread case
  libbpf: detect broken PID filtering logic for multi-uprobe
  bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic
  bpf: fix multi-uprobe PID filtering logic
  bpf: Fix potential integer overflow in resolve_btfids
  MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT
====================

Link: https://lore.kernel.org/r/20240527203551.29712-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27 16:26:30 -07:00
Jakub Sitnicki 3b9ce0491a Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
This reverts commit ff91059932.

This check is no longer needed. BPF programs attached to tracepoints are
now rejected by the verifier when they attempt to delete from a
sockmap/sockhash maps.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-2-944b372f2101@cloudflare.com
2024-05-27 19:34:25 +02:00
Abhishek Chauhan 1693c5db6a net: Add additional bit to support clockid_t timestamp type
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.

To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.

We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23 14:14:36 -07:00
Abhishek Chauhan 4d25ca2d68 net: Rename mono_delivery_time to tstamp_type for scalabilty
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.

Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.

As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.

Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.

Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.

In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.

Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23 14:14:23 -07:00
Linus Torvalds daa121128a dma-mapping updates for Linux 6.10
- optimize DMA sync calls when they are no-ops (Alexander Lobakin)
  - fix swiotlb padding for untrusted devices (Michael Kelley)
  - add documentation for swiotb (Michael Kelley)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmZLV+gLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPO7hAAlKuXigzwcrVEUnfRGRdaZ28xbmffyC1dPfw8HRZe
 xJqvD51aJ/VOoOCcUyt3hNLEQHwtjEk4eM0xGcAASMdwceU58doJCcDJBpbbgbDK
 CPKJgBLQBC1JfAJUpRiJkV4RsudRhAyndIzUPVgkz0WObpEgDpfO0ClHRF/0Pavy
 1sBFVFMbB1ewb/D8ffpp+DWfwrwu0oMC3A2LkYu2F5SQFWuVOpbNemrnZ6K2ckPt
 2mcLpJ308+sti8Ka/LrI2akU8JCLYMYDQnue/44v3X3Gm63cMcEx/fj5M5x6m71n
 P+cxAkjsGDHybnfjbUvR842to8msRsH4CI4Zbb69+5HDlWSadM8JhQd74oeii6o6
 RiGPrrFEk7vCxFOkUsqGFYMykEX+71wXfQ1Mpp/b4QgdqBLkxW4ozQ3Ya7ASUs2z
 TLLmQvIXtYKGnyU+RdOkvS6piHjd4wVHOhuGVdXqVT7WrbaPeovY4TNSTV2ZA1gE
 9Y5RCdrX9xeGGNjsYXKwsWGvXVsm6UTQmQVUsatQb3ic+K3S6tQR9pwzk0HmhMuM
 BscWHSAEL7T8ZZ5Ydph45Cw/6xdH7LggD+nRtLcdAuzCika12eabZHsO0DrF533n
 qXYOjZOgsMEZWICynxq6+EGQKGWY+F+GyKDMU2w2Es5OgMa9Bqb40aSF+Q887s96
 xwI=
 =Pa8W
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - optimize DMA sync calls when they are no-ops (Alexander Lobakin)

 - fix swiotlb padding for untrusted devices (Michael Kelley)

 - add documentation for swiotb (Michael Kelley)

* tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping:
  dma: fix DMA sync for drivers not calling dma_set_mask*()
  xsk: use generic DMA sync shortcut instead of a custom one
  page_pool: check for DMA sync shortcut earlier
  page_pool: don't use driver-set flags field directly
  page_pool: make sure frag API fields don't span between cachelines
  iommu/dma: avoid expensive indirect calls for sync operations
  dma: avoid redundant calls for sync operations
  dma: compile-out DMA sync op calls when not used
  iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices
  swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()
  Documentation/core-api: add swiotlb documentation
2024-05-20 10:23:39 -07:00
Linus Torvalds 89721e3038 net-accept-more-20240515
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+
 4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8
 UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s
 i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq
 6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV
 t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9
 8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5
 c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe
 0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj
 qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ
 4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x
 xLq2iq+ehw==
 =S6Xt
 -----END PGP SIGNATURE-----

Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept
  requests.

  This is very similar to previous work that enabled the same hint for
  doing receives on sockets. By far the majority of the work here is
  refactoring to enable the networking side to pass back whether or not
  the socket had more pending requests after accepting the current one,
  the last patch just wires it up for io_uring.

  Not only does this enable applications to know whether there are more
  connections to accept right now, it also enables smarter logic for
  io_uring multishot accept on whether to retry immediately or wait for
  a poll trigger"

* tag 'net-accept-more-20240515' of git://git.kernel.dk/linux:
  io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept
  net: pass back whether socket was empty post accept
  net: have do_accept() take a struct proto_accept_arg argument
  net: change proto and proto_ops accept type
2024-05-18 10:32:39 -07:00
Jakub Kicinski 5c1672705a net: revert partially applied PHY topology series
The series is causing issues with PHY drivers built as modules.
Since it was only partially applied and the merge window has
opened let's revert and try again for v6.11.

Revert 6916e461e7 ("net: phy: Introduce ethernet link topology representation")
Revert 0ec5ed6c13 ("net: sfp: pass the phy_device when disconnecting an sfp module's PHY")
Revert e75e4e074c ("net: phy: add helpers to handle sfp phy connect/disconnect")
Revert fdd353965b ("net: sfp: Add helper to return the SFP bus name")
Revert 841942bc62 ("net: ethtool: Allow passing a phy index for some commands")

Link: https://lore.kernel.org/all/171242462917.4000.9759453824684907063.git-patchwork-notify@kernel.org/
Link: https://lore.kernel.org/all/20240507102822.2023826-1-maxime.chevallier@bootlin.com/
Link: https://lore.kernel.org/r/20240513154156.104281-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 18:35:02 -07:00
Jens Axboe 92ef0fd55a net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel
invocation or not, add a struct proto_accept_arg struct as the argument.
This then holds all of these arguments, and prepares accept for being
able to pass back more information.

No functional changes in this patch.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13 18:19:09 -06:00
Daniel Jurgens b56035101e netdev: Add queue stats for TX stop and wake
TX queue stop and wake are counted by some drivers.
Support reporting these via netdev-genl queue stats.

Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 14:58:36 -07:00
Richard Gobert 4b0ebbca3e net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.

These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.

This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.

This results in less parsing code for non-loop flush tests for TCP and UDP
flows.

To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).

perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
        6.94% [kernel] [k] inet_gro_receive
        3.02% [kernel] [k] tcp_gro_receive

patch applied:
        4.27% [kernel] [k] tcp_gro_receive
        4.22% [kernel] [k] inet_gro_receive

perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
        10.09% [kernel] [k] inet_gro_receive
        2.08% [kernel] [k] tcp_gro_receive

patch applied:
        6.97% [kernel] [k] inet_gro_receive
        3.68% [kernel] [k] tcp_gro_receive

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 14:44:06 -07:00
Kuniyuki Iwashima 7172dc93d6 af_unix: Add dead flag to struct scm_fp_list.
Commit 1af2dface5 ("af_unix: Don't access successor in unix_del_edges()
during GC.") fixed use-after-free by avoid accessing edge->successor while
GC is in progress.

However, there could be a small race window where another process could
call unix_del_edges() while gc_in_progress is true and __skb_queue_purge()
is on the way.

So, we need another marker for struct scm_fp_list which indicates if the
skb is garbage-collected.

This patch adds dead flag in struct scm_fp_list and set it true before
calling __skb_queue_purge().

Fixes: 1af2dface5 ("af_unix: Don't access successor in unix_del_edges() during GC.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240508171150.50601-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10 18:52:45 -07:00
Jakub Kicinski e7073830cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
  35d92abfba ("net: hns3: fix kernel crash when devlink reload during initialization")
  2a1a1a7b5f ("net: hns3: add command queue trace for hns3")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09 10:01:01 -07:00
Eric Dumazet e2d09e5a1e net: dst_cache: minor optimization in dst_cache_set_ip6()
There is no need to use this_cpu_ptr(dst_cache->cache) twice.

Compiler is unable to optimize the second call, because of
per-cpu constraints.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132717.627518-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08 18:50:11 -07:00
Eric Dumazet 3b09b2bd0d net: dst_cache: annotate data-races around dst_cache->reset_ts
dst_cache->reset_ts is read or written locklessly,
add READ_ONCE() and WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132000.614591-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08 18:49:51 -07:00
Oleksij Rempel 768cf84138 net: add IEEE 802.1q specific helpers
IEEE 802.1q specification provides recommendation and examples which can
be used as good default values for different drivers.

This patch implements mapping examples documented in IEEE 802.1Q-2022 in
Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming
and mapping DSCP to Traffic Type inspired by RFC8325.

This helpers will be used in followup patches for dsa/microchip DCB
implementation.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08 10:35:09 +01:00
Alexander Lobakin 4321de4497 page_pool: check for DMA sync shortcut earlier
We can save a couple more function calls in the Page Pool code if we
check for dma_need_sync() earlier, just when we test pp->p.dma_sync.
Move both these checks into an inline wrapper and call the PP wrapper
over the generic DMA sync function only when both are true.
You can't cache the result of dma_need_sync() in &page_pool, as it may
change anytime if an SWIOTLB buffer is allocated or mapped.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-08 08:51:20 +02:00
Alexander Lobakin 403f11ac9a page_pool: don't use driver-set flags field directly
page_pool::p is driver-defined params, copied directly from the
structure passed to page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields the same way as the already existing one (::has_init_callback).
They are defined as bits in the driver-set params, leave them so here
as well, to not waste byte-per-bit or so. Almost 30 bits are still free
for future extensions.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive. ::flags passed by the driver can
now go to the "slow" PP params.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07 13:29:54 +02:00
Alexander Lobakin 1f20a57694 page_pool: make sure frag API fields don't span between cachelines
After commit 5027ec19f1 ("net: page_pool: split the page_pool_params
into fast and slow") that made &page_pool contain only "hot" params at
the start, cacheline boundary chops frag API fields group in the middle
again.
To not bother with this each time fast params get expanded or shrunk,
let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to
their actual size (2 longs + 1 int). This ensures 16-byte alignment for
the 32-bit architectures and 32-byte alignment for the 64-bit ones,
excluding unnecessary false-sharing.
::page_state_hold_cnt is used quite intensively on hotpath no matter if
frag API is used, so move it to the newly created hole in the first
cacheline.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07 13:29:54 +02:00
Eric Dumazet 9cf621bd5f rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protection
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.

All rtnl_link_ops->get_link_net() methods already using dev_net()
are ready. I added READ_ONCE() annotations on others.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 979aad40da rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()
dev->xdp_prog is protected by RCU, we can lift RTNL requirement
from rtnl_xdp_prog_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 6890ab31d1 rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()
Change dev_change_proto_down() and dev_change_proto_down_reason()
to write once on dev->proto_down and dev->proto_down_reason.

Then rtnl_fill_proto_down() can use READ_ONCE() annotations
and run locklessly.

rtnl_proto_down_size() should assume worst case,
because readng dev->proto_down_reason multiple
times would be racy without RTNL in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 6747a5d499 rtnetlink: do not depend on RTNL for many attributes
Following device fields can be read locklessly
in rtnl_fill_ifinfo() :

type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group,
promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size,
gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size,
tso_max_segs, num_rx_queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 55a2c86c8d net: write once on dev->allmulti and dev->promiscuity
In the following patch we want to read dev->allmulti
and dev->promiscuity locklessly from rtnl_fill_ifinfo()

In this patch I change __dev_set_promiscuity() and
__dev_set_allmulti() to write these fields (and dev->flags)
only if they succeed, with WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet ad13b5b0d1 rtnetlink: do not depend on RTNL for IFLA_TXQLEN output
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.

Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 8a58268133 rtnetlink: do not depend on RTNL for IFLA_IFNAME output
We can use netdev_copy_name() to no longer rely on RTNL
to fetch dev->name.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Eric Dumazet 698419ffb6 rtnetlink: do not depend on RTNL for IFLA_QDISC output
dev->qdisc can be read using RCU protection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07 11:14:50 +02:00
Thadeu Lima de Souza Cascardo a26ff37e62 net: fix out-of-bounds access in ops_init
net_alloc_generic is called by net_alloc, which is called without any
locking. It reads max_gen_ptrs, which is changed under pernet_ops_rwsem. It
is read twice, first to allocate an array, then to set s.len, which is
later used to limit the bounds of the array access.

It is possible that the array is allocated and another thread is
registering a new pernet ops, increments max_gen_ptrs, which is then used
to set s.len with a larger than allocated length for the variable array.

Fix it by reading max_gen_ptrs only once in net_alloc_generic. If
max_gen_ptrs is later incremented, it will be caught in net_assign_generic.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Fixes: 073862ba5d ("netns: fix net_alloc_generic()")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240502132006.3430840-1-cascardo@igalia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 13:38:14 +02:00
Felix Fietkau 8928756d53 net: move skb_gro_receive_list from udp to core
This helper function will be used for TCP fraglist GRO support

Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06 11:54:03 +02:00
Mina Almasry 173e7622cc Revert "net: mirror skb frag ref/unref helpers"
This reverts commit a580ea994f.

This revert is to resolve Dragos's report of page_pool leak here:
https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/

The reverted patch interacts very badly with commit 2cc3aeb5ec ("skbuff:
Fix a potential race while recycling page_pool packets"). The reverted
commit hopes that the pp_recycle + is_pp_page variables do not change
between the skb_frag_ref and skb_frag_unref operation. If such a change
occurs, the skb_frag_ref/unref will not operate on the same reference type.
In the case of Dragos's report, the grabbed ref was a pp ref, but the unref
was a page ref, because the pp_recycle setting on the skb was changed.

Attempting to fix this issue on the fly is risky. Lets revert and I hope
to reland this with better understanding and testing to ensure we don't
regress some edge case while streamlining skb reffing.

Fixes: a580ea994f ("net: mirror skb frag ref/unref helpers")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 16:05:53 -07:00
Roded Zats 1aec77b2bb rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validation
Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a
struct ifla_vf_vlan_info so the size of such attribute needs to be at least
of sizeof(struct ifla_vf_vlan_info) which is 14 bytes.
The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes)
which is less than sizeof(struct ifla_vf_vlan_info) so this validation
is not enough and a too small attribute might be cast to a
struct ifla_vf_vlan_info, this might result in an out of bands
read access when accessing the saved (casted) entry in ivvl.

Fixes: 79aab093a0 ("net: Update API for VF vlan protocol 802.1ad support")
Signed-off-by: Roded Zats <rzats@paloaltonetworks.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240502155751.75705-1-rzats@paloaltonetworks.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:57:50 -07:00
Eric Dumazet c1742dcb6b net: no longer acquire RTNL in threaded_show()
dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:14:01 -07:00
Eric Dumazet 0feb396f74 rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump()
Switch rtnl_stats_dump() to use for_each_netdev_dump()
instead of net->dev_index_head[] hash table.

This makes the code much easier to read, and fixes
scalability issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
Eric Dumazet 136c2a9a2a rtnetlink: change rtnl_stats_dump() return value
By returning 0 (or an error) instead of skb->len,
we allow NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03 15:03:42 -07:00
Joel Granados ce218712b0 net: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register
  This is not longer needed and is safe after commit c899710fe7
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Replace the for loop stop condition in sysctl_core_net_init that tests
  for procname == NULL with one that depends on array size
* Removed the "-1" in mpls_net_init that adjusted for having an extra
  empty element when looping over ctl_table arrays
* Use a table_size variable to keep the value of ARRAY_SIZE

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03 13:29:41 +01:00
Jakub Kicinski e958da0ddb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/filter.h
kernel/bpf/core.c
  66e13b615a ("bpf: verifier: prevent userspace memory access")
  d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 12:06:25 -07:00
Richard Gobert 5ef31ea5d0 net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp:
additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the
complete phase of gro. The functions always return skb->network_header,
which in the case of encapsulated packets at the gro complete phase, is
always set to the innermost L3 of the packet. That means that calling
{ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in
gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_
L3/L4 may return an unexpected value.

This incorrect usage leads to a bug in GRO's UDP socket lookup.
udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These
*_hdr functions return network_header which will point to the innermost L3,
resulting in the wrong offset being used in __udp{4,6}_lib_lookup with
encapsulated packets.

This patch adds network_offset and inner_network_offset to napi_gro_cb, and
makes sure both are set correctly.

To fix the issue, network_offsets union is used inside napi_gro_cb, in
which both the outer and the inner network offsets are saved.

Reproduction example:

Endpoint configuration example (fou + local address bind)

    # ip fou add port 6666 ipproto 4
    # ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip
    # ip link set tun1 up
    # ip a add 1.1.1.2/24 dev tun1

Netperf TCP_STREAM result on net-next before patch is applied:

net-next main, GRO enabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.28        2.37

net-next main, GRO disabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.01     2745.06

patch applied, GRO enabled:
    $ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
    Recv   Send    Send
    Socket Socket  Message  Elapsed
    Size   Size    Size     Time     Throughput
    bytes  bytes   bytes    secs.    10^6bits/sec

    131072  16384  16384    5.01     2877.38

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-02 11:02:48 +02:00
Kuniyuki Iwashima 0840556e5a net: Protect dev->name by seqlock.
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy
dev->name which is currently protected by rtnl_lock().

This patch does the following:

  1) Add seqlock netdev_rename_lock to protect dev->name

  2) Add netdev_copy_name() that copies dev->name to buffer
     under netdev_rename_lock

  3) Use netdev_copy_name() in netdev_get_name() and drop
     devnet_rename_sem

Suggested-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01 18:37:07 -07:00
Felix Fietkau d091e579b8 net: core: reject skb_copy(_expand) for fraglist GSO skbs
SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become
invalid. Return NULL if such an skb is passed to skb_copy or
skb_copy_expand, in order to prevent a crash on a potential later
call to skb_gso_segment.

Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-01 11:44:10 +01:00
Breno Leitao c2e6a872bd netpoll: Fix race condition in netpoll_owner_active
KCSAN detected a race condition in netpoll:

	BUG: KCSAN: data-race in net_rx_action / netpoll_send_skb
	write (marked) to 0xffff8881164168b0 of 4 bytes by interrupt on cpu 10:
	net_rx_action (./include/linux/netpoll.h:90 net/core/dev.c:6712 net/core/dev.c:6822)
<snip>
	read to 0xffff8881164168b0 of 4 bytes by task 1 on cpu 2:
	netpoll_send_skb (net/core/netpoll.c:319 net/core/netpoll.c:345 net/core/netpoll.c:393)
	netpoll_send_udp (net/core/netpoll.c:?)
<snip>
	value changed: 0x0000000a -> 0xffffffff

This happens because netpoll_owner_active() needs to check if the
current CPU is the owner of the lock, touching napi->poll_owner
non atomically. The ->poll_owner field contains the current CPU holding
the lock.

Use an atomic read to check if the poll owner is the current CPU.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240429100437.3487432-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 19:03:47 -07:00
Eric Dumazet c204fef97e net: move sysctl_mem_pcpu_rsv to net_hotdata
sysctl_mem_pcpu_rsv is used in TCP fast path,
move it to net_hodata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet f3d93817fb net: add <net/proto_memory.h>
Move some proto memory definitions out of <net/sock.h>

Very few files need them, and following patch
will include <net/hotdata.h> from <net/proto_memory.h>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet d480dc76d9 net: move sysctl_skb_defer_max to net_hotdata
sysctl_skb_defer_max is used in TCP fast path,
move it to net_hodata.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet a86a0661b8 net: move sysctl_max_skb_frags to net_hotdata
sysctl_max_skb_frags is used in TCP and MPTCP fast paths,
move it to net_hodata for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:46:52 -07:00
Eric Dumazet 05d6d49209 inet: introduce dst_rtable() helper
I added dst_rt6_info() in commit
e8dfd42c17 ("ipv6: introduce dst_rt6_info() helper")

This patch does a similar change for IPv4.

Instead of (struct rtable *)dst casts, we can use :

 #define dst_rtable(_ptr) \
             container_of_const(_ptr, struct rtable, dst)

Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:32:38 -07:00
Jakub Kicinski 12b6c3a038 net: page_pool: support error injection
Because of caching / recycling using the general page allocation
failures to induce errors in page pool allocation is very hard.
Add direct error injection support to page_pool_alloc_pages().

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 08:15:31 -07:00
Xuan Zhuo 0cfe71f45f netdev: add queue stats
These stats are commonly. Support reporting those via netdev-genl queue
stats.

name: rx-hw-drops
name: rx-hw-drop-overruns
name: rx-csum-unnecessary
name: rx-csum-none
name: rx-csum-bad
name: rx-hw-gro-packets
name: rx-hw-gro-bytes
name: rx-hw-gro-wire-packets
name: rx-hw-gro-wire-bytes
name: rx-hw-drop-ratelimits
name: tx-hw-drops
name: tx-hw-drop-errors
name: tx-csum-none
name: tx-needs-csum
name: tx-hw-gso-packets
name: tx-hw-gso-bytes
name: tx-hw-gso-wire-packets
name: tx-hw-gso-wire-bytes
name: tx-hw-drop-ratelimits

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30 10:51:33 +02:00
Jakub Kicinski 89de2db193 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI
 g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj
 v6jJwDcybCym1hLx+1x1JCZ4eoAFswE=
 =xbna
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-04-29

We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).

The main changes are:

1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
   memory addresses and implement support in x86 BPF JIT. This allows
   inlining per-CPU array and hashmap lookups
   and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.

2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.

3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
   atomics in bpf_arena which can be JITed as a single x86 instruction,
   from Alexei Starovoitov.

4) Add support for passing mark with bpf_fib_lookup helper,
   from Anton Protopopov.

5) Add a new bpf_wq API for deferring events and refactor sleepable
   bpf_timer code to keep common code where possible,
   from Benjamin Tissoires.

6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
   to check when NULL is passed for non-NULLable parameters,
   from Eduard Zingerman.

7) Harden the BPF verifier's and/or/xor value tracking,
   from Harishankar Vishwanathan.

8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
   crypto subsystem, from Vadim Fedorenko.

9) Various improvements to the BPF instruction set standardization doc,
   from Dave Thaler.

10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
    from Andrea Righi.

11) Bigger batch of BPF selftests refactoring to use common network helpers
    and to drop duplicate code, from Geliang Tang.

12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
    from Jose E. Marchesi.

13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled,
    from Kumar Kartikeya Dwivedi.

14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
    from David Vernet.

15) Extend the BPF verifier to allow different input maps for a given
    bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.

16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
    for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.

17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
    the stack memory, from Martin KaFai Lau.

18) Improve xsk selftest coverage with new tests on maximum and minimum
    hardware ring size configurations, from Tushar Vyavahare.

19) Various ReST man pages fixes as well as documentation and bash completion
    improvements for bpftool, from Rameez Rehman & Quentin Monnet.

20) Fix libbpf with regards to dumping subsequent char arrays,
    from Quentin Deslandes.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
  bpf, docs: Clarify PC use in instruction-set.rst
  bpf_helpers.h: Define bpf_tail_call_static when building with GCC
  bpf, docs: Add introduction for use in the ISA Internet Draft
  selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
  bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
  selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
  bpf: check bpf_dummy_struct_ops program params for test runs
  selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
  selftests/bpf: adjust dummy_st_ops_success to detect additional error
  bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
  selftests/bpf: Add ring_buffer__consume_n test.
  bpf: Add bpf_guard_preempt() convenience macro
  selftests: bpf: crypto: add benchmark for crypto functions
  selftests: bpf: crypto skcipher algo selftests
  bpf: crypto: add skcipher to bpf crypto
  bpf: make common crypto API for TC/XDP programs
  bpf: update the comment for BTF_FIELDS_MAX
  selftests/bpf: Fix wq test.
  selftests/bpf: Use make_sockaddr in test_sock_addr
  selftests/bpf: Use connect_to_addr in test_sock_addr
  ...
====================

Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29 13:12:19 -07:00
Eric Dumazet e8dfd42c17 ipv6: introduce dst_rt6_info() helper
Instead of (struct rt6_info *)dst casts, we can use :

 #define dst_rt6_info(_ptr) \
         container_of_const(_ptr, struct rt6_info, dst)

Some places needed missing const qualifiers :

ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()

v2: added missing parts (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 13:32:01 +01:00
Eric Dumazet cd42ba1c8a net: give more chances to rcu in netdev_wait_allrefs_any()
This came while reviewing commit c4e86b4363 ("net: add two more
call_rcu_hurry()").

Paolo asked if adding one synchronize_rcu() would help.

While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.

Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.

Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.

Fixes: 0e4be9e57e ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 09:54:12 +01:00
Jakub Kicinski b2ff42c6d3 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZiwdfQAKCRDbK58LschI
 g1oqAP9mjayeIHCfYMQZa2eevy1PmVlgdNdFdMDWZFS/pHv9cgD/ZdmGzbUDKCAQ
 Y/KiTajitZw3kxtHX45v8/Ugtlsh9Qg=
 =Ewiw
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-04-26

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 14 files changed, 168 insertions(+), 72 deletions(-).

The main changes are:

1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page,
   from Puranjay Mohan.

2) Fix a crash in XDP with devmap broadcast redirect when the latter map
   is in process of being torn down, from Toke Høiland-Jørgensen.

3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF
   program runtime stats, from Xu Kuohai.

4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue,
    from Jason Xing.

5) Fix BPF verifier error message in resolve_pseudo_ldimm64,
   from Anton Protopopov.

6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item,
   from Andrii Nakryiko.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64
  bpf, x86: Fix PROBE_MEM runtime load check
  bpf: verifier: prevent userspace memory access
  xdp: use flags field to disambiguate broadcast redirect
  arm32, bpf: Reimplement sign-extension mov instruction
  riscv, bpf: Fix incorrect runtime stats
  bpf, arm64: Fix incorrect runtime stats
  bpf: Fix a verifier verbose message
  bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
  MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers
  MAINTAINERS: Update email address for Puranjay Mohan
  bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition
====================

Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-26 17:36:53 -07:00
Alexander Lobakin ef9226cd56 page_pool: constify some read-only function arguments
There are several functions taking pointers to data they don't modify.
This includes statistics fetching, page and page_pool parameters, etc.
Constify the pointers, so that call sites will be able to pass const
pointers as well.
No functional changes, no visible changes in functions sizes.

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-24 11:06:25 -07:00
Breno Leitao c661050f93 net: create a dummy net_device allocator
It is impossible to use init_dummy_netdev together with alloc_netdev()
as the 'setup' argument.

This is because alloc_netdev() initializes some fields in the net_device
structure, and later init_dummy_netdev() memzero them all. This causes
some problems as reported here:

	https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/

Split the init_dummy_netdev() function in two. Create a new function called
init_dummy_netdev_core() that does not memzero the net_device structure.
Then have init_dummy_netdev() memzero-ing and calling
init_dummy_netdev_core(), keeping the old behaviour.

init_dummy_netdev_core() is the new function that could be called as an
argument for alloc_netdev().

Also, create a helper to allocate and initialize dummy net devices,
leveraging init_dummy_netdev_core() as the setup argument. This function
basically simplify the allocation of dummy devices, by allocating and
initializing it. Freeing the device continue to be done through
free_netdev()

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Breno Leitao f8d05679fb net: free_netdev: exit earlier if dummy
For dummy devices, exit earlier at free_netdev() instead of executing
the whole function. This is necessary, because dummy devices are
special, and shouldn't have the second part of the function executed.

Otherwise reg_state, which is NETREG_DUMMY, will be overwritten and
there will be no way to identify that this is a dummy device. Also, this
device do not need the final put_device(), since dummy devices are not
registered (through register_netdevice()), where the device reference is
increased (at netdev_register_kobject()/device_add()).

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Breno Leitao c6e7f27684 net: core: Fix documentation
Fix bad grammar in description of init_dummy_netdev() function.  This
topic showed up in the review of the "allocate dummy device dynamically"
patch set.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-24 12:00:16 +01:00
Eric Dumazet 1c04b46cbd neighbour: fix neigh_master_filtered()
If we no longer hold RTNL, we must use netdev_master_upper_dev_get_rcu()
instead of netdev_master_upper_dev_get().

Fixes: ba0f780694 ("neighbour: no longer hold RTNL in neigh_dump_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240421185753.1808077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 19:04:50 -07:00
Jakub Kicinski ce05d0f203 netdev: support dumping a single netdev in qstats
Having to filter the right ifindex in the tests is a bit tedious.
Add support for dumping qstats for a single ifindex.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240420023543.3300306-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23 10:09:49 -07:00
Jakub Kicinski af046fd169 Merge branch 'for-uring-ubufops' into HEAD
Pavel Begunkov says:

====================
implement io_uring notification (ubuf_info) stacking (net part)

To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.

The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.

liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.

bytes  | before | after (Kqps)
1200   | 195    | 1023
4000   | 193    | 1386
8000   | 154    | 1058
====================

Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 17:15:39 -07:00
Pavel Begunkov 65bada80de net: add callback for setting a ubuf_info to skb
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.

Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 16:21:59 -07:00
Pavel Begunkov 7ab4f16f9e net: extend ubuf_info callback to ops structure
We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22 16:21:35 -07:00
Toke Høiland-Jørgensen 5bcf0dcbf9 xdp: use flags field to disambiguate broadcast redirect
When redirecting a packet using XDP, the bpf_redirect_map() helper will set
up the redirect destination information in struct bpf_redirect_info (using
the __bpf_xdp_redirect_map() helper function), and the xdp_do_redirect()
function will read this information after the XDP program returns and pass
the frame on to the right redirect destination.

When using the BPF_F_BROADCAST flag to do multicast redirect to a whole
map, __bpf_xdp_redirect_map() sets the 'map' pointer in struct
bpf_redirect_info to point to the destination map to be broadcast. And
xdp_do_redirect() reacts to the value of this map pointer to decide whether
it's dealing with a broadcast or a single-value redirect. However, if the
destination map is being destroyed before xdp_do_redirect() is called, the
map pointer will be cleared out (by bpf_clear_redirect_map()) without
waiting for any XDP programs to stop running. This causes xdp_do_redirect()
to think that the redirect was to a single target, but the target pointer
is also NULL (since broadcast redirects don't have a single target), so
this causes a crash when a NULL pointer is passed to dev_map_enqueue().

To fix this, change xdp_do_redirect() to react directly to the presence of
the BPF_F_BROADCAST flag in the 'flags' value in struct bpf_redirect_info
to disambiguate between a single-target and a broadcast redirect. And only
read the 'map' pointer if the broadcast flag is set, aborting if that has
been cleared out in the meantime. This prevents the crash, while keeping
the atomic (cmpxchg-based) clearing of the map pointer itself, and without
adding any more checks in the non-broadcast fast path.

Fixes: e624d4ed4a ("xdp: Extend xdp_redirect_map with broadcast support")
Reported-and-tested-by: syzbot+af9492708df9797198d6@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20240418071840.156411-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-22 10:24:41 -07:00
Thomas Weißschuh bfa858f220 sysctl: treewide: constify ctl_table_header::ctl_table_arg
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.

Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22 08:56:31 +01:00
Eric Dumazet ba0f780694 neighbour: no longer hold RTNL in neigh_dump_info()
neigh_dump_table() is already relying on RCU protection.

pneigh_dump_table() is using its own protection (tbl->lock)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet 7e4975f7e7 neighbour: fix neigh_dump_info() return value
Change neigh_dump_table() and pneigh_dump_table()
to either return 0 or -EMSGSIZE if not enough
space was available in the skb.

Then neigh_dump_info() can do the same.

This allows NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Eric Dumazet f8f2eb9de6 neighbour: add RCU protection to neigh_tables[]
In order to remove RTNL protection from neightbl_dump_info()
and neigh_dump_info() later, we need to add
RCU protection to neigh_tables[].

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 12:39:20 +01:00
Jason Xing f7b60cce84 net: rps: locklessly access rflow->cpu
This is the last member in struct rps_dev_flow which should be
protected locklessly. So finish it.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Jason Xing f00bf5dc83 net: rps: protect filter locklessly
As we can see, rflow->filter can be written/read concurrently, so
lockless access is needed.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Jason Xing 84b6823cd9 net: rps: protect last_qtail with rps_input_queue_tail_save() helper
Removing one unnecessary reader protection and add another writer
protection to finish the locklessly proctection job.

Note: the removed READ_ONCE() is not needed because we only have to protect
the locklessly reader in the different context (rps_may_expire_flow()).

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19 11:38:03 +01:00
Jakub Kicinski 41e3ddb291 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/trace/events/rpcgss.h
  386f4a7379 ("trace: events: cleanup deprecated strncpy uses")
  a4833e3aba ("SUNRPC: Fix rpcgss_context trace event acceptor field")

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_tc_lib.c
  2cca35f5dd ("ice: Fix checking for unsupported keys on non-tunnel device")
  784feaa65d ("ice: Add support for PFCP hardware offload in switchdev")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-18 13:12:24 -07:00
Zheng Li eabf425bc6 neighbour: guarantee the localhost connections be established successfully even the ARP table is full
Inter-process communication on localhost should be established successfully
even the ARP table is full, many processes on server machine use the
localhost to communicate such as command-line interface (CLI),
servers hope all CLI commands can be executed successfully even the arp
table is full. Right now CLI commands got timeout when the arp table is
full. Set the parameter of exempt_from_gc to be true for LOOPBACK net
device to keep localhost neigh in arp table, not removed by gc.

the steps of reproduced:
server with "gc_thresh3 = 1024" setting, ping server from more than 1024
same netmask Lan IPv4 addresses, run "ssh localhost" on console interface,
then the command will get timeout.

Signed-off-by: Zheng Li <James.Z.Li@Dell.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416095343.540-1-lizheng043@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-18 12:01:03 +02:00
Eric Dumazet 1514b06aff netns: no longer hold RTNL in rtnl_net_dumpid()
- rtnl_net_dumpid() is already fully RCU protected,
  RTNL is not needed there.

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20240416140739.967941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:30:28 -07:00
Eric Dumazet 0f022d32c3 net/sched: Fix mirred deadlock on device recursion
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.

[..... other info removed for brevity....]
[   82.890906]
[   82.890906] ============================================
[   82.890906] WARNING: possible recursive locking detected
[   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
[   82.890906] --------------------------------------------
[   82.890906] ping/418 is trying to acquire lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] but task is already holding lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] other info that might help us debug this:
[   82.890906]  Possible unsafe locking scenario:
[   82.890906]
[   82.890906]        CPU0
[   82.890906]        ----
[   82.890906]   lock(&sch->q.lock);
[   82.890906]   lock(&sch->q.lock);
[   82.890906]
[   82.890906]  *** DEADLOCK ***
[   82.890906]
[..... other info removed for brevity....]

Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth1

tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
     action mirred egress redirect dev eth0

We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.

Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4 ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c025 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17 18:22:52 -07:00
Heiner Kallweit 9382b4f338 net: constify net_class
AFAICS all users of net_class take a const struct class * argument.
Therefore fully constify net_class.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:46:01 +01:00
Jason Xing 4d0470b9ad net: save some cycles when doing skb_attempt_defer_free()
Normally, we don't face these two exceptions very often meanwhile
we have some chance to meet the condition where the current cpu id
is the same as skb->alloc_cpu.

One simple test that can help us see the frequency of this statement
'cpu == raw_smp_processor_id()':
1. running iperf -s and iperf -c [ip] -P [MAX CPU]
2. using BPF to capture skb_attempt_defer_free()

I can see around 4% chance that happens to satisfy the statement.
So moving this statement at the beginning can save some cycles in
most cases.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:34:59 +01:00
Jakub Kicinski 3db3b62955 net: dev_addr_lists: move locking out of init/exit in kunit
We lock and unlock rtnl in init/exit for convenience,
but it started causing problems if the exit is handled
by a different thread. To avoid having to futz with
disabling locking assertions move the locking into
the test cases. We don't use ASSERTs so it should
be safe.

   ============= dev-addr-list-test (6 subtests) ==============
   [PASSED] dev_addr_test_basic
   [PASSED] dev_addr_test_sync_one
   [PASSED] dev_addr_test_add_del
   [PASSED] dev_addr_test_del_main
   [PASSED] dev_addr_test_add_set
   [PASSED] dev_addr_test_add_excl
   =============== [PASSED] dev-addr-list-test ================

Link: https://lore.kernel.org/all/20240403131936.787234-7-linux@roeck-us.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 10:26:35 +01:00
Wander Lairson Costa f1e197a665 drop_monitor: replace spin_lock by raw_spin_lock
trace_drop_common() is called with preemption disabled, and it acquires
a spin_lock. This is problematic for RT kernels because spin_locks are
sleeping locks in this configuration, which causes the following splat:

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 449, name: rcuc/47
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
5 locks held by rcuc/47/449:
 #0: ff1100086ec30a60 ((softirq_ctrl.lock)){+.+.}-{2:2}, at: __local_bh_disable_ip+0x105/0x210
 #1: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0xbf/0x130
 #2: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0x11c/0x210
 #3: ffffffffb394a160 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x360/0xc70
 #4: ff1100086ee07520 (&data->lock){+.+.}-{2:2}, at: trace_drop_common.constprop.0+0xb5/0x290
irq event stamp: 139909
hardirqs last  enabled at (139908): [<ffffffffb1df2b33>] _raw_spin_unlock_irqrestore+0x63/0x80
hardirqs last disabled at (139909): [<ffffffffb19bd03d>] trace_drop_common.constprop.0+0x26d/0x290
softirqs last  enabled at (139892): [<ffffffffb07a1083>] __local_bh_enable_ip+0x103/0x170
softirqs last disabled at (139898): [<ffffffffb0909b33>] rcu_cpu_kthread+0x93/0x1f0
Preemption disabled at:
[<ffffffffb1de786b>] rt_mutex_slowunlock+0xab/0x2e0
CPU: 47 PID: 449 Comm: rcuc/47 Not tainted 6.9.0-rc2-rt1+ #7
Hardware name: Dell Inc. PowerEdge R650/0Y2G81, BIOS 1.6.5 04/15/2022
Call Trace:
 <TASK>
 dump_stack_lvl+0x8c/0xd0
 dump_stack+0x14/0x20
 __might_resched+0x21e/0x2f0
 rt_spin_lock+0x5e/0x130
 ? trace_drop_common.constprop.0+0xb5/0x290
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 trace_drop_common.constprop.0+0xb5/0x290
 ? preempt_count_sub+0x1c/0xd0
 ? _raw_spin_unlock_irqrestore+0x4a/0x80
 ? __pfx_trace_drop_common.constprop.0+0x10/0x10
 ? rt_mutex_slowunlock+0x26a/0x2e0
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 ? __pfx_rt_mutex_slowunlock+0x10/0x10
 ? skb_queue_purge_reason.part.0+0x1bf/0x230
 trace_kfree_skb_hit+0x15/0x20
 trace_kfree_skb+0xe9/0x150
 kfree_skb_reason+0x7b/0x110
 skb_queue_purge_reason.part.0+0x1bf/0x230
 ? __pfx_skb_queue_purge_reason.part.0+0x10/0x10
 ? mark_lock.part.0+0x8a/0x520
...

trace_drop_common() also disables interrupts, but this is a minor issue
because we could easily replace it with a local_lock.

Replace the spin_lock with raw_spin_lock to avoid sleeping in atomic
context.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-15 09:54:15 +01:00
Eric Dumazet 32affa5578 fib: rules: no longer hold RTNL in fib_nl_dumprule()
- fib rules are already RCU protected, RTNL is not needed
  to get them.

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240411133340.1332796-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-12 19:09:31 -07:00
Mina Almasry a580ea994f net: mirror skb frag ref/unref helpers
Refactor some of the skb frag ref/unref helpers for improved clarity.

Implement napi_pp_get_page() to be the mirror counterpart of
napi_pp_put_page().

Implement skb_page_ref() to be the mirror of skb_page_unref().

Improve __skb_frag_ref() to become a mirror counterpart of
__skb_frag_unref(). Previously unref could handle pp & non-pp pages,
while the ref could only handle non-pp pages. Now both the ref & unref
helpers can correctly handle both pp & non-pp pages.

Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
skb_pp_frag_ref(), and use __skb_frag_ref() instead.  This lets us
remove pp specific handling from skb_try_coalesce.

Additionally, since __skb_frag_ref() can now handle both pp & non-pp
pages, a latent issue in skb_shift() should now be fixed. Previously
this function would do a non-pp ref & pp unref on potential pp frags
(fragfrom). After this patch, skb_shift() should correctly do a pp
ref/unref on pp frags.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:29:23 -07:00
Mina Almasry f6d827b180 net: move skb ref helpers to new header
Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.

Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 19:29:22 -07:00
Yonghong Song 699c23f02c bpf: Add bpf_link support for sk_msg and sk_skb progs
Add bpf_link support for sk_msg and sk_skb programs. We have an
internal request to support bpf_link for sk_msg programs so user
space can have a uniform handling with bpf_link based libbpf
APIs. Using bpf_link based libbpf API also has a benefit which
makes system robust by decoupling prog life cycle and
attachment life cycle.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-10 19:52:25 -07:00
Pavel Begunkov d8415a165c net: use SKB_CONSUMED in skb_attempt_defer_free()
skb_attempt_defer_free() is used to free already processed skbs, so pass
SKB_CONSUMED as the reason in kfree_skb_napi_cache().

Suggested-by: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/bcf5dbdda79688b074ab7ae2238535840a6d3fc2.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 19:27:32 -07:00
Pavel Begunkov 7cb31c46b9 net: cache for same cpu skb_attempt_defer_free
Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/a887463fb219d973ec5ad275e31194812571f1f5.1712711977.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-10 19:27:32 -07:00
Mina Almasry f58f3c9563 net: remove napi_frag_unref
With the changes in the last patches, napi_frag_unref() is now
reduandant. Remove it and use skb_page_unref directly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-09 18:20:32 -07:00
Eric Dumazet 4308811ba9 net: display more skb fields in skb_dump()
Print these additional fields in skb_dump() to ease debugging.

- mac_len
- csum_start (in v2, at Willem suggestion)
- csum_offset (in v2, at Willem suggestion)
- priority
- mark
- alloc_cpu
- vlan_all
- encapsulation
- inner_protocol
- inner_mac_header
- inner_network_header
- inner_transport_header

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 14:06:31 +01:00
Eric Dumazet db77cdc696 net: dqs: use sysfs_emit() in favor of sprintf()
Commit 6025b9135f ("net: dqs: add NIC stall detector based on BQL")
added three sysfs files.

Use the recommended sysfs_emit() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 11:04:08 +01:00
Jason Xing 6648e61322 bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
Fix NULL pointer data-races in sk_psock_skb_ingress_enqueue() which
syzbot reported [1].

[1]
BUG: KCSAN: data-race in sk_psock_drop / sk_psock_skb_ingress_enqueue

write to 0xffff88814b3278b8 of 8 bytes by task 10724 on cpu 1:
 sk_psock_stop_verdict net/core/skmsg.c:1257 [inline]
 sk_psock_drop+0x13e/0x1f0 net/core/skmsg.c:843
 sk_psock_put include/linux/skmsg.h:459 [inline]
 sock_map_close+0x1a7/0x260 net/core/sock_map.c:1648
 unix_release+0x4b/0x80 net/unix/af_unix.c:1048
 __sock_release net/socket.c:659 [inline]
 sock_close+0x68/0x150 net/socket.c:1421
 __fput+0x2c1/0x660 fs/file_table.c:422
 __fput_sync+0x44/0x60 fs/file_table.c:507
 __do_sys_close fs/open.c:1556 [inline]
 __se_sys_close+0x101/0x1b0 fs/open.c:1541
 __x64_sys_close+0x1f/0x30 fs/open.c:1541
 do_syscall_64+0xd3/0x1d0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

read to 0xffff88814b3278b8 of 8 bytes by task 10713 on cpu 0:
 sk_psock_data_ready include/linux/skmsg.h:464 [inline]
 sk_psock_skb_ingress_enqueue+0x32d/0x390 net/core/skmsg.c:555
 sk_psock_skb_ingress_self+0x185/0x1e0 net/core/skmsg.c:606
 sk_psock_verdict_apply net/core/skmsg.c:1008 [inline]
 sk_psock_verdict_recv+0x3e4/0x4a0 net/core/skmsg.c:1202
 unix_read_skb net/unix/af_unix.c:2546 [inline]
 unix_stream_read_skb+0x9e/0xf0 net/unix/af_unix.c:2682
 sk_psock_verdict_data_ready+0x77/0x220 net/core/skmsg.c:1223
 unix_stream_sendmsg+0x527/0x860 net/unix/af_unix.c:2339
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0x140/0x180 net/socket.c:745
 ____sys_sendmsg+0x312/0x410 net/socket.c:2584
 ___sys_sendmsg net/socket.c:2638 [inline]
 __sys_sendmsg+0x1e9/0x280 net/socket.c:2667
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x46/0x50 net/socket.c:2674
 do_syscall_64+0xd3/0x1d0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

value changed: 0xffffffff83d7feb0 -> 0x0000000000000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10713 Comm: syz-executor.4 Tainted: G        W          6.8.0-syzkaller-08951-gfe46a7dd189e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024

Prior to this, commit 4cd12c6065 ("bpf, sockmap: Fix NULL pointer
dereference in sk_psock_verdict_data_ready()") fixed one NULL pointer
similarly due to no protection of saved_data_ready. Here is another
different caller causing the same issue because of the same reason. So
we should protect it with sk_callback_lock read lock because the writer
side in the sk_psock_drop() uses "write_lock_bh(&sk->sk_callback_lock);".

To avoid errors that could happen in future, I move those two pairs of
lock into the sk_psock_data_ready(), which is suggested by John Fastabend.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: syzbot+aa8c8ec2538929f18f2d@syzkaller.appspotmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=aa8c8ec2538929f18f2d
Link: https://lore.kernel.org/all/20240329134037.92124-1-kerneljasonxing@gmail.com
Link: https://lore.kernel.org/bpf/20240404021001.94815-1-kerneljasonxing@gmail.com
2024-04-08 09:18:22 +02:00
Maxime Chevallier 6916e461e7 net: phy: Introduce ethernet link topology representation
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.

With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.

The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.

Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.

The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.

This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.

The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to. The PHY index can be
re-used for PHYs that are persistent.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-06 18:25:14 +01:00
Jakub Kicinski 9f06f87fef net: skbuff: generalize the skb->decrypted bit
The ->decrypted bit can be reused for other crypto protocols.
Remove the direct dependency on TLS, add helpers to clean up
the ifdefs leaking out everywhere.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-06 17:34:31 +01:00
Jakub Kicinski cf1ca1f66d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/ip_gre.c
  17af420545 ("erspan: make sure erspan_base_hdr is present in skb->head")
  5832c4a77d ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/

Adjacent changes:

net/ipv6/ip6_fib.c
  d21d40605b ("ipv6: Fix infinite recursion in fib6_dump_done().")
  5fc68320c1 ("ipv6: remove RTNL protection from inet6_dump_fib()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 18:01:07 -07:00
Jakub Kicinski 1cfa2f10f4 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZg7vEAAKCRDbK58LschI
 gxAeAQD18uHqGbcCCnOVnaETdi3bQcSBpobuclThpN1WdMILcwEA+5Vz9Lxmt6BH
 qF/vbVHAOeCZt3LTOJ/jx1GRVstVWQk=
 =+QKv
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-04-04

We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 75 insertions(+), 24 deletions(-).

The main changes are:

1) Fix x86 BPF JIT under retbleed=stuff which causes kernel panics due to
   incorrect destination IP calculation and incorrect IP for relocations,
   from Uros Bizjak and Joan Bruguera Micó.

2) Fix BPF arena file descriptor leaks in the verifier,
   from Anton Protopopov.

3) Defer bpf_link deallocation to after RCU grace period as currently
   running multi-{kprobes,uprobes} programs might still access cookie
   information from the link, from Andrii Nakryiko.

4) Fix a BPF sockmap lock inversion deadlock in map_delete_elem reported
   by syzkaller, from Jakub Sitnicki.

5) Fix resolve_btfids build with musl libc due to missing linux/types.h
   include, from Natanael Copa.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Prevent lock inversion deadlock in map delete elem
  x86/bpf: Fix IP for relocating call depth accounting
  x86/bpf: Fix IP after emitting call depth accounting
  bpf: fix possible file descriptor leaks in verifier
  tools/resolve_btfids: fix build with musl libc
  bpf: support deferring bpf_link dealloc to after RCU grace period
  bpf: put uprobe link's path and task in release callback
====================

Link: https://lore.kernel.org/r/20240404183258.4401-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 11:37:39 -07:00
Marcelo Tosatti 2f3c7195a7 net: enable timestamp static key if CPU
For systems that use CPU isolation (via nohz_full), creating or destroying
a socket with SO_TIMESTAMP, SO_TIMESTAMPNS or SO_TIMESTAMPING with flag
SOF_TIMESTAMPING_RX_SOFTWARE will cause a static key to be enabled/disabled.
This in turn causes undesired IPIs to isolated CPUs.

So enable the static key unconditionally, if CPU isolation is enabled,
thus avoiding the IPIs.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/ZgrUiLLtbEUf9SFn@tpad
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:14:53 -07:00
Alexander Lobakin 39806b96c8 page_pool: try direct bulk recycling
Now that the checks for direct recycling possibility live inside the
Page Pool core, reuse them when performing bulk recycling.
page_pool_put_page_bulk() can be called from process context as well,
page_pool_napi_local() takes care of this at the very beginning.
Under high .ndo_xdp_xmit() traffic load, the win is 2-3% Pps assuming
the sending driver uses xdp_return_frame_bulk() on Tx completion.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20240329165507.3240110-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02 18:13:49 -07:00
Alexander Lobakin 4a96a4e807 page_pool: check for PP direct cache locality later
Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check
whether it's safe to use direct recycling, we can use both globally for
each page instead of relying solely on @allow_direct argument.
Let's assume that @allow_direct means "I'm sure it's local, don't waste
time rechecking this" and when it's false, try the mentioned params to
still recycle the page directly. If neither is true, we'll lose some
CPU cycles, but then it surely won't be hotpath. On the other hand,
paths where it's possible to use direct cache, but not possible to
safely set @allow_direct, will benefit from this move.
The whole propagation of @napi_safe through a dozen of skb freeing
functions can now go away, which saves us some stack space.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-02 18:13:49 -07:00
Jakub Sitnicki ff91059932 bpf, sockmap: Prevent lock inversion deadlock in map delete elem
syzkaller started using corpuses where a BPF tracing program deletes
elements from a sockmap/sockhash map. Because BPF tracing programs can be
invoked from any interrupt context, locks taken during a map_delete_elem
operation must be hardirq-safe. Otherwise a deadlock due to lock inversion
is possible, as reported by lockdep:

       CPU0                    CPU1
       ----                    ----
  lock(&htab->buckets[i].lock);
                               local_irq_disable();
                               lock(&host->lock);
                               lock(&htab->buckets[i].lock);
  <Interrupt>
    lock(&host->lock);

Locks in sockmap are hardirq-unsafe by design. We expects elements to be
deleted from sockmap/sockhash only in task (normal) context with interrupts
enabled, or in softirq context.

Detect when map_delete_elem operation is invoked from a context which is
_not_ hardirq-unsafe, that is interrupts are disabled, and bail out with an
error.

Note that map updates are not affected by this issue. BPF verifier does not
allow updating sockmap/sockhash from a BPF tracing program today.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Reported-by: syzbot+bc922f476bd65abbd466@syzkaller.appspotmail.com
Reported-by: syzbot+d4066896495db380182e@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+d4066896495db380182e@syzkaller.appspotmail.com
Acked-by: John Fastabend <john.fastabend@gmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=d4066896495db380182e
Closes: https://syzkaller.appspot.com/bug?extid=bc922f476bd65abbd466
Link: https://lore.kernel.org/bpf/20240402104621.1050319-1-jakub@cloudflare.com
2024-04-02 16:31:05 +02:00
Eric Dumazet c62fdf5b11 net: rps: add rps_input_queue_head_add() helper
process_backlog() can batch increments of sd->input_queue_head,
saving some memory bandwidth.

Also add READ_ONCE()/WRITE_ONCE() annotations around
sd->input_queue_head accesses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:32 +01:00
Eric Dumazet 36b83ffcf2 net: rps: change input_queue_tail_incr_save()
input_queue_tail_incr_save() is incrementing the sd queue_tail
and save it in the flow last_qtail.

Two issues here :

- no lock protects the write on last_qtail, we should use appropriate
  annotations.

- We can perform this write after releasing the per-cpu backlog lock,
  to decrease this lock hold duration (move away the cache line miss)

Also move input_queue_head_incr() and rps helpers to include/net/rps.h,
while adding rps_ prefix to better reflect their role.

v2: Fixed a build issue (Jakub and kernel build bots)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:32 +01:00
Eric Dumazet f7efd01fe2 net: enqueue_to_backlog() cleanup
We can remove a goto and a label by reversing a condition.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:32 +01:00
Eric Dumazet a7ae7b0b2e net: make softnet_data.dropped an atomic_t
If under extreme cpu backlog pressure enqueue_to_backlog() has
to drop a packet, it could do this without dirtying a cache line
and potentially slowing down the target cpu.

Move sd->dropped into a separate cache line, and make it atomic.

In non pressure mode, this field is not touched, no need to consume
valuable space in a hot cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:32 +01:00
Eric Dumazet 95e48d862a net: enqueue_to_backlog() change vs not running device
If the device attached to the packet given to enqueue_to_backlog()
is not running, we drop the packet.

But we accidentally increase sd->dropped, giving false signals
to admins: sd->dropped should be reserved to cpu backlog pressure,
not to temporary glitches at device dismantles.

While we are at it, perform the netif_running() test before
we get the rps lock, and use REASON_DEV_READY
drop reason instead of NOT_SPECIFIED.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:31 +01:00
Eric Dumazet 2fe50a4d72 net: move dev_xmit_recursion() helpers to net/core/dev.h
Move dev_xmit_recursion() and friends to net/core/dev.h

They are only used from net/core/dev.c and net/core/filter.c.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:31 +01:00
Eric Dumazet b9495b564d net: move kick_defer_list_purge() to net/core/dev.h
kick_defer_list_purge() is defined in net/core/dev.c
and used from net/core/skubff.c

Because we need softnet_data, include <linux/netdevice.h>
from net/core/dev.h

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 11:28:31 +01:00
Michal Swiatkowski 6dd514f481 pfcp: always set pfcp metadata
In PFCP receive path set metadata needed by flower code to do correct
classification based on this metadata.

Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:28 +01:00
Alexander Lobakin 5b2be2ab76 net: net_test: add tests for IP tunnel flags conversion helpers
Now that there are helpers for converting IP tunnel flags between the
old __be16 format and the bitmap format, make sure they work as expected
by adding a couple of tests to the networking testing suite. The helpers
are all inline, so no dependencies on the related CONFIG_* (or a
standalone module) are needed.

Cover three possible cases:

1. No bits past BIT(15) are set, VTI/SIT bits are not set. This
   conversion is almost a direct assignment.
2. No bits past BIT(15) are set, but VTI/SIT bit is set. During the
   conversion, it must be transformed into BIT(16) in the bitmap,
   but still compatible with the __be16 format.
3. The bitmap has bits past BIT(15) set (not the VTI/SIT one). The
   result will be truncated.
   Note that currently __IP_TUNNEL_FLAG_NUM is 17 (incl. special),
   which means that the result of this case is currently
   semi-false-positive. When BIT(17) is finally here, it will be
   adjusted accordingly.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:28 +01:00
Alexander Lobakin 5832c4a77d ip_tunnel: convert __be16 tunnel flags to bitmaps
Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT
have been defined as __be16. Now all of those 16 bits are occupied
and there's no more free space for new flags.
It can't be simply switched to a bigger container with no
adjustments to the values, since it's an explicit Endian storage,
and on LE systems (__be16)0x0001 equals to
(__be64)0x0001000000000000.
We could probably define new 64-bit flags depending on the
Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on
LE, but that would introduce an Endianness dependency and spawn a
ton of Sparse warnings. To mitigate them, all of those places which
were adjusted with this change would be touched anyway, so why not
define stuff properly if there's no choice.

Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the
value already coded and a fistful of <16 <-> bitmap> converters and
helpers. The two flags which have a different bit position are
SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as
__cpu_to_be16(), but as (__force __be16), i.e. had different
positions on LE and BE. Now they both have strongly defined places.
Change all __be16 fields which were used to store those flags, to
IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) ->
unsigned long[1] for now, and replace all TUNNEL_* occurrences to
their bitmap counterparts. Use the converters in the places which talk
to the userspace, hardware (NFP) or other hosts (GRE header). The rest
must explicitly use the new flags only. This must be done at once,
otherwise there will be too many conversions throughout the code in
the intermediate commits.
Finally, disable the old __be16 flags for use in the kernel code
(except for the two 'irregular' flags mentioned above), to prevent
any accidental (mis)use of them. For the userspace, nothing is
changed, only additions were made.

Most noticeable bloat-o-meter difference (.text):

vmlinux:	307/-1 (306)
gre.ko:		62/0 (62)
ip_gre.ko:	941/-217 (724)	[*]
ip_tunnel.ko:	390/-900 (-510)	[**]
ip_vti.ko:	138/0 (138)
ip6_gre.ko:	534/-18 (516)	[*]
ip6_tunnel.ko:	118/-10 (108)

[*] gre_flags_to_tnl_flags() grew, but still is inlined
[**] ip_tunnel_find() got uninlined, hence such decrease

The average code size increase in non-extreme case is 100-200 bytes
per module, mostly due to sizeof(long) > sizeof(__be16), as
%__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers
are able to expand the majority of bitmap_*() calls here into direct
operations on scalars.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:28 +01:00
Johannes Berg e8058a49e6 netlink: introduce type-checking attribute iteration
There are, especially with multi-attr arrays, many cases
of needing to iterate all attributes of a specific type
in a netlink message or a nested attribute. Add specific
macros to support that case.

Also convert many instances using this spatch:

    @@
    iterator nla_for_each_attr;
    iterator name nla_for_each_attr_type;
    identifier nla;
    expression head, len, rem;
    expression ATTR;
    type T;
    identifier x;
    @@
    -nla_for_each_attr(nla, head, len, rem)
    +nla_for_each_attr_type(nla, ATTR, head, len, rem)
     {
    <... T x; ...>
    -if (nla_type(nla) == ATTR) {
     ...
    -}
     }

    @@
    identifier nla;
    iterator nla_for_each_nested;
    iterator name nla_for_each_nested_type;
    expression attr, rem;
    expression ATTR;
    type T;
    identifier x;
    @@
    -nla_for_each_nested(nla, attr, rem)
    +nla_for_each_nested_type(nla, ATTR, attr, rem)
     {
    <... T x; ...>
    -if (nla_type(nla) == ATTR) {
     ...
    -}
     }

    @@
    iterator nla_for_each_attr;
    iterator name nla_for_each_attr_type;
    identifier nla;
    expression head, len, rem;
    expression ATTR;
    type T;
    identifier x;
    @@
    -nla_for_each_attr(nla, head, len, rem)
    +nla_for_each_attr_type(nla, ATTR, head, len, rem)
     {
    <... T x; ...>
    -if (nla_type(nla) != ATTR) continue;
     ...
     }

    @@
    identifier nla;
    iterator nla_for_each_nested;
    iterator name nla_for_each_nested_type;
    expression attr, rem;
    expression ATTR;
    type T;
    identifier x;
    @@
    -nla_for_each_nested(nla, attr, rem)
    +nla_for_each_nested_type(nla, ATTR, attr, rem)
     {
    <... T x; ...>
    -if (nla_type(nla) != ATTR) continue;
     ...
     }

Although I had to undo one bad change this made, and
I also adjusted some other code for whitespace and to
use direct variable initialization now.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-29 15:06:02 -07:00