Commit Graph

75222 Commits

Author SHA1 Message Date
Linus Torvalds 7352a6765c vfs-6.7.xattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
 okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
 qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
 =2PXg
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs xattr updates from Christian Brauner:
 "The 's_xattr' field of 'struct super_block' currently requires a
  mutable table of 'struct xattr_handler' entries (although each handler
  itself is const). However, no code in vfs actually modifies the
  tables.

  This changes the type of 's_xattr' to allow const tables, and modifies
  existing file systems to move their tables to .rodata. This is
  desirable because these tables contain entries with function pointers
  in them; moving them to .rodata makes it considerably less likely to
  be modified accidentally or maliciously at runtime"

* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  const_structs.checkpatch: add xattr_handler
  net: move sockfs_xattr_handlers to .rodata
  shmem: move shmem_xattr_handlers to .rodata
  overlayfs: move xattr tables to .rodata
  xfs: move xfs_xattr_handlers to .rodata
  ubifs: move ubifs_xattr_handlers to .rodata
  squashfs: move squashfs_xattr_handlers to .rodata
  smb: move cifs_xattr_handlers to .rodata
  reiserfs: move reiserfs_xattr_handlers to .rodata
  orangefs: move orangefs_xattr_handlers to .rodata
  ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
  ntfs3: move ntfs_xattr_handlers to .rodata
  nfs: move nfs4_xattr_handlers to .rodata
  kernfs: move kernfs_xattr_handlers to .rodata
  jfs: move jfs_xattr_handlers to .rodata
  jffs2: move jffs2_xattr_handlers to .rodata
  hfsplus: move hfsplus_xattr_handlers to .rodata
  hfs: move hfs_xattr_handlers to .rodata
  gfs2: move gfs2_xattr_handlers_max to .rodata
  fuse: move fuse_xattr_handlers to .rodata
  ...
2023-10-30 09:29:44 -10:00
Linus Torvalds df9c65b5fc vfs-6.7.iov_iter
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppQwAKCRCRxhvAZXjc
 om2kAP4u+eLsrhJHfUPUttGUEkSkZE+5/s/f1A/1GcV51usLSgEAu8urxAnP49GW
 INaDABXaFfKx8/KI/H2YFZPKGwlNEwY=
 =PDDE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull iov_iter updates from Christian Brauner:
 "This contain's David's iov_iter cleanup work to convert the iov_iter
  iteration macros to inline functions:

   - Remove last_offset from iov_iter as it was only used by ITER_PIPE

   - Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
     match that on powerpc and get rid of a sparse warning

   - Convert iter->user_backed to user_backed_iter() in the sound PCM
     driver

   - Convert iter->user_backed to user_backed_iter() in a couple of
     infiniband drivers

   - Renumber the type enum so that the ITER_* constants match the order
     in iterate_and_advance*()

   - Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
     user_backed_iter() to just use the type value and get rid of the
     extra flag

   - Convert the iov_iter iteration macros to always-inline functions to
     make the code easier to follow. It uses function pointers, but they
     get optimised away

   - Move the check for ->copy_mc to _copy_from_iter() and
     copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
     where it gets repeated for every segment. Instead, we check once
     and invoke a side function that can use iterate_bvec() rather than
     iterate_and_advance() and supply a different step function

   - Move the copy-and-csum code to net/ where it can be in proximity
     with the code that uses it

   - Fold memcpy_and_csum() in to its two users

   - Move csum_and_copy_from_iter_full() out of line and merge in
     csum_and_copy_from_iter() since the former is the only caller of
     the latter

   - Move hash_and_copy_to_iter() to net/ where it can be with its only
     caller"

* tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter, net: Move hash_and_copy_to_iter() to net/
  iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
  iov_iter, net: Fold in csum_and_memcpy()
  iov_iter, net: Move csum_and_copy_to/from_iter() to net/
  iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  iov_iter: Convert iterate*() to inline funcs
  iov_iter: Derive user-backedness from the iterator type
  iov_iter: Renumber ITER_* constants
  infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
  sound: Fix snd_pcm_readv()/writev() to use iov access functions
  iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
  iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
2023-10-30 09:24:21 -10:00
Jakub Kicinski ce1afe2804 net: fill in MODULE_DESCRIPTION()s under net/802*
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-28 11:29:28 +01:00
Jakub Kicinski beb5eed32a net: fill in MODULE_DESCRIPTION()s under net/core
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().

Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-28 11:29:27 +01:00
Yonghong Song 06497763c8 net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
With latest sync from net-next tree, bpf-next has a bpf selftest failure:
  [root@arch-fb-vm1 bpf]# ./test_progs -t setget_sockopt
  ...
  [   76.194349] ============================================
  [   76.194682] WARNING: possible recursive locking detected
  [   76.195039] 6.6.0-rc7-g37884503df08-dirty #67 Tainted: G        W  OE
  [   76.195518] --------------------------------------------
  [   76.195852] new_name/154 is trying to acquire lock:
  [   76.196159] ffff8c3e06ad8d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: ip_sock_set_tos+0x19/0x30
  [   76.196669]
  [   76.196669] but task is already holding lock:
  [   76.197028] ffff8c3e06ad8d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_listen+0x21/0x70
  [   76.197517]
  [   76.197517] other info that might help us debug this:
  [   76.197919]  Possible unsafe locking scenario:
  [   76.197919]
  [   76.198287]        CPU0
  [   76.198444]        ----
  [   76.198600]   lock(sk_lock-AF_INET);
  [   76.198831]   lock(sk_lock-AF_INET);
  [   76.199062]
  [   76.199062]  *** DEADLOCK ***
  [   76.199062]
  [   76.199420]  May be due to missing lock nesting notation
  [   76.199420]
  [   76.199879] 2 locks held by new_name/154:
  [   76.200131]  #0: ffff8c3e06ad8d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_listen+0x21/0x70
  [   76.200644]  #1: ffffffff90f96a40 (rcu_read_lock){....}-{1:2}, at: __cgroup_bpf_run_filter_sock_ops+0x55/0x290
  [   76.201268]
  [   76.201268] stack backtrace:
  [   76.201538] CPU: 4 PID: 154 Comm: new_name Tainted: G        W  OE      6.6.0-rc7-g37884503df08-dirty #67
  [   76.202134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
  [   76.202699] Call Trace:
  [   76.202858]  <TASK>
  [   76.203002]  dump_stack_lvl+0x4b/0x80
  [   76.203239]  __lock_acquire+0x740/0x1ec0
  [   76.203503]  lock_acquire+0xc1/0x2a0
  [   76.203766]  ? ip_sock_set_tos+0x19/0x30
  [   76.204050]  ? sk_stream_write_space+0x12a/0x230
  [   76.204389]  ? lock_release+0xbe/0x260
  [   76.204661]  lock_sock_nested+0x32/0x80
  [   76.204942]  ? ip_sock_set_tos+0x19/0x30
  [   76.205208]  ip_sock_set_tos+0x19/0x30
  [   76.205452]  do_ip_setsockopt+0x4b3/0x1580
  [   76.205719]  __bpf_setsockopt+0x62/0xa0
  [   76.205963]  bpf_sock_ops_setsockopt+0x11/0x20
  [   76.206247]  bpf_prog_630217292049c96e_bpf_test_sockopt_int+0xbc/0x123
  [   76.206660]  bpf_prog_493685a3bae00bbd_bpf_test_ip_sockopt+0x49/0x4b
  [   76.207055]  bpf_prog_b0bcd27f269aeea0_skops_sockopt+0x44c/0xec7
  [   76.207437]  __cgroup_bpf_run_filter_sock_ops+0xda/0x290
  [   76.207829]  __inet_listen_sk+0x108/0x1b0
  [   76.208122]  inet_listen+0x48/0x70
  [   76.208373]  __sys_listen+0x74/0xb0
  [   76.208630]  __x64_sys_listen+0x16/0x20
  [   76.208911]  do_syscall_64+0x3f/0x90
  [   76.209174]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
  ...

Both ip_sock_set_tos() and inet_listen() calls lock_sock(sk) which
caused a dead lock.

To fix the issue, use sockopt_lock_sock() in ip_sock_set_tos()
instead. sockopt_lock_sock() will avoid lock_sock() if it is in bpf
context.

Fixes: 878d951c67 ("inet: lock the socket in ip_sock_set_tos()")
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231027182424.1444845-1-yonghong.song@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 15:41:28 -07:00
Kuniyuki Iwashima 3a04927f8d af_unix: Remove module remnants.
Since commit 97154bcf4d ("af_unix: Kconfig: make CONFIG_UNIX bool"),
af_unix.c is no longer built as module.

Let's remove unnecessary #if condition, exitcall, and module macros.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20231026212305.45545-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 09:12:47 -07:00
Geliang Tang 14cb0e0bf3 mptcp: define more local variables sk
'(struct sock *)msk' is used several times in mptcp_nl_cmd_announce(),
mptcp_nl_cmd_remove() or mptcp_userspace_pm_set_flags() in pm_userspace.c,
it's worth adding a local variable sk to point it.

Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-8-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:30 -07:00
Geliang Tang a6c85fc61c mptcp: move sk assignment statement ahead
If we move the sk assignment statement ahead in mptcp_nl_cmd_sf_create()
or mptcp_nl_cmd_sf_destroy(), right after the msk null-check statements,
sk can be used after the create_err or destroy_err labels instead of
open-coding it again.

Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-7-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:30 -07:00
Geliang Tang a16c054b52 mptcp: use mptcp_get_ext helper
Use mptcp_get_ext() helper defined in protocol.h instead of open-coding
it in mptcp_sendmsg_frag().

Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-6-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:30 -07:00
Geliang Tang 83d580ddbe mptcp: use mptcp_check_fallback helper
Use __mptcp_check_fallback() helper defined in net/mptcp/protocol.h,
instead of open-coding it in both __mptcp_do_fallback() and
mptcp_diag_fill_info().

Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-5-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:30 -07:00
Geliang Tang 74cbb0c65b mptcp: drop useless ssk in pm_subflow_check_next
The code using 'ssk' parameter of mptcp_pm_subflow_check_next() has been
dropped in commit "95d686517884 (mptcp: fix subflow accounting on close)".
So drop this useless parameter ssk.

Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-4-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:30 -07:00
Geliang Tang 84c531f54a mptcp: userspace pm send RM_ADDR for ID 0
This patch adds the ability to send RM_ADDR for local ID 0. Check
whether id 0 address is removed, if not, put id 0 into a removing
list, pass it to mptcp_pm_remove_addr() to remove id 0 address.

There is no reason not to allow the userspace to remove the initial
address (ID 0). This special case was not taken into account not
letting the userspace to delete all addresses as announced.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-3-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:47:17 -07:00
Jakub Kicinski 79fa29570b net: selftests: use ethtool_sprintf()
During a W=1 build GCC 13.2 says:

net/core/selftests.c: In function ‘net_selftest_get_strings’:
net/core/selftests.c:404:52: error: ‘%s’ directive output may be truncated writing up to 279 bytes into a region of size 28 [-Werror=format-truncation=]
  404 |                 snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1,
      |                                                    ^~
net/core/selftests.c:404:17: note: ‘snprintf’ output between 5 and 284 bytes into a destination of size 32
  404 |                 snprintf(p, ETH_GSTRING_LEN, "%2d. %s", i + 1,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  405 |                          net_selftests[i].name);
      |                          ~~~~~~~~~~~~~~~~~~~~~~

avoid it by using ethtool_sprintf().

Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20231026022916.566661-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-27 08:09:26 -07:00
Nikolay Aleksandrov 6808918343 net: bridge: fill in MODULE_DESCRIPTION()
Fill in bridge's module description.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 11:16:44 +01:00
Eric Biggers 0a6bfaa0e6 net: ipv6: stop checking crypto_ahash_alignmask
Now that the alignmask for ahash and shash algorithms is always 0,
crypto_ahash_alignmask() always returns 0 and will be removed.  In
preparation for this, stop checking crypto_ahash_alignmask() in ah6.c.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-27 18:04:29 +08:00
Eric Biggers e77f5dd701 net: ipv4: stop checking crypto_ahash_alignmask
Now that the alignmask for ahash and shash algorithms is always 0,
crypto_ahash_alignmask() always returns 0 and will be removed.  In
preparation for this, stop checking crypto_ahash_alignmask() in ah4.c.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-27 18:04:29 +08:00
Lukas Wunner 201c0da4d0 treewide: Add SPDX identifier to IETF ASN.1 modules
Per section 4.c. of the IETF Trust Legal Provisions, "Code Components"
in IETF Documents are licensed on the terms of the BSD-3-Clause license:

https://trustee.ietf.org/documents/trust-legal-provisions/tlp-5/

The term "Code Components" specifically includes ASN.1 modules:

https://trustee.ietf.org/documents/trust-legal-provisions/code-components-list-3/

Add an SPDX identifier as well as a copyright notice pursuant to section
6.d. of the Trust Legal Provisions to all ASN.1 modules in the tree
which are derived from IETF Documents.

Section 4.d. of the Trust Legal Provisions requests that each Code
Component identify the RFC from which it is taken, so link that RFC
in every ASN.1 module.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-27 18:04:28 +08:00
Eric Biggers 69dde0a1fa libceph: stop checking crypto_shash_alignmask
Now that the shash algorithm type does not support nonzero alignmasks,
crypto_shash_alignmask() always returns 0 and will be removed.  In
preparation for this, stop checking crypto_shash_alignmask() in
net/ceph/messenger_v2.c.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-27 18:04:25 +08:00
Ido Schimmel ddd17a54e6 rtnetlink: Add MDB get support
Now that both the bridge and VXLAN drivers implement the MDB get net
device operation, expose the functionality to user space by registering
a handler for RTM_GETMDB messages. Derive the net device from the
ifindex specified in the ancillary header and invoke its MDB get NDO.

Note that unlike other get handlers, the allocation of the skb
containing the response is not performed in the common rtnetlink code as
the size is variable and needs to be determined by the respective
driver.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:42 +01:00
Ido Schimmel 68b380a395 bridge: mcast: Add MDB get support
Implement support for MDB get operation by looking up a matching MDB
entry, allocating the skb according to the entry's size and then filling
in the response. The operation is performed under the bridge multicast
lock to ensure that the entry does not change between the time the reply
size is determined and when the reply is filled in.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:42 +01:00
Ido Schimmel 6d0259dd6c bridge: mcast: Rename MDB entry get function
The current name is going to conflict with the upcoming net device
operation for the MDB get operation.

Rename the function to br_mdb_entry_skb_get(). No functional changes
intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:41 +01:00
Ido Schimmel 62ef9cba98 bridge: mcast: Factor out a helper for PG entry size calculation
Currently, netlink notifications are sent for individual port group
entries and not for the entire MDB entry itself.

Subsequent patches are going to add MDB get support which will require
the bridge driver to reply with an entire MDB entry.

Therefore, as a preparation, factor out an helper to calculate the size
of an individual port group entry. When determining the size of the
reply this helper will be invoked for each port group entry in the MDB
entry.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:41 +01:00
Ido Schimmel 1b6d993509 bridge: mcast: Account for missing attributes
The 'MDBA_MDB' and 'MDBA_MDB_ENTRY' nest attributes are not accounted
for when calculating the size of MDB notifications. Add them along with
comments for existing attributes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:41 +01:00
Ido Schimmel b9109b5b77 bridge: mcast: Dump MDB entries even when snooping is disabled
Currently, the bridge driver does not dump MDB entries when multicast
snooping is disabled although the entries are present in the kernel:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp

This behavior differs from other netlink dump interfaces that dump
entries regardless if they are used or not. For example, VLANs are
dumped even when VLAN filtering is disabled:

 # ip link set dev br0 type bridge vlan_filtering 0
 # bridge vlan show dev swp1
 port              vlan-id
 swp1              1 PVID Egress Untagged

Remove the check and always dump MDB entries:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:51:41 +01:00
Dmitry Safonov faadfaba5e net/tcp: Add TCP_AO_REPAIR
Add TCP_AO_REPAIR setsockopt(), getsockopt(). They let a user to repair
TCP-AO ISNs/SNEs. Also let the user hack around when (tp->repair) is on
and add ao_info on a socket in any supported state.
As SNEs now can be read/written at any moment, use
WRITE_ONCE()/READ_ONCE() to set/read them.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:46 +01:00
Dmitry Safonov 248411b8cb net/tcp: Wire up l3index to TCP-AO
Similarly how TCP_MD5SIG_FLAG_IFINDEX works for TCP-MD5,
TCP_AO_KEYF_IFINDEX is an AO-key flag that binds that MKT to a specified
by L3 ifinndex. Similarly, without this flag the key will work in
the default VRF l3index = 0 for connections.
To prevent AO-keys from overlapping, it's restricted to add key B for a
socket that has key A, which have the same sndid/rcvid and one of
the following is true:
- !(A.keyflags & TCP_AO_KEYF_IFINDEX) or !(B.keyflags & TCP_AO_KEYF_IFINDEX)
  so that any key is non-bound to a VRF
- A.l3index == B.l3index
  both want to work for the same VRF

Additionally, it's restricted to match TCP-MD5 keys for the same peer
the following way:
|--------------|--------------------|----------------|---------------|
|              | MD5 key without    |     MD5 key    |    MD5 key    |
|              |     l3index        |    l3index=0   |   l3index=N   |
|--------------|--------------------|----------------|---------------|
|  TCP-AO key  |                    |                |               |
|  without     |       reject       |    reject      |   reject      |
|  l3index     |                    |                |               |
|--------------|--------------------|----------------|---------------|
|  TCP-AO key  |                    |                |               |
|  l3index=0   |       reject       |    reject      |   allow       |
|--------------|--------------------|----------------|---------------|
|  TCP-AO key  |                    |                |               |
|  l3index=N   |       reject       |    allow       |   reject      |
|--------------|--------------------|----------------|---------------|

This is done with the help of tcp_md5_do_lookup_any_l3index() to reject
adding AO key without TCP_AO_KEYF_IFINDEX if there's TCP-MD5 in any VRF.
This is important for case where sysctl_tcp_l3mdev_accept = 1
Similarly, for TCP-AO lookups tcp_ao_do_lookup() may be used with
l3index < 0, so that __tcp_ao_key_cmp() will match TCP-AO key in any VRF.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:46 +01:00
Dmitry Safonov 67fa83f7c8 net/tcp: Add static_key for TCP-AO
Similarly to TCP-MD5, add a static key to TCP-AO that is patched out
when there are no keys on a machine and dynamically enabled with the
first setsockopt(TCP_AO) adds a key on any socket. The static key is as
well dynamically disabled later when the socket is destructed.

The lifetime of enabled static key here is the same as ao_info: it is
enabled on allocation, passed over from full socket to twsk and
destructed when ao_info is scheduled for destruction.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov d6732b95b6 net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)
Delete becomes very, very fast - almost free, but after setsockopt()
syscall returns, the key is still alive until next RCU grace period.
Which is fine for listen sockets as userspace needs to be aware of
setsockopt(TCP_AO) and accept() race and resolve it with verification
by getsockopt() after TCP connection was accepted.

The benchmark results (on non-loaded box, worse with more RCU work pending):
> ok 33    Worst case delete    16384 keys: min=5ms max=10ms mean=6.93904ms stddev=0.263421
> ok 34        Add a new key    16384 keys: min=1ms max=4ms mean=2.17751ms stddev=0.147564
> ok 35 Remove random-search    16384 keys: min=5ms max=10ms mean=6.50243ms stddev=0.254999
> ok 36         Remove async    16384 keys: min=0ms max=0ms mean=0.0296107ms stddev=0.0172078

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov ef84703a91 net/tcp: Add TCP-AO getsockopt()s
Introduce getsockopt(TCP_AO_GET_KEYS) that lets a user get TCP-AO keys
and their properties from a socket. The user can provide a filter
to match the specific key to be dumped or ::get_all = 1 may be
used to dump all keys in one syscall.

Add another getsockopt(TCP_AO_INFO) for providing per-socket/per-ao_info
stats: packet counters, Current_key/RNext_key and flags like
::ao_required and ::accept_icmps.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 7753c2f0a8 net/tcp: Add option for TCP-AO to (not) hash header
Provide setsockopt() key flag that makes TCP-AO exclude hashing TCP
header for peers that match the key. This is needed for interraction
with middleboxes that may change TCP options, see RFC5925 (9.2).

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 953af8e3ac net/tcp: Ignore specific ICMPs for TCP-AO connections
Similarly to IPsec, RFC5925 prescribes:
  ">> A TCP-AO implementation MUST default to ignore incoming ICMPv4
  messages of Type 3 (destination unreachable), Codes 2-4 (protocol
  unreachable, port unreachable, and fragmentation needed -- ’hard
  errors’), and ICMPv6 Type 1 (destination unreachable), Code 1
  (administratively prohibited) and Code 4 (port unreachable) intended
  for connections in synchronized states (ESTABLISHED, FIN-WAIT-1, FIN-
  WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT) that match MKTs."

A selftest (later in patch series) verifies that this attack is not
possible in this TCP-AO implementation.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 2717b5adea net/tcp: Add tcp_hash_fail() ratelimited logs
Add a helper for logging connection-detailed messages for failed TCP
hash verification (both MD5 and AO).

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 64382c71a5 net/tcp: Add TCP-AO SNE support
Add Sequence Number Extension (SNE) for TCP-AO.
This is needed to protect long-living TCP-AO connections from replaying
attacks after sequence number roll-over, see RFC5925 (6.2).

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov af09a341dc net/tcp: Add TCP-AO segments counters
Introduce segment counters that are useful for troubleshooting/debugging
as well as for writing tests.
Now there are global snmp counters as well as per-socket and per-key.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 0a3a809089 net/tcp: Verify inbound TCP-AO signed segments
Now there is a common function to verify signature on TCP segments:
tcp_inbound_hash(). It has checks for all possible cross-interactions
with MD5 signs as well as with unsigned segments.

The rules from RFC5925 are:
(1) Any TCP segment can have at max only one signature.
(2) TCP connections can't switch between using TCP-MD5 and TCP-AO.
(3) TCP-AO connections can't stop using AO, as well as unsigned
    connections can't suddenly start using AO.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 9427c6aa3e net/tcp: Sign SYN-ACK segments with TCP-AO
Similarly to RST segments, wire SYN-ACKs to TCP-AO.
tcp_rsk_used_ao() is handy here to check if the request socket used AO
and needs a signature on the outgoing segments.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov 06b22ef295 net/tcp: Wire TCP-AO to request sockets
Now when the new request socket is created from the listening socket,
it's recorded what MKT was used by the peer. tcp_rsk_used_ao() is
a new helper for checking if TCP-AO option was used to create the
request socket.
tcp_ao_copy_all_matching() will copy all keys that match the peer on the
request socket, as well as preparing them for the usage (creating
traffic keys).

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:45 +01:00
Dmitry Safonov decde2586b net/tcp: Add TCP-AO sign to twsk
Add support for sockets in time-wait state.
ao_info as well as all keys are inherited on transition to time-wait
socket. The lifetime of ao_info is now protected by ref counter, so
that tcp_ao_destroy_sock() will destruct it only when the last user is
gone.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov ba7783ad45 net/tcp: Add AO sign to RST packets
Wire up sending resets to TCP-AO hashing.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov f7dca36fc5 net/tcp: Add tcp_parse_auth_options()
Introduce a helper that:
(1) shares the common code with TCP-MD5 header options parsing
(2) looks for hash signature only once for both TCP-MD5 and TCP-AO
(3) fails with -EEXIST if any TCP sign option is present twice, see
    RFC5925 (2.2):
    ">> A single TCP segment MUST NOT have more than one TCP-AO in its
    options sequence. When multiple TCP-AOs appear, TCP MUST discard
    the segment."

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov 1e03d32bea net/tcp: Add TCP-AO sign to outgoing packets
Using precalculated traffic keys, sign TCP segments as prescribed by
RFC5925. Per RFC, TCP header options are included in sign calculation:
"The TCP header, by default including options, and where the TCP
checksum and TCP-AO MAC fields are set to zero, all in network-
byte order." (5.1.3)

tcp_ao_hash_header() has exclude_options parameter to optionally exclude
TCP header from hash calculation, as described in RFC5925 (9.1), this is
needed for interaction with middleboxes that may change "some TCP
options". This is wired up to AO key flags and setsockopt() later.

Similarly to TCP-MD5 hash TCP segment fragments.

From this moment a user can start sending TCP-AO signed segments with
one of crypto ahash algorithms from supported by Linux kernel. It can
have a user-specified MAC length, to either save TCP option header space
or provide higher protection using a longer signature.
The inbound segments are not yet verified, TCP-AO option is ignored and
they are accepted.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov 7c2ffaf21b net/tcp: Calculate TCP-AO traffic keys
Add traffic key calculation the way it's described in RFC5926.
Wire it up to tcp_finish_connect() and cache the new keys straight away
on already established TCP connections.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov 0aadc73995 net/tcp: Prevent TCP-MD5 with TCP-AO being set
Be as conservative as possible: if there is TCP-MD5 key for a given peer
regardless of L3 interface - don't allow setting TCP-AO key for the same
peer. According to RFC5925, TCP-AO is supposed to replace TCP-MD5 and
there can't be any switch between both on any connected tuple.
Later it can be relaxed, if there's a use, but in the beginning restrict
any intersection.

Note: it's still should be possible to set both TCP-MD5 and TCP-AO keys
on a listening socket for *different* peers.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov 4954f17dde net/tcp: Introduce TCP_AO setsockopt()s
Add 3 setsockopt()s:
1. TCP_AO_ADD_KEY to add a new Master Key Tuple (MKT) on a socket
2. TCP_AO_DEL_KEY to delete present MKT from a socket
3. TCP_AO_INFO to change flags, Current_key/RNext_key on a TCP-AO sk

Userspace has to introduce keys on every socket it wants to use TCP-AO
option on, similarly to TCP_MD5SIG/TCP_MD5SIG_EXT.
RFC5925 prohibits definition of MKTs that would match the same peer,
so do sanity checks on the data provided by userspace. Be as
conservative as possible, including refusal of defining MKT on
an established connection with no AO, removing the key in-use and etc.

(1) and (2) are to be used by userspace key manager to add/remove keys.
(3) main purpose is to set RNext_key, which (as prescribed by RFC5925)
is the KeyID that will be requested in TCP-AO header from the peer to
sign their segments with.

At this moment the life of ao_info ends in tcp_v4_destroy_sock().

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov c845f5f359 net/tcp: Add TCP-AO config and structures
Introduce new kernel config option and common structures as well as
helpers to be used by TCP-AO code.

Co-developed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Co-developed-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Dmitry Safonov 8c73b26315 net/tcp: Prepare tcp_md5sig_pool for TCP-AO
TCP-AO, similarly to TCP-MD5, needs to allocate tfms on a slow-path,
which is setsockopt() and use crypto ahash requests on fast paths,
which are RX/TX softirqs. Also, it needs a temporary/scratch buffer
for preparing the hash.

Rework tcp_md5sig_pool in order to support other hashing algorithms
than MD5. It will make it possible to share pre-allocated crypto_ahash
descriptors and scratch area between all TCP hash users.

Internally tcp_sigpool calls crypto_clone_ahash() API over pre-allocated
crypto ahash tfm. Kudos to Herbert, who provided this new crypto API.

I was a little concerned over GFP_ATOMIC allocations of ahash and
crypto_request in RX/TX (see tcp_sigpool_start()), so I benchmarked both
"backends" with different algorithms, using patched version of iperf3[2].
On my laptop with i7-7600U @ 2.80GHz:

                         clone-tfm                per-CPU-requests
TCP-MD5                  2.25 Gbits/sec           2.30 Gbits/sec
TCP-AO(hmac(sha1))       2.53 Gbits/sec           2.54 Gbits/sec
TCP-AO(hmac(sha512))     1.67 Gbits/sec           1.64 Gbits/sec
TCP-AO(hmac(sha384))     1.77 Gbits/sec           1.80 Gbits/sec
TCP-AO(hmac(sha224))     1.29 Gbits/sec           1.30 Gbits/sec
TCP-AO(hmac(sha3-512))    481 Mbits/sec            480 Mbits/sec
TCP-AO(hmac(md5))        2.07 Gbits/sec           2.12 Gbits/sec
TCP-AO(hmac(rmd160))     1.01 Gbits/sec            995 Mbits/sec
TCP-AO(cmac(aes128))     [not supporetd yet]      2.11 Gbits/sec

So, it seems that my concerns don't have strong grounds and per-CPU
crypto_request allocation can be dropped/removed from tcp_sigpool once
ciphers get crypto_clone_ahash() support.

[1]: https://lore.kernel.org/all/ZDefxOq6Ax0JeTRH@gondor.apana.org.au/T/#u
[2]: https://github.com/0x7f454c46/iperf/tree/tcp-md5-ao
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27 10:35:44 +01:00
Florian Westphal eefed7662f xfrm: policy: fix layer 4 flowi decoding
The commit shipped with two bugs:
 fl4->fl4_icmp_type = flkeys->icmp.type;
 fl4->fl4_icmp_type = flkeys->icmp.code;
               ~~~~ should have been "code".

But the more severe bug is that I got fooled by flowi member defines:
fl4_icmp_type, fl4_gre_key and fl4_dport share the same union/address.

Fix typo and make gre/icmp key setting depend on the l4 protocol.

Fixes: 7a0207094f ("xfrm: policy: replace session decode with flow dissector")
Reported-and-tested-by: Antony Antony <antony@phenome.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-27 10:12:09 +02:00
Hangyu Hua ce07087964 9p/net: fix possible memory leak in p9_check_errors()
When p9pdu_readf() is called with "s?d" attribute, it allocates a pointer
that will store a string. But when p9pdu_readf() fails while handling "d"
then this pointer will not be freed in p9_check_errors().

Fixes: 51a87c552d ("9p: rework client code to use new protocol support functions")
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Message-ID: <20231027030302.11927-1-hbh25y@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2023-10-27 12:44:13 +09:00
Dominique Martinet 39763480dd 9p/net: xen: fix false positive printf format overflow warning
Use the constant to make the compiler happy about this warning:
net/9p/trans_xen.c: In function ‘xen_9pfs_front_changed’:
net/9p/trans_xen.c:444:39: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 8 [-Wformat-overflow=]
  444 |                 sprintf(str, "ring-ref%d", i);
      |                                       ^~
In function ‘xen_9pfs_front_init’,
    inlined from ‘xen_9pfs_front_changed’ at net/9p/trans_xen.c:516:8,
    inlined from ‘xen_9pfs_front_changed’ at net/9p/trans_xen.c:504:13:
net/9p/trans_xen.c:444:30: note: directive argument in the range [-2147483644, 2147483646]
  444 |                 sprintf(str, "ring-ref%d", i);
      |                              ^~~~~~~~~~~~
net/9p/trans_xen.c:444:17: note: ‘sprintf’ output between 10 and 20 bytes into a destination of size 16
  444 |                 sprintf(str, "ring-ref%d", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/9p/trans_xen.c: In function ‘xen_9pfs_front_changed’:
net/9p/trans_xen.c:450:45: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 2 [-Wformat-overflow=]
  450 |                 sprintf(str, "event-channel-%d", i);
      |                                             ^~
In function ‘xen_9pfs_front_init’,
    inlined from ‘xen_9pfs_front_changed’ at net/9p/trans_xen.c:516:8,
    inlined from ‘xen_9pfs_front_changed’ at net/9p/trans_xen.c:504:13:
net/9p/trans_xen.c:450:30: note: directive argument in the range [-2147483644, 2147483646]
  450 |                 sprintf(str, "event-channel-%d", i);
      |                              ^~~~~~~~~~~~~~~~~~
net/9p/trans_xen.c:450:17: note: ‘sprintf’ output between 16 and 26 bytes into a destination of size 16
  450 |                 sprintf(str, "event-channel-%d", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is no change in logic: there only are a constant number of rings,
and there also already is a BUILD_BUG_ON that checks if that constant
goes over 9 as anything bigger would no longer fit the event-channel-%d
destination size.

In theory having that size as part of the struct means it could be
modified by another thread and makes the compiler lose track of possible
values for 'i' here, using the constant directly here makes it work.

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Message-ID: <20231025103445.1248103-3-asmadeus@codewreck.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
2023-10-27 12:44:08 +09:00
Jakub Kicinski edd68156bc wireless-next patches for v6.7
The third, and most likely the last, features pull request for v6.7.
 Fixes all over and only few small new features.
 
 Major changes:
 
 iwlwifi
 
 * more Multi-Link Operation (MLO) work
 
 ath12k
 
 * QCN9274: mesh support
 
 ath11k
 
 * firmware-2.bin container file format support
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmU6KqgRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZtyMwf7B/BqV0LCNzBxtrWl3WYtgQgULgWFmEJt
 83/Vo8pXelZzzMMERwvZtPCwEUm/L/vOO/a/k0oSz/XQbt4PTIBGnWA7JwYZGY++
 1Kc79oMyXxG4Q4RCnKG/qQMzCnyL54RHUfFQrNaa3Bkgp7vGobU+ixH4NaqHI3M9
 OFmyhCklk9AO0VTtT6vQQBM6wM3UC1adneZMVlb8xD2Wi5rkrRk4PX5msgaYrStR
 ketZE6IPnnX8DziqGZPlTz1SSuOSnwGTOramdeGLKIUUlZbPWHTSBZ8lh/xnvGUB
 561mp3/iguFtq2NvduPBqItotBzLGvnJZbLDrBPxB/v99q+7/cziSA==
 =Xf7b
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2023-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.7

The third, and most likely the last, features pull request for v6.7.
Fixes all over and only few small new features.

Major changes:

iwlwifi
 - more Multi-Link Operation (MLO) work

ath12k
 - QCN9274: mesh support

ath11k
 - firmware-2.bin container file format support

* tag 'wireless-next-2023-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (155 commits)
  wifi: ray_cs: Remove unnecessary (void*) conversions
  Revert "wifi: ath11k: call ath11k_mac_fils_discovery() without condition"
  wifi: ath12k: Introduce and use ath12k_sta_to_arsta()
  wifi: ath12k: fix htt mlo-offset event locking
  wifi: ath12k: fix dfs-radar and temperature event locking
  wifi: ath11k: fix gtk offload status event locking
  wifi: ath11k: fix htt pktlog locking
  wifi: ath11k: fix dfs radar event locking
  wifi: ath11k: fix temperature event locking
  wifi: ath12k: rename the sc naming convention to ab
  wifi: ath12k: rename the wmi_sc naming convention to wmi_ab
  wifi: ath11k: add firmware-2.bin support
  wifi: ath11k: qmi: refactor ath11k_qmi_m3_load()
  wifi: rtw89: cleanup firmware elements parsing
  wifi: rt2x00: rework MT7620 PA/LNA RF calibration
  wifi: rt2x00: rework MT7620 channel config function
  wifi: rt2x00: improve MT7620 register initialization
  MAINTAINERS: wifi: rt2x00: drop Helmut Schaa
  wifi: wlcore: main: replace deprecated strncpy with strscpy
  wifi: wlcore: boot: replace deprecated strncpy with strscpy
  ...
====================

Link: https://lore.kernel.org/r/20231026090411.B2426C433CB@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26 20:27:58 -07:00
Jakub Kicinski c6f9b7138b bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZTp12QAKCRDbK58LschI
 g8BrAQDifqp5liEEdXV8jdReBwJtqInjrL5tzy5LcyHUMQbTaAEA6Ph3Ct3B+3oA
 mFnIW/y6UJiJrby0Xz4+vV5BXI/5WQg=
 =pLCV
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-10-26

We've added 51 non-merge commits during the last 10 day(s) which contain
a total of 75 files changed, 5037 insertions(+), 200 deletions(-).

The main changes are:

1) Add open-coded task, css_task and css iterator support.
   One of the use cases is customizable OOM victim selection via BPF,
   from Chuyi Zhou.

2) Fix BPF verifier's iterator convergence logic to use exact states
   comparison for convergence checks, from Eduard Zingerman,
   Andrii Nakryiko and Alexei Starovoitov.

3) Add BPF programmable net device where bpf_mprog defines the logic
   of its xmit routine. It can operate in L3 and L2 mode,
   from Daniel Borkmann and Nikolay Aleksandrov.

4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking
   for global per-CPU allocator, from Hou Tao.

5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section
   was going to be present whenever a binary has SHT_GNU_versym section,
   from Andrii Nakryiko.

6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into
   atomic_set_release(), from Paul E. McKenney.

7) Add a warning if NAPI callback missed xdp_do_flush() under
   CONFIG_DEBUG_NET which helps checking if drivers were missing
   the former, from Sebastian Andrzej Siewior.

8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing
   a warning under sleepable programs, from Yafang Shao.

9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before
   checking map_locked, from Song Liu.

10) Make BPF CI linked_list failure test more robust,
    from Kumar Kartikeya Dwivedi.

11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik.

12) Fix xsk starving when multiple xsk sockets were associated with
    a single xsk_buff_pool, from Albert Huang.

13) Clarify the signed modulo implementation for the BPF ISA standardization
    document that it uses truncated division, from Dave Thaler.

14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider
    signed bounds knowledge, from Andrii Nakryiko.

15) Add an option to XDP selftests to use multi-buffer AF_XDP
    xdp_hw_metadata and mark used XDP programs as capable to use frags,
    from Larysa Zaremba.

16) Fix bpftool's BTF dumper wrt printing a pointer value and another
    one to fix struct_ops dump in an array, from Manu Bretelle.

* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits)
  netkit: Remove explicit active/peer ptr initialization
  selftests/bpf: Fix selftests broken by mitigations=off
  samples/bpf: Allow building with custom bpftool
  samples/bpf: Fix passing LDFLAGS to libbpf
  samples/bpf: Allow building with custom CFLAGS/LDFLAGS
  bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
  selftests/bpf: Add selftests for netkit
  selftests/bpf: Add netlink helper library
  bpftool: Extend net dump with netkit progs
  bpftool: Implement link show support for netkit
  libbpf: Add link-based API for netkit
  tools: Sync if_link uapi header
  netkit, bpf: Add bpf programmable net device
  bpf: Improve JEQ/JNE branch taken logic
  bpf: Fold smp_mb__before_atomic() into atomic_set_release()
  bpf: Fix unnecessary -EBUSY from htab_lock_bucket
  xsk: Avoid starving the xsk further down the list
  bpf: print full verifier states on infinite loop detection
  selftests/bpf: test if state loops are detected in a tricky case
  bpf: correct loop detection for iterators convergence
  ...
====================

Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26 20:02:41 -07:00
Jakub Kicinski ea23fbd2a8 netlink: make range pointers in policies const
struct nla_policy is usually constant itself, but unless
we make the ranges inside constant we won't be able to
make range structs const. The ranges are not modified
by the core.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231025162204.132528-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26 16:24:09 -07:00
Jakub Kicinski ec4c20ca09 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/mac80211/rx.c
  91535613b6 ("wifi: mac80211: don't drop all unprotected public action frames")
  6c02fab724 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value")

Adjacent changes:

drivers/net/ethernet/apm/xgene/xgene_enet_main.c
  61471264c0 ("net: ethernet: apm: Convert to platform remove callback returning void")
  d2ca43f306 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF")

net/vmw_vsock/virtio_transport.c
  64c99d2d6a ("vsock/virtio: support to send non-linear skb")
  53b08c4985 ("vsock/virtio: initialize the_virtio_vsock before using VQs")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-26 13:46:28 -07:00
Paolo Abeni 3967336126 netfilter pull request 23-10-25
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmU5gvgACgkQ1V2XiooU
 IOS9eA//QyIqcGRzr+tX1ZPwikkicmSb7w8vkrY7jXMNWNiye54tA5fSJsxqpIMy
 9J9k8eB+fI6AJV44tOq4K8XsYCcI4ZEst2mftumvuq8igX27ulz46uLvoVwnqhwc
 NlPO06RwSHQHR3S5tKRRwwYfUwpPjDRCW15c14pHw4EkHAaL+dItLwATrrJhPv93
 PUZNGbB+i+55QrLJdMMshvpoPAhLmo57cDvDcerhOWygaoxiaKIaR0bRQ40eM3Zl
 j9veG2oiuehv/RHqFJ5MBCiqrIYRHU8kTflqaVA+ODfgUSbijCcv/RQxaILnwnZd
 57vdLSBTgVFh2uiPTYGAxfUwv3BpA7g9uzmeBgMbl/t71HhYoPe/QhjvoVhlzNl/
 1JrCFbrQVaCdhtZbDt7f359mUtXv9yh1+9pBytEkdbxRfsDmzRVsFrUeLo0P9Ho8
 4jKpnaqPTdnzhfoQocZjL7+M22/zk6jZCu1Pcs318yqpTkhJwiTtByo4iaQ2f0F7
 xgA/auZ/33mmirFXnLzMoA4b0TbJNm+Jjye3tdbu0FlY2sKb943RW2kXg9rdfyOL
 OUvSSux7Fezyq5y55+KV4FFrKgYhrZ5tWR8mVBg1KHrYs9r9p7SmXL3KzbSPxBvn
 QrPdCQsQ187QNIkwli/ChYQcrahwSiEqFSJbdDmzIA7AFNLHVaY=
 =ZG+P
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. Mostly
nf_tables updates with two patches for connlabel and br_netfilter.

1) Rename function name to perform on-demand GC for rbtree elements,
   and replace async GC in rbtree by sync GC. Patches from Florian Westphal.

2) Use commit_mutex for NFT_MSG_GETRULE_RESET to ensure that two
   concurrent threads invoking this command do not underrun stateful
   objects. Patches from Phil Sutter.

3) Use single hook to deal with IP and ARP packets in br_netfilter.
   Patch from Florian Westphal.

4) Use atomic_t in netns->connlabel use counter instead of using a
   spinlock, also patch from Florian.

5) Cleanups for stateful objects infrastructure in nf_tables.
   Patches from Phil Sutter.

6) Flush path uses opaque set element offered by the iterator, instead of
   calling pipapo_deactivate() which looks up for it again.

7) Set backend .flush interface always succeeds, make it return void
   instead.

8) Add struct nft_elem_priv placeholder structure and use it by replacing
   void * to pass opaque set element representation from backend to frontend
   which defeats compiler type checks.

9) Shrink memory consumption of set element transactions, by reducing
   struct nft_trans_elem object size and reducing stack memory usage.

10) Use struct nft_elem_priv also for set backend .insert operation too.

11) Carry reset flag in nft_set_dump_ctx structure, instead of passing it
    as a function argument, from Phil Sutter.

netfilter pull request 23-10-25

* tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
  netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
  netfilter: nf_tables: shrink memory consumption of set elements
  netfilter: nf_tables: expose opaque set element as struct nft_elem_priv
  netfilter: nf_tables: set backend .flush always succeeds
  netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
  netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
  netfilter: nf_tables: nft_obj_filter fits into cb->ctx
  netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
  netfilter: nf_tables: A better name for nft_obj_filter
  netfilter: nf_tables: Unconditionally allocate nft_obj_filter
  netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
  netfilter: conntrack: switch connlabels to atomic_t
  br_netfilter: use single forward hook for ip and arp
  netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests
  netfilter: nf_tables: Introduce nf_tables_getrule_single()
  netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()
  netfilter: nft_set_rbtree: prefer sync gc to async worker
  netfilter: nft_set_rbtree: rename gc deactivate+erase function
====================

Link: https://lore.kernel.org/r/20231025212555.132775-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-26 12:20:35 +02:00
Alex Henrie 629df6701c net: ipv6/addrconf: clamp preferred_lft to the minimum required
If the preferred lifetime was less than the minimum required lifetime,
ipv6_create_tempaddr would error out without creating any new address.
On my machine and network, this error happened immediately with the
preferred lifetime set to 1 second, after a few minutes with the
preferred lifetime set to 4 seconds, and not at all with the preferred
lifetime set to 5 seconds. During my investigation, I found a Stack
Exchange post from another person who seems to have had the same
problem: They stopped getting new addresses if they lowered the
preferred lifetime below 3 seconds, and they didn't really know why.

The preferred lifetime is a preference, not a hard requirement. The
kernel does not strictly forbid new connections on a deprecated address,
nor does it guarantee that the address will be disposed of the instant
its total valid lifetime expires. So rather than disable IPv6 privacy
extensions altogether if the minimum required lifetime swells above the
preferred lifetime, it is more in keeping with the user's intent to
increase the temporary address's lifetime to the minimum necessary for
the current network conditions.

With these fixes, setting the preferred lifetime to 3 or 4 seconds "just
works" because the extra fraction of a second is practically
unnoticeable. It's even possible to reduce the time before deprecation
to 1 or 2 seconds by also disabling duplicate address detection (setting
/proc/sys/net/ipv6/conf/*/dad_transmits to 0). I realize that that is a
pretty niche use case, but I know at least one person who would gladly
sacrifice performance and convenience to be sure that they are getting
the maximum possible level of privacy.

Link: https://serverfault.com/a/1031168/310447
Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231024212312.299370-3-alexhenrie24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 18:23:06 -07:00
Alex Henrie bfbf81b310 net: ipv6/addrconf: clamp preferred_lft to the maximum allowed
Without this patch, there is nothing to stop the preferred lifetime of a
temporary address from being greater than its valid lifetime. If that
was the case, the valid lifetime was effectively ignored.

Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231024212312.299370-2-alexhenrie24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 18:23:06 -07:00
Yan Zhai 03d6c848bf ipv6: avoid atomic fragment on GSO packets
When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].

Add an extra check in the GSO slow output path. For each segment from
the original over-sized packet, if it fits with the path MTU, then avoid
generating an atomic fragment.

Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1]
Fixes: b210de4f8c ("net: ipv6: Validate GSO SKB before finish IPv6 processing")
Reported-by: David Wragg <dwragg@cloudflare.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Link: https://lore.kernel.org/r/90912e3503a242dca0bc36958b11ed03a2696e5e.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 18:04:29 -07:00
Yan Zhai 1f7ec1b372 ipv6: refactor ip6_finish_output for GSO handling
Separate GSO and non-GSO packets handling to make the logic cleaner. For
GSO packets, frag_max_size check can be omitted because it is only
useful for packets defragmented by netfilter hooks. Both local output
and GRO logic won't produce GSO packets when defragment is needed. This
also mirrors what IPv4 side code is doing.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/0e1d4599f858e2becff5c4fe0b5f843236bc3fe8.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 18:04:29 -07:00
Yan Zhai e57a344785 ipv6: drop feature RTAX_FEATURE_ALLFRAG
RTAX_FEATURE_ALLFRAG was added before the first git commit:

https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html

The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit
9d289715eb ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).

The feature would always test false at the moment, so remove related
code or mark them as unused.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 18:04:29 -07:00
Jakub Kicinski 5e5d8b94a4 netfilter pull request 23-10-25
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmU45rAACgkQ1V2XiooU
 IORtEQ//U91FHPqc1KpJi5lAnXXAXaDji6RjZ080bwk4H3oXc2moc71SiGEgggGs
 POZEnN0sNJXfUacdG23pQGLnrT1iQpG927mzV01W9HhyZEopO4g+mRt5ymt/qmvO
 Q9MKWuNOlJCD5blPyKxU7VF3LsQynaPST1IbuPI1NVKiqNUpIpAWC1G+Ofpt67QY
 Tq7KiJDX0yc+51OFT9Ahs3piSbzC5bl0yC4iynajPxziv+rUiJW5ym2GM24G2rNh
 /SD4EeJkArdFa3I4Kf15Hnj9809qQP22PDhoQ2Hzzr7XbveArmPjaI0UQ39uV5Jr
 1/lFP3iQMBsj04dI/xRLBHJHb2WZvlNa+btV/RHuaw1TEnYevdarMl3Lh0q7p5sT
 3M4JBbk0+bq7ZXWmDBT48ZQs4S5UqMscunZXKg2k0fZPn/rSlASAZ3TAXZuF0avp
 KLQGQsjeBX/zgmQqhq37/oD+YV13LCtEqC0xz4WgX9WpVvgyMR3LFcsHQcZBAVUN
 PJenvgmpdo8sbhABOXsURJPVDo0JzS4xZhrPyIKaojTo33KfQ/1Z5Ef0EOkbs75+
 6wMoUTdvcZK+Y5f6hvMQ/XOu7XNz0sVZlfBjAhFrVU/TsbprviQCN8QB1IQNHclm
 5A93VnID0WPCSAmOmaIdMlcJka4wKv4irI+Iv8vNlQXqV7dXuzQ=
 =r+0z
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This patch contains two late Netfilter's flowtable fixes for net:

1) Flowtable GC pushes back packets to classic path in every GC run,
   ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only
   used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset
   by the time the flow is offloaded (this status bit is only reliable
   in the sched/act_ct datapath).

2) sched/act_ct logic to push back packets to classic path to reevaluate
   if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is
   set on and no hardware offload request is pending to be handled.
   From Vlad Buslov.

These two patches fixes two problems that were introduced in the
previous 6.5 development cycle.

* tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  net/sched: act_ct: additional checks for outdated flows
  netfilter: flowtable: GC pushes back packets to classic path
====================

Link: https://lore.kernel.org/r/20231025100819.2664-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 16:02:06 -07:00
Alexandru Matei 53b08c4985 vsock/virtio: initialize the_virtio_vsock before using VQs
Once VQs are filled with empty buffers and we kick the host, it can send
connection requests. If the_virtio_vsock is not initialized before,
replies are silently dropped and do not reach the host.

virtio_transport_send_pkt() can queue packets once the_virtio_vsock is
set, but they won't be processed until vsock->tx_run is set to true. We
queue vsock->send_pkt_work when initialization finishes to send those
packets queued earlier.

Fixes: 0deab087b1 ("vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock")
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20231024191742.14259-1-alexandru.matei@uipath.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 15:49:47 -07:00
Dominique Martinet 9b5c628183 9p: v9fs_listxattr: fix %s null argument warning
W=1 warns about null argument to kprintf:
In file included from fs/9p/xattr.c:12:
In function ‘v9fs_xattr_get’,
    inlined from ‘v9fs_listxattr’ at fs/9p/xattr.c:142:9:
include/net/9p/9p.h:55:2: error: ‘%s’ directive argument is null
[-Werror=format-overflow=]
   55 |  _p9_debug(level, __func__, fmt, ##__VA_ARGS__)
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use an empty string instead of :
 - this is ok 9p-wise because p9pdu_vwritef serializes a null string
and an empty string the same way (one '0' word for length)
 - since this degrades the print statements, add new single quotes for
xattr's name delimter (Old: "file = (null)", new: "file = ''")

Link: https://lore.kernel.org/r/20231008060138.517057-1-suhui@nfschina.com
Suggested-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <20231025103445.1248103-2-asmadeus@codewreck.org>
2023-10-26 07:05:52 +09:00
Marco Elver 355f074609 9p/trans_fd: Annotate data-racy writes to file::f_flags
syzbot reported:

 | BUG: KCSAN: data-race in p9_fd_create / p9_fd_create
 |
 | read-write to 0xffff888130fb3d48 of 4 bytes by task 15599 on cpu 0:
 |  p9_fd_open net/9p/trans_fd.c:842 [inline]
 |  p9_fd_create+0x210/0x250 net/9p/trans_fd.c:1092
 |  p9_client_create+0x595/0xa70 net/9p/client.c:1010
 |  v9fs_session_init+0xf9/0xd90 fs/9p/v9fs.c:410
 |  v9fs_mount+0x69/0x630 fs/9p/vfs_super.c:123
 |  legacy_get_tree+0x74/0xd0 fs/fs_context.c:611
 |  vfs_get_tree+0x51/0x190 fs/super.c:1519
 |  do_new_mount+0x203/0x660 fs/namespace.c:3335
 |  path_mount+0x496/0xb30 fs/namespace.c:3662
 |  do_mount fs/namespace.c:3675 [inline]
 |  __do_sys_mount fs/namespace.c:3884 [inline]
 |  [...]
 |
 | read-write to 0xffff888130fb3d48 of 4 bytes by task 15563 on cpu 1:
 |  p9_fd_open net/9p/trans_fd.c:842 [inline]
 |  p9_fd_create+0x210/0x250 net/9p/trans_fd.c:1092
 |  p9_client_create+0x595/0xa70 net/9p/client.c:1010
 |  v9fs_session_init+0xf9/0xd90 fs/9p/v9fs.c:410
 |  v9fs_mount+0x69/0x630 fs/9p/vfs_super.c:123
 |  legacy_get_tree+0x74/0xd0 fs/fs_context.c:611
 |  vfs_get_tree+0x51/0x190 fs/super.c:1519
 |  do_new_mount+0x203/0x660 fs/namespace.c:3335
 |  path_mount+0x496/0xb30 fs/namespace.c:3662
 |  do_mount fs/namespace.c:3675 [inline]
 |  __do_sys_mount fs/namespace.c:3884 [inline]
 |  [...]
 |
 | value changed: 0x00008002 -> 0x00008802

Within p9_fd_open(), O_NONBLOCK is added to f_flags of the read and
write files. This may happen concurrently if e.g. mounting process
modifies the fd in another thread.

Mark the plain read-modify-writes as intentional data-races, with the
assumption that the result of executing the accesses concurrently will
always result in the same result despite the accesses themselves not
being atomic.

Reported-by: syzbot+e441aeeb422763cc5511@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/ZO38mqkS0TYUlpFp@elver.google.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Message-ID: <20231025103445.1248103-1-asmadeus@codewreck.org>
2023-10-26 07:05:42 +09:00
Paolo Abeni 8005184fd1 mptcp: refactor sndbuf auto-tuning
The MPTCP protocol account for the data enqueued on all the subflows
to the main socket send buffer, while the send buffer auto-tuning
algorithm set the main socket send buffer size as the max size among
the subflows.

That causes bad performances when at least one subflow is sndbuf
limited, e.g. due to very high latency, as the MPTCP scheduler can't
even fill such buffer.

Change the send-buffer auto-tuning algorithm to compute the main socket
send buffer size as the sum of all the subflows buffer size.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-9-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:35 -07:00
Paolo Abeni 9fdc779331 mptcp: ignore notsent_lowat setting at the subflow level
Any latency related tuning taking action at the subflow level does
not really affect the user-space, as only the main MPTCP socket is
relevant.

Anyway any limiting setting may foul the MPTCP scheduler, not being
able to fully use the subflow-level cwin, leading to very poor b/w
usage.

Enforce notsent_lowat to be a no-op on every subflow.

Note that TCP_NOTSENT_LOWAT is currently not supported, and properly
dealing with that will require more invasive changes.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-8-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni a1ab24e5fc mptcp: consolidate sockopt synchronization
Move the socket option synchronization for active subflows
at subflow creation time. This allows removing the now unused
unlocked variant of such helper.

While at that, clean-up a bit the mptcp_subflow_create_socket()
errors path.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-7-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni 0ffe8e7490 mptcp: use copy_from_iter helpers on transmit
The perf traces show an high cost for the MPTCP transmit path memcpy.

It turn out that the helper currently in use carries quite a bit
of unneeded overhead, e.g. to map/unmap the memory pages.

Moving to the 'copy_from_iter' variant removes such overhead and
additionally gains the no-cache support.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-6-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni 5684ab1a0e mptcp: give rcvlowat some love
The MPTCP protocol allow setting sk_rcvlowat, but the value there
is currently ignored.

Additionally, the default subflows sk_rcvlowat basically disables per
subflow delayed ack: the MPTCP protocol move the incoming data from the
subflows into the msk socket as soon as the TCP stacks invokes the subflow
data_ready callback. Later, when __tcp_ack_snd_check() takes action,
the subflow-level copied_seq matches rcv_nxt, and that mandate for an
immediate ack.

Let the mptcp receive path be aware of such threshold, explicitly tracking
the amount of data available to be ready and checking vs sk_rcvlowat in
mptcp_poll() and before waking-up readers.

Additionally implement the set_rcvlowat() callback, to properly handle
the rcvbuf auto-tuning on sk_rcvlowat changes.

Finally to properly handle delayed ack, force the subflow level threshold
to 0 and instead explicitly ask for an immediate ack when the msk level th
is not reached.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-5-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni f1f26512a9 mptcp: use plain bool instead of custom binary enum
The 'data_avail' subflow field is already used as plain boolean,
drop the custom binary enum type and switch to bool.

No functional changed intended.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-3-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni bf0e96108f mptcp: properly account fastopen data
Currently the socket level counter aggregating the received data
does not take in account the data received via fastopen.

Address the issue updating the counter as required.

Fixes: 38967f424b ("mptcp: track some aggregate data counters")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-2-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:34 -07:00
Paolo Abeni d866ae9aaa mptcp: add a new sysctl for make after break timeout
The MPTCP protocol allows sockets with no alive subflows to stay
in ESTABLISHED status for and user-defined timeout, to allow for
later subflows creation.

Currently such timeout is constant - TCP_TIMEWAIT_LEN. Let the
user-space configure them via a newly added sysctl, to better cope
with busy servers and simplify (make them faster) the relevant
pktdrill tests.

Note that the new know does not apply to orphaned MPTCP socket
waiting for the data_fin handshake completion: they always wait
TCP_TIMEWAIT_LEN.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-1-9dc60939d371@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25 12:23:33 -07:00
Deming Wang 1711435e3e net: ipv6: fix typo in comments
The word "advertize" should be replaced by "advertise".

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25 10:38:07 +01:00
Deming Wang 197f9fba96 net: ipv4: fix typo in comments
The word "advertize" should be replaced by "advertise".

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25 10:38:07 +01:00
Vlad Buslov a63b662212 net/sched: act_ct: additional checks for outdated flows
Current nf_flow_is_outdated() implementation considers any flow table flow
which state diverged from its underlying CT connection status for teardown
which can be problematic in the following cases:

- Flow has never been offloaded to hardware in the first place either
because flow table has hardware offload disabled (flag
NF_FLOWTABLE_HW_OFFLOAD is not set) or because it is still pending on 'add'
workqueue to be offloaded for the first time. The former is incorrect, the
later generates excessive deletions and additions of flows.

- Flow is already pending to be updated on the workqueue. Tearing down such
flows will also generate excessive removals from the flow table, especially
on highly loaded system where the latency to re-offload a flow via 'add'
workqueue can be quite high.

When considering a flow for teardown as outdated verify that it is both
offloaded to hardware and doesn't have any pending updates.

Fixes: 41f2c7c342 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-25 11:35:57 +02:00
Pablo Neira Ayuso 735795f68b netfilter: flowtable: GC pushes back packets to classic path
Since 41f2c7c342 ("net/sched: act_ct: Fix promotion of offloaded
unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY
back to classic path in every run, ie. every second. This is because of
a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct.

In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on
and IPS_SEEN_REPLY is unreliable since users decide when to offload the
flow before, such bit might be set on at a later stage.

Fix it by adding a custom .gc handler that sched/act_ct can use to
deal with its NF_FLOW_HW_ESTABLISHED bit.

Fixes: 41f2c7c342 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple")
Reported-by: Vladimir Smelhaus <vl.sm@email.cz>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-25 11:35:46 +02:00
Florian Westphal 70f06c115b sched: act_ct: switch to per-action label counting
net->ct.labels_used was meant to convey 'number of ip/nftables rules
that need the label extension allocated'.

act_ct enables this for each net namespace, which voids all attempts
to avoid ct->ext allocation when possible.

Move this increment to the control plane to request label extension
space allocation only when its needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25 10:24:04 +01:00
Jakub Kicinski 00d67093e4 Three more fixes:
- don't drop all unprotected public action frames since
    some don't have a protected dual
  - fix pointer confusion in scanning code
  - fix warning in some connections with multiple links
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmU3naIACgkQ10qiO8sP
 aADUNA//clFAsaH6A94tD6Hgyafi9idBpsHERkYE4RaGmiID34yOYDInvkoDmsy1
 WG7wEdNjnsYDBrX0eG1x7WSrQRLhs76U0HnBP9tFYIeygnLuul2/UNRFkK6EwQfn
 OJbVJdjQdL/c8p129DUr5JKhavbc4ovY2acECLRY54n1fAYlnn6u7SWsOyCu0zl7
 wXSQ5pzYHu5lFM5LSFj6mC7U8b/aFQ5r9XsNHwGz4YVvd5cEdLYc/y5bAK6xAIxz
 jJcJLV088QikAcYmgIS7MNQuKrMudNjCEDWtqM23N9pO//QjsbOag2q02JmfxWyv
 4YJy42G/0K0/wjwCpIZig2OOE5iKDKCJ+dvNBUaCnnTn1ARQSXSDgAYkmWReCrZu
 DUpvn8Be3fgULtaUC0QQ3R1oCTVJyKYTD55Ofcy3Pj1qt+1lmhgLp1qyezemJcfJ
 p2sv5GLwyPLcOUTjeqTgP57xoJl2JV9vUVey9xvk2dMl0vS5qpfIf3FR7R0+HtlZ
 UIrveQOLMsKAaamk59RaSpfg4vJoCqaabu97f/lHRc5WdaeURSlUw0rU9xdqc/P+
 GTX7ubKoMiMEx11v25JdTE3eniFGxu28cojqScryvFo6WIlkYp4cbNxtRb4i9rOX
 ZJXQWCE6YJZ90VlR/a8lpJnTXjntQT5vBtH7vhqAneN2TJN74h8=
 =AvmQ
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Three more fixes:
 - don't drop all unprotected public action frames since
   some don't have a protected dual
 - fix pointer confusion in scanning code
 - fix warning in some connections with multiple links

* tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: don't drop all unprotected public action frames
  wifi: cfg80211: fix assoc response warning on failed links
  wifi: cfg80211: pass correct pointer to rdev_inform_bss()
====================

Link: https://lore.kernel.org/r/20231024103540.19198-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:10:53 -07:00
Florian Fainelli 87cd83714f net: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUIT
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI
and creates an alias named IFLA_DSA_CONDUIT.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20231023181729.1191071-3-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:08:14 -07:00
Florian Fainelli 6ca80638b9 net: dsa: Use conduit and user terms
Use more inclusive terms throughout the DSA subsystem by moving away
from "master" which is replaced by "conduit" and "slave" which is
replaced by "user". No functional changes.

Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/20231023181729.1191071-2-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:08:14 -07:00
Jakub Kicinski ce4cfa2318 net: remove else after return in dev_prep_valid_name()
Remove unnecessary else clauses after return.
I copied this if / else construct from somewhere,
it makes the code harder to read.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:59 -07:00
Jakub Kicinski 70e1b14c1b net: remove dev_valid_name() check from __dev_alloc_name()
__dev_alloc_name() is only called by dev_prep_valid_name(),
which already checks that name is valid.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:58 -07:00
Jakub Kicinski 7ad17b04dc net: trust the bitmap in __dev_alloc_name()
Prior to restructuring __dev_alloc_name() handled both printf
and non-printf names. In a clever attempt at code reuse it
always prints the name into a buffer and checks if it's
a duplicate.

Trust the bitmap, and return an error if its full.

This shrinks the possible ID space by one from 32K to 32K - 1,
as previously the max value would have been tried as a valid ID.
It seems very unlikely that anyone would care as we heard
no requests to increase the max beyond 32k.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:58 -07:00
Jakub Kicinski 9a81046812 net: reduce indentation of __dev_alloc_name()
All callers of __dev_valid_name() go thru dev_prep_valid_name()
which handles the non-printf case. Focus __dev_alloc_name() on
the sprintf case, remove the indentation level.

Minor functional change of returning -EINVAL if % is not found,
which should now never happen.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:58 -07:00
Jakub Kicinski 556c755a4d net: make dev_alloc_name() call dev_prep_valid_name()
__dev_alloc_name() handles both the sprintf and non-sprintf
target names. This complicates the code.

dev_prep_valid_name() already handles the non-sprintf case,
before calling __dev_alloc_name(), make the only other caller
also go thru dev_prep_valid_name(). This way we can drop
the non-sprintf handling in __dev_alloc_name() in one of
the next changes.

commit 55a5ec9b77 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and
commit 029b6d1405 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
tell us that we can't start returning -EEXIST from dev_alloc_name()
on name duplicates. Bite the bullet and pass the expected errno to
dev_prep_valid_name().

dev_prep_valid_name() must now propagate out the allocated id
for printf names.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:58 -07:00
Jakub Kicinski bd07063dd1 net: don't use input buffer of __dev_alloc_name() as a scratch space
Callers of __dev_alloc_name() want to pass dev->name as
the output buffer. Make __dev_alloc_name() not clobber
that buffer on failure, and remove the workarounds
in callers.

dev_alloc_name_ns() is now completely unnecessary.

The extra strscpy() added here will be gone by the end
of the patch series.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231023152346.3639749-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:02:58 -07:00
Davide Caratti aab4d85649 net: mptcp: use policy generated by YAML spec
generated with:

 $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \
 > --spec Documentation/netlink/specs/mptcp.yaml --source \
 > -o net/mptcp/mptcp_pm_gen.c
 $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \
 > --spec Documentation/netlink/specs/mptcp.yaml --header \
 > -o net/mptcp/mptcp_pm_gen.h

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-7-16b1f701f900@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:00:32 -07:00
Davide Caratti 1e07938e29 net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}
so that they will match names generated from YAML spec.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-6-16b1f701f900@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:00:32 -07:00
Davide Caratti 1d0507f468 net: mptcp: convert netlink from small_ops to ops
in the current MPTCP control plane, all operations use a netlink
attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush
operations only parse the first element in the message _ the one that
describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and
mostly used in ADD_ADDR operations _ probably the similarity of "attr",
"addr" and "add" might cause some confusion to human readers).
Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes
for each single operation, hopefully makes all this clearer to human
readers.

- use a separate attribute set for add/del/get/flush address operation,
  binary compatible with the existing one, to store the endpoint address.
  MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as
  MPTCP_PM_ATTR_ADDR) for these operations.
- convert mptcp_pm_ops[] and add policy files accordingly.

this prepares MPTCP control plane to be described as YAML spec.

Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-3-16b1f701f900@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24 13:00:31 -07:00
Phil Sutter 9cdee06347 netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
Relieve the dump callback from having to check nlmsg_type upon each
call. Prep work for set element reset locking.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 15:48:30 +02:00
Pablo Neira Ayuso 078996fcd6 netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
Return struct nft_elem_priv instead of struct nft_set_ext for
consistency with ("netfilter: nf_tables: expose opaque set element as
struct nft_elem_priv") and to prepare the introduction of element
timeout updates from control path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:37:46 +02:00
Pablo Neira Ayuso 0e1ea651c9 netfilter: nf_tables: shrink memory consumption of set elements
Instead of copying struct nft_set_elem into struct nft_trans_elem, store
the pointer to the opaque set element object in the transaction. Adapt
set backend API (and set backend implementations) to take the pointer to
opaque set element representation whenever required.

This patch deconstifies .remove() and .activate() set backend API since
these modify the set element opaque object. And it also constify
nft_set_elem_ext() this provides access to the nft_set_ext struct
without updating the object.

According to pahole on x86_64, this patch shrinks struct nft_trans_elem
size from 216 to 24 bytes.

This patch also reduces stack memory consumption by removing the
template struct nft_set_elem object, using the opaque set element object
instead such as from the set iterator API, catchall elements and the get
element command.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:37:42 +02:00
Pablo Neira Ayuso 9dad402b89 netfilter: nf_tables: expose opaque set element as struct nft_elem_priv
Add placeholder structure and place it at the beginning of each struct
nft_*_elem for each existing set backend, instead of exposing elements
as void type to the frontend which defeats compiler type checks. Use
this pointer to this new type to replace void *.

This patch updates the following set backend API to use this new struct
nft_elem_priv placeholder structure:

- update
- deactivate
- flush
- get

as well as the following helper functions:

- nft_set_elem_ext()
- nft_set_elem_init()
- nft_set_elem_destroy()
- nf_tables_set_elem_destroy()

This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to
native element representation from the corresponding set backend.
BUILD_BUG_ON() makes sure this .priv placeholder is always at the top
of the opaque set element representation.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Pablo Neira Ayuso 6509a2e410 netfilter: nf_tables: set backend .flush always succeeds
.flush is always successful since this results from iterating over the
set elements to toggle mark the element as inactive in the next
generation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Pablo Neira Ayuso 26cec9d414 netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
Use the element object that is already offered instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter a552339063 netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
Relieve the dump callback from having to inspect nlmsg_type upon each
call, just do it once at start of the dump.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter 5a893b9cdf netfilter: nf_tables: nft_obj_filter fits into cb->ctx
No need to allocate it if one may just use struct netlink_callback's
scratch area for it.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter 2eda95cfa2 netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
Prep work for moving the context into struct netlink_callback scratch
area.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter ecf49cad80 netfilter: nf_tables: A better name for nft_obj_filter
Name it for what it is supposed to become, a real nft_obj_dump_ctx. No
functional change intended.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter 4279cc60b3 netfilter: nf_tables: Unconditionally allocate nft_obj_filter
Prep work for moving the filter into struct netlink_callback's scratch
area.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00
Phil Sutter ff16111cc1 netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
The code does not make use of cb->args fields past the first one, no
need to zero them.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24 13:16:30 +02:00