mirror of https://github.com/torvalds/linux.git
911 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
9303c3ced1 |
net: move sk->sk_err_soft and sk->sk_sndbuf
sk->sk_sndbuf is read-mostly in tx path, so move it from sock_write_tx group to more appropriate sock_read_tx. sk->sk_err_soft was not identified previously, but is used from tcp_ack(). Move it to sock_write_tx group for better cache locality. Also change tcp_ack() to clear sk->sk_err_soft only if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
17b14d235f |
net: move sk_uid and sk_protocol to sock_read_tx
sk_uid and sk_protocol are read from inet6_csk_route_socket() for each TCP transmit. Also read from udpv6_sendmsg(), udp_sendmsg() and others. Move them to sock_read_tx for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
9db27c8062 |
udp: add udp_drops_inc() helper
Generic sk_drops_inc() reads sk->sk_drop_counters. We know the precise location for UDP sockets. Move sk_drop_counters out of sock_read_rxtx so that sock_write_rxtx starts at a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-9-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
16c610162d |
net: call cond_resched() less often in __release_sock()
While stress testing TCP I had unexpected retransmits and sack packets
when a single cpu receives data from multiple high-throughput flows.
super_netperf 4 -H srv -T,10 -l 3000 &
Tcpdump extract:
00:00:00.000007 IP6 clnt > srv: Flags [.], seq 26062848:26124288, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 61440
00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26124288:26185728, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [P.], seq 26185728:26243072, ack 1, win 66, options [nop,nop,TS val 651460834 ecr 3100749131], length 57344
00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26243072:26304512, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [.], seq 26304512:26365952, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 61440
00:00:00.000007 IP6 clnt > srv: Flags [P.], seq 26365952:26423296, ack 1, win 66, options [nop,nop,TS val 651460844 ecr 3100749141], length 57344
00:00:00.000006 IP6 clnt > srv: Flags [.], seq 26423296:26484736, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [.], seq 26484736:26546176, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [P.], seq 26546176:26603520, ack 1, win 66, options [nop,nop,TS val 651460853 ecr 3100749150], length 57344
00:00:00.003932 IP6 clnt > srv: Flags [P.], seq 26603520:26619904, ack 1, win 66, options [nop,nop,TS val 651464844 ecr 3100753141], length 16384
00:00:00.006602 IP6 clnt > srv: Flags [.], seq 24862720:24866816, ack 1, win 66, options [nop,nop,TS val 651471419 ecr 3100759716], length 4096
00:00:00.013000 IP6 clnt > srv: Flags [.], seq 24862720:24866816, ack 1, win 66, options [nop,nop,TS val 651484421 ecr 3100772718], length 4096
00:00:00.000416 IP6 srv > clnt: Flags [.], ack 26619904, win 1393, options [nop,nop,TS val 3100773185 ecr 651484421,nop,nop,sack 1 {24862720:24866816}], length 0
After analysis, it appears this is because of the cond_resched()
call from __release_sock().
When current thread is yielding, while still holding the TCP socket lock,
it might regain the cpu after a very long time.
Other peer TLP/RTO is firing (multiple times) and packets are retransmit,
while the initial copy is waiting in the socket backlog or receive queue.
In this patch, I call cond_resched() only once every 16 packets.
Modern TCP stack now spends less time per packet in the backlog,
especially because ACK are no longer sent (commit
|
|
|
|
5ef04a7b06 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h |
|
|
|
5d6b58c932 |
net: lockless sock_i_ino()
Followup of commit |
|
|
|
99a2ace61b |
net: use dst_dev_rcu() in sk_setup_caps()
Use RCU to protect accesses to dst->dev from sk_setup_caps()
and sk_dst_gso_max_size().
Also use dst_dev_rcu() in ip6_dst_mtu_maybe_forward(),
and ip_dst_mtu_maybe_forward().
ip4_dst_hoplimit() can use dst_dev_net_rcu().
Fixes:
|
|
|
|
c51613fa27 |
net: add sk->sk_drop_counters
Some sockets suffer from heavy false sharing on sk->sk_drops, and fields in the same cache line. Add sk->sk_drop_counters to: - move the drop counter(s) to dedicated cache lines. - Add basic NUMA awareness to these drop counter(s). Following patches will use this infrastructure for UDP and RAW sockets. sk_clone_lock() is not yet ready, it would need to properly set newsk->sk_drop_counters if we plan to use this for TCP sockets. v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
f86f42ed2c |
net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
a6d4f25888 |
net: set net.core.rmem_max and net.core.wmem_max to 4 MB
SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
bf64002c94 |
net: Define sk_memcg under CONFIG_MEMCG.
Except for sk_clone_lock(), all accesses to sk->sk_memcg is done under CONFIG_MEMCG. As a bonus, let's define sk->sk_memcg under CONFIG_MEMCG. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
bb178c6bc0 |
net-memcg: Pass struct sock to mem_cgroup_sk_(un)?charge().
We will store a flag in the lowest bit of sk->sk_memcg. Then, we cannot pass the raw pointer to mem_cgroup_charge_skmem() and mem_cgroup_uncharge_skmem(). Let's pass struct sock to the functions. While at it, they are renamed to match other functions starting with mem_cgroup_sk_. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
43049b0db0 |
net-memcg: Introduce mem_cgroup_sk_enabled().
The socket memcg feature is enabled by a static key and only works for non-root cgroup. We check both conditions in many places. Let's factorise it as a helper function. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://patch.msgid.link/20250815201712.1745332-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
bd4aa23373 |
net: Clean up __sk_mem_raise_allocated().
In __sk_mem_raise_allocated(), charged is initialised as true due
to the weird condition removed in the previous patch.
It makes the variable unreliable by itself, so we have to check
another variable, memcg, in advance.
Also, we will factorise the common check below for memcg later.
if (mem_cgroup_sockets_enabled && sk->sk_memcg)
As a prep, let's initialise charged as false and memcg as NULL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://patch.msgid.link/20250815201712.1745332-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
9d85c565a7 |
net: Call trace_sock_exceed_buf_limit() for memcg failure with SK_MEM_RECV.
Initially, trace_sock_exceed_buf_limit() was invoked when
__sk_mem_raise_allocated() failed due to the memcg limit or the
global limit.
However, commit
|
|
|
|
a6f190630d |
net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOC
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
88fe14253e |
net: dst: add four helpers to annotate data-races around dst->dev
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().
Fixing all the races is going to need many changes.
We probably will have to add full RCU protection.
Add three helpers to ease this painful process.
static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
return READ_ONCE(dst->dev);
}
static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
return dst_dev(skb_dst(skb));
}
static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
return dev_net(skb_dst_dev(skb));
}
static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
return dev_net_rcu(skb_dst_dev(skb));
}
Fixes:
|
|
|
|
8a402bbe54 |
net: dst: annotate data-races around dst->obsolete
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
935b67675a |
net: make sk->sk_rcvtimeo lockless
Followup of commit
|
|
|
|
3169e36ae1 |
net: make sk->sk_sndtimeo lockless
Followup of commit
|
|
|
|
c51da3f7a1 |
net: remove sock_i_uid()
Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
561939ed44 |
net: remove unused sock_enable_timestamps
This function was introduced in commit |
|
|
|
8542d6fac2 |
Fix sock_exceed_buf_limit not being triggered in __sk_mem_raise_allocated
When a process under memory pressure is not part of any cgroup and the charged flag is false, trace_sock_exceed_buf_limit was not called as expected. This regression was introduced by commit |
|
|
|
1b98f357da |
Networking changes for 6.16.
Core
----
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing
again the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter
---------
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools
still use this interface.
- Implement support for wildcard netdevice in netdev basechain
and flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF
---
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols
---------
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the single
flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API
----------
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling
-----------------
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers
----------------------
- OpenVPN virtual driver: offload OpenVPN data channels processing
to the kernel-space, increasing the data transfer throughput WRT
the user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers
-------
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the stearing table handling to reduce significantly
the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
/HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
3GjFIOQKW2Q5
=t/Tz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
|
|
|
|
7d7a103d29 |
vfs-6.16-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
=nzYS
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
|
|
|
|
77cbe1a6d8 |
af_unix: Introduce SO_PASSRIGHTS.
As long as recvmsg() or recvmmsg() is used with cmsg, it is not possible to avoid receiving file descriptors via SCM_RIGHTS. This behaviour has occasionally been flagged as problematic, as it can be (ab)used to trigger DoS during close(), for example, by passing a FUSE-controlled fd or a hung NFS fd. For instance, as noted on the uAPI Group page [0], an untrusted peer could send a file descriptor pointing to a hung NFS mount and then close it. Once the receiver calls recvmsg() with msg_control, the descriptor is automatically installed, and then the responsibility for the final close() now falls on the receiver, which may result in blocking the process for a long time. Regarding this, systemd calls cmsg_close_all() [1] after each recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS. However, this cannot work around the issue at all, because the final fput() may still occur on the receiver's side once sendmsg() with SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does not work for the same reason. Thus, we need a better way to refuse SCM_RIGHTS at sendmsg(). Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS. Note that this option is enabled by default for backward compatibility. Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0] Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
0e81cfd971 |
af_unix: Move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.
As explained in the next patch, SO_PASSRIGHTS would have a problem
if we assigned a corresponding bit to socket->flags, so it must be
managed in struct sock.
Mixing socket->flags and sk->sk_flags for similar options will look
confusing, and sk->sk_flags does not have enough space on 32bit system.
Also, as mentioned in commit
|
|
|
|
7d8d93fdde |
net: Restrict SO_PASS{CRED,PIDFD,SEC} to AF_{UNIX,NETLINK,BLUETOOTH}.
SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling
scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by
scm_recv_unix().
scm_recv() is called from AF_NETLINK and AF_BLUETOOTH.
scm_recv_unix() is literally called from AF_UNIX.
Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and
SO_PASSPIDFD to AF_UNIX only.
Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock
and united with another field.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
ae4f2f59e1 |
tcp: Restrict SO_TXREHASH to TCP socket.
sk->sk_txrehash is only used for TCP. Let's restrict SO_TXREHASH to TCP to reflect this. Later, we will make sk_txrehash a part of the union for other protocol families. Note that we need to modify BPF selftest not to get/set SO_TEREHASH for non-TCP sockets. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
bd61848900 |
net: devmem: Implement TX path
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
20b70e5896
|
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
Now that all preconditions are met, allow handing out pidfs for reaped sk->sk_peer_pids. Thanks to Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> for pointing out that we need to limit this to AF_UNIX sockets for now. Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-4-450a19461e75@kernel.org Reviewed-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
b590c928cc
|
net, pidfd: report EINVAL for ESRCH
dbus-broker relies -EINVAL being returned to indicate ESRCH in [1].
This causes issues for some workloads as reported in [2].
Paper over it until this is fixed in userspace.
Link: https://lore.kernel.org/20250416-gegriffen-tiefbau-70cfecb80ac8@brauner
Link:
|
|
|
|
22d6c9eebf |
net: Unexport shared functions for DCCP.
DCCP was removed, so many inet functions no longer need to be exported. Let's unexport or use EXPORT_IPV6_MOD() for such functions. sk_free_unlock_clone() is inlined in sk_clone_lock() as it's the only caller. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250410023921.11307-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
faeefc173b |
sock: Correct error checking condition for (assign|release)_proto_idx()
(assign|release)_proto_idx() wrongly check find_first_zero_bit() failure by condition '(prot->inuse_idx == PROTO_INUSE_NR - 1)' obviously. Fix by correcting the condition to '(prot->inuse_idx == PROTO_INUSE_NR)' Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250410-fix_net-v2-1-d69e7c5739a4@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
ab59a86056 |
Including fixes from netfilter.
Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink().
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr().
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmf3xusSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkud0P/iWOQB0oj0nvxl2ionPzgJEPduxuF0V6
YPyDBUzLC7Gq6NmTcdDlNJt8fE6UmKUIneghUm9Ss7LRpKv0/TPvorKMSK44Zt53
a5q49JeoI0TvnnhJesdHjiF31hrInqZmcX8OjSH8Q/SCKuy7rsgzao0vjvhd7lxm
wA6LlWnJO1Pf991nNpbjUSoAZ7CMNlEIewGkdq0+6UADC7D9VagKTgIkFKw1BvRw
2Eb2pzvdO9Pj02+l/mjdRhUzMZlr+FG+WBqXk5oKR0YZ2t3CS4O9/UUBoAn775tM
gCfzepNuAUXGX0I6h+DANCNuswWuG/IvYTdhy+hRWblYeCkILU60E8eVMlh7tpII
fUd5GSRhX1NpGNHUlDG/4b6IcjMO3ebtce2cm2Y9t2CUe7EqB0HZyvTczNroTxip
KXrXcCBuEkzxXCZhaN/CrBu8Piu8vJk/rMH5ha1khce9CkmYY+m9ruvsYjZmPI+/
P/SFkRdb/yV/SIOmay8FCJsy60t4FOtLnlDDrnygq4Q/9a7VwafebVpKS1fbTELG
ZTiELN3/PN2GUnfREf0DVLPfn9sdqrMZaclLOJpp/Zi1/RZpo52WHceXJShiu9pe
A8B+3SuPgOaLfhwyqiHlWm5moc9kNF26vlrWfFjK1GrJdxMisYwQoWD5eHfQFhDX
UaxlAmndwZa9
=wkGz
-----END PGP SIGNATURE-----
Merge tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink()
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr()
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority"
* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ethtool: cmis_cdb: Fix incorrect read / write length extension
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
net: ppp: Add bound checking for skb data on ppp_sync_txmung
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
ipv6: Align behavior across nexthops during path selection
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net: libwx: handle page_pool_dev_alloc_pages error
selftests: mptcp: validate MPJoin HMacFailure counters
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
rtnetlink: Fix bad unlock balance in do_setlink().
net: ethtool: Don't call .cleanup_data when prepare_data fails
tc: Ensure we have enough buffer space when sending filter netlink notifications
net: libwx: Fix the wrong Rx descriptor field
octeontx2-pf: qos: fix VF root node parent queue index
selftests: tls: check that disconnect does nothing
...
|
|
|
|
0bb2f7a1ad |
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
# lsmod | grep cifs
cifs 1159168 0
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"
MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1
iptables -A INPUT -s ${CIFS_SERVER} -j DROP
for i in $(seq 10);
do
umount ${MNT}
rmmod cifs
sleep 1
done
rm -r ${MNT}
iptables -D INPUT -s ${CIFS_SERVER} -j DROP
[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
<IRQ>
__lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...
BUG: kernel NULL pointer dereference, address: 00000000000000c4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:ffa0000000468a10 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027
RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88
RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff
R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88
R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1
FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
ip_list_rcv_finish (net/ipv4/ip_input.c:628)
ip_list_rcv (net/ipv4/ip_input.c:670)
__netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
__napi_poll.constprop.0 (net/core/dev.c:7191)
net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
handle_softirqs (kernel/softirq.c:561)
__irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
irq_exit_rcu (kernel/softirq.c:680)
common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
</IRQ>
<TASK>
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202
RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5
RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:315)
common_startup_64 (arch/x86/kernel/head_64.S:421)
</TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2: 00000000000000c4
Fixes:
|
|
|
|
8fa7292fee |
treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
15492700ac |
tcp: cache RTAX_QUICKACK metric in a hot cache line
tcp_in_quickack_mode() is called from input path for small packets. It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in sock_read_tx group (for good reasons). Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses. Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull these cache lines for the cases a delayed ACK is scheduled. After this patch TCP receive path does not longer access sock_read_tx group. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250312083907.1931644-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
456cc675b6 |
sock: add sock_kmemdup helper
This patch adds the sock version of kmemdup() helper, named sock_kmemdup(), to duplicate the input "src" memory block using the socket's option memory buffer. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/f828077394c7d1f3560123497348b438c875b510.1740735165.git.tanggeliang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
357660d759 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c |
|
|
|
5c70eb5c59 |
net: better track kernel sockets lifetime
While kernel sockets are dismantled during pernet_operations->exit(),
their freeing can be delayed by any tx packets still held in qdisc
or device queues, due to skb_set_owner_w() prior calls.
This then trigger the following warning from ref_tracker_dir_exit() [1]
To fix this, make sure that kernel sockets own a reference on net->passive.
Add sk_net_refcnt_upgrade() helper, used whenever a kernel socket
is converted to a refcounted one.
[1]
[ 136.263918][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[ 136.263918][ T35] sk_alloc+0x2b3/0x370
[ 136.263918][ T35] inet6_create+0x6ce/0x10f0
[ 136.263918][ T35] __sock_create+0x4c0/0xa30
[ 136.263918][ T35] inet_ctl_sock_create+0xc2/0x250
[ 136.263918][ T35] igmp6_net_init+0x39/0x390
[ 136.263918][ T35] ops_init+0x31e/0x590
[ 136.263918][ T35] setup_net+0x287/0x9e0
[ 136.263918][ T35] copy_net_ns+0x33f/0x570
[ 136.263918][ T35] create_new_namespaces+0x425/0x7b0
[ 136.263918][ T35] unshare_nsproxy_namespaces+0x124/0x180
[ 136.263918][ T35] ksys_unshare+0x57d/0xa70
[ 136.263918][ T35] __x64_sys_unshare+0x38/0x40
[ 136.263918][ T35] do_syscall_64+0xf3/0x230
[ 136.263918][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 136.263918][ T35]
[ 136.343488][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at
[ 136.343488][ T35] sk_alloc+0x2b3/0x370
[ 136.343488][ T35] inet6_create+0x6ce/0x10f0
[ 136.343488][ T35] __sock_create+0x4c0/0xa30
[ 136.343488][ T35] inet_ctl_sock_create+0xc2/0x250
[ 136.343488][ T35] ndisc_net_init+0xa7/0x2b0
[ 136.343488][ T35] ops_init+0x31e/0x590
[ 136.343488][ T35] setup_net+0x287/0x9e0
[ 136.343488][ T35] copy_net_ns+0x33f/0x570
[ 136.343488][ T35] create_new_namespaces+0x425/0x7b0
[ 136.343488][ T35] unshare_nsproxy_namespaces+0x124/0x180
[ 136.343488][ T35] ksys_unshare+0x57d/0xa70
[ 136.343488][ T35] __x64_sys_unshare+0x38/0x40
[ 136.343488][ T35] do_syscall_64+0xf3/0x230
[ 136.343488][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
|
|
|
|
e87700965a |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZ7ffOQAKCRAraS+Z+3y5 EzVHAP9h/QkeYoOZW9gul08I8vFiZsFe/lbOSLJWxeVfxb9JhgD/cMqby3qAxQK6 lsdNQ9jYG2232Wym89ag7fvTBK15Wg4= =gkN2 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-02-20 We've added 19 non-merge commits during the last 8 day(s) which contain a total of 35 files changed, 1126 insertions(+), 53 deletions(-). The main changes are: 1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing 2) Add network TX timestamping support to BPF sock_ops, from Jason Xing 3) Add TX metadata Launch Time support, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: igc: Add launch time support to XDP ZC igc: Refactor empty frame insertion for launch time support net: stmmac: Add launch time support to XDP ZC selftests/bpf: Add launch time request to xdp_hw_metadata xsk: Add launch time hardware offload support to XDP Tx metadata selftests/bpf: Add simple bpf tests in the tx path for timestamping feature bpf: Support selective sampling for bpf timestamping bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING bpf: Disable unsafe helpers in TX timestamping callbacks bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping bpf: Add networking timestamping support to bpf_get/setsockopt() selftests/bpf: Add rto max for bpf_setsockopt test bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt ==================== Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
df600f3b1d |
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
This patch introduces a new bpf_skops_tx_timestamping() function that prepares the "struct bpf_sock_ops" ctx and then executes the sockops BPF program. The subsequent patch will utilize bpf_skops_tx_timestamping() at the existing TX timestamping kernel callbacks (__sk_tstamp_tx specifically) to call the sockops BPF program. Later, four callback points to report information to user space based on this patch will be introduced. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250220072940.99994-3-kerneljasonxing@gmail.com |
|
|
|
c8802ded46 |
net: dismiss sk_forward_alloc_get()
After the previous patch we can remove the forward_alloc_get proto callback, basically reverting commit |
|
|
|
b9d752105e |
net: use napi_id_valid helper
In commit
|
|
|
|
f0e70409b7 |
net: avoid unconditionally touching sk_tsflags on RX
After commit
|
|
|
|
385f186aba |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h |
|
|
|
5b0af621c3 |
net: restrict SO_REUSEPORT to inet sockets
After blamed commit, crypto sockets could accidentally be destroyed
from RCU call back, as spotted by zyzbot [1].
Trying to acquire a mutex in RCU callback is not allowed.
Restrict SO_REUSEPORT socket option to inet sockets.
v1 of this patch supported TCP, UDP and SCTP sockets,
but fcnal-test.sh test needed RAW and ICMP support.
[1]
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:562
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 24, name: ksoftirqd/1
preempt_count: 100, expected: 0
RCU nest depth: 0, expected: 0
1 lock held by ksoftirqd/1/24:
#0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
#0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2561 [inline]
#0: ffffffff8e937ba0 (rcu_callback){....}-{0:0}, at: rcu_core+0xa37/0x17a0 kernel/rcu/tree.c:2823
Preemption disabled at:
[<ffffffff8161c8c8>] softirq_handle_begin kernel/softirq.c:402 [inline]
[<ffffffff8161c8c8>] handle_softirqs+0x128/0x9b0 kernel/softirq.c:537
CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.13.0-rc3-syzkaller-00174-ga024e377efed #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
__might_resched+0x5d4/0x780 kernel/sched/core.c:8758
__mutex_lock_common kernel/locking/mutex.c:562 [inline]
__mutex_lock+0x131/0xee0 kernel/locking/mutex.c:735
crypto_put_default_null_skcipher+0x18/0x70 crypto/crypto_null.c:179
aead_release+0x3d/0x50 crypto/algif_aead.c:489
alg_do_release crypto/af_alg.c:118 [inline]
alg_sock_destruct+0x86/0xc0 crypto/af_alg.c:502
__sk_destruct+0x58/0x5f0 net/core/sock.c:2260
rcu_do_batch kernel/rcu/tree.c:2567 [inline]
rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:561
run_ksoftirqd+0xca/0x130 kernel/softirq.c:950
smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Fixes:
|
|
|
|
e45469e594 |
sock: Introduce SO_RCVPRIORITY socket option
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit
|
|
|
|
a32f3e9d1e |
sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to this is IP_TOS, when the priority value is calculated during the handling of ancillary data, as implemented in commit |