linux

Commit Graph

Author	SHA1	Message	Date
Vladimir Oltean	4e4c00f34d	Documentation: net: dsa: mention simple HSR offload helpers Keep the documentation up to date. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251130131657.65080-16-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-12-01 16:51:55 -08:00
Vladimir Oltean	977839161f	Documentation: net: dsa: mention availability of RedBox Since commit `5055cccfc2` ("net: hsr: Provide RedBox support (HSR-SAN)"), RedBox is available (including for offload in DSA). Update the DSA documentation that states it isn't. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251130131657.65080-15-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-12-01 16:51:49 -08:00
Eric Dumazet	9a5e5334ad	tcp: remove icsk->icsk_retransmit_timer Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-25 19:28:29 -08:00
Eric Dumazet	08dfe37023	tcp: introduce icsk->icsk_keepalive_timer sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-25 19:28:29 -08:00
Daniel Zahka	b11d358bf8	net/mlx5: implement swp_l4_csum_mode via devlink params swp_l4_csum_mode controls how L4 transmit checksums are computed when using Software Parser (SWP) hints for header locations. Supported values: 1. default: device will choose between full_csum or l4_only. Driver will discover the device's choice during initialization. 2. full_csum: calculate L4 checksum with the pseudo-header. 3. l4_only: calculate L4 checksum without the pseudo-header. Only available when swp_l4_csum_mode_l4_only is set in mlx5_ifc_nv_sw_offload_cap_bits. Note that 'default' might be returned from the device and passed to userspace, and it might also be set during a devlink_param::reset_default() call, but attempts to set a value of default directly with param-set will be rejected. The l4_only setting is a dependency for PSP initialization in mlx5e_psp_init(). Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-5-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 19:01:22 -08:00
Daniel Zahka	2a367002ed	devlink: support default values for param-get and param-set Support querying and resetting to default param values. Introduce two new devlink netlink attrs: DEVLINK_ATTR_PARAM_VALUE_DEFAULT and DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an optional parameter value inside of the param_value nested attribute. The latter is used in param-set requests from userspace to indicate that the driver should reset the param to its default value. To implement this, two new functions are added to the devlink driver api: devlink_param::get_default() and devlink_param::reset_default(). These callbacks allow drivers to implement default param actions for runtime and permanent cmodes. For driverinit params, the core latches the last value set by a driver via devl_param_driverinit_value_set(), and uses that as the default value for a param. Because default parameter values are optional, it would be impossible to discern whether or not a param of type bool has default value of false or not provided if the default value is encoded using a netlink flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an associated default value, the default value is encoded using a u8 type. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 19:01:22 -08:00
Eric Dumazet	ecfea98b7d	tcp: add net.ipv4.tcp_rcvbuf_low_rtt This is a follow up of commit `aa251c8463` ("tcp: fix too slow tcp_rcvbuf_grow() action") which brought again the issue that I tried to fix in commit `65c5287892` ("tcp: fix sk_rcvbuf overshoot") We also recently increased tcp_rmem[2] to 32 MB in commit `572be9bf9d` ("tcp: increase tcp_rmem[2] to 32 MB") Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can force NIC driver to not recycle pages from their page pool, and also can cause cache evictions for DDIO enabled cpus/NIC, as receivers are usually slower than senders. Add net.ipv4.tcp_rcvbuf_low_rtt sysctl, set by default to 1000 usec (1 ms) If RTT if smaller than the sysctl value, use the RTT/tcp_rcvbuf_low_rtt ratio to control sk_rcvbuf inflation. Tested: Pair of hosts with a 200Gbit IDPF NIC. Using netperf/netserver Client initiates 8 TCP bulk flows, asking netserver to use CPU #10 only. super_netperf 8 -H server -T,10 -l 30 On server, use perf -e tcp:tcp_rcvbuf_grow while test is running. Before: sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1 perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script\|tail -20\|cut -c30-230 1153.051201: tcp:tcp_rcvbuf_grow: time=398 rtt_us=382 copied=6905856 inq=180224 space=6115328 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1153.138752: tcp:tcp_rcvbuf_grow: time=446 rtt_us=413 copied=5529600 inq=180224 space=4505600 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil 1153.361484: tcp:tcp_rcvbuf_grow: time=415 rtt_us=380 copied=7061504 inq=204800 space=6725632 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1153.457642: tcp:tcp_rcvbuf_grow: time=483 rtt_us=421 copied=5885952 inq=720896 space=4407296 ooo=0 scaling_ratio=240 rcvbuf=23763511 rcv_ssthresh=22223271 window_clamp=22278291 rcv_wnd=21430272 famil 1153.466002: tcp:tcp_rcvbuf_grow: time=308 rtt_us=281 copied=3244032 inq=180224 space=2883584 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41713664 famil 1153.747792: tcp:tcp_rcvbuf_grow: time=394 rtt_us=332 copied=4460544 inq=585728 space=3063808 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41373696 famil 1154.260747: tcp:tcp_rcvbuf_grow: time=652 rtt_us=226 copied=10977280 inq=737280 space=9486336 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29197743 window_clamp=29217691 rcv_wnd=28368896 fami 1154.375019: tcp:tcp_rcvbuf_grow: time=461 rtt_us=443 copied=7573504 inq=507904 space=6856704 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25288704 famil 1154.463072: tcp:tcp_rcvbuf_grow: time=494 rtt_us=408 copied=7983104 inq=200704 space=7065600 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25579520 famil 1154.474658: tcp:tcp_rcvbuf_grow: time=507 rtt_us=459 copied=5586944 inq=540672 space=4718592 ooo=0 scaling_ratio=240 rcvbuf=17852266 rcv_ssthresh=16692999 window_clamp=16736499 rcv_wnd=16056320 famil 1154.584657: tcp:tcp_rcvbuf_grow: time=494 rtt_us=427 copied=8126464 inq=204800 space=7782400 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil 1154.702117: tcp:tcp_rcvbuf_grow: time=480 rtt_us=406 copied=5734400 inq=180224 space=5349376 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil 1155.941595: tcp:tcp_rcvbuf_grow: time=717 rtt_us=670 copied=11042816 inq=3784704 space=7159808 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=14614528 fam 1156.384735: tcp:tcp_rcvbuf_grow: time=529 rtt_us=473 copied=9011200 inq=180224 space=7258112 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=18018304 famil 1157.821676: tcp:tcp_rcvbuf_grow: time=529 rtt_us=272 copied=8224768 inq=602112 space=6545408 ooo=0 scaling_ratio=240 rcvbuf=67000000 rcv_ssthresh=62793576 window_clamp=62812500 rcv_wnd=62115840 famil 1158.906379: tcp:tcp_rcvbuf_grow: time=710 rtt_us=445 copied=11845632 inq=540672 space=10240000 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29205935 window_clamp=29217691 rcv_wnd=28536832 fam 1164.600160: tcp:tcp_rcvbuf_grow: time=841 rtt_us=430 copied=12976128 inq=1290240 space=11304960 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29212591 window_clamp=29217691 rcv_wnd=27856896 fa 1165.163572: tcp:tcp_rcvbuf_grow: time=845 rtt_us=800 copied=12632064 inq=540672 space=7921664 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25912795 window_clamp=25937095 rcv_wnd=25260032 fami 1165.653464: tcp:tcp_rcvbuf_grow: time=388 rtt_us=309 copied=4493312 inq=180224 space=3874816 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41995899 window_clamp=42050919 rcv_wnd=41713664 famil 1166.651211: tcp:tcp_rcvbuf_grow: time=556 rtt_us=553 copied=6328320 inq=540672 space=5554176 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=20946944 famil After: sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1000 perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script\|tail -20\|cut -c30-230 1457.053149: tcp:tcp_rcvbuf_grow: time=128 rtt_us=24 copied=1441792 inq=40960 space=1269760 ooo=0 scaling_ratio=240 rcvbuf=2960741 rcv_ssthresh=2605474 window_clamp=2775694 rcv_wnd=2568192 family=AF_I 1458.000778: tcp:tcp_rcvbuf_grow: time=128 rtt_us=31 copied=1441792 inq=24576 space=1400832 ooo=0 scaling_ratio=240 rcvbuf=3060163 rcv_ssthresh=2810042 window_clamp=2868902 rcv_wnd=2674688 family=AF_I 1458.088059: tcp:tcp_rcvbuf_grow: time=190 rtt_us=110 copied=3227648 inq=385024 space=2781184 ooo=0 scaling_ratio=240 rcvbuf=6728240 rcv_ssthresh=6252705 window_clamp=6307725 rcv_wnd=5799936 family=AF 1458.148549: tcp:tcp_rcvbuf_grow: time=232 rtt_us=129 copied=3956736 inq=237568 space=2842624 ooo=0 scaling_ratio=240 rcvbuf=6731333 rcv_ssthresh=6252705 window_clamp=6310624 rcv_wnd=5918720 family=AF 1458.466861: tcp:tcp_rcvbuf_grow: time=193 rtt_us=83 copied=2949120 inq=180224 space=2457600 ooo=0 scaling_ratio=240 rcvbuf=5751438 rcv_ssthresh=5357689 window_clamp=5391973 rcv_wnd=5054464 family=AF_ 1458.775476: tcp:tcp_rcvbuf_grow: time=257 rtt_us=127 copied=4304896 inq=352256 space=3346432 ooo=0 scaling_ratio=240 rcvbuf=8067131 rcv_ssthresh=7523275 window_clamp=7562935 rcv_wnd=7061504 family=AF 1458.776631: tcp:tcp_rcvbuf_grow: time=200 rtt_us=96 copied=3260416 inq=143360 space=2768896 ooo=0 scaling_ratio=240 rcvbuf=6397256 rcv_ssthresh=5938567 window_clamp=5997427 rcv_wnd=5828608 family=AF_ 1459.707973: tcp:tcp_rcvbuf_grow: time=215 rtt_us=96 copied=2506752 inq=163840 space=1388544 ooo=0 scaling_ratio=240 rcvbuf=3068867 rcv_ssthresh=2768282 window_clamp=2877062 rcv_wnd=2555904 family=AF_ 1460.246494: tcp:tcp_rcvbuf_grow: time=231 rtt_us=80 copied=3756032 inq=204800 space=3117056 ooo=0 scaling_ratio=240 rcvbuf=7288091 rcv_ssthresh=6773725 window_clamp=6832585 rcv_wnd=6471680 family=AF_ 1460.714596: tcp:tcp_rcvbuf_grow: time=270 rtt_us=110 copied=4714496 inq=311296 space=3719168 ooo=0 scaling_ratio=240 rcvbuf=8957739 rcv_ssthresh=8339020 window_clamp=8397880 rcv_wnd=7933952 family=AF 1462.029977: tcp:tcp_rcvbuf_grow: time=101 rtt_us=19 copied=1105920 inq=40960 space=1036288 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=1986560 family=AF_I 1462.802385: tcp:tcp_rcvbuf_grow: time=89 rtt_us=45 copied=1069056 inq=0 space=1064960 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=2035712 family=AF_INET6 1462.918648: tcp:tcp_rcvbuf_grow: time=105 rtt_us=33 copied=1441792 inq=180224 space=1069056 ooo=0 scaling_ratio=240 rcvbuf=2383282 rcv_ssthresh=2091684 window_clamp=2234326 rcv_wnd=1896448 family=AF_ 1463.222533: tcp:tcp_rcvbuf_grow: time=273 rtt_us=144 copied=4603904 inq=385024 space=3469312 ooo=0 scaling_ratio=240 rcvbuf=8422564 rcv_ssthresh=7891053 window_clamp=7896153 rcv_wnd=7409664 family=AF 1466.519312: tcp:tcp_rcvbuf_grow: time=130 rtt_us=23 copied=1343488 inq=0 space=1261568 ooo=0 scaling_ratio=240 rcvbuf=2780158 rcv_ssthresh=2493778 window_clamp=2606398 rcv_wnd=2494464 family=AF_INET6 1466.681003: tcp:tcp_rcvbuf_grow: time=128 rtt_us=21 copied=1441792 inq=12288 space=1343488 ooo=0 scaling_ratio=240 rcvbuf=2932027 rcv_ssthresh=2578555 window_clamp=2748775 rcv_wnd=2568192 family=AF_I 1470.689959: tcp:tcp_rcvbuf_grow: time=255 rtt_us=122 copied=3932160 inq=204800 space=3551232 ooo=0 scaling_ratio=240 rcvbuf=8182038 rcv_ssthresh=7647384 window_clamp=7670660 rcv_wnd=7442432 family=AF 1471.754154: tcp:tcp_rcvbuf_grow: time=188 rtt_us=95 copied=2138112 inq=577536 space=1429504 ooo=0 scaling_ratio=240 rcvbuf=3113650 rcv_ssthresh=2806426 window_clamp=2919046 rcv_wnd=2248704 family=AF_ 1476.813542: tcp:tcp_rcvbuf_grow: time=269 rtt_us=99 copied=3088384 inq=180224 space=2564096 ooo=0 scaling_ratio=240 rcvbuf=6219470 rcv_ssthresh=5771893 window_clamp=5830753 rcv_wnd=5509120 family=AF_ 1477.738309: tcp:tcp_rcvbuf_grow: time=166 rtt_us=54 copied=1777664 inq=180224 space=1417216 ooo=0 scaling_ratio=240 rcvbuf=3117118 rcv_ssthresh=2874958 window_clamp=2922298 rcv_wnd=2613248 family=AF_ We can see sk_rcvbuf values are much smaller, and that rtt_us (estimation of rtt from a receiver point of view) is kept small, instead of being bloated. No difference in throughput. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20251119084813.3684576-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 17:44:23 -08:00
Eric Dumazet	6d5dea6824	tcp: tcp_moderate_rcvbuf is only used in rx path sysctl_tcp_moderate_rcvbuf is only used from tcp_rcvbuf_grow(). Move it to netns_ipv4_read_rx group. Remove various CACHELINE_ASSERT_GROUP_SIZE() from netns_ipv4_struct_check(), as they have no real benefit but cause pain for all changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251119084813.3684576-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-20 17:44:23 -08:00
Jakub Kicinski	6785aa9d20	ipsec-next-2025-11-18 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmkcOdcACgkQrB3Eaf9P W7eVvA/8DMvSC3/hVo1OoGCCZPvaHvRQ6RvbeOSFxezpjXvpgs+JygQbCc+A5O0k DHpPQgIpk2M4WfPYwzwZKsACL9SH5OjI9ggnczJB0nlYXx0fIhRm8M3EKSDZPUgL 9wIBT/L5jeQRN5HL0Zr3Z+aICgVJtSDpOMXD0S02OIgxzArL3TT7lwbb8PCUBNzk PZDLrFRTGeNa9xFwaIDm8BOmpulyde6Vl9bqNrAPq+ZhvYyVP12f/FyzLLe16pCU f/qpe5b9tl2xH7Bm7pzB/gq68LfSK3/u4dsqT589WjRgPjuEr7HpD6H3doqQYwwB kFsb36q1xaOiNwaK3IDatzeDgo0s6LGYulCqwuulfwof8Sll0+yz92Eaz+ei3Of8 uytm0AiuZJLJwDO9VM6tU0BCplwJQrJlhfjqpLE526MGPC9Pgc21NxqfuWzEe57+ ghDv+B+EqFfgWcF0cYD301Fhor5Bky0RpCsWvkj3WUP8RzJGMfZunVlQG+2xLxDa IA+BWm8DvH0uD54ASATOlRkGHudcj2m67pH0GdnFjX8vKufGrD/daRBv2fBw1kut P9QSFh73ZI1OABRPReX2BbOp6dgYcCxcg3ee2zjL7Fv5jWwiqDb8Eesm7h2MzLMQ KFQttpuMj3IZFQJrxDNtIxVgu4RQmSSomx5Ddkl+lw66xuzRn0A= =U1+J -----END PGP SIGNATURE----- Merge tag 'ipsec-next-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-11-18 1) Relax a lock contention bottleneck to improve IPsec crypto offload performance. From Jianbo Liu. 2) Deprecate pfkey, the interface will be removed in 2027. 3) Update xfrm documentation and move it to ipsec maintainance. From Bagas Sanjaya. * tag 'ipsec-next-2025-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: MAINTAINERS: Add entry for XFRM documentation net: Move XFRM documentation into its own subdirectory Documentation: xfrm_sync: Number the fifth section Documentation: xfrm_sysctl: Trim trailing colon in section heading Documentation: xfrm_sync: Trim excess section heading characters Documentation: xfrm_sync: Properly reindent list text Documentation: xfrm_device: Separate hardware offload sublists Documentation: xfrm_device: Use numbered list for offloading steps Documentation: xfrm_device: Wrap iproute2 snippets in literal code block pfkey: Deprecate pfkey xfrm: Skip redundant replay recheck for the hardware offload path xfrm: Refactor xfrm_input lock to reduce contention with RSS ==================== Link: https://patch.msgid.link/20251118092610.2223552-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-18 17:55:41 -08:00
Eric Dumazet	ca412f25d6	tcp: reduce tcp_comp_sack_slack_ns default value to 10 usec net.ipv4.tcp_comp_sack_slack_ns current default value is too high. When a flow has many drops (1 % or more), and small RTT, adding 100 usec before sending SACK stalls the sender relying on getting SACK fast enough to keep the pipe busy. Decrease the default to 10 usec. This is orthogonal to Congestion Control heuristics to determine if drops are caused by congestion or not. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251114135141.3810964-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-17 17:02:43 -08:00
Bagas Sanjaya	03e23b18c7	net: Move XFRM documentation into its own subdirectory XFRM docs are currently reside in Documentation/networking directory, yet these are distinctive as a group of their own. Move them into xfrm subdirectory. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:30:03 +01:00
Bagas Sanjaya	7276e7ae56	Documentation: xfrm_sync: Number the fifth section Number the fifth section ("Exception to threshold settings") to be consistent with the rest of sections. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:30:03 +01:00
Bagas Sanjaya	c08b786b82	Documentation: xfrm_sysctl: Trim trailing colon in section heading The sole section heading ("/proc/sys/net/core/xfrm_* Variables") has trailing colon. Trim it. Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:29:36 +01:00
Bagas Sanjaya	01ad7831fb	Documentation: xfrm_sync: Trim excess section heading characters The first section "Message Structure" has excess underline, while the second and third one ("TLVS reflect the different parameters" and "Default configurations for the parameters") have trailing colon. Trim them. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:28:11 +01:00
Bagas Sanjaya	a397b259c1	Documentation: xfrm_sync: Properly reindent list text List texts are currently aligned at the start of column, rather than after the list marker. Reindent them. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:28:10 +01:00
Bagas Sanjaya	840188d276	Documentation: xfrm_device: Separate hardware offload sublists Sublists of hardware offload type lists are rendered in combined paragraph due to lack of separator from their parent list. Add it. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:28:10 +01:00
Bagas Sanjaya	340e2a7386	Documentation: xfrm_device: Use numbered list for offloading steps Format xfrm offloading steps as numbered list. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:28:09 +01:00
Bagas Sanjaya	68ec5df1d8	Documentation: xfrm_device: Wrap iproute2 snippets in literal code block iproute2 snippets (ip x) are shown in long-running definition lists instead. Format them as literal code blocks that do the semantic job better. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2025-11-12 08:28:09 +01:00
Saeed Mahameed	0e535824d0	devlink: Introduce switchdev_inactive eswitch mode Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and documentation. Before having traffic flow through an eswitch, a user may want to have the ability to block traffic towards the FDB until FDB is fully programmed and the user is ready to send traffic to it. For example: when two eswitches are present for vports in a multi-PF setup, one eswitch may take over the traffic from the other when the user chooses. Before this take over, a user may want to first program the inactive eswitch and then once ready redirect traffic to this new eswitch. switchdev modes transition semantics: legacy->switchdev_inactive: Create switchdev mode normally, traffic not allowed to flow yet. switchdev_inactive->switchdev: Enable traffic to flow. switchdev->switchdev_inactive: Block traffic on the FDB, FDB and representros state and content is preserved. When eswitch is configured to this mode, traffic is ignored/dropped on this eswitch FDB, while current configuration is kept, e.g FDB rules and netdev representros are kept available, FDB programming is allowed. Example: # start inactive switchdev devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive # setup TC rules, representors etc .. # activate devlink dev eswitch set pci/0000:08:00.1 mode switchdev Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251108070404.1551708-2-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-11-11 13:17:53 +01:00
Jakub Kicinski	a0c3aefb08	Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf) Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for controlling the maximum number of MAC address filters allowed by a VF. This allows administrators to control the VF behavior in a more nuanced manner. Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF for VFs running on E800 series ice hardware. This improves performance and scalability for virtualized network functions in 5G and LTE deployments. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: add RSS support for GTP protocol via ethtool ice: Extend PTYPE bitmap coverage for GTP encapsulated flows ice: improve TCAM priority handling for RSS profiles ice: implement GTP RSS context tracking and configuration ice: add virtchnl definitions and static data for GTP RSS ice: add flow parsing for GTP and new protocol field support i40e: support generic devlink param "max_mac_per_vf" devlink: Add new "max_mac_per_vf" generic device param ==================== Link: https://patch.msgid.link/20251106225321.1609605-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 19:15:36 -08:00
Eric Dumazet	416dd649f3	tcp: add net.ipv4.tcp_comp_sack_rtt_percent TCP SACK compression has been added in 2018 in commit `5d9f4262b7` ("tcp: add SACK compression"). It is working great for WAN flows (with large RTT). Wifi in particular gets a significant boost _when_ ACK are suppressed. Add a new sysctl so that we can tune the very conservative 5 % value that has been used so far in this formula, so that small RTT flows can benefit from this feature. delay = min ( 5 % of RTT, 1 ms) This patch adds new tcp_comp_sack_rtt_percent sysctl to ease experiments and tuning. Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl), set the default value to 33 %. Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ ) The rationale for 33% is basically to try to facilitate pipelining, where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in flight" at all times: + 1 skb in the qdisc waiting to be sent by the NIC next + 1 skb being sent by the NIC (being serialized by the NIC out onto the wire) + 1 skb being received and aggregated by the receiver machine's aggregation mechanism (some combination of LRO, GRO, and sack compression) Note that this is basically the same magic number (3) and the same rationales as: (a) tcp_tso_should_defer() ensuring that we defer sending data for no longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3), and (b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO skbs to maintain pipelining and full throughput at low RTTs Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-07 18:41:44 -08:00
Mohammad Heib	2c031d4c77	i40e: support generic devlink param "max_mac_per_vf" Currently the i40e driver enforces its own internally calculated per-VF MAC filter limit, derived from the number of allocated VFs and available hardware resources. This limit is not configurable by the administrator, which makes it difficult to control how many MAC addresses each VF may use. This patch adds support for the new generic devlink runtime parameter "max_mac_per_vf" which provides administrators with a way to cap the number of MAC addresses a VF can use: - When the parameter is set to 0 (default), the driver continues to use its internally calculated limit. - When set to a non-zero value, the driver applies this value as a strict cap for VFs, overriding the internal calculation. Important notes: - The configured value is a theoretical maximum. Hardware limits may still prevent additional MAC addresses from being added, even if the parameter allows it. - Since MAC filters are a shared hardware resource across all VFs, setting a high value may cause resource contention and starve other VFs. - This change gives administrators predictable and flexible control over VF resource allocation, while still respecting hardware limitations. - Previous discussion about this change: https://lore.kernel.org/netdev/20250805134042.2604897-2-dhill@redhat.com https://lore.kernel.org/netdev/20250823094952.182181-1-mheib@redhat.com Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-11-06 12:57:31 -08:00
Mohammad Heib	9352d40c8b	devlink: Add new "max_mac_per_vf" generic device param Add a new device generic parameter to controls the maximum number of MAC filters allowed per VF. For example, to limit a VF to 3 MAC addresses: $ devlink dev param set pci/0000:3b:00.0 name max_mac_per_vf \ value 3 \ cmode runtime Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2025-11-06 12:57:31 -08:00
Dong Yibo	ee61c10cd4	net: rnpgbe: Add build support for rnpgbe Add build options and doc for mucse. Initialize pci device access for MUCSE devices. Signed-off-by: Dong Yibo <dong100@mucse.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: MD Danish Anwar <danishanwar@ti.com> Link: https://patch.msgid.link/20251101013849.120565-2-dong100@mucse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-04 18:11:36 -08:00
Oleksij Rempel	e6e93fb013	ethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE access Introduce the userspace entry point for PHY MSE diagnostics via ethtool netlink. This exposes the core API added previously and returns both capability information and one or more snapshots. Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries: - ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information - ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel if available, otherwise WORST, otherwise LINK) Link down returns -ENETDOWN. Changes: - YAML: add attribute sets (mse, mse-capabilities, mse-snapshot) and the mse-get operation - UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs, ETHTOOL_MSG_MSE_GET/REPLY - ethtool core: add net/ethtool/mse.c implementing the request, register genl op, and hook into ethnl dispatch - docs: document MSE_GET in ethtool-netlink.rst The include/uapi/linux/ethtool_netlink_generated.h is generated from Documentation/netlink/specs/ethtool.yaml. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 18:32:27 -08:00
Samiullah Khawaja	c18d4b190a	net: Extend NAPI threaded polling to allow kthread based busy polling Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 18:11:40 -08:00
Maxime Chevallier	209ff7af79	net: stmmac: rename devlink parameter ts_coarse into phc_coarse_adj The devlink param "ts_coarse" doesn't indicate that we get coarse timestamps, but rather that the PHC clock adjusments are coarse as the frequency won't be continuously adjusted. Adjust the devlink parameter name to reflect that. The Coarse terminlogy comes from the dwmac register naming, update the documentation to better explain what the parameter is about. With this change, the parameter can now be adjusted using: devlink dev param set <dev> name phc_coarse_adj value true cmode runtime Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20251030182454.182406-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-03 17:05:36 -08:00
Bagas Sanjaya	01cc760632	Documentation: ARCnet: Update obsolete contact info ARCnet docs states that inquiries on the subsystem should be emailed to Avery Pennarun <apenwarr@worldvisions.ca>, for whom has been in CREDITS since the beginning of kernel git history and her email address is unreachable (bounce). The subsystem is now maintained by Michael Grzeschik since `c38f6ac74c` ("MAINTAINERS: add arcnet and take maintainership"). In addition, there used to be a dedicated ARCnet mailing list but its archive at epistolary.org has been shut down. ARCnet discussion nowadays take place in netdev list. The arcnet.com domain mentioned has become AIoT (Artificial Intelligence of Things) related Typeform page and ARCnet info now resides on arcnet.cc (ARCnet Resource Center) instead. Update contact information. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251028014451.10521-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 18:08:08 -07:00
Bagas Sanjaya	a7aca10c00	Documentation: netconsole: Separate literal code blocks for full and short netcat command name versions Both full and short (abbreviated) command name versions of netcat example are combined in single literal code block due to 'or::' paragraph being indented one more space than the preceding paragraph (before the short version example). Unindent it to separate the versions. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251030075013.40418-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 16:54:22 -07:00
Jakub Kicinski	1a2352ad82	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc4). No conflicts, adjacent changes: drivers/net/ethernet/stmicro/stmmac/stmmac_main.c `ded9813d17` ("net: stmmac: Consider Tx VLAN offload tag length for maxSDU") `26ab9830be` ("net: stmmac: replace has_xxxx with core_type") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-31 06:46:03 -07:00
Halil Pasic	8f736087e5	net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by giving up and going the way of a TCP fallback. This was reasonable before the sizes of the allocations there were compile time constants and reasonably small. But now those are actually configurable. So instead of giving up, keep retrying with half of the requested size unless we dip below the old static sizes -- then give up! In terms of numbers that means we give up when it is certain that we at best would end up allocating less than 16 send WR buffers or less than 48 recv WR buffers. This is to avoid regressions due to having fewer buffers compared the static values of the past. Please note that SMC-R is supposed to be an optimisation over TCP, and falling back to TCP is superior to establishing an SMC connection that is going to perform worse. If the memory allocation fails (and we propagate -ENOMEM), we fall back to TCP. Preserve (modulo truncation) the ratio of send/recv WR buffer counts. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-3-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 13:31:43 +01:00
Halil Pasic	aef3cdb47b	net/smc: make wr buffer count configurable Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those get replaced with lgr->max_send_wr and lgr->max_recv_wr respective. Please note that although with the default sysctl values qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but can not be assumed to be generally true any more. I see no downside to that, but my confidence level is rather modest. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-30 13:31:43 +01:00
Ido Schimmel	d12d04d221	ipv6: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of icmp6_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Since commit `20e1954fe2` ("ipv6: RFC 4884 partial support for SIT/GRE tunnels") it is possible for icmp6_send() to be called with an skb that already contains ICMP extensions. This can happen when we receive an ICMPv4 message with extensions from a tunnel and translate it to an ICMPv6 message towards an IPv6 host in the overlay network. I could not find an RFC that supports this behavior, but it makes sense to not overwrite the original extensions that were appended to the packet. Therefore, avoid appending extensions if the length field in the provided ICMPv6 header is already filled. Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it available to IPv6 when it is built as a module. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:30 -07:00
Ido Schimmel	f0e7036fc9	ipv4: icmp: Add RFC 5837 support Add the ability to append the incoming IP interface information to ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of __icmp_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 18:28:29 -07:00
Bagas Sanjaya	a433038098	Documentation: netconsole: Remove obsolete contact people Breno Leitao has been listed in MAINTAINERS as netconsole maintainer since `7c938e438c` ("MAINTAINERS: make Breno the netconsole maintainer"), but the documentation says otherwise that bug reports should be sent to original netconsole authors. Remove obsolate contact info. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251028132027.48102-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-29 17:40:19 -07:00
Maxime Chevallier	6920fa0c76	net: stmmac: Add a devlink attribute to control timestamping mode The DWMAC1000 supports 2 timestamping configurations to configure how frequency adjustments are made to the ptp_clock, as well as the reported timestamp values. There was a previous attempt at upstreaming support for configuring this mode by Olivier Dautricourt and Julien Beraud a few years back [1] In a nutshell, the timestamping can be either set in fine mode or in coarse mode. In fine mode, which is the default, we use the overflow of an accumulator to trigger frequency adjustments, but by doing so we lose precision on the timetamps that are produced by the timestamping unit. The main drawback is that the sub-second increment value, used to generate timestamps, can't be set to lower than (2 / ptp_clock_freq). The "fine" qualification comes from the frequent frequency adjustments we are able to do, which is perfect for a PTP follower usecase. In Coarse mode, we don't do frequency adjustments based on an accumulator overflow. We can therefore have very fine subsecond increment values, allowing for better timestamping precision. However this mode works best when the ptp clock frequency is adjusted based on an external signal, such as a PPS input produced by a GPS clock. This mode is therefore perfect for a Grand-master usecase. Introduce a driver-specific devlink parameter "ts_coarse" to enable or disable coarse mode, keeping the "fine" mode as a default. This can then be changed with: devlink dev param set <dev> name ts_coarse value true cmode runtime The associated documentation is also added. [1] : https://lore.kernel.org/netdev/20200514102808.31163-1-olivier.dautricourt@orolia.com/ Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20251024070720.71174-3-maxime.chevallier@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-28 15:34:35 +01:00
Wilfred Mallawa	82cb5be6ad	net/tls: support setting the maximum payload size During a handshake, an endpoint may specify a maximum record size limit. Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the maximum record size. Meaning that, the outgoing records from the kernel can exceed a lower size negotiated during the handshake. In such a case, the TLS endpoint must send a fatal "record_overflow" alert [1], and thus the record is discarded. Upcoming Western Digital NVMe-TCP hardware controllers implement TLS support. For these devices, supporting TLS record size negotiation is necessary because the maximum TLS record size supported by the controller is less than the default 16KB currently used by the kernel. Currently, there is no way to inform the kernel of such a limit. This patch adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that allows for setting the maximum plaintext fragment size. Once set, outgoing records are no larger than the size specified. This option can be used to specify the record size limit. [1] https://www.rfc-editor.org/rfc/rfc8449 Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-27 16:13:42 -07:00
Bagas Sanjaya	9ff8609265	net: rmnet: Use section heading markup for packet format subsections Format subsections of "Packet format" section as reST subsections. Link: https://lore.kernel.org/linux-doc/aO_MefPIlQQrCU3j@horms.kernel.org/ Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251022025456.19004-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-23 17:28:44 -07:00
Jakub Kicinski	2b7553db91	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.18-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-23 10:53:08 -07:00
Randy Dunlap	86c48f50ba	Documentation: networking: ax25: update the mailing list info. Update the mailing list subscription information for the linux-hams mailing list. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251020052716.3136773-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-21 17:41:57 -07:00
Bagas Sanjaya	97aa8ecb57	net: 6pack: Demote "How to turn on 6pack support" section heading "How to turn on 6pack support" is a subsection of "Building and installing the 6pack driver". Yet, the former is in the same heading level as the latter as sections, making it listed in networking docs toctree. Demote it to subsection. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20251017064525.28836-4-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-21 10:57:45 +02:00
Bagas Sanjaya	122d696c17	net: nfc: Format userspace interface subsection headings Subsection headings of "Userspace interface" is written in normal paragraph, all-capped. Properly format them as reST section headings. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20251017064525.28836-3-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-10-21 10:57:45 +02:00
Jesse Brandeburg	98c2f0b42e	net: docs: add missing features that can have stats While trying to figure out ethtool -I \| --include-statistics, I noticed some docs got missed when implementing commit `0e9c127729` ("ethtool: add interface to read Tx hardware timestamping statistics"). Fix up the docs to match the kernel code, and while there, sort them in alphabetical order. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20251016-jk-iwl-next-2025-10-15-v2-8-ff3a390d9fc6@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-20 18:31:25 -07:00
Bagas Sanjaya	cb74f8c952	Documentation: net: net_failover: Separate cloud-ifupdown-helper and reattach-vf.sh code blocks marker cloud-ifupdown-helper patch and reattach-vf.sh script are rendered in htmldocs output as normal paragraphs instead of literal code blocks due to missing separator from respective code block marker. Add it. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251016093936.29442-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 16:28:29 -07:00
Kuniyuki Iwashima	1c17f4373d	ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock. In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at the beginning of a new cache line because 1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned of __cacheline_group_begin(tcp_sock_write_tx) 2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned of struct numa_drop_counters 3. in raw6_sock, struct numa_drop_counters is placed before struct ipv6_pinfo . struct ipv6_pinfo is 136 bytes, but the last cache line is only used by ipv6_fl_list: $ pahole -C ipv6_pinfo vmlinux struct ipv6_pinfo { ... /* --- cacheline 2 boundary (128 bytes) --- / struct ipv6_fl_socklist ipv6_fl_list; /* 128 8 / / size: 136, cachelines: 3, members: 23 / Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock to save a full cache line for {tcp6,udp6,raw6}_sock. Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have 64 bytes less, while {tcp,udp,raw}_sock retain the same size. Before: # grep -E "^(RAW\|UDP[^L\-]\|TCP)" /proc/slabinfo \| awk '{print $1, "\t", $4}' RAWv6 1408 UDPv6 1472 TCPv6 2560 RAW 1152 UDP 1280 TCP 2368 After: # grep -E "^(RAW\|UDP[^L\-]\|TCP)" /proc/slabinfo \| awk '{print $1, "\t", $4}' RAWv6 1344 UDPv6 1408 TCPv6 2496 RAW 1152 UDP 1280 TCP 2368 Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the same cache line. $ pahole -C inet_sock vmlinux ... / --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- / struct ipv6_pinfo pinet6; /* 760 8 / / --- cacheline 12 boundary (768 bytes) --- / struct ipv6_fl_socklist ipv6_fl_list; /* 768 8 / unsigned long inet_flags; / 776 8 */ Doc churn is due to the insufficient Type column (only 1 space short). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-17 16:06:52 -07:00
Bagas Sanjaya	1b0124ad50	net: rmnet: Fix checksum offload header v5 and aggregation packet formatting Packet format for checksum offload header v5 and aggregation, and header type table for the former, are shown in normal paragraphs instead. Use appropriate markup. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251015092540.32282-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-16 15:50:52 -07:00
Jakub Kicinski	5e655aadda	linux-can-fixes-for-6.18-20251014 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmjuAZATHG1rbEBwZW5n dXRyb25peC5kZQAKCRAMdGXf+ZCRnPeECACtwzFozRla2Y+WTR7+BZiBlRtcWZ8d aNHrvtjefaX4TkgEVgC9Qt+VI7Wzv2TDlVWIWZnm3lotufmQsbRdDdEyfeRHPR7m 9yZcqRvLGQ17LDnC13W66YaZXhhz263rvCTwzcLyuB7tnO+zyYfakTZfR+xtfJ3x LBT6yNVlujy+/I4NCRNwlLzJc5fdGTKaSbt8ECf7qygcbQfZ/AcLUV9/AIjweVP1 Tcp2qumrM+HIU0lPQzfyiEGJn/weLRfajVbzcsv7NNc9vtlP2Ayi0LnLoXl98Pfk 00nJVXmV6rAJCm/LqDKoWvP98jEL4B41+06RRwsm5sJQ7caRS+knuVVn =Wml2 -----END PGP SIGNATURE----- Merge tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-10-14 The first 2 paches are by Celeste Liu and target the gS_usb driver. The first patch remove the limitation to 3 CAN interface per USB device. The second patch adds the missing population of net_device->dev_port. The next 4 patches are by me and fix the m_can driver. They add a missing pm_runtime_disable(), fix the CAN state transition back to Error Active and fix the state after ifup and suspend/resume. Another patch by me targets the m_can driver, too and replaces Dong Aisheng's old email address. The next 2 patches are by Vincent Mailhol and update the CAN networking Documentation. Tetsuo Handa contributes the last patch that add missing cleanup calls in the NETDEV_UNREGISTER notification handler. * tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: j1939: add missing calls in NETDEV_UNREGISTER notification handler can: add Transmitter Delay Compensation (TDC) documentation can: remove false statement about 1:1 mapping between DLC and length can: m_can: replace Dong Aisheng's old email address can: m_can: fix CAN state in system PM can: m_can: m_can_chip_config(): bring up interface in correct state can: m_can: m_can_handle_state_errors(): fix CAN state transition to Error Active can: m_can: m_can_plat_remove(): add missing pm_runtime_disable() can: gs_usb: gs_make_candev(): populate net_device->dev_port can: gs_usb: increase max interface to U8_MAX ==================== Link: https://patch.msgid.link/20251014122140.990472-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-10-15 17:56:20 -07:00
Vincent Mailhol	b5746b3e8e	can: add Transmitter Delay Compensation (TDC) documentation Back in 2021, support for CAN TDC was added to the kernel in series [1] and in iproute2 in series [2]. However, the documentation was never updated. Add a new sub-section under CAN-FD driver support to document how to configure the TDC using the "ip tool". [1] add the netlink interface for CAN-FD Transmitter Delay Compensation (TDC) Link: https://lore.kernel.org/all/20210918095637.20108-1-mailhol.vincent@wanadoo.fr/ [2] iplink_can: cleaning, fixes and adding TDC support Link: https://lore.kernel.org/all/20211103164428.692722-1-mailhol.vincent@wanadoo.fr/ Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Link: https://patch.msgid.link/20251013-can-fd-doc-v2-2-5d53bdc8f2ad@kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2025-10-13 21:26:01 +02:00
Vincent Mailhol	c282993ccd	can: remove false statement about 1:1 mapping between DLC and length The CAN-FD section of can.rst still states that there is a 1:1 mapping between the Classical CAN DLC and its length. This is only true for the DLC values up to 8. Beyond that point, the length remains at 8. For reference, the mapping between the CAN DLC and the length is given in below table [1]: DLC value CBFF and CEFF FBFF and FEFF [decimal] [byte] [byte] ---------------------------------------------- 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 8 12 10 8 16 11 8 20 12 8 24 13 8 32 14 8 48 15 8 64 Remove the erroneous statement. Instead just state that the length of a Classical CAN frame ranges from 0 to 8. [1] ISO 11898-1:2024, Table 5 -- DLC: coding of the four LSB Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Link: https://patch.msgid.link/20251013-can-fd-doc-v2-1-5d53bdc8f2ad@kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2025-10-13 21:26:01 +02:00
Nicolas Dichtel	0b4b77eff5	doc: fix seg6_flowlabel path This sysctl is not per interface; it's global per netns. Fixes: `292ecd9f5a` ("doc: move seg6_flowlabel to seg6-sysctl.rst") Reported-by: Philippe Guibert <philippe.guibert@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2025-10-12 22:51:37 +01:00

1 2 3 4 5 ...

3072 Commits