Convert all the legacy code directly accessing the pp fields in net_iov
to access them through @desc in net_iov.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each time some payload is consumed by user space (recvmsg() and friends),
TCP calls tcp_rcv_space_adjust() to run DRS algorithm to check
if an increase of sk->sk_rcvbuf is needed.
This function is based on time sampling, and currently calls
tcp_mstamp_refresh(tp), which is a wrapper around ktime_get_ns().
ktime_get_ns() has a high cost on some platforms.
100+ cycles for rdtscp on AMD EPYC Turin for instance.
We do not have to refresh tp->tcp_mpstamp, using the last cached value
is enough. We only need to refresh it from __tcp_cleanup_rbuf()
if an ACK must be sent (this is a rare event).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251024120707.3516550-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'if (!key && hash_location)' check in tcp_inbound_md5_hash() implies
that hash_location might be null. However, later code in the function
dereferences hash_location anyway, without checking for null first.
Fortunately, there is no real bug, since tcp_inbound_md5_hash() is
called only with non-null values of hash_location.
Therefore, remove the unnecessary and misleading null check of
hash_location. This silences a Smatch static checker warning
(https://lore.kernel.org/netdev/aPi4b6aWBbBR52P1@stanley.mountain/)
Also fix the related comment at the beginning of the function.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20251022221209.19716-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaPFVxwAKCRAraS+Z+3y5
E1yDAP98nxKLBFb9auLj8AV1zUNl4lKEpOIhH7ceXaDT7ucsagEAwvSbGAN5sun/
M0+sOpKEZwF8JBoRU8L61uYPYSu7XAk=
=2K9O
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-10-16
We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 18 files changed, 577 insertions(+), 38 deletions(-).
The main changes are:
1) Bypass the global per-protocol memory accounting either by setting
a netns sysctl or using bpf_setsockopt in a bpf program,
from Kuniyuki Iwashima.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Add test for sk->sk_bypass_prot_mem.
bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
net: Introduce net.core.bypass_prot_mem sysctl.
net: Allow opt-out from global protocol memory accounting.
tcp: Save lock_sock() for memcg in inet_csk_accept().
====================
Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make tcp-md5 use the MD5 library API (added in 6.18) instead of the
crypto_ahash API. This is much simpler and also more efficient:
- The library API just operates on struct md5_ctx. Just allocate this
struct on the stack instead of using a pool of pre-allocated
crypto_ahash and ahash_request objects.
- The library API accepts standard pointers and doesn't require
scatterlists. So, for hashing the headers just use an on-stack buffer
instead of a pool of pre-allocated kmalloc'ed scratch buffers.
- The library API never fails. Therefore, checking for MD5 hashing
errors is no longer necessary. Update tcp_v4_md5_hash_skb(),
tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(),
tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and
tcp_request_sock_ops::calc_md5_hash to return void instead of int.
- The library API provides direct access to the MD5 code, eliminating
unnecessary overhead such as indirect function calls and scatterlist
management. Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a
speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or
from 1020 to 678 cycles (33% fewer) with skb->len == 140.
Since tcp_sigpool_hash_skb_data() can no longer be used, add a function
tcp_md5_hash_skb_data() which is specialized to MD5. Of course, to the
extent that this duplicates any code, it's well worth it.
To preserve the existing behavior of TCP-MD5 support being disabled when
the kernel is booted with "fips=1", make tcp_md5_do_add() check
fips_enabled itself. Previously it relied on the error from
crypto_alloc_ahash("md5") being bubbled up. I don't know for sure that
this is actually needed, but this preserves the existing behavior.
Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel
that includes this commit and a kernel that doesn't include this commit.
(Side note: please don't use TCP-MD5! It's cryptographically weak. But
as long as Linux supports it, it might as well be implemented properly.)
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.
Sometimes, system processes do not want that limitation. For a similar
purpose, there is SO_RESERVE_MEM for sockets under memcg.
Also, by opting out of the per-protocol accounting, sockets under memcg
can avoid paying costs for two orthogonal memory accounting mechanisms.
A microbenchmark result is in the subsequent bpf patch.
Let's allow opt-out from the per-protocol memory accounting if
sk->sk_bypass_prot_mem is true.
sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache
line, and sk_has_account() always fetches sk->sk_prot before accessing
sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch.
The following patches will set sk->sk_bypass_prot_mem to true, and
then, the per-protocol memory accounting will be skipped.
Note that this does NOT disable memcg, but rather the per-protocol one.
Another option not to use the hole in struct sock_common is create
sk_prot variants like tcp_prot_bypass, but this would complicate
SOCKMAP logic, tcp_bpf_prots etc.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com
Some applications uses TCP_TX_DELAY socket option after TCP flow
is established.
Some metrics need to be updated, otherwise TCP might take time to
adapt to the new (emulated) RTT.
This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto
and sk->sk_pacing_rate.
This is best effort, and for instance icsk_rto is reset
without taking backoff into account.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some applications (like selftests/net/tcp_mmap.c) call SO_RCVLOWAT
on their listener, before accept().
This has an unfortunate effect on wscale selection in
tcp_select_initial_window() during 3WHS.
For instance, tcp_mmap was negotiating wscale 4, regardless
of tcp_rmem[2] and sysctl_rmem_max.
Do not change tp->window_clamp if it is zero
or bigger than our computed value.
Zero value is special, it allows tcp_select_initial_window()
to enable autotuning.
Note that SO_RCVLOWAT use on listener is probably not wise,
because tp->scaling_ratio has a default value, possibly wrong.
Fixes: d1361840f8 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251003184119.2526655-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tp->tcp_clean_acked is fetched in tx path when snd_una is updated.
This field thus belongs to tcp_sock_read_tx group.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_ack() writes this field, it belongs to tcp_sock_write_txrx.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Maintaining the CACHELINE_ASSERT_GROUP_SIZE() uses
for struct tcp_sock has been painful.
This had little benefit, so remove them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add validation points and state propagation to support PSP key
exchange inline, on TCP connections. The expectation is that
application will use some well established mechanism like TLS
handshake to establish a secure channel over the connection and
if both endpoints are PSP-capable - exchange and install PSP keys.
Because the connection can existing in PSP-unsecured and PSP-secured
state we need to make sure that there are no race conditions or
retransmission leaks.
On Tx - mark packets with the skb->decrypted bit when PSP key
is at the enqueue time. Drivers should only encrypt packets with
this bit set. This prevents retransmissions getting encrypted when
original transmission was not. Similarly to TLS, we'll use
sk->sk_validate_xmit_skb to make sure PSP skbs can't "escape"
via a PSP-unaware device without being encrypted.
On Rx - validation is done under socket lock. This moves the validation
point later than xfrm, for example. Please see the documentation patch
for more details on the flow of securing a connection, but for
the purpose of this patch what's important is that we want to
enforce the invariant that once connection is secured any skb
in the receive queue has been encrypted with PSP.
Add GRO and coalescing checks to prevent PSP authenticated data from
being combined with cleartext data, or data with non-matching PSP
state. On Rx, check skb's with psp_skb_coalesce_diff() at points
before psp_sk_rx_policy_check(). After skb's are policy checked and on
the socket receive queue, skb_cmp_decrypted() is sufficient for
checking for coalescable PSP state. On Tx, tcp_write_collapse_fence()
should be called when transitioning a socket into PSP Tx state to
prevent data sent as cleartext from being coalesced with PSP
encapsulated data.
This change only adds the validation points, for ease of review.
Subsequent change will add the ability to install keys, and flesh
the enforcement logic out
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250917000954.859376-5-daniel.zahka@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
AccECN option may fail in various way, handle these:
- Attempt to negotiate the use of AccECN on the 1st retransmitted SYN
- From the 2nd retransmitted SYN, stop AccECN negotiation
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
- If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing
This patch use existing padding bits in tcp_request_sock and
holes in tcp_sock without increasing the size.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
2nd ACK is necessary to unambiguously indicate which
of the ECN byte counters in increasing. The first
ACK has two counters increasing due to the ecnfield
edge.
- ACKs with CE to allow CEP delta validations to take
advantage of the option.
- Force option to be sent every at least once per 2^22
bytes. The check is done using the bit edges of the
byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
nothing in the ECN state requires that. The default is 3
times per RTT, and its period can be set via
sysctl_tcp_ecn_option_beacon.
Below are the pahole outcomes before and after this patch,
in which the group size of tcp_sock_write_tx is increased
from 89 to 97 due to the new u64 accecn_opt_tstamp member:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
struct list_head tsorted_sent_queue; /* 2496 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2521 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
u64 accecn_opt_tstamp; /* 2596 8 */
struct list_head tsorted_sent_queue; /* 2504 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2529 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2529 0 */
u8 nonagle:4; /* 2529: 0 1 */
u8 rate_app_limited:1; /* 2529: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2530: 0 1 */
u8 unused2:4; /* 2530: 4 1 */
u8 accecn_minlen:2; /* 2531: 0 1 */
u8 est_ecnfield:2; /* 2531: 2 1 */
u8 accecn_opt_demand:2; /* 2531: 4 1 */
u8 prev_ecnfield:2; /* 2531: 6 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2636 0 */
[...]
/* size: 3200, cachelines: 50, members: 173 */
}
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).
A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.
The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.
tcp_ecn_option sysctl can be used to select option sending
mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
and TCP_ECN_OPTION_FULL.
This patch increases the size of tcp_info struct, as there is
no existing holes for new u32 variables. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
/* size: 248, cachelines: 4, members: 61 */
}
[AFTER THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
__u32 tcpi_received_ce; /* 248 4 */
__u32 tcpi_delivered_e1_bytes; /* 252 4 */
__u32 tcpi_delivered_e0_bytes; /* 256 4 */
__u32 tcpi_delivered_ce_bytes; /* 260 4 */
__u32 tcpi_received_e1_bytes; /* 264 4 */
__u32 tcpi_received_e0_bytes; /* 268 4 */
__u32 tcpi_received_ce_bytes; /* 272 4 */
/* size: 280, cachelines: 5, members: 68 */
}
This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
group size of tcp_sock_write_rx is increased from 96 to 112. Below
are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4; /* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
u32 delivered_ecn_bytes[3];/* 2672 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These three byte counters track IP ECN field payload byte sums for
all arriving (acceptable) packets for ECT0, ECT1, and CE. The
AccECN option (added by a later patch in the series) echoes these
counters back to sender side; therefore, it is placed within the
group of tcp_sock_write_txrx.
Below are the pahole outcomes before and after this patch, in which
the group size of tcp_sock_write_txrx is increased from 95 + 4 to
107 + 4 and an extra 4-byte hole is created but will be exploited
in later patches:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 received_ecn_bytes[3];/* 2584 12 */
u32 app_limited; /* 2596 4 */
u32 rcv_wnd; /* 2600 4 */
struct tcp_options_received rx_opt; /* 2604 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accurate ECN negotiation parts based on the specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.
The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).
Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.
tcp_ecn sysctl can be used to select the highest ECN variant
(Accurate ECN, ECN, No ECN) that is attemped to be negotiated and
requested for incoming connection and outgoing connection:
TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN,
TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN,
TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN.
After this patch, the size of tcp_request_sock remains unchanged
and no new holes are added. Below are the pahole outcomes before
and after this patch:
[BEFORE THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
/* size: 360, cachelines: 6, members: 16 */
}
[AFTER THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
bool accecn_ok; /* 357 1 */
u8 syn_ect_snt:2; /* 358: 0 1 */
u8 syn_ect_rcv:2; /* 358: 2 1 */
u8 accecn_fail_mode:4; /* 358: 4 1 */
/* size: 360, cachelines: 6, members: 20 */
}
After this patch, the size of tcp_sock remains unchanged and no new
holes are added. Also, 4 bits of the existing 2-byte hole are exploited.
Below are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 unused:5; /* 2761: 3 1 */
u8 thin_lto:1; /* 2762: 0 1 */
u8 fastopen_connect:1; /* 2762: 1 1 */
u8 fastopen_no_cookie:1; /* 2762: 2 1 */
u8 fastopen_client_fail:2; /* 2762: 3 1 */
u8 frto:1; /* 2762: 5 1 */
/* XXX 2 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
/* XXX 2 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 syn_ect_snt:2; /* 2761: 3 1 */
u8 syn_ect_rcv:2; /* 2761: 5 1 */
u8 thin_lto:1; /* 2761: 7 1 */
u8 fastopen_connect:1; /* 2762: 0 1 */
u8 fastopen_no_cookie:1; /* 2762: 1 1 */
u8 fastopen_client_fail:2; /* 2762: 2 1 */
u8 frto:1; /* 2762: 4 1 */
/* XXX 3 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
u8 accecn_fail_mode:4; /* 2766: 0 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.
With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.
Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.
The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.
This patch uses the first 1-byte hole and the last 4-byte hole of
the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'.
Also, the group size of tcp_sock_write_txrx is increased from
91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are
the trimmed pahole outcomes before and after this patch.
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* XXX 2 bytes hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 app_limited; /* 2580 4 */
u32 rcv_wnd; /* 2684 4 */
struct tcp_options_received rx_opt; /* 2688 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2612 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 161 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.
Similar thing was possible for twsk already.
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():
1. The socket is yet in TCP bind buckets, making it reachable for
incoming segments [on another CPU core], potentially available to send
late FIN/ACK/RST replies.
2. There is at least one code path, where tcp_done() is called before
sending RST [kudos to Bob for investigation]. This is a case of
a server, that finished sending its data and just called close().
The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
__tcp_close())
tcp_v4_do_rcv()/tcp_v6_do_rcv()
tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */
tcp_reset()
tcp_done_with_error()
tcp_done()
inet_csk_destroy_sock() /* Destroys AO/MD5 keys */
/* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */
tcp_v4_send_reset() /* Sends an unsigned RST segment */
tcpdump:
> 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0
> 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72)
> 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0
> 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0
> 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
> 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
Note the signed RSTs later in the dump - those are sent by the server
when the fin-wait socket gets removed from hash buckets, by
the listener socket.
Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).
This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).
While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.
Investigated-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_recvmsg_dmabuf can export the following errors:
- EFAULT when linear copy fails
- ETOOSMALL when cmsg put fails
- ENODEV if one of the frags is readable
- ENOMEM on xarray failures
But they are all ignored and replaced by EFAULT in the caller
(tcp_recvmsg_locked). Expose real error to the userspace to
add more transparency on what specifically fails.
In non-devmem case (skb_copy_datagram_msg) doing `if (!copied)
copied=-EFAULT` is ok because skb_copy_datagram_msg can return only EFAULT.
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250910162429.4127997-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Small change to use tcp_eat_recv_skb() instead
of __kfree_skb(). This can help if an application
under attack has to close many sockets with unread data.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250903084720.1168904-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the receive queue contains payload that was already
received, __tcp_close() can send an unexpected RST.
Refine the code to take tp->copied_seq into account,
as we already do in tcp recvmsg().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250903084720.1168904-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed)
sockets in tcp_orphan_count.
In some code that was shared with DCCP, tcp_orphan_count is referenced
via sk->sk_prot->orphan_count.
Let's reference tcp_orphan_count directly.
inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c
due to header dependency.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
setsockopt(TCP_MAXSEG) writes over a field that does not need
socket lock protection anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250821141901.18839-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This field is already read locklessly for listeners,
next patch will make setsockopt(TCP_MAXSEG) lockless.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250821141901.18839-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper tcp_sock_set_maxseg() to directly set the TCP_MAXSEG
sockopt from kernel space.
This new helper will be used in the following patch from MPTCP.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250719-net-next-mptcp-tcp_maxseg-v2-2-8c910fbc5307@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit 2e910b9532 ("net: Add a function to
splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter()
never used the @gfp argument. Remove it and adapt callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported a bug [1] where sk->sk_forward_alloc can overflow.
When we send data, if an skb exists at the tail of the write queue, the
kernel will attempt to append the new data to that skb. However, the code
that checks for available space in the skb is flawed:
'''
copy = size_goal - skb->len
'''
The types of the variables involved are:
'''
copy: ssize_t (s64 on 64-bit systems)
size_goal: int
skb->len: unsigned int
'''
Due to C's type promotion rules, the signed size_goal is converted to an
unsigned int to match skb->len before the subtraction. The result is an
unsigned int.
When this unsigned int result is then assigned to the s64 copy variable,
it is zero-extended, preserving its non-negative value. Consequently, copy
is always >= 0.
Assume we are sending 2GB of data and size_goal has been adjusted to a
value smaller than skb->len. The subtraction will result in copy holding a
very large positive integer. In the subsequent logic, this large value is
used to update sk->sk_forward_alloc, which can easily cause it to overflow.
The syzkaller reproducer uses TCP_REPAIR to reliably create this
condition. However, this can also occur in real-world scenarios. The
tcp_bound_to_half_wnd() function can also reduce size_goal to a small
value. This would cause the subsequent tcp_wmem_schedule() to set
sk->sk_forward_alloc to a value close to INT_MAX. Further memory
allocation requests would then cause sk_forward_alloc to wrap around and
become negative.
[1]: https://syzkaller.appspot.com/bug?extid=de6565462ab540f50e47
Reported-by: syzbot+de6565462ab540f50e47@syzkaller.appspotmail.com
Fixes: 270a1c3de4 ("tcp: Support MSG_SPLICE_PAGES")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250707054112.101081-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.
Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.
My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.
This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Preserve the error code returned by sock_cmsg_send and return that on
err.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.
After this series improvements, it is time to increase the default.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.
Provide API for sendmsg to be able to send dmabufs bound to this device:
- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.
We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.
We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.
Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.
Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tcp: fastopen: pass TFO child indication through getsockopt
Note that this uses up the last bit of a field in struct tcp_info
Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-3-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp: fastopen: note that a child socket was created
This uses up the last bit in a field of tcp_sock.
Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-2-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a tracepoint to monitor TCP send operations, enabling detailed
visibility into TCP message transmission.
Create a new tracepoint within the tcp_sendmsg_locked function,
capturing traditional fields along with size_goal, which indicates the
optimal data size for a single TCP segment. Additionally, a reference to
the struct sock sk is passed, allowing direct access for BPF programs.
The implementation is largely based on David's patch[1] and suggestions.
Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 0de2a5c4b8.
I forgot that a TCP socket could receive messages in its error queue.
sock_queue_err_skb() can be called without socket lock being held,
and changes sk->sk_rmem_alloc.
The fact that skbs in error queue are limited by sk->sk_rcvbuf
means that error messages can be dropped if socket receive
queues are full, which is an orthogonal issue.
In future kernels, we could use a separate sk->sk_error_mem_alloc
counter specifically for the error queue.
Fixes: 0de2a5c4b8 ("tcp: avoid atomic operations on sk->sk_rmem_alloc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250331075946.31960-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP uses generic skb_set_owner_r() and sock_rfree()
for received packets, with socket lock being owned.
Switch to private versions, avoiding two atomic operations
per packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support adjusting/reading delayed ack max for socket level by using
set/getsockopt().
This option aligns with TCP_BPF_DELACK_MAX usage. Considering that bpf
option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.
Add WRITE_ONCE/READ_ONCE() to prevent data-race if setsockopt()
happens to write one value to icsk_delack_max while icsk_delack_max is
being read.
Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250317120314.41404-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support adjusting/reading RTO MIN for socket level by using set/getsockopt().
This new option has the same effect as TCP_BPF_RTO_MIN, which means it
doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that
bpf option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.
When the socket is created, its icsk_rto_min is set to the default
value that is controlled by sysctl_tcp_rto_min_us. Then if application
calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then
icsk_rto_min will be overridden in jiffies unit.
This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around
icsk_rto_min.
Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE
is enabled from tcp_ack().
Rename it to tcp_clean_acked, move it to tcp_sock structure
in the tcp_sock_read_rx for better cache locality in TCP
fast path.
Define this field only when CONFIG_TLS_DEVICE is enabled
saving 8 bytes on configs not using it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250317085313.2023214-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>