mirror of https://github.com/torvalds/linux.git
1252 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
9b6412e697 |
tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
- create a pair of netns
- run a basic TCP test over ipcomp6
- delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.
We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.
Fixes:
|
|
|
|
f5da7c4518 |
tcp: adjust rcvq_space after updating scaling ratio
Since commit under Fixes we set the window clamp in accordance to newly measured rcvbuf scaling_ratio. If the scaling_ratio decreased significantly we may put ourselves in a situation where windows become smaller than rcvq_space, preventing tcp_rcv_space_adjust() from increasing rcvbuf. The significant decrease of scaling_ratio is far more likely since commit |
|
|
|
d16b344790 |
tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason. This patch adds the corresponding SNMP counter, for folks using nstat instead of tracing for TCP diagnostics. nstat -az | grep PAWSOldAck Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250113135558.3180360-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
124c4c32e9 |
tcp: add TCP_RFC7323_PAWS_ACK drop reason
XPS can cause reorders because of the relaxed OOO
conditions for pure ACK packets.
For hosts not using RFS, what can happpen is that ACK
packets are sent on behalf of the cpu processing NIC
interrupts, selecting TX queue A for ACK packet P1.
Then a subsequent sendmsg() can run on another cpu.
TX queue selection uses the socket hash and can choose
another queue B for packets P2 (with payload).
If queue A is more congested than queue B,
the ACK packet P1 could be sent on the wire after
P2.
A linux receiver when processing P1 (after P2) currently increments
LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab)
and use TCP_RFC7323_PAWS drop reason.
It might also send a DUPACK if not rate limited.
In order to better understand this pattern, this
patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK.
For old ACKS like these, we no longer increment
LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK,
keeping credit for other more interesting DUPACK.
perf record -e skb:kfree_skb -a
perf script
...
swapper 0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
ea98b61bdd |
tcp: add drop_reason support to tcp_disordered_ack()
Following patch is adding a new drop_reason to tcp_validate_incoming(). Change tcp_disordered_ack() to not return a boolean anymore, but a drop reason. Change its name to tcp_disordered_ack_check() Refactor tcp_validate_incoming() to ease the code review of the following patch, and reduce indentation level. This patch is a refactor, with no functional change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250113135558.3180360-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
4f4aa4aa28 |
net: fix memory leak in tcp_conn_request()
If inet_csk_reqsk_queue_hash_add() return false, tcp_conn_request() will
return without free the dst memory, which allocated in af_ops->route_req.
Here is the kmemleak stack:
unreferenced object 0xffff8881198631c0 (size 240):
comm "softirq", pid 0, jiffies 4299266571 (age 1802.392s)
hex dump (first 32 bytes):
00 10 9b 03 81 88 ff ff 80 98 da bc ff ff ff ff ................
81 55 18 bb ff ff ff ff 00 00 00 00 00 00 00 00 .U..............
backtrace:
[<ffffffffb93e8d4c>] kmem_cache_alloc+0x60c/0xa80
[<ffffffffba11b4c5>] dst_alloc+0x55/0x250
[<ffffffffba227bf6>] rt_dst_alloc+0x46/0x1d0
[<ffffffffba23050a>] __mkroute_output+0x29a/0xa50
[<ffffffffba23456b>] ip_route_output_key_hash+0x10b/0x240
[<ffffffffba2346bd>] ip_route_output_flow+0x1d/0x90
[<ffffffffba254855>] inet_csk_route_req+0x2c5/0x500
[<ffffffffba26b331>] tcp_conn_request+0x691/0x12c0
[<ffffffffba27bd08>] tcp_rcv_state_process+0x3c8/0x11b0
[<ffffffffba2965c6>] tcp_v4_do_rcv+0x156/0x3b0
[<ffffffffba299c98>] tcp_v4_rcv+0x1cf8/0x1d80
[<ffffffffba239656>] ip_protocol_deliver_rcu+0xf6/0x360
[<ffffffffba2399a6>] ip_local_deliver_finish+0xe6/0x1e0
[<ffffffffba239b8e>] ip_local_deliver+0xee/0x360
[<ffffffffba239ead>] ip_rcv+0xad/0x2f0
[<ffffffffba110943>] __netif_receive_skb_one_core+0x123/0x140
Call dst_release() to free the dst memory when
inet_csk_reqsk_queue_hash_add() return false in tcp_conn_request().
Fixes:
|
|
|
|
dbd5e2e79e |
net: tcp: Add noinline_for_tracing annotation for tcp_drop_reason()
We previously hooked the tcp_drop_reason() function using BPF to monitor TCP drop reasons. However, after upgrading our compiler from GCC 9 to GCC 11, tcp_drop_reason() is now inlined, preventing us from hooking into it. To address this, it would be beneficial to make noinline explicitly for tracing. Link: https://lore.kernel.org/netdev/CANn89iJuShCmidCi_ZkYABtmscwbVjhuDta1MS5LxV_4H9tKOA@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Menglong Dong <menglong8.dong@gmail.com> Link: https://patch.msgid.link/20241024093742.87681-3-laoar.shao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
27c80efcc2 |
tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out
Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp if retransmits are outstanding. tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll need to zero retrans_stamp here to prevent spurious retransmits_timed_out(). The logic to zero retrans_stamp is from this 2019 commit: commit |
|
|
|
b41b4cbd96 |
tcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safe
Fix tcp_enter_recovery() so that if there are no retransmits out then
we zero retrans_stamp when entering fast recovery. This is necessary
to fix two buggy behaviors.
Currently a non-zero retrans_stamp value can persist across multiple
back-to-back loss recovery episodes. This is because we generally only
clears retrans_stamp if we are completely done with loss recoveries,
and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This
behavior causes two bugs:
(1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed
immediately by a new CA_Recovery, the retrans_stamp value can persist
and can be a time before this new CA_Recovery episode starts. That
means that timestamp-based undo will be using the wrong retrans_stamp
(a value that is too old) when comparing incoming TS ecr values to
retrans_stamp to see if the current fast recovery episode can be
undone.
(2) If there is a roughly minutes-long sequence of back-to-back fast
recovery episodes, one after another (e.g. in a shallow-buffered or
policed bottleneck), where each fast recovery successfully makes
forward progress and recovers one window of sequence space (but leaves
at least one retransmit in flight at the end of the recovery),
followed by several RTOs, then the ETIMEDOUT check may be using the
wrong retrans_stamp (a value set at the start of the first fast
recovery in the sequence). This can cause a very premature ETIMEDOUT,
killing the connection prematurely.
This commit changes the code to zero retrans_stamp when entering fast
recovery, when this is known to be safe (no retransmits are out in the
network). That ensures that when starting a fast recovery episode, and
it is safe to do so, retrans_stamp is set when we send the fast
retransmit packet. That addresses both bug (1) and bug (2) by ensuring
that (if no retransmits are out when we start a fast recovery) we use
the initial fast retransmit of this fast recovery as the time value
for undo and ETIMEDOUT calculations.
This makes intuitive sense, since the start of a new fast recovery
episode (in a scenario where no lost packets are out in the network)
means that the connection has made forward progress since the last RTO
or fast recovery, and we should thus "restart the clock" used for both
undo and ETIMEDOUT logic.
Note that if when we start fast recovery there *are* retransmits out
in the network, there can still be undesirable (1)/(2) issues. For
example, after this patch we can still have the (1) and (2) problems
in cases like this:
+ round 1: sender sends flight 1
+ round 2: sender receives SACKs and enters fast recovery 1,
retransmits some packets in flight 1 and then sends some new data as
flight 2
+ round 3: sender receives some SACKs for flight 2, notes losses, and
retransmits some packets to fill the holes in flight 2
+ fast recovery has some lost retransmits in flight 1 and continues
for one or more rounds sending retransmits for flight 1 and flight 2
+ fast recovery 1 completes when snd_una reaches high_seq at end of
flight 1
+ there are still holes in the SACK scoreboard in flight 2, so we
enter fast recovery 2, but some retransmits in the flight 2 sequence
range are still in flight (retrans_out > 0), so we can't execute the
new retrans_stamp=0 added here to clear retrans_stamp
It's not yet clear how to fix these remaining (1)/(2) issues in an
efficient way without breaking undo behavior, given that retrans_stamp
is currently used for undo and ETIMEDOUT. Perhaps the optimal (but
expensive) strategy would be to set retrans_stamp to the timestamp of
the earliest outstanding retransmit when entering fast recovery. But
at least this commit makes things better.
Note that this does not change the semantics of retrans_stamp; it
simply makes retrans_stamp accurate in some cases where it was not
before:
(1) Some loss recovery, followed by an immediate entry into a fast
recovery, where there are no retransmits out when entering the fast
recovery.
(2) When a TFO server has a SYNACK retransmit that sets retrans_stamp,
and then the ACK that completes the 3-way handshake has SACK blocks
that trigger a fast recovery. In this case when entering fast recovery
we want to zero out the retrans_stamp from the TFO SYNACK retransmit,
and set the retrans_stamp based on the timestamp of the fast recovery.
We introduce a tcp_retrans_stamp_cleanup() helper, because this
two-line sequence already appears in 3 places and is about to appear
in 2 more as a result of this bug fix patch series. Once this bug fix
patches series in the net branch makes it into the net-next branch
we'll update the 3 other call sites to use the new helper.
This is a long-standing issue. The Fixes tag below is chosen to be the
oldest commit at which the patch will apply cleanly, which is from
Linux v3.5 in 2012.
Fixes:
|
|
|
|
e37ab73736 |
tcp: fix to allow timestamp undo if no retransmits were sent
Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after this commit from 2019: commit |
|
|
|
5f60d5f6bb |
move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h |
|
|
|
65249feb6b |
net: add support for skbs with unreadable frags
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
a2cbb16039 |
tcp: Update window clamping condition
This patch is based on the discussions between Neal Cardwell and
Eric Dumazet in the link
https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/
It was correctly pointed out that tp->window_clamp would not be
updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if
(copied <= tp->rcvq_space.space). While it is expected for most
setups to leave the sysctl enabled, the latter condition may
not end up hitting depending on the TCP receive queue size and
the pattern of arriving data.
The updated check should be hit only on initial MSS update from
TCP_MIN_MSS to measured MSS value and subsequently if there was
an update to a larger value.
Fixes:
|
|
|
|
05f76b2d63 |
tcp: Adjust clamping window for applications specifying SO_RCVBUF
tp->scaling_ratio is not updated based on skb->len/skb->truesize once SO_RCVBUF is set leading to the maximum window scaling to be 25% of rcvbuf after commit |
|
|
|
c166829268 |
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.
I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni
suggested me to look at the impact on TFO with "plain" TCP.
For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.
This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].
For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.
One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:
`../common/defaults.sh`
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
+0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
Simultaneous SYN-data crossing is also not supported by TFO, see [6].
Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only:
that's a more generic solution than the one initially proposed, and
also enough to fix the issues described above.
Later on, Eric Dumazet mentioned that an ACK should still be sent in
reaction to the second SYN+ACK that is received: not sending a DUPACK
here seems wrong and could hurt:
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8>
+0 < S 0:0(0) win 1000 <mss 1000, sackOK, nop, nop>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop>
+0 > . 1:1(0) ack 1 <nop, nop, sack 0:1> // <== Here
So in this version, the 'goto consume' is dropped, to always send an ACK
when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be
seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of
simultaneous SYN crossing: that's what is expected here.
Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
Fixes:
|
|
|
|
23e89e8ee7 |
tcp: Don't drop SYN+ACK for simultaneous connect().
RFC 9293 states that in the case of simultaneous connect(), the connection
gets established when SYN+ACK is received. [0]
TCP Peer A TCP Peer B
1. CLOSED CLOSED
2. SYN-SENT --> <SEQ=100><CTL=SYN> ...
3. SYN-RECEIVED <-- <SEQ=300><CTL=SYN> <-- SYN-SENT
4. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED
5. SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
6. ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
7. ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
However, since commit
|
|
|
|
7c8267275d |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/act_ct.c |
|
|
|
0ec986ed7b |
tcp: fix incorrect undo caused by DSACK of TLP retransmit
Loss recovery undo_retrans bookkeeping had a long-standing bug where a DSACK from a spurious TLP retransmit packet could cause an erroneous undo of a fast recovery or RTO recovery that repaired a single really-lost packet (in a sequence range outside that of the TLP retransmit). Basically, because the loss recovery state machine didn't account for the fact that it sent a TLP retransmit, the DSACK for the TLP retransmit could erroneously be implicitly be interpreted as corresponding to the normal fast recovery or RTO recovery retransmit that plugged a real hole, thus resulting in an improper undo. For example, consider the following buggy scenario where there is a real packet loss but the congestion control response is improperly undone because of this bug: + send packets P1, P2, P3, P4 + P1 is really lost + send TLP retransmit of P4 + receive SACK for original P2, P3, P4 + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1 + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!) + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole) The fix: when we initialize undo machinery in tcp_init_undo(), if there is a TLP retransmit in flight, then increment tp->undo_retrans so that we make sure that we receive a DSACK corresponding to the TLP retransmit, as well as DSACKs for all later normal retransmits, before triggering a loss recovery undo. Note that we also have to move the line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO we remember the tp->tlp_high_seq value until tcp_init_undo() and clear it only afterward. Also note that the bug dates back to the original 2013 TLP implementation, commit |
|
|
|
76ed626479 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h |
|
|
|
4b74726c01 |
tcp: Don't flag tcp_sk(sk)->rx_opt.saw_unknown for TCP AO.
When we process segments with TCP AO, we don't check it in
tcp_parse_options(). Thus, opt_rx->saw_unknown is set to 1,
which unconditionally triggers the BPF TCP option parser.
Let's avoid the unnecessary BPF invocation.
Fixes:
|
|
|
|
a6458ab7fd |
UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()
In some production workloads we noticed that connections could
sometimes close extremely prematurely with ETIMEDOUT after
transmitting only 1 TLP and RTO retransmission (when we would normally
expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection
closes with ETIMEDOUT).
From tracing we determined that these workloads can suffer from a
scenario where in fast recovery, after some retransmits, a DSACK undo
can happen at a point where the scoreboard is totally clear (we have
retrans_out == sacked_out == lost_out == 0). In such cases, calling
tcp_try_keep_open() means that we do not execute any code path that
clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can
remain erroneously set to the start time of the undone fast recovery,
even after the fast recovery is undone. If minutes or hours elapse,
and then a TLP/RTO/RTO sequence occurs, then the start_ts value in
retransmits_timed_out() (which is from tp->retrans_stamp) will be
erroneously ancient (left over from the fast recovery undone via
DSACKs). Thus this ancient tp->retrans_stamp value can cause the
connection to die very prematurely with ETIMEDOUT via
tcp_write_err().
The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to
call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures
that if no retransmits are in flight at the time of DSACK undo in fast
recovery then we properly zero retrans_stamp. Note that calling
tcp_try_to_open() is more consistent with other loss recovery
behavior, since normal fast recovery (CA_Recovery) and RTO recovery
(CA_Loss) both normally end when tp->snd_una meets or exceeds
tp->high_seq and then in tcp_fastretrans_alert() the "default" switch
case executes tcp_try_to_open(). Also note that by inspection this
change to call tcp_try_to_open() implies at least one other nice bug
fix, where now an ECE-marked DSACK that causes an undo will properly
invoke tcp_enter_cwr() rather than ignoring the ECE mark.
Fixes:
|
|
|
|
193b9b2002 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: |
|
|
|
5dfe9d2739 |
tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
Testing determined that the recent commit |
|
|
|
ff46e3b442 |
Fix race for duplicate reqsk on identical SYN
When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().
These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.
The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.
========================================================================
This behavior is consistently reproducible in my local setup,
which comprises:
| NETA1 ------ NETB1 |
PC_A --- bond --- | | --- bond --- PC_B
| NETA2 ------ NETB2 |
- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
bonded these two cards using BOND_MODE_BROADCAST mode and configured
them to be handled by different CPU.
- PC_B is the Client, also equipped with two network cards, NETB1 and
NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.
If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <==
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <==
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117,
Two SYNACKs with different seq numbers are sent by localhost,
resulting in an anomaly.
========================================================================
The attempted solution is as follows:
Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the
ehash insertion is successful (Up to now, the reason for unsuccessful
insertion is that a reqsk for the same connection has already been
inserted). If the insertion fails, release the reqsk.
Due to the refcnt, Kuniyuki suggests also adding a return value check
for the DCCP module; if ehash insertion fails, indicating a successful
insertion of the same connection, simply release the reqsk as well.
Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.
Fixes:
|
|
|
|
a6ec08beec |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c |
|
|
|
46a02aa357 |
tcp: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving socket to the tracepoint. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
9e046bb111 |
tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()
Some applications were reporting ETIMEDOUT errors on apparently
good looking flows, according to packet dumps.
We were able to root cause the issue to an accidental setting
of tp->retrans_stamp in the following scenario:
- client sends TFO SYN with data.
- server has TFO disabled, ACKs only SYN but not payload.
- client receives SYNACK covering only SYN.
- tcp_ack() eats SYN and sets tp->retrans_stamp to 0.
- tcp_rcv_fastopen_synack() calls tcp_xmit_retransmit_queue()
to retransmit TFO payload w/o SYN, sets tp->retrans_stamp to "now",
but we are not in any loss recovery state.
- TFO payload is ACKed.
- we are not in any loss recovery state, and don't see any dupacks,
so we don't get to any code path that clears tp->retrans_stamp.
- tp->retrans_stamp stays non-zero for the lifetime of the connection.
- after first RTO, tcp_clamp_rto_to_user_timeout() clamps second RTO
to 1 jiffy due to bogus tp->retrans_stamp.
- on clamped RTO with non-zero icsk_retransmits, retransmits_timed_out()
sets start_ts from tp->retrans_stamp from TFO payload retransmit
hours/days ago, and computes bogus long elapsed time for loss recovery,
and suffers ETIMEDOUT early.
Fixes:
|
|
|
|
96be3dcd01 |
net/tcp: Add tcp-md5 and tcp-ao tracepoints
Instead of forcing userspace to parse dmesg (that's what currently is happening, at least in codebase of my current company), provide a better way, that can be enabled/disabled in runtime. Currently, there are already tcp events, add hashing related ones there, too. Rasdaemon currently exercises net_dev_xmit_timeout, devlink_health_report, but it'll be trivial to teach it to deal with failed hashes. Otherwise, BGP may trace/log them itself. Especially exciting for possible investigations is key rotation (RNext_key requests). Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
adbe695a97 |
tcp: move inet_reqsk_alloc() close to inet_reqsk_clone()
inet_reqsk_alloc() does not belong to tcp_input.c, move it to inet_connection_sock.c instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
0711153018 |
tcp: wrap mptcp and decrypted checks into tcp_skb_can_collapse_rx()
tcp_skb_can_collapse() checks for conditions which don't make sense on input. Because of this we ended up sprinkling a few pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls on the input path. Group them in a new helper. This should make it less likely that someone will check mptcp and not decrypted or vice versa when adding new code. This implicitly adds a decrypted check early in tcp_collapse(). AFAIU this will very slightly increase our ability to collapse packets under memory pressure, not a real bug. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
5e514f1cba |
tcp: add tcp_done_with_error() helper
tcp_reset() ends with a sequence that is carefuly ordered. We need to fix [e]poll bugs in the following patches, it makes sense to use a common helper. Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240528125253.1966136-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
6e62702feb |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3 Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk= =5Dk7 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
e7073830cc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c |
|
|
|
94062790ae |
tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
TCP_SYN_RECV state is really special, it is only used by
cross-syn connections, mostly used by fuzzers.
In the following crash [1], syzbot managed to trigger a divide
by zero in tcp_rcv_space_adjust()
A socket makes the following state transitions,
without ever calling tcp_init_transfer(),
meaning tcp_init_buffer_space() is also not called.
TCP_CLOSE
connect()
TCP_SYN_SENT
TCP_SYN_RECV
shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN)
TCP_FIN_WAIT1
To fix this issue, change tcp_shutdown() to not
perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
which makes no sense anyway.
When tcp_rcv_state_process() later changes socket state
from TCP_SYN_RECV to TCP_ESTABLISH, then look at
sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
and send a FIN packet from a sane socket state.
This means tcp_send_fin() can now be called from BH
context, and must use GFP_ATOMIC allocations.
[1]
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767
Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48
RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246
RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7
R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30
R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da
FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0
Call Trace:
<TASK>
tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513
tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578
inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680
sock_recvmsg_nosec net/socket.c:1046 [inline]
sock_recvmsg+0x109/0x280 net/socket.c:1068
____sys_recvmsg+0x1db/0x470 net/socket.c:2803
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7faeb6363db9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9
RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c
R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
Fixes:
|
|
|
|
57bfc7605c |
tcp: Add new args for cong_control in tcp_congestion_ops
This patch adds two new arguments for cong_control of struct tcp_congestion_ops: - ack - flag These two arguments are inherited from the caller tcp_cong_control in tcp_intput.c. One use case of them is to update cwnd and pacing rate inside cong_control based on the info they provide. For example, the flag can be used to decide if it is the right time to raise or reduce a sender's cwnd. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Miao Xu <miaxu@meta.com> Link: https://lore.kernel.org/r/20240502042318.801932-2-miaxu@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> |
|
|
|
f3d93817fb |
net: add <net/proto_memory.h>
Move some proto memory definitions out of <net/sock.h> Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
89de2db193 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj v6jJwDcybCym1hLx+1x1JCZ4eoAFswE= =xbna -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-04-29 We've added 147 non-merge commits during the last 32 day(s) which contain a total of 158 files changed, 9400 insertions(+), 2213 deletions(-). The main changes are: 1) Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86 BPF JIT. This allows inlining per-CPU array and hashmap lookups and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko. 2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song. 3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction, from Alexei Starovoitov. 4) Add support for passing mark with bpf_fib_lookup helper, from Anton Protopopov. 5) Add a new bpf_wq API for deferring events and refactor sleepable bpf_timer code to keep common code where possible, from Benjamin Tissoires. 6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs to check when NULL is passed for non-NULLable parameters, from Eduard Zingerman. 7) Harden the BPF verifier's and/or/xor value tracking, from Harishankar Vishwanathan. 8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel crypto subsystem, from Vadim Fedorenko. 9) Various improvements to the BPF instruction set standardization doc, from Dave Thaler. 10) Extend libbpf APIs to partially consume items from the BPF ringbuffer, from Andrea Righi. 11) Bigger batch of BPF selftests refactoring to use common network helpers and to drop duplicate code, from Geliang Tang. 12) Support bpf_tail_call_static() helper for BPF programs with GCC 13, from Jose E. Marchesi. 13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled, from Kumar Kartikeya Dwivedi. 14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs, from David Vernet. 15) Extend the BPF verifier to allow different input maps for a given bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu. 16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan. 17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison the stack memory, from Martin KaFai Lau. 18) Improve xsk selftest coverage with new tests on maximum and minimum hardware ring size configurations, from Tushar Vyavahare. 19) Various ReST man pages fixes as well as documentation and bash completion improvements for bpftool, from Rameez Rehman & Quentin Monnet. 20) Fix libbpf with regards to dumping subsequent char arrays, from Quentin Deslandes. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits) bpf, docs: Clarify PC use in instruction-set.rst bpf_helpers.h: Define bpf_tail_call_static when building with GCC bpf, docs: Add introduction for use in the ISA Internet Draft selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args selftests/bpf: dummy_st_ops should reject 0 for non-nullable params bpf: check bpf_dummy_struct_ops program params for test runs selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops selftests/bpf: adjust dummy_st_ops_success to detect additional error bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable selftests/bpf: Add ring_buffer__consume_n test. bpf: Add bpf_guard_preempt() convenience macro selftests: bpf: crypto: add benchmark for crypto functions selftests: bpf: crypto skcipher algo selftests bpf: crypto: add skcipher to bpf crypto bpf: make common crypto API for TC/XDP programs bpf: update the comment for BTF_FIELDS_MAX selftests/bpf: Fix wq test. selftests/bpf: Use make_sockaddr in test_sock_addr selftests/bpf: Use connect_to_addr in test_sock_addr ... ==================== Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
48e2cd3e3d |
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
Two important arguments in RTT estimation, mrtt and srtt, are passed to tcp_bpf_rtt(), so that bpf programs get more information about RTT computation in BPF_SOCK_OPS_RTT_CB. The difference between bpf_sock_ops->srtt_us and the srtt here is: the former is an old rtt before update, while srtt passed by tcp_bpf_rtt() is that after update. Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Link: https://lore.kernel.org/r/20240425161724.73707-2-lulie@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> |
|
|
|
2bd99aef1b |
tcp: accept bare FIN packets under memory pressure
Andrew Oates reported that some macOS hosts could repeatedly send FIN packets even if the remote peer drops them and send back DUP ACK RWIN 0 packets. <quoting Andrew> 20:27:16.968254 gif0 In IP macos > victim: Flags [SEW], seq 1950399762, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 501897188 ecr 0,sackOK,eol], length 0 20:27:16.968339 gif0 Out IP victim > macos: Flags [S.E], seq 2995489058, ack 1950399763, win 1448, options [mss 1460,sackOK,TS val 3829877593 ecr 501897188,nop,wscale 0], length 0 20:27:16.968833 gif0 In IP macos > victim: Flags [.], ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 0 20:27:16.968885 gif0 In IP macos > victim: Flags [P.], seq 1:1449, ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 1448 20:27:16.968896 gif0 Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829877593 ecr 501897188], length 0 20:27:19.454593 gif0 In IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829877593], length 0 20:27:19.454675 gif0 Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829880079 ecr 501899674], length 0 20:27:19.455116 gif0 In IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829880079], length 0 The retransmits/dup-ACKs then repeat in a tight loop. </quoting Andrew> RFC 9293 3.4. Sequence Numbers states : Note that when the receive window is zero no segments should be acceptable except ACK segments. Thus, it is be possible for a TCP to maintain a zero receive window while transmitting data and receiving ACKs. However, even when the receive window is zero, a TCP must process the RST and URG fields of all incoming segments. Even if we could consider a bare FIN.ACK packet to be an ACK in RFC terms, the retransmits should use exponential backoff. Accepting the FIN in linux does not add extra memory costs, because the FIN flag will simply be merged to the tail skb in the receive queue, and incoming packet is freed. Reported-by: Andrew Oates <aoates@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Vidhi Goel <vidhi_goel@apple.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
41eecbd712 |
tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt :
tcp_timewait_state_process()
...
u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
if (isn == 0)
isn++;
TCP_SKB_CB(skb)->tcp_tw_isn = isn;
return TCP_TW_SYN;
This SYN packet also bypasses normal checks against listen queue
being full or not.
tcp_conn_request()
...
__u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
...
/* TW buckets are converted to open requests without
* limitations, they conserve resources and peer is
* evidently real one.
*/
if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
if (!want_cookie)
goto drop;
}
This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.
Unfortunately this field has been accidentally cleared
after the call to tcp_timewait_state_process() returning
TCP_TW_SYN.
Using a field in TCP_SKB_CB(skb) for a temporary state
is overkill.
Switch instead to a per-cpu variable.
As a bonus, we do not have to clear tcp_tw_isn in TCP receive
fast path.
It is temporarily set then cleared only in the TCP_TW_SYN dance.
Fixes:
|
|
|
|
b9e8104058 |
tcp: propagate tcp_tw_isn via an extra parameter to ->route_req()
tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find out if the request socket is created by a SYN hitting a TIMEWAIT socket. This has been buggy for a decade, lets directly pass the information from tcp_conn_request(). This is a preparatory patch to make the following one easier to review. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
9f06f87fef |
net: skbuff: generalize the skb->decrypted bit
The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out everywhere. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
f410cbea9f |
tcp: annotate data-races around tp->window_clamp
tp->window_clamp can be read locklessly, add READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240404114231.2195171-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
58169ec9c4 |
inet: preserve const qualifier in inet_csk()
We can change inet_csk() to propagate its argument const qualifier, thanks to container_of_const(). We have to fix few places that had mistakes, like tcp_bound_rto(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240329144931.295800-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
7d6ed9afde |
tcp: add dropreasons in tcp_rcv_state_process()
In this patch, I equipped this function with more dropreasons, but it still doesn't work yet, which I will do later. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
e615e3a24e |
tcp: add more specific possible drop reasons in tcp_rcv_synsent_state_process()
This patch does two things: 1) add two more new reasons 2) only change the return value(1) to various drop reason values for the future use For now, we still cannot trace those two reasons. We'll implement the full function in the subsequent patch in this series. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
21bd52ea38 |
tcp: Spelling s/curcuit/circuit/
Fix a misspelling of "circuit". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
d25f32722f |
tcp: no need to use acceptable for conn_request
Since tcp_conn_request() always returns zero, there is no need to keep the dead code. Remove it then. Link: https://lore.kernel.org/netdev/CANn89iJwx9b2dUGUKFSV3PF=kN5o+kxz3A_fHZZsOS4AnXhBNw@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240213131205.4309-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
795a7dfbc3 |
net: tcp: accept old ack during closing
For now, the packet with an old ack is not accepted if we are in
FIN_WAIT1 state, which can cause retransmission. Taking the following
case as an example:
Client Server
| |
FIN_WAIT1(Send FIN, seq=10) FIN_WAIT1(Send FIN, seq=20, ack=10)
| |
| Send ACK(seq=21, ack=11)
Recv ACK(seq=21, ack=11)
|
Recv FIN(seq=20, ack=10)
In the case above, simultaneous close is happening, and the FIN and ACK
packet that send from the server is out of order. Then, the FIN will be
dropped by the client, as it has an old ack. Then, the server has to
retransmit the FIN, which can cause delay if the server has set the
SO_LINGER on the socket.
Old ack is accepted in the ESTABLISHED and TIME_WAIT state, and I think
it should be better to keep the same logic.
In this commit, we accept old ack in FIN_WAIT1/FIN_WAIT2/CLOSING/LAST_ACK
states. Maybe we should limit it to FIN_WAIT1 for now?
Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240126040519.1846345-1-menglong8.dong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
8f674972d6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c |