mirror of https://github.com/torvalds/linux.git
7668 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
e6652a8ef3 |
net: ping6: Fix ping -6 with interface name
When passing interface parameter to ping -6:
$ ping -6 ::11:141:84:9 -I eth2
Results in:
PING ::11:141:84:10(::11:141:84:10) from ::11:141:84:9 eth2: 56 data bytes
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument
Initialize the fl6's outgoing interface (OIF) before triggering
ip6_datagram_send_ctl. Don't wipe fl6 after ip6_datagram_send_ctl() as
changes in fl6 that may happen in the function are overwritten explicitly.
Update comment accordingly.
Fixes:
|
|
|
|
3e0b8f529c |
net/ipv6: Expand and rename accept_unsolicited_na to accept_untracked_na
RFC 9131 changes default behaviour of handling RX of NA messages when the
corresponding entry is absent in the neighbour cache. The current
implementation is limited to accept just unsolicited NAs. However, the
RFC is more generic where it also accepts solicited NAs. Both types
should result in adding a STALE entry for this case.
Expand accept_untracked_na behaviour to also accept solicited NAs to
be compliant with the RFC and rename the sysctl knob to
accept_untracked_na.
Fixes:
|
|
|
|
7e062cda7d |
Networking changes for 5.19.
Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding
to very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space
to remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections
that have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting.
This makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core
from Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKNMPQACgkQMUZtbf5S
IrsRARAAuDyYs6jFYB3p+xazZdOnbF4iAgVv71+DQGvmsCl6CB9OrsNZMlvE85OL
Q3gjcRbgjrkN4lhgI8DmiGYbsUJnAvVjFdNjccz1Z/vTLYvuIM0ol54MUp5S+9WY
StncOJkOGJxxR/Gi5gzVmejPDsysU3Jik+hm/fpIcz8pybXxAsFKU5waY5qfl+/T
TZepfV0VCfqRDjqcF1qA5+jJZNU8pdodQlZ1+mh8bwu6Jk1ZkWkj6Ov8MWdwQldr
LnPeK/9hIGzkdJYHZfajxA3t8D0K5CHzSuih2bJ9ry8ZXgVBkXEThew778/R5izW
uB0YZs9COFlrIP7XHjtRTy/2xHOdYIPlj2nWhVdfuQDX8Crvt4VRN6EZ1rjko1ZJ
WanfG6WHF8NH5pXBRQbh3kIMKBnYn6OIzuCfCQSqd+niHcxFIM4vRiggeXI5C5TW
vJgEWfK6X+NfDiFVa3xyCrEmp5ieA/pNecpwd8rVkql+MtFAAw4vfsotLKOJEAru
J/XL6UE+YuLqIJV9ACZ9x1AFXXAo661jOxBunOo4VXhXVzWS9lYYz5r5ryIkgT/8
/Fr0zjANJWgfIuNdIBtYfQ4qG+LozGq038VA06RhFUAZ5tF9DzhqJs2Q2AFuWWBC
ewCePJVqo1j2Ceq2mGonXRt47OEnlePoOxTk9W+cKZb7ZWE+zEo=
=Wjii
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding to
very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space to
remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections that
have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting. This
makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core from
Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection"
* tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits)
ptp: ocp: Add firmware header checks
ptp: ocp: fix PPS source selector debugfs reporting
ptp: ocp: add .init function for sma_op vector
ptp: ocp: vectorize the sma accessor functions
ptp: ocp: constify selectors
ptp: ocp: parameterize input/output sma selectors
ptp: ocp: revise firmware display
ptp: ocp: add Celestica timecard PCI ids
ptp: ocp: Remove #ifdefs around PCI IDs
ptp: ocp: 32-bit fixups for pci start address
Revert "net/smc: fix listen processing for SMC-Rv2"
ath6kl: Use cc-disable-warning to disable -Wdangling-pointer
selftests/bpf: Dynptr tests
bpf: Add dynptr data slices
bpf: Add bpf_dynptr_read and bpf_dynptr_write
bpf: Dynptr support for ring buffers
bpf: Add bpf_dynptr_from_mem for local dynptrs
bpf: Add verifier support for dynptrs
bpf: Suppress 'passing zero to PTR_ERR' warning
bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
...
|
|
|
|
ac2ab99072 |
Random number generator updates for Linux 5.19-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKKpM8ACgkQSfxwEqXe
A6726w/+OJimGd4arvpSmdn+vxepSyDLgKfwM0x5zprRVd16xg8CjJr4eMonTesq
YvtJRqpetb53MB+sMhutlvQqQzrjtf2MBkgPwF4I2gUrk7vLD45Q+AGdGhi/rUwz
wHGA7xg1FHLHia2M/9idSqi8QlZmUP4u4l5ZnMyTUHiwvRD6XOrWKfqvUSawNzyh
hCWlTUxDrjizsW5YpsJX/MkRadSC8loJEk5ByZebow6nRPfurJvqfrcOMgHyNrbY
pOZ/CGPxcetMqotL2TuuJt5wKmenqYhIWGAp3YM2SWWgU2ueBZekW8AYeMfgUcvh
LWV93RpSuAnE5wsdjIULvjFnEDJBf8ihfMnMrd9G5QjQu44tuKWfY2MghLSpYzaR
V6UFbRmhrqhqiStHQXOvk1oqxtpbHlc9zzJLmvPmDJcbvzXQ9Opk5GVXAmdtnHnj
M/ty3wGWxucY6mHqT8MkCShSSslbgEtc1pEIWHdrUgnaiSVoCVBEO+9LqLbjvOTm
XA/6YtoiCE5FasK51pir1zVb2GORQn0v8HnuAOsusD/iPAlRQ/G5jZkaXbwRQI6j
atYL1svqvSKn5POnzqAlMUXfMUr19K5xqJdp7i6qmlO1Vq6Z+tWbCQgD1JV+Wjkb
CMyvXomFCFu4aYKGRE2SBRnWLRghG3kYHqEQ15yTPMQerxbUDNg=
=SUr3
-----END PGP SIGNATURE-----
Merge tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"These updates continue to refine the work began in 5.17 and 5.18 of
modernizing the RNG's crypto and streamlining and documenting its
code.
New for 5.19, the updates aim to improve entropy collection methods
and make some initial decisions regarding the "premature next" problem
and our threat model. The cloc utility now reports that random.c is
931 lines of code and 466 lines of comments, not that basic metrics
like that mean all that much, but at the very least it tells you that
this is very much a manageable driver now.
Here's a summary of the various updates:
- The random_get_entropy() function now always returns something at
least minimally useful. This is the primary entropy source in most
collectors, which in the best case expands to something like RDTSC,
but prior to this change, in the worst case it would just return 0,
contributing nothing. For 5.19, additional architectures are wired
up, and architectures that are entirely missing a cycle counter now
have a generic fallback path, which uses the highest resolution
clock available from the timekeeping subsystem.
Some of those clocks can actually be quite good, despite the CPU
not having a cycle counter of its own, and going off-core for a
stamp is generally thought to increase jitter, something positive
from the perspective of entropy gathering. Done very early on in
the development cycle, this has been sitting in next getting some
testing for a while now and has relevant acks from the archs, so it
should be pretty well tested and fine, but is nonetheless the thing
I'll be keeping my eye on most closely.
- Of particular note with the random_get_entropy() improvements is
MIPS, which, on CPUs that lack the c0 count register, will now
combine the high-speed but short-cycle c0 random register with the
lower-speed but long-cycle generic fallback path.
- With random_get_entropy() now always returning something useful,
the interrupt handler now collects entropy in a consistent
construction.
- Rather than comparing two samples of random_get_entropy() for the
jitter dance, the algorithm now tests many samples, and uses the
amount of differing ones to determine whether or not jitter entropy
is usable and how laborious it must be. The problem with comparing
only two samples was that if the cycle counter was extremely slow,
but just so happened to be on the cusp of a change, the slowness
wouldn't be detected. Taking many samples fixes that to some
degree.
This, combined with the other improvements to random_get_entropy(),
should make future unification of /dev/random and /dev/urandom
maybe more possible. At the very least, were we to attempt it again
today (we're not), it wouldn't break any of Guenter's test rigs
that broke when we tried it with 5.18. So, not today, but perhaps
down the road, that's something we can revisit.
- We attempt to reseed the RNG immediately upon waking up from system
suspend or hibernation, making use of the various timestamps about
suspend time and such available, as well as the usual inputs such
as RDRAND when available.
- Batched randomness now falls back to ordinary randomness before the
RNG is initialized. This provides more consistent guarantees to the
types of random numbers being returned by the various accessors.
- The "pre-init injection" code is now gone for good. I suspect you
in particular will be happy to read that, as I recall you
expressing your distaste for it a few months ago. Instead, to avoid
a "premature first" issue, while still allowing for maximal amount
of entropy availability during system boot, the first 128 bits of
estimated entropy are used immediately as it arrives, with the next
128 bits being buffered. And, as before, after the RNG has been
fully initialized, it winds up reseeding anyway a few seconds later
in most cases. This resulted in a pretty big simplification of the
initialization code and let us remove various ad-hoc mechanisms
like the ugly crng_pre_init_inject().
- The RNG no longer pretends to handle the "premature next" security
model, something that various academics and other RNG designs have
tried to care about in the past. After an interesting mailing list
thread, these issues are thought to be a) mainly academic and not
practical at all, and b) actively harming the real security of the
RNG by delaying new entropy additions after a potential compromise,
making a potentially bad situation even worse. As well, in the
first place, our RNG never even properly handled the premature next
issue, so removing an incomplete solution to a fake problem was
particularly nice.
This allowed for numerous other simplifications in the code, which
is a lot cleaner as a consequence. If you didn't see it before,
https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a
thread worth skimming through.
- While the interrupt handler received a separate code path years ago
that avoids locks by using per-cpu data structures and a faster
mixing algorithm, in order to reduce interrupt latency, input and
disk events that are triggered in hardirq handlers were still
hitting locks and more expensive algorithms. Those are now
redirected to use the faster per-cpu data structures.
- Rather than having the fake-crypto almost-siphash-based random32
implementation be used right and left, and in many places where
cryptographically secure randomness is desirable, the batched
entropy code is now fast enough to replace that.
- As usual, numerous code quality and documentation cleanups. For
example, the initialization state machine now uses enum symbolic
constants instead of just hard coding numbers everywhere.
- Since the RNG initializes once, and then is always initialized
thereafter, a pretty heavy amount of code used during that
initialization is never used again. It is now completely cordoned
off using static branches and it winds up in the .text.unlikely
section so that it doesn't reduce cache compactness after the RNG
is ready.
- A variety of functions meant for waiting on the RNG to be
initialized were only used by vsprintf, and in not a particularly
optimal way. Replacing that usage with a more ordinary setup made
it possible to remove those functions.
- A cleanup of how we warn userspace about the use of uninitialized
/dev/urandom and uninitialized get_random_bytes() usage.
Interestingly, with the change you merged for 5.18 that attempts to
use jitter (but does not block if it can't), the majority of users
should never see those warnings for /dev/urandom at all now, and
the one for in-kernel usage is mainly a debug thing.
- The file_operations struct for /dev/[u]random now implements
.read_iter and .write_iter instead of .read and .write, allowing it
to also implement .splice_read and .splice_write, which makes
splice(2) work again after it was broken here (and in many other
places in the tree) during the set_fs() removal. This was a bit of
a last minute arrival from Jens that hasn't had as much time to
bake, so I'll be keeping my eye on this as well, but it seems
fairly ordinary. Unfortunately, read_iter() is around 3% slower
than read() in my tests, which I'm not thrilled about. But Jens and
Al, spurred by this observation, seem to be making progress in
removing the bottlenecks on the iter paths in the VFS layer in
general, which should remove the performance gap for all drivers.
- Assorted other bug fixes, cleanups, and optimizations.
- A small SipHash cleanup"
* tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits)
random: check for signals after page of pool writes
random: wire up fops->splice_{read,write}_iter()
random: convert to using fops->write_iter()
random: convert to using fops->read_iter()
random: unify batched entropy implementations
random: move randomize_page() into mm where it belongs
random: remove mostly unused async readiness notifier
random: remove get_random_bytes_arch() and add rng_has_arch_random()
random: move initialization functions out of hot pages
random: make consistent use of buf and len
random: use proper return types on get_random_{int,long}_wait()
random: remove extern from functions in header
random: use static branch for crng_ready()
random: credit architectural init the exact amount
random: handle latent entropy and command line from random_init()
random: use proper jiffies comparison macro
random: remove ratelimiting for in-kernel unseeded randomness
random: move initialization out of reseeding hot path
random: avoid initializing twice in credit race
random: use symbolic constants for crng_init states
...
|
|
|
|
dc7769244e |
tcp_ipv6: set the drop_reason in the right place
Looks like the IPv6 version of the patch under Fixes was
a copy/paste of the IPv4 but hit the wrong spot.
It is tcp_v6_rcv() which uses drop_reason as a boolean, and
needs to be protected against reason == 0 before calling free.
tcp_v6_do_rcv() has a pretty straightforward flow.
The resulting warning looks like this:
WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775
Call Trace:
tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767)
ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438)
ip6_input_finish (include/linux/rcupdate.h:726)
ip6_input (include/linux/netfilter.h:307)
Fixes:
|
|
|
|
805cb5aadc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, misc updates and fallout fixes from recent Florian's code rewritting (from last pull request): 1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi. 2) Avoid unnecessary GC with a timestamp in conncount, from William Tu and Yifeng Sun. 3) Remove TCP conntrack debugging, from Florian Westphal. 4) Fix compilation warning in ctnetlink, from Florian. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" netfilter: conntrack: remove pr_debug callsites from tcp tracker netfilter: nf_conncount: reduce unnecessary GC netfilter: Use l3mdev flow key when re-routing mangled packets ==================== Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
d7e6f58360 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/main.c |
|
|
|
d4150779e6 |
random32: use real rng for non-deterministic randomness
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop👏making👏bespoke👏random👏number👏generators👏. Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
|
|
|
2c50fc0475 |
netfilter: Use l3mdev flow key when re-routing mangled packets
Commit
|
|
|
|
f8319dfd1b |
net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().
And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.
Fixes:
|
|
|
|
5d368f0328 |
ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()
INET6_MATCH() runs without holding a lock on the socket.
We probably need to annotate most reads.
This patch makes INET6_MATCH() an inline function
to ease our changes.
v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
Change the name to inet6_match(), this is no longer a macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
4c971d2f35 |
net: annotate races around sk->sk_bound_dev_if
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while this field can be changed by another thread. Adds minimal annotations to avoid KCSAN splats for UDP. Following patches will add more annotations to potential lockless readers. BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0: __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576 __sys_connect_file net/socket.c:1900 [inline] __sys_connect+0x197/0x1b0 net/socket.c:1917 __do_sys_connect net/socket.c:1927 [inline] __se_sys_connect net/socket.c:1924 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1924 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1: udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffff9b Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 I chose to not add Fixes: tag because race has minor consequences and stable teams busy enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
80e425b613 |
ipv6: Add hop-by-hop header to jumbograms in ip6_output
Instead of simply forcing a 0 payload_len in IPv6 header, implement RFC 2675 and insert a custom extension header. Note that only TCP stack is currently potentially generating jumbograms, and that this extension header is purely local, it wont be sent on a physical link. This is needed so that packet capture (tcpdump and friends) can properly dissect these large packets. Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
81fbc81213 |
ipv6/gro: insert temporary HBH/jumbo header
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build BIG TCP ipv6 packets (bigger than 64K). This patch changes ipv6_gro_complete() to insert a HBH/jumbo header so that resulting packet can go through IPv6/TCP stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
09f3d1a3a5 |
ipv6/gso: remove temporary HBH/jumbo header
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.
If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.
v2: perform HBH removal from ipv6_gso_segment() instead of
skb_segment() (Alexander feedback)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
1a01a07517 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.
The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.
1) Add a dedicated list for conntrack event redelivery.
2) Include event redelivery list in conntrack dumps of dying type.
3) Remove per-cpu dying list for event redelivery, not used anymore.
4) Add netns .pre_exit to cttimeout to zap timeout objects before
synchronize_rcu() call.
5) Remove nf_ct_unconfirmed_destroy.
6) Add generation id for conntrack extensions for conntrack
timeout and helpers.
7) Detach timeout policy from conntrack on cttimeout module removal.
8) Remove __nf_ct_unconfirmed_destroy.
9) Remove unconfirmed list.
10) Remove unconditional local_bh_disable in init_conntrack().
11) Consolidate conntrack iterator nf_ct_iterate_cleanup().
12) Detect if ctnetlink listeners exist to short-circuit event
path early.
13) Un-inline nf_ct_ecache_ext_add().
14) Add nf_conntrack_events autodetect ctnetlink listener mode
and make it default.
15) Add nf_ct_ecache_exist() to check for event cache extension.
16) Extend flowtable reverse route lookup to include source, iif,
tos and mark, from Sven Auhagen.
17) Do not verify zero checksum UDP packets in nf_reject,
from Kevin Mitchell.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
2c5f153647 |
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-05-13 1) Cleanups for the code behind the XFRM offload API. This is a preparation for the extension of the API for policy offload. From Leon Romanovsky. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: drop not needed flags variable in XFRM offload struct net/mlx5e: Use XFRM state direction instead of flags netdevsim: rely on XFRM state direction instead of flags ixgbe: propagate XFRM offload state direction instead of flags xfrm: store and rely on direction to construct offload flags xfrm: rename xfrm_state_offload struct to allow reuse xfrm: delete not used number of external headers xfrm: free not used XFRM_ESP_NO_TRAILER flag ==================== Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
4f9bd53084 |
netfilter: conntrack: skip verification of zero UDP checksum
The checksum is optional for UDP packets. However nf_reject would previously require a valid checksum to elicit a response such as ICMP_DEST_UNREACH. Add some logic to nf_reject_verify_csum to determine if a UDP packet has a zero checksum and should therefore not be verified. Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> |
|
|
|
04c494e68a |
Revert "tcp/dccp: get rid of inet_twsk_purge()"
This reverts commits: |
|
|
|
cae3873c5b |
net: inet: Retire port only listening_hash
The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit |
|
|
|
b01a277a05 |
xfrm: free not used XFRM_ESP_NO_TRAILER flag
After removal of Innova IPsec support from mlx5 driver, the last user of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely remove it as no other hardware is capable (or need) to remove ESP trailer. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
|
|
|
c8227d568d |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/forwarding/Makefile |
|
|
|
b2d057560b |
secure_seq: use the 64 bits of the siphash for port offset calculation
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
|
|
|
|
425b9c7f51 |
memcg: accounting for objects allocated for new netdevice
Creating a new netdevice allocates at least ~50Kb of memory for various kernel objects, but only ~5Kb of them are accounted to memcg. As a result, creating an unlimited number of netdevice inside a memcg-limited container does not fall within memcg restrictions, consumes a significant part of the host's memory, can cause global OOM and lead to random kills of host processes. The main consumers of non-accounted memory are: ~10Kb 80+ kernfs nodes ~6Kb ipv6_add_dev() allocations 6Kb __register_sysctl_table() allocations 4Kb neigh_sysctl_register() allocations 4Kb __devinet_sysctl_register() allocations 4Kb __addrconf_sysctl_register() allocations Accounting of these objects allows to increase the share of memcg-related memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice on typical VM with default Fedora 35 kernel) and this should be enough to somehow protect the host from misuse inside container. Other related objects are quite small and may not be taken into account to minimize the expected performance degradation. It should be separately mentonied ~300 bytes of percpu allocation of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes it can become the main consumer of memory. This patch does not enables kernfs accounting as it affects other parts of the kernel and should be discussed separately. However, even without kernfs, this patch significantly improves the current situation and allows to take into account more than half of all netdevice allocations. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
4c7f24f857 |
net: sysctl: introduce sysctl SYSCTL_THREE
This patch introdues the SYSCTL_THREE. KUnit: [00:10:14] ================ sysctl_test (10 subtests) ================= [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max [00:10:14] =================== [PASSED] sysctl_test =================== ./run_kselftest.sh -c sysctl ... ok 1 selftests: sysctl: sysctl.sh Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
bd8a53675c |
net: sysctl: use shared sysctl macro
This patch replace two, four and long_one to SYSCTL_XXX. Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
b52e1cce31 |
ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL
ARPHRD_TUNNEL interface can't process rs packets and will generate TX errors ex: ip tunnel add ethn mode ipip local 192.168.1.1 remote 192.168.1.2 ifconfig ethn x.x.x.x ethn: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1480 inet x.x.x.x netmask 255.255.255.255 destination x.x.x.x inet6 fe80::5efe:ac1e:3cdb prefixlen 64 scopeid 0x20<link> tunnel txqueuelen 1000 (IPIP Tunnel) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 3 dropped 0 overruns 0 carrier 0 collisions 0 Signed-off-by: jianghaoran <jianghaoran@kylinos.cn> Link: https://lore.kernel.org/r/20220429053802.246681-1-jianghaoran@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
b11ebf2ca2 |
ip6_gre: Make IP6GRE and IP6GRETAP devices always NETIF_F_LLTX
Recently we made o_seqno atomic_t. Stop special-casing TUNNEL_SEQ, and always mark IP6GRE[TAP] devices as NETIF_F_LLTX, since we no longer need the TX lock (&txq->_xmit_lock). Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
a9384a4c1d |
mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
Whenever RCU protected list replaces an object,
the pointer to the new object needs to be updated
_before_ the call to kfree_rcu() or call_rcu()
Also ip6_mc_msfilter() needs to update the pointer
before releasing the mc_lock mutex.
Note that linux-5.13 was supporting kfree_rcu(NULL, rcu),
so this fix does not need the conditional test I was
forced to use in the equivalent patch for IPv4.
Fixes:
|
|
|
|
58f71be58b |
ipv6: refactor ip6_finish_output2()
Throw neigh checks in ip6_finish_output2() under a single slow path if, so we don't have the overhead in the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
4b143ed7dd |
ipv6: help __ip6_finish_output() inlining
There are two callers of __ip6_finish_output(), both are in ip6_finish_output(). We can combine the call sites into one and handle return code after, that will inline __ip6_finish_output(). Note, error handling under NET_XMIT_CN will only return 0 if __ip6_finish_output() succeded, and in this case it return 0. Considering that NET_XMIT_SUCCESS is 0, it'll be returning exactly the same result for it as before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
6fd1d51cfa |
net: SO_RCVMARK socket option for SO_MARK with recvmsg()
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK should be included in the ancillary data returned by recvmsg(). Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs(). Signed-off-by: Erin MacNeil <lnx.erin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
0e55546b18 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/netdevice.h net/core/dev.c |
|
|
|
68822bdf76 |
net: generalize skb freeing deferral to per-cpu lists
Logic added in commit
|
|
|
|
ba5a4fdd63 |
tcp: make sure treq->af_specific is initialized
syzbot complained about a recent change in TCP stack,
hitting a NULL pointer [1]
tcp request sockets have an af_specific pointer, which
was used before the blamed change only for SYNACK generation
in non SYNCOOKIE mode.
tcp requests sockets momentarily created when third packet
coming from client in SYNCOOKIE mode were not using
treq->af_specific.
Make sure this field is populated, in the same way normal
TCP requests sockets do in tcp_conn_request().
[1]
TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies. Check SNMP counters.
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe4807f9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
dst_input include/net/dst.h:461 [inline]
ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
__netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
__netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
__napi_poll+0xb3/0x6e0 net/core/dev.c:6413
napi_poll net/core/dev.c:6480 [inline]
net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
Fixes:
|
|
|
|
31c417c948 |
ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in collect_md mode is racy for [IP6]GRE[TAP] devices. Consider the following sequence of events: 1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link add ... external". "ip" ignores "[o]seq" if "external" is specified, so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e. it uses lockless TX); 2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g. bpf_skb_set_tunnel_key() in an eBPF program attached to this device; 3. gre_fb_xmit() or __gre6_xmit() processes these skb's: gre_build_header(skb, tun_hlen, flags, protocol, tunnel_id_to_key32(tun_info->key.tun_id), (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0); ^^^^^^^^^^^^^^^^^ Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may try to do this tunnel->o_seqno++ in parallel, which is racy. Fix it by making o_seqno atomic_t. As mentioned by Eric Dumazet in commit |
|
|
|
fde98ae91f |
ip6_gre: Make o_seqno start from 0 in native mode
For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in
native mode. According to RFC 2890 2.2., "The first datagram is sent
with a sequence number of 0." Fix it.
It is worth mentioning that o_seqno already starts from 0 in collect_md
mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(),
where tunnel->o_seqno is passed to gre_build_header() before getting
incremented.
Fixes:
|
|
|
|
8ddffdb944 |
netfilter: Update ip6_route_me_harder to consider L3 domain
The commit referenced below fixed packet re-routing if Netfilter mangles
a routing key property of a packet and the packet is routed in a VRF L3
domain. The fix, however, addressed IPv4 re-routing, only.
This commit applies the same behavior for IPv6. While at it, untangle
the nested ternary operator to make the code more readable.
Fixes:
|
|
|
|
81ee0eb6c0 |
ipv6: Use ipv6_only_sock() helper in condition.
This patch replaces some sk_ipv6only tests with ipv6_only_sock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
89e9c72800 |
ipv6: Remove __ipv6_only_sock().
Since commit
|
|
|
|
d09d3ec03f |
net/ipv6: Enforce limits for accept_unsolicited_na sysctl
Fix mistake in the original patch where limits were specified but the handler didn't take care of the limits. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
f70925bf99 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/microchip/lan966x/lan966x_main.c |
|
|
|
f9a2fb7331 |
net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131
Add a new neighbour cache entry in STALE state for routers on receiving an unsolicited (gratuitous) neighbour advertisement with target link-layer-address option specified. This is similar to the arp_accept configuration for IPv4. A new sysctl endpoint is created to turn on this behaviour: /proc/sys/net/ipv6/conf/interface/accept_unsolicited_na. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
9cb7c01342 |
ipv6: make ip6_rt_gc_expire an atomic_t
Reads and Writes to ip6_rt_gc_expire always have been racy,
as syzbot reported lately [1]
There is a possible risk of under-flow, leading
to unexpected high value passed to fib6_run_gc(),
although I have not observed this in the field.
Hosts hitting ip6_dst_gc() very hard are under pretty bad
state anyway.
[1]
BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc
read-write to 0xffff888102110744 of 4 bytes by task 13165 on cpu 1:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
read-write to 0xffff888102110744 of 4 bytes by task 11607 on cpu 0:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
value changed: 0x00000bb3 -> 0x00000ba9
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11607 Comm: kworker/0:21 Not tainted 5.18.0-rc1-syzkaller-00037-g42e7a03d3bad-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: mld mld_ifc_work
Fixes:
|
|
|
|
0339d25a28 |
ipv6: fix NULL deref in ip6_rcv_core()
idev can be NULL, as the surrounding code suggests.
Fixes:
|
|
|
|
ab198e1d0d |
ip6_gre: Fix skb_under_panic in __gre6_xmit()
Feng reported an skb_under_panic BUG triggered by running
test_ip6gretap() in tools/testing/selftests/bpf/test_tunnel.sh:
[ 82.492551] skbuff: skb_under_panic: text:ffffffffb268bb8e len:403 put:12 head:ffff9997c5480000 data:ffff9997c547fff8 tail:0x18b end:0x2c0 dev:ip6gretap11
<...>
[ 82.607380] Call Trace:
[ 82.609389] <TASK>
[ 82.611136] skb_push.cold.109+0x10/0x10
[ 82.614289] __gre6_xmit+0x41e/0x590
[ 82.617169] ip6gre_tunnel_xmit+0x344/0x3f0
[ 82.620526] dev_hard_start_xmit+0xf1/0x330
[ 82.623882] sch_direct_xmit+0xe4/0x250
[ 82.626961] __dev_queue_xmit+0x720/0xfe0
<...>
[ 82.633431] packet_sendmsg+0x96a/0x1cb0
[ 82.636568] sock_sendmsg+0x30/0x40
<...>
The following sequence of events caused the BUG:
1. During ip6gretap device initialization, tunnel->tun_hlen (e.g. 4) is
calculated based on old flags (see ip6gre_calc_hlen());
2. packet_snd() reserves header room for skb A, assuming
tunnel->tun_hlen is 4;
3. Later (in clsact Qdisc), the eBPF program sets a new tunnel key for
skb A using bpf_skb_set_tunnel_key() (see _ip6gretap_set_tunnel());
4. __gre6_xmit() detects the new tunnel key, and recalculates
"tun_hlen" (e.g. 12) based on new flags (e.g. TUNNEL_KEY and
TUNNEL_SEQ);
5. gre_build_header() calls skb_push() with insufficient reserved header
room, triggering the BUG.
As sugguested by Cong, fix it by moving the call to skb_cow_head() after
the recalculation of tun_hlen.
Reproducer:
OBJ=$LINUX/tools/testing/selftests/bpf/test_tunnel_kern.o
ip netns add at_ns0
ip link add veth0 type veth peer name veth1
ip link set veth0 netns at_ns0
ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
ip netns exec at_ns0 ip link set dev veth0 up
ip link set dev veth1 up mtu 1500
ip addr add dev veth1 172.16.1.200/24
ip netns exec at_ns0 ip addr add ::11/96 dev veth0
ip netns exec at_ns0 ip link set dev veth0 up
ip addr add dev veth1 ::22/96
ip link set dev veth1 up
ip netns exec at_ns0 \
ip link add dev ip6gretap00 type ip6gretap seq flowlabel 0xbcdef key 2 \
local ::11 remote ::22
ip netns exec at_ns0 ip addr add dev ip6gretap00 10.1.1.100/24
ip netns exec at_ns0 ip addr add dev ip6gretap00 fc80::100/96
ip netns exec at_ns0 ip link set dev ip6gretap00 up
ip link add dev ip6gretap11 type ip6gretap external
ip addr add dev ip6gretap11 10.1.1.200/24
ip addr add dev ip6gretap11 fc80::200/24
ip link set dev ip6gretap11 up
tc qdisc add dev ip6gretap11 clsact
tc filter add dev ip6gretap11 egress bpf da obj $OBJ sec ip6gretap_set_tunnel
tc filter add dev ip6gretap11 ingress bpf da obj $OBJ sec ip6gretap_get_tunnel
ping6 -c 3 -w 10 -q ::11
Fixes:
|
|
|
|
f40c064e93 |
ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()
Do not update tunnel->tun_hlen in data plane code. Use a local variable instead, just like "tunnel_hlen" in net/ipv4/ip_gre.c:gre_fb_xmit(). Co-developed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
2cc7fb9d24 |
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says: ==================== pull request (net): ipsec 2022-04-14 1) Fix the output interface for VRF cases in xfrm_dst_lookup. From David Ahern. 2) Fix write out of bounds by doing COW on esp output when the packet size is larger than a page. From Sabrina Dubroca. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
edf45f007a | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net | |
|
|
eeab7e7ff4 |
net: ipv6: add skb drop reasons to ip6_protocol_deliver_rcu()
Replace kfree_skb() used in ip6_protocol_deliver_rcu() with kfree_skb_reason(). No new reasons are added. Some paths are ignored, as they are not common, such as encapsulation on non-final protocol. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> |