mirror of https://github.com/torvalds/linux.git
606 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
9961a78594 |
for-6.10/io_uring-20240511
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89 S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21 E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd 40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69 8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV 9aZx87CyhA== =DwAV -----END PGP SIGNATURE----- Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ... |
|
|
|
7ab4f16f9e |
net: extend ubuf_info callback to ops structure
We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
f8bbc07ac5 |
tun: limit printing rate when illegal packet received by tun dev
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.
net_ratelimit mechanism can be used to limit the dumping rate.
PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980"
#0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
#1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
#2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
#3 [fffffe00003fced0] do_nmi at ffffffff8922660d
#4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
[exception RIP: io_serial_in+20]
RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002
RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0
RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f
R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020
R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffa655314979e8] io_serial_in at ffffffff89792594
#6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
#7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
#8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
#9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
#10 [ffffa65531497ac8] console_unlock at ffffffff89316124
#11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
#12 [ffffa65531497b68] printk at ffffffff89318306
#13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
#14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
#15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
#16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
#17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
#18 [ffffa65531497f10] kthread at ffffffff892d2e72
#19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f
Fixes:
|
|
|
|
490a79faf9 |
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
4166204d7e |
net: tap: Remove generic .ndo_get_stats64
Commit
|
|
|
|
46f480ec14 |
net: tuntap: Leverage core stats allocator
With commit
|
|
|
|
65f5dd4f02 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/mptcp/protocol.c |
|
|
|
2a770cdc43 |
tun: Fix xdp_rxq_info's queue_index when detaching
When a queue(tfile) is detached, we only update tfile's queue_index,
but do not update xdp_rxq_info's queue_index. This patch fixes it.
Fixes:
|
|
|
|
4d2bb0bfe8 |
xdp: rely on skb pointer reference in do_xdp_generic and netif_receive_generic_xdp
Rely on skb pointer reference instead of the skb pointer in do_xdp_generic and netif_receive_generic_xdp routine signatures. This is a preliminary patch to add multi-buff support for xdp running in generic mode where we will need to reallocate the skb to avoid linearization and we will need to make it visible to do_xdp_generic() caller. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c09415b1f48c8620ef4d76deed35050a7bddf7c2.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
45a96c407e |
tun: Implement ethtool's get_channels() callback
Implement the tun .get_channels functionality. This feature is necessary for some tools, such as libxdp, which need to retrieve the queue count. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
3f3ebe5362 |
net/tun: use reciprocal_scale
Use the inline function reciprocal_scale rather than open coding the scale optimization. Also, remove unnecessary initializations. Resulting compiled code is unchanged (according to godbolt). Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20240126002550.169608-1-stephen@networkplumber.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
f1084c427f |
tun: add missing rx stats accounting in tun_xdp_act
The TUN can be used as vhost-net backend, and it is necessary to
count the packets transmitted from TUN to vhost-net/virtio-net.
However, there are some places in the receive path that were not
taken into account when using XDP. It would be beneficial to also
include new accounting for successfully received bytes using
dev_sw_netstats_rx_add.
Fixes:
|
|
|
|
5744ba05e7 |
tun: fix missing dropped counter in tun_xdp_act
The commit |
|
|
|
cbfbfe3aee |
tun: prevent negative ifindex
After commit |
|
|
|
b2f8323364 |
tun: add __exit annotations to module exit func tun_cleanup()
Add missing __exit annotations to module exit func tun_cleanup(). Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814083000.3893589-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
6231e47b6f |
tun: avoid high-order page allocation for packet header
When gso.hdr_len is zero and a packet is transmitted via write() or writev(), all payload is treated as header which requires a contiguous memory allocation. This allocation request is harder to satisfy, and may even fail if there is enough fragmentation. Note that sendmsg() code path limits the linear copy length, so this change makes write()/writev() and sendmsg() paths more consistent. Signed-off-by: Tahsin Erdogan <trdgn@amazon.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230809164753.2247594-1-trdgn@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
4d016ae42e |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c |
|
|
|
59eeb23294 |
drivers: net: prevent tun_build_skb() to exceed the packet size limit
Using the syzkaller repro with reduced packet size it was discovered
that XDP_PACKET_HEADROOM is not checked in tun_can_build_skb(),
although pad may be incremented in tun_build_skb(). This may end up
with exceeding the PAGE_SIZE limit in tun_build_skb().
Jason Wang <jasowang@redhat.com> proposed to count XDP_PACKET_HEADROOM
always (e.g. without rcu_access_pointer(tun->xdp_prog)) in
tun_can_build_skb() since there's a window during which XDP program
might be attached between tun_can_build_skb() and tun_build_skb().
Fixes:
|
|
|
|
35b1b1fd96 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/dsa/port.c |
|
|
|
ce7c7fef14 |
net: tun: change tun_alloc_skb() to allow bigger paged allocations
tun_alloc_skb() is currently calling sock_alloc_send_pskb() forcing order-0 page allocations. Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max allocation size by 8x. Also add logic to increase the linear part if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tahsin Erdogan <trdgn@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230801205254.400094-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
9bc3047374 |
net: tun_chr_open(): set sk_uid from current_fsuid()
Commit |
|
|
|
82b2bc2794 |
tun: Fix memory leak for detached NAPI queue.
syzkaller reported [0] memory leaks of sk and skb related to the TUN
device with no repro, but we can reproduce it easily with:
struct ifreq ifr = {}
int fd_tun, fd_tmp;
char buf[4] = {};
fd_tun = openat(AT_FDCWD, "/dev/net/tun", O_WRONLY, 0);
ifr.ifr_flags = IFF_TUN | IFF_NAPI | IFF_MULTI_QUEUE;
ioctl(fd_tun, TUNSETIFF, &ifr);
ifr.ifr_flags = IFF_DETACH_QUEUE;
ioctl(fd_tun, TUNSETQUEUE, &ifr);
fd_tmp = socket(AF_PACKET, SOCK_PACKET, 0);
ifr.ifr_flags = IFF_UP;
ioctl(fd_tmp, SIOCSIFFLAGS, &ifr);
write(fd_tun, buf, sizeof(buf));
close(fd_tun);
If we enable NAPI and multi-queue on a TUN device, we can put skb into
tfile->sk.sk_write_queue after the queue is detached. We should prevent
it by checking tfile->detached before queuing skb.
Note this must be done under tfile->sk.sk_write_queue.lock because write()
and ioctl(IFF_DETACH_QUEUE) can run concurrently. Otherwise, there would
be a small race window:
write() ioctl(IFF_DETACH_QUEUE)
`- tun_get_user `- __tun_detach
|- if (tfile->detached) |- tun_disable_queue
| `-> false | `- tfile->detached = tun
| `- tun_queue_purge
|- spin_lock_bh(&queue->lock)
`- __skb_queue_tail(queue, skb)
Another solution is to call tun_queue_purge() when closing and
reattaching the detached queue, but it could paper over another
problems. Also, we do the same kind of test for IFF_NAPI_FRAGS.
[0]:
unreferenced object 0xffff88801edbc800 (size 2048):
comm "syz-executor.1", pid 33269, jiffies 4295743834 (age 18.756s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<000000008c16ea3d>] __do_kmalloc_node mm/slab_common.c:965 [inline]
[<000000008c16ea3d>] __kmalloc+0x4a/0x130 mm/slab_common.c:979
[<000000003addde56>] kmalloc include/linux/slab.h:563 [inline]
[<000000003addde56>] sk_prot_alloc+0xef/0x1b0 net/core/sock.c:2035
[<000000003e20621f>] sk_alloc+0x36/0x2f0 net/core/sock.c:2088
[<0000000028e43843>] tun_chr_open+0x3d/0x190 drivers/net/tun.c:3438
[<000000001b0f1f28>] misc_open+0x1a6/0x1f0 drivers/char/misc.c:165
[<000000004376f706>] chrdev_open+0x111/0x300 fs/char_dev.c:414
[<00000000614d379f>] do_dentry_open+0x2f9/0x750 fs/open.c:920
[<000000008eb24774>] do_open fs/namei.c:3636 [inline]
[<000000008eb24774>] path_openat+0x143f/0x1a30 fs/namei.c:3791
[<00000000955077b5>] do_filp_open+0xce/0x1c0 fs/namei.c:3818
[<00000000b78973b0>] do_sys_openat2+0xf0/0x260 fs/open.c:1356
[<00000000057be699>] do_sys_open fs/open.c:1372 [inline]
[<00000000057be699>] __do_sys_openat fs/open.c:1388 [inline]
[<00000000057be699>] __se_sys_openat fs/open.c:1383 [inline]
[<00000000057be699>] __x64_sys_openat+0x83/0xf0 fs/open.c:1383
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff88802f671700 (size 240):
comm "syz-executor.1", pid 33269, jiffies 4295743854 (age 18.736s)
hex dump (first 32 bytes):
68 c9 db 1e 80 88 ff ff 68 c9 db 1e 80 88 ff ff h.......h.......
00 c0 7b 2f 80 88 ff ff 00 c8 db 1e 80 88 ff ff ..{/............
backtrace:
[<00000000e9d9fdb6>] __alloc_skb+0x223/0x250 net/core/skbuff.c:644
[<000000002c3e4e0b>] alloc_skb include/linux/skbuff.h:1288 [inline]
[<000000002c3e4e0b>] alloc_skb_with_frags+0x6f/0x350 net/core/skbuff.c:6378
[<00000000825f98d7>] sock_alloc_send_pskb+0x3ac/0x3e0 net/core/sock.c:2729
[<00000000e9eb3df3>] tun_alloc_skb drivers/net/tun.c:1529 [inline]
[<00000000e9eb3df3>] tun_get_user+0x5e1/0x1f90 drivers/net/tun.c:1841
[<0000000053096912>] tun_chr_write_iter+0xac/0x120 drivers/net/tun.c:2035
[<00000000b9282ae0>] call_write_iter include/linux/fs.h:1868 [inline]
[<00000000b9282ae0>] new_sync_write fs/read_write.c:491 [inline]
[<00000000b9282ae0>] vfs_write+0x40f/0x530 fs/read_write.c:584
[<00000000524566e4>] ksys_write+0xa1/0x170 fs/read_write.c:637
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fixes:
|
|
|
|
6e98b09da9 |
Networking changes for 6.4.
Core
----
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances.
- Reduce compound page head access for zero-copy data transfers.
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
- Threaded NAPI improvements, adding defer skb free support and unneeded
softirq avoidance.
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking.
- Add lockless accesses annotation to sk_err[_soft].
- Optimize again the skb struct layout.
- Extends the skb drop reasons to make it usable by multiple
subsystems.
- Better const qualifier awareness for socket casts.
BPF
---
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and variable-sized
accesses.
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward.
- Add more precise memory usage reporting for all BPF map types.
- Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params.
- Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton.
- Bigger batch of BPF verifier improvements to prepare for upcoming BPF
open-coded iterators allowing for less restrictive looping capabilities.
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
programs to NULL-check before passing such pointers into kfunc.
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
local storage maps.
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps.
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree.
- Add BPF verifier support for ST instructions in convert_ctx_access()
which will help new -mcpu=v4 clang flag to start emitting them.
- Add ARM32 USDT support to libbpf.
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations.
Protocols
---------
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address.
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
- Add the handshake upcall mechanism, allowing the user-space
to implement generic TLS handshake on kernel's behalf.
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures.
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers.
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction.
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore.
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter
---------
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged.
- Update bridge netfilter and ovs conntrack helpers to handle
IPv6 Jumbo packets properly, i.e. fetch the packet length
from hop-by-hop extension header. This is needed for BIT TCP
support.
- The iptables 32bit compat interface isn't compiled in by default
anymore.
- Move ip(6)tables builtin icmp matches to the udptcp one.
This has the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used.
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device.
Driver API
----------
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time.
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them.
- Allow the page_pool to directly recycle the pages from safely
localized NAPI.
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization.
- Add YNL support for user headers and struct attrs.
- Add partial YNL specification for devlink.
- Add partial YNL specification for ethtool.
- Add tc-mqprio and tc-taprio support for preemptible traffic classes.
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device.
- Add basic LED support for switch/phy.
- Add NAPI documentation, stop relaying on external links.
- Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
work to make the hardware timestamping layer selectable by user
space.
- Add transceiver support and improve the error messages for CAN-FD
controllers.
New hardware / drivers
----------------------
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers
-------
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors.
- add support for configuring max SDU for each Tx queue.
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only
on shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates.
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices
(e.g. MAC address from efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
vP7ugFnttneg
=ITVa
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
|
|
|
|
de4f5fed3f |
iov_iter: add iter_iovec() helper
This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
438b406055 |
tun: flag the device as supporting FMODE_NOWAIT
tun already checks for both O_NONBLOCK and IOCB_NOWAIT in its read and write iter handlers, so it's fully ready for FMODE_NOWAIT. But for some reason it doesn't set it. Rectify that oversight. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/3f7dc1f0-79ca-d85c-4d16-8c12c5bd492d@kernel.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
de42873367 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd agIZTkRs6BHF9He8oROypARZxTeMLwM= =nO1C -----END PGP SIGNATURE----- Daniel Borkmann says: ==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit |
|
|
|
a096ccca6e |
tun: tun_chr_open(): correctly initialize socket uid
sock_init_data() assumes that the `struct socket` passed in input is
contained in a `struct socket_alloc` allocated with sock_alloc().
However, tun_chr_open() passes a `struct socket` embedded in a `struct
tun_file` allocated with sk_alloc().
This causes a type confusion when issuing a container_of() with
SOCK_INODE() in sock_init_data() which results in assigning a wrong
sk_uid to the `struct sock` in input.
On default configuration, the type confused field overlaps with the
high 4 bytes of `struct tun_struct __rcu *tun` of `struct tun_file`,
NULL at the time of call, which makes the uid of all tun sockets 0,
i.e., the root one.
Fix the assignment by using sock_init_data_uid().
Fixes:
|
|
|
|
66c0e13ad2 |
drivers: net: turn on XDP features
A summary of the flags being set for various drivers is given below. Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features that can be turned off and on at runtime. This means that these flags may be set and unset under RTNL lock protection by the driver. Hence, READ_ONCE must be used by code loading the flag value. Also, these flags are not used for synchronization against the availability of XDP resources on a device. It is merely a hint, and hence the read may race with the actual teardown of XDP resources on the device. This may change in the future, e.g. operations taking a reference on the XDP resources of the driver, and in turn inhibiting turning off this flag. However, for now, it can only be used as a hint to check whether device supports becoming a redirection target. Turn 'hw-offload' feature flag on for: - netronome (nfp) - netdevsim. Turn 'native' and 'zerocopy' features flags on for: - intel (i40e, ice, ixgbe, igc) - mellanox (mlx5). - stmmac - netronome (nfp) Turn 'native' features flags on for: - amazon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2, enetc) - funeth - intel (igb) - marvell (mvneta, mvpp2, octeontx2) - mellanox (mlx4) - mtk_eth_soc - qlogic (qede) - sfc - socionext (netsec) - ti (cpsw) - tap - tsnep - veth - xen - virtio_net. Turn 'basic' (tx, pass, aborted and drop) features flags on for: - netronome (nfp) - cavium (thunder) - hyperv. Turn 'redirect_target' feature flag on for: - amanzon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2) - intel (i40e, ice, igb, ixgbe) - ti (cpsw) - marvell (mvneta, mvpp2) - sfc - socionext (netsec) - qlogic (qede) - mellanox (mlx5) - tap - veth - virtio_net - xen Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Marek Majtyka <alardam@gmail.com> Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
|
|
|
399e082764 |
driver/net/tun: Added features for USO.
Added support for USO4 and USO6. For now, to "enable" USO, it's required to set both USO4 and USO6 simultaneously. USO enables NETIF_F_GSO_UDP_L4. Signed-off-by: Andrew Melnychenko <andrew@daynix.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
f2bb566f5c |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c |
|
|
|
5daadc86f2 |
net: tun: Fix use-after-free in tun_detach()
syzbot reported use-after-free in tun_detach() [1]. This causes call
trace like below:
==================================================================
BUG: KASAN: use-after-free in notifier_call_chain+0x1ee/0x200 kernel/notifier.c:75
Read of size 8 at addr ffff88807324e2a8 by task syz-executor.0/3673
CPU: 0 PID: 3673 Comm: syz-executor.0 Not tainted 6.1.0-rc5-syzkaller-00044-gcc675d22e422 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x15e/0x461 mm/kasan/report.c:395
kasan_report+0xbf/0x1f0 mm/kasan/report.c:495
notifier_call_chain+0x1ee/0x200 kernel/notifier.c:75
call_netdevice_notifiers_info+0x86/0x130 net/core/dev.c:1942
call_netdevice_notifiers_extack net/core/dev.c:1983 [inline]
call_netdevice_notifiers net/core/dev.c:1997 [inline]
netdev_wait_allrefs_any net/core/dev.c:10237 [inline]
netdev_run_todo+0xbc6/0x1100 net/core/dev.c:10351
tun_detach drivers/net/tun.c:704 [inline]
tun_chr_close+0xe4/0x190 drivers/net/tun.c:3467
__fput+0x27c/0xa90 fs/file_table.c:320
task_work_run+0x16f/0x270 kernel/task_work.c:179
exit_task_work include/linux/task_work.h:38 [inline]
do_exit+0xb3d/0x2a30 kernel/exit.c:820
do_group_exit+0xd4/0x2a0 kernel/exit.c:950
get_signal+0x21b1/0x2440 kernel/signal.c:2858
arch_do_signal_or_restart+0x86/0x2300 arch/x86/kernel/signal.c:869
exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
exit_to_user_mode_prepare+0x15f/0x250 kernel/entry/common.c:203
__syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:296
do_syscall_64+0x46/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The cause of the issue is that sock_put() from __tun_detach() drops
last reference count for struct net, and then notifier_call_chain()
from netdev_state_change() accesses that struct net.
This patch fixes the issue by calling sock_put() from tun_detach()
after all necessary accesses for the struct net has done.
Fixes:
|
|
|
|
ab00af85d2 |
net: tun: rebuild error handling in tun_get_user
The error handling in tun_get_user is very scattered. This patch unifies error handling, reduces duplication of code, and makes the logic clearer. Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
966a9b4903 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/pch_can.c |
|
|
|
07d120aa33 |
net: tun: call napi_schedule_prep() to ensure we own a napi
A recent patch exposed another issue in napi_get_frags()
caught by syzbot [1]
Before feeding packets to GRO, and calling napi_complete()
we must first grab NAPI_STATE_SCHED.
[1]
WARNING: CPU: 0 PID: 3612 at net/core/dev.c:6076 napi_complete_done+0x45b/0x880 net/core/dev.c:6076
Modules linked in:
CPU: 0 PID: 3612 Comm: syz-executor408 Not tainted 6.1.0-rc3-syzkaller-00175-g1118b2049d77 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
RIP: 0010:napi_complete_done+0x45b/0x880 net/core/dev.c:6076
Code: c1 ea 03 0f b6 14 02 4c 89 f0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 24 04 00 00 41 89 5d 1c e9 73 fc ff ff e8 b5 53 22 fa <0f> 0b e9 82 fe ff ff e8 a9 53 22 fa 48 8b 5c 24 08 31 ff 48 89 de
RSP: 0018:ffffc90003c4f920 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000030 RCX: 0000000000000000
RDX: ffff8880251c0000 RSI: ffffffff875a58db RDI: 0000000000000007
RBP: 0000000000000001 R08: 0000000000000007 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff888072d02628
R13: ffff888072d02618 R14: ffff888072d02634 R15: 0000000000000000
FS: 0000555555f13300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c44d3892b8 CR3: 00000000172d2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
napi_complete include/linux/netdevice.h:510 [inline]
tun_get_user+0x206d/0x3a60 drivers/net/tun.c:1980
tun_chr_write_iter+0xdb/0x200 drivers/net/tun.c:2027
call_write_iter include/linux/fs.h:2191 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_writev+0x1aa/0x630 fs/read_write.c:934
do_writev+0x133/0x2f0 fs/read_write.c:977
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f37021a3c19
Fixes:
|
|
|
|
1118b2049d |
net: tun: Fix memory leaks of napi_get_frags
kmemleak reports after running test_progs:
unreferenced object 0xffff8881b1672dc0 (size 232):
comm "test_progs", pid 394388, jiffies 4354712116 (age 841.975s)
hex dump (first 32 bytes):
e0 84 d7 a8 81 88 ff ff 80 2c 67 b1 81 88 ff ff .........,g.....
00 40 c5 9b 81 88 ff ff 00 00 00 00 00 00 00 00 .@..............
backtrace:
[<00000000c8f01748>] napi_skb_cache_get+0xd4/0x150
[<0000000041c7fc09>] __napi_build_skb+0x15/0x50
[<00000000431c7079>] __napi_alloc_skb+0x26e/0x540
[<000000003ecfa30e>] napi_get_frags+0x59/0x140
[<0000000099b2199e>] tun_get_user+0x183d/0x3bb0 [tun]
[<000000008a5adef0>] tun_chr_write_iter+0xc0/0x1b1 [tun]
[<0000000049993ff4>] do_iter_readv_writev+0x19f/0x320
[<000000008f338ea2>] do_iter_write+0x135/0x630
[<000000008a3377a4>] vfs_writev+0x12e/0x440
[<00000000a6b5639a>] do_writev+0x104/0x280
[<00000000ccf065d8>] do_syscall_64+0x3b/0x90
[<00000000d776e329>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
The issue occurs in the following scenarios:
tun_get_user()
napi_gro_frags()
napi_frags_finish()
case GRO_NORMAL:
gro_normal_one()
list_add_tail(&skb->list, &napi->rx_list);
<-- While napi->rx_count < READ_ONCE(gro_normal_batch),
<-- gro_normal_list() is not called, napi->rx_list is not empty
<-- not ask to complete the gro work, will cause memory leaks in
<-- following tun_napi_del()
...
tun_napi_del()
netif_napi_del()
__netif_napi_del()
<-- &napi->rx_list is not empty, which caused memory leaks
To fix, add napi_complete() after napi_gro_frags().
Fixes:
|
|
|
|
fbeb229a66 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
598d2982b1 |
net: tun: bump the link speed from 10Mbps to 10Gbps
The 10Mbps link speed was set in 2004 when the ethtool interface was initially added to the tun driver. It might have been a good assumption 18 years ago, but CPUs and network stack came a long way since then. Other virtual ports typically report much higher speeds. For example, veth reports 10Gbps since its introduction in 2007. Some userspace applications rely on the current link speed in certain situations. For example, Open vSwitch is using link speed as an upper bound for QoS configuration if user didn't specify the maximum rate. Advertised 10Mbps doesn't match reality in a modern world, so users have to always manually override the value with something more sensible to avoid configuration issues, e.g. limiting the traffic too much. This also creates additional confusion among users. Bump the advertised speed to at least match the veth. Alternative might be to explicitly report UNKNOWN and let the user decide on a right value for them. And it is indeed "the right way" of fixing the problem. However, that may cause issues with bonding or with some userspace applications that may rely on speed value to be reported (even though they should not). Just changing the speed value should be a safer option. Users can still override the speed with ethtool, if necessary. RFC discussion is linked below. Link: https://lore.kernel.org/lkml/20221021114921.3705550-1-i.maximets@ovn.org/ Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-July/051958.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20221031173953.614577-1-i.maximets@ovn.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
|
|
|
363a5328f4 |
net: tun: fix bugs for oversize packet when napi frags enabled
Recently, we got two syzkaller problems because of oversize packet
when napi frags enabled.
One of the problems is because the first seg size of the iov_iter
from user space is very big, it is 2147479538 which is bigger than
the threshold value for bail out early in __alloc_pages(). And
skb->pfmemalloc is true, __kmalloc_reserve() would use pfmemalloc
reserves without __GFP_NOWARN flag. Thus we got a warning as following:
========================================================
WARNING: CPU: 1 PID: 17965 at mm/page_alloc.c:5295 __alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
...
Call trace:
__alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
__alloc_pages_node include/linux/gfp.h:550 [inline]
alloc_pages_node include/linux/gfp.h:564 [inline]
kmalloc_large_node+0x94/0x350 mm/slub.c:4038
__kmalloc_node_track_caller+0x620/0x8e4 mm/slub.c:4545
__kmalloc_reserve.constprop.0+0x1e4/0x2b0 net/core/skbuff.c:151
pskb_expand_head+0x130/0x8b0 net/core/skbuff.c:1654
__skb_grow include/linux/skbuff.h:2779 [inline]
tun_napi_alloc_frags+0x144/0x610 drivers/net/tun.c:1477
tun_get_user+0x31c/0x2010 drivers/net/tun.c:1835
tun_chr_write_iter+0x98/0x100 drivers/net/tun.c:2036
The other problem is because odd IPv6 packets without NEXTHDR_NONE
extension header and have big packet length, it is 2127925 which is
bigger than ETH_MAX_MTU(65535). After ipv6_gso_pull_exthdrs() in
ipv6_gro_receive(), network_header offset and transport_header offset
are all bigger than U16_MAX. That would trigger skb->network_header
and skb->transport_header overflow error, because they are all '__u16'
type. Eventually, it would affect the value for __skb_push(skb, value),
and make it be a big value. After __skb_push() in ipv6_gro_receive(),
skb->data would less than skb->head, an out of bounds memory bug occurred.
That would trigger the problem as following:
==================================================================
BUG: KASAN: use-after-free in eth_type_trans+0x100/0x260
...
Call trace:
dump_backtrace+0xd8/0x130
show_stack+0x1c/0x50
dump_stack_lvl+0x64/0x7c
print_address_description.constprop.0+0xbc/0x2e8
print_report+0x100/0x1e4
kasan_report+0x80/0x120
__asan_load8+0x78/0xa0
eth_type_trans+0x100/0x260
napi_gro_frags+0x164/0x550
tun_get_user+0xda4/0x1270
tun_chr_write_iter+0x74/0x130
do_iter_readv_writev+0x130/0x1ec
do_iter_write+0xbc/0x1e0
vfs_writev+0x13c/0x26c
To fix the problems, restrict the packet size less than
(ETH_MAX_MTU - NET_SKB_PAD - NET_IP_ALIGN) which has considered reserved
skb space in napi_alloc_skb() because transport_header is an offset from
skb->head. Add len check in tun_napi_alloc_frags() simply.
Fixes:
|
|
|
|
aff3069954 |
net: tun: Convert to use sysfs_emit() APIs
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
accc3b4a57 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
195624d9c2 |
tun: support not enabling carrier in TUNSETIFF
This change adds support for not enabling carrier during TUNSETIFF interface creation by specifying the IFF_NO_CARRIER flag. Our tests make heavy use of tun interfaces. In some scenarios, the test process creates the interface but another process brings it up after the interface is discovered via netlink notification. In that case, it is not possible to create a tun/tap interface with carrier off without it racing against the bring up. Immediately setting carrier off via TUNSETCARRIER is still too late. Signed-off-by: Patrick Rohr <prohr@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
|
|
fb3ceec187 |
net: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220830201457.7984-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
ff1fa2081d |
net: tun: avoid disabling NAPI twice
Eric reports that syzbot made short work out of my speculative
fix. Indeed when queue gets detached its tfile->tun remains,
so we would try to stop NAPI twice with a detach(), close()
sequence.
Alternative fix would be to move tun_napi_disable() to
tun_detach_all() and let the NAPI run after the queue
has been detached.
Fixes:
|
|
|
|
a8fc8cb569 |
net: tun: stop NAPI when detaching queues
While looking at a syzbot report I noticed the NAPI only gets
disabled before it's deleted. I think that user can detach
the queue before destroying the device and the NAPI will never
be stopped.
Fixes:
|
|
|
|
3b9bc84d31 |
net: tun: unlink NAPI from device on destruction
Syzbot found a race between tun file and device destruction.
NAPIs live in struct tun_file which can get destroyed before
the netdev so we have to del them explicitly. The current
code is missing deleting the NAPI if the queue was detached
first.
Fixes:
|
|
|
|
16d083e28f |
net: switch to netif_napi_add_tx()
Switch net callers to the new API not requiring the NAPI_POLL_WEIGHT argument. Acked-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://lore.kernel.org/r/20220504163725.550782-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
968a1a5d65 |
tun: annotate access to queue->trans_start
Commit |
|
|
|
625788b584 |
net: add per-cpu storage and net->core_stats
Before adding yet another possibly contended atomic_long_t,
it is time to add per-cpu storage for existing ones:
dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler
Because many devices do not have to increment such counters,
allocate the per-cpu storage on demand, so that dev_get_stats()
does not have to spend considerable time folding zero counters.
Note that some drivers have abused these counters which
were supposed to be only used by core networking stack.
v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
Change in netdev_core_stats_alloc() (Jakub)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: jeffreyji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
3d391f6518 |
tun: vxlan: Use netif_rx().
Since commit
|
|
|
|
4b4f052e2d |
net: tun: track dropped skb via kfree_skb_reason()
The TUN can be used as vhost-net backend. E.g, the tun_net_xmit() is the interface to forward the skb from TUN to vhost-net/virtio-net. However, there are many "goto drop" in the TUN driver. Therefore, the kfree_skb_reason() is involved at each "goto drop" to help userspace ftrace/ebpf to track the reason for the loss of packets. The below reasons are introduced: - SKB_DROP_REASON_DEV_READY - SKB_DROP_REASON_NOMEM - SKB_DROP_REASON_HDR_TRUNC - SKB_DROP_REASON_TAP_FILTER - SKB_DROP_REASON_TAP_TXFILTER Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> |