linux/drivers
Daniel Borkmann 920da36341 netkit: Add xsk support for af_xdp applications
Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.

Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type ether \
            src 00:00:00:00:00:00 \
            src-mask ff:ff:ff:ff:ff:ff \
            dst $mac dst-mask 00:00:00:00:00:00 \
            proto 0 proto-mask 0xffff action 15
  [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
  # ip netns add foo
  # ip link add numrxqueues 2 nk type netkit single
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do queue-create \
                   --json "{"ifindex": $(ifindex nk), "type": "rx", \
                            "lease": { "ifindex": $(ifindex eth0), \
                                       "queue": { "type": "rx", "id": 15 } } }"
  {'id': 1}
  # ip link set nk netns foo
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk up
  # ip netns exec foo qemu-system-x86_64 \
          -kernel $kernel \
          -drive file=${image_name},index=0,media=disk,format=raw \
          -append "root=/dev/sda rw console=ttyS0" \
          -cpu host \
          -m $memory \
          -enable-kvm \
          -device virtio-net-pci,netdev=net0,mac=$mac \
          -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
          -nographic

We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0] and more
recently at LPC [1].

For getting to a first starting point to connect all things with
KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
Pod which acts as a regular Kubernetes Pod while not perfect, is not
a big problem given its out of reach from the application sitting
inside the VM (and some of the control plane aspects are baked in
the launcher Pod already), so the isolation barrier is still the VM.
Eventually the goal is to have a XDP/XSK redirect extension where
there is no need to have the xsk map, and the BPF program can just
derive the target xsk through the queue where traffic was received
on.

The exposure through netkit is because Cilium should not act as a
proxy handing out xsk sockets. Existing applications expect a netdev
from kernel side and should not need to rewrite just to implement
against a CNI's protocol. Also, all the memory should not be accounted
against Cilium but rather the application Pod itself which is consuming
AF_XDP. Further, on up/downgrades we expect the data plane to being
completely decoupled from the control plane; if Cilium would own the
sockets that would be disruptive. Another use-case which opens up and
is regularly asked from users would be to have DPDK applications on
top of AF_XDP in regular Kubernetes Pods.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
Link: https://lpc.events/event/19/contributions/2275/ [1]
Link: https://patch.msgid.link/20260115082603.219152-13-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20 11:58:50 +01:00
..
accel
accessibility
acpi ACPI: PCI: IRQ: Fix INTx GSIs signedness 2026-01-05 19:06:40 +01:00
amba
android rust_binder: remove spin_lock() in rust_shrink_free_page() 2025-12-29 11:34:16 +01:00
ata
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2026-01-08 11:38:33 -08:00
auxdisplay
base
bcma
block block-6.19-20260109 2026-01-09 15:42:46 -10:00
bluetooth
bus
cache
cdrom
cdx
char
clk
clocksource
comedi
connector
counter counter: 104-quad-8: Fix incorrect return value in IRQ handler 2025-12-22 20:03:23 +09:00
cpufreq
cpuidle
crypto crypto: qat - fix duplicate restarting msg during AER error 2025-12-29 08:44:14 +08:00
cxl
dax
dca
devfreq
dibs
dio
dma
dma-buf
dpll dpll: zl3073x: Implement device mode setting support 2026-01-19 12:04:57 -08:00
edac
eisa
extcon
firewire firewire: nosy: Fix dma_free_coherent() size 2025-12-26 22:04:03 +09:00
firmware arm64: efi: Fix NULL pointer dereference by initializing user_ns 2025-12-24 21:32:57 +01:00
fpga
fsi
fwctl
gnss
gpib
gpio gpio: shared: fix a false-positive sharing detection with reset-gpios 2026-01-09 09:56:46 +01:00
gpu amd-drm-fixes-6.19-2026-01-06: 2026-01-08 10:34:27 +10:00
greybus
hid hid-for-linus-2026010801 2026-01-08 07:44:48 -08:00
hsi
hte
hv
hwmon
hwspinlock
hwtracing
i2c
i3c
idle
iio
infiniband bnxt_en: Update FW interface to 1.10.3.151 2026-01-10 15:19:50 -08:00
input Input updates for v6.19-rc1 2025-12-21 15:21:10 -08:00
interconnect
iommu iommupt: Make pt_feature() always_inline 2026-01-10 10:50:45 +01:00
ipack
irqchip Revert "irqchip/riscv-imsic: Embed the vector array in lpriv" 2026-01-09 16:10:05 +01:00
isdn
leds
macintosh
mailbox
mcb
md block-6.19-20260102 2026-01-02 12:15:59 -08:00
media [GIT PULL for v6.19-rc6] media fixes 2026-01-14 08:18:01 -08:00
memory
memstick
message
mfd
misc Char/Misc driver fixes for 6.19-rc5 2026-01-11 07:27:44 -10:00
mmc
most
mtd treewide: Update email address 2026-01-11 06:09:11 -10:00
mux
net netkit: Add xsk support for af_xdp applications 2026-01-20 11:58:50 +01:00
nfc
ntb
nubus
nvdimm
nvme
nvmem
of of: unittest: Fix memory leak in unittest_data_add() 2026-01-02 15:36:37 -06:00
opp
parisc
parport
pci soc: fixes for 6.19 2026-01-09 15:11:45 -10:00
pcmcia
peci
perf
phy phy: add phy_get_rx_polarity() and phy_get_tx_polarity() 2026-01-14 18:16:05 +05:30
pinctrl pinctrl: qcom: lpass-lpi: mark the GPIO controller as sleeping 2026-01-01 15:40:56 +01:00
platform platform/x86: asus-armoury: add support for G835LW 2025-12-30 12:51:46 +02:00
pmdomain pmdomain: imx: Fix reference count leak in imx_gpc_probe() 2025-12-29 11:41:09 +01:00
pnp
power
powercap
pps
ps3
ptp
pwm
rapidio
ras
regulator regulator: fp9931: fix regulator node pointer 2025-12-24 11:31:29 +00:00
remoteproc
resctrl arm_mpam: Stop using uninitialized variables in __ris_msmon_read() 2026-01-08 19:03:15 +00:00
reset
rpmsg
rtc
s390
sbus
scsi scsi: bfa: Update outdated comment 2026-01-04 15:28:08 -05:00
sh
siox
slimbus
soc
soundwire
spi spi: cadence-quadspi: Prevent indirect read 2025-12-23 15:18:22 +00:00
spmi
ssb
staging
target
tc
tee
thermal
thunderbolt
tty serial: xilinx_uartps: fix rs485 delay_rts_after_send 2025-12-23 11:55:16 +01:00
ufs scsi: ufs: host: mediatek: Make read-only array scale_us static const 2026-01-04 15:48:50 -05:00
uio treewide: Update email address 2026-01-11 06:09:11 -10:00
usb Merge patch series "usb: typec: ucsi: revert broken buffer management" 2025-12-23 15:59:03 +01:00
vdpa
vfio vfio/xe: Fix use-after-free in xe_vfio_pci_alloc_file() 2025-12-28 12:42:46 -07:00
vhost vhost/vsock: improve RCU read sections around vhost_vsock_get() 2025-12-24 08:02:57 -05:00
video
virt
virtio
w1
watchdog
xen ACPI: PCI: IRQ: Fix INTx GSIs signedness 2026-01-05 19:06:40 +01:00
zorro
Kconfig
Makefile