-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI
LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2
tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB
Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5
lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw
Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx
zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo
o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe
if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV
Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5
8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1
GhFvN6fwdw==
=4g/f
-----END PGP SIGNATURE-----
Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
I started to see zcrx data corruptions. That turned out to be due
to CQ tail pointing to a stale entry which happened to be from
a zcrx request. I.e. the tail is incremented without the CQE
memory being changed.
The culprit is __io_cqring_overflow_flush() passing "cqe32=true"
to io_get_cqe_overflow() for non-mixed CQE32 setups, which only
expects it to be set for mixed 32B CQEs and not for SETUP_CQE32.
The fix is slightly hacky, long term it's better to unify mixed and
CQE32 handling.
Fixes: e26dca67fd ("io_uring: add support for IORING_SETUP_CQE_MIXED")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move user SQ/CQ offset initialisation at the end of io_prepare_config()
where it already calculated all information to set it properly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move ring layouts calculations into io_prepare_config(), so that more
misconfiguration checking can be done earlier before creating a ctx.
It also deduplicates some code with ring resizing. And as a bonus, now
it initialises params->sq_off.array, which is closer to all other user
offset init, and also applies it to ring resizing, which was previously
missing it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a structure keeping SQ/CQ sizes and offsets. For now it only records
data previously returned from rings_size and the SQ size.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There will be more information needed during ctx setup, and instead of
passing a handful of pointers around, wrap them all into a new
structure. Add a helper for encapsulating all configuration checks and
preparation, that's also reused for ring resizing.
Note, it indirectly adds a io_uring_sanitise_params() check to ring
resizing, which is a good thing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use size_add / size_mul set of functions for rings_size() calculations.
It's more consistent with struct_size(), and errors are preserved across
a series of calculations, so intermediate result checks can be omitted.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A preparation patch inversing the IORING_SETUP_NO_SQARRAY check, this
way there is only one successful return path from the function, which
will be helpful later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge 6.18-rc io_uring fixes, as certain coming changes depend on some
of these.
* io_uring-6.18:
io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs
io_uring/query: return number of available queries
io_uring/rw: ensure allocated iovec gets cleared for early failure
io_uring: fix regbuf vector size truncation
io_uring: fix types for region size calulation
io_uring/zcrx: remove sync refill uapi
io_uring: fix buffer auto-commit for multishot uring_cmd
io_uring: correct __must_hold annotation in io_install_fixed_file
io_uring zcrx: add MAINTAINERS entry
io_uring: Fix code indentation error
io_uring/sqpoll: be smarter on when to update the stime usage
io_uring/sqpoll: switch away from getrusage() for CPU accounting
io_uring: fix incorrect unlikely() usage in io_waitid_prep()
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_sanitise_params() sanitises most of the setup flags invariants,
move the IORING_SETUP_FLAGS check from io_uring_setup() into it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mem_is_zero() does the job without hand rolled loops, use that to verify
reserved fields of ring params.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to pass the user requested number of SQ entries
separately from the main parameter structure io_uring_params. Initialise
it at the beginning and stop passing it in favour of struct
io_uring_params::sq_entries.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_SETUP_NO_MMAP rings remain user accessible even before the ctx
setup is finalised, so use WRITE_ONCE consistently when initialising
rings.
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a refcount to struct io_zcrx_ifq to reverse the refcounting
relationship i.e. rings now reference ifqs instead. As a result of this,
remove ctx->refs that an ifq holds on a ring via the page pool memory
provider.
This ref ifq->refs is held by internal users of an ifq, namely rings and
the page pool memory provider associated with an ifq. This is needed to
keep the ifq around until the page pool is destroyed.
Since ifqs now no longer hold refs to ring ctx, there isn't a need to
split the cleanup of ifqs into two: io_shutdown_zcrx_ifqs() in
io_ring_exit_work() while waiting for ctx->refs to drop to 0, and
io_unregister_zcrx_ifqs() after. Remove io_shutdown_zcrx_ifqs().
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor io_free_region() to take user_struct directly, instead of
accessing it from the ring ctx.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a bunch of code strictly dealing with cancelations, and that
code really belongs in cancel.c rather than in the core io_uring.c file.
Move the code there. Mostly mechanical, only real oddity here is that
struct io_defer_entry now needs to be visible across both io_uring.c
and cancel.c.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for uring_cmd implementations to implement functions
with the io_req_tw_func_t signature, introduce a wrapper struct
io_tw_req to hide the struct io_kiocb * argument. The intention is for
only the io_uring core to access the inner struct io_kiocb *. uring_cmd
implementations should instead call a helper from io_uring/cmd.h to
convert struct io_tw_req to struct io_uring_cmd *.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_fallback_req_func() calls io_should_terminate_tw() on each req's ctx.
But since the reqs all come from the ctx's fallback_llist, req->ctx will
be ctx for all of the reqs. Therefore, compute ts.cancel as
io_should_terminate_tw(ctx) just once, outside the loop.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_submit_sqes() shouldn't be stepping into its main loop when there is
nothing to submit, i.e. nr=0. Fix 0 submission queue entries checks,
which should follow after all user input truncations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix the indentation to ensure consistent code style and improve
readability and to fix the errors:
ERROR: code indent should use tabs where possible
+ return io_net_import_vec(req, kmsg, sr->buf, sr->len, ITER_SOURCE);$
ERROR: code indent should use tabs where possible
+^I^I^I struct io_big_cqe *big_cqe)$
Tested by running the /scripts/checkpatch.pl
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normal rings support 64b SQEs for posting submissions, while certain
features require the ring to be configured with IORING_SETUP_SQE128, as
they need to convey more information per submission. This, in turn,
makes ALL the SQEs be 128b in size. This is somewhat wasteful and
inefficient, particularly when only certain SQEs need to be of the
bigger variant.
This adds support for setting up a ring with mixed SQE sizes, using
IORING_SETUP_SQE_MIXED. When setup in this mode, SQEs posted to the ring
may be either 64b or 128b in size. If a SQE is 128b in size, then opcode
will be set to a variante to indicate that this is the case. Any other
non-128b opcode will assume the SQ's default size.
SQEs on these types of mixed rings may also utilize NOP with skip
success set. This can happen if the ring is one (small) SQE entry away
from wrapping, and an attempt is made to get a 128b SQE. As SQEs must be
contiguous in the SQ ring, a 128b SQE cannot wrap the ring. For this
case, a single NOP SQE should be inserted with the SKIP_SUCCESS flag
set. The kernel will process this as a normal NOP and without posting a
CQE.
Signed-off-by: Keith Busch <kbusch@kernel.org>
[axboe: {} style fix and assign sqe before opcode read]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a good practice to validate parameters before doing any heavy stuff
like queue allocations. Do that for io_allocate_scq_urings().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A minor cleanup precomputing the sq size first instead of branching
array_size() in io_allocate_scq_urings().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than do per-tw checking, which needs to dip into the task_struct
for checking flags, do it upfront before running task_work. This places
a 'cancel' member in io_tw_token_t, which is assigned before running
task_work for that given ctx.
This is both more efficient in doing it upfront rather than for every
task_work, and it means that io_should_terminate_tw() can be made
private in io_uring.c rather than need to be called by various
callbacks of task_work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the liburing git URL to point to the git.kernel.org servers,
rather than my private git.kernel.dk server. Due to continued AI
scraping of cgit etc, it's becoming quite the chore to maintain a
private git server.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
ye4/K8Y3vA==
=uc/3
-----END PGP SIGNATURE-----
Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Store ring provided buffers locally for the users, rather than stuff
them into struct io_kiocb.
These types of buffers must always be fully consumed or recycled in
the current context, and leaving them in struct io_kiocb is hence not
a good ideas as that struct has a vastly different life time.
Basically just an architecture cleanup that can help prevent issues
with ring provided buffers in the future.
- Support for mixed CQE sizes in the same ring.
Before this change, a CQ ring either used the default 16b CQEs, or it
was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
a few 32b CQEs were needed, this caused everything else to use big
CQEs. This is wasteful both in terms of memory usage, but also memory
bandwidth for the posted CQEs.
With IORING_SETUP_CQE_MIXED, applications may use request types that
post both normal 16b and big 32b CQEs on the same ring.
- Add helpers for async data management, to make it harder for opcode
handlers to mess it up.
- Add support for multishot for uring_cmd, which ublk can use. This
helps improve efficiency, by providing a persistent request type that
can trigger multiple CQEs.
- Add initial support for ring feature querying.
We had basic support for probe operations, but the API isn't great.
Rather than expand that, add support for QUERY which is easily
expandable and can cover a lot more cases than the existing probe
support. This will help applications get a better idea of what
operations are supported on a given host.
- zcrx improvements from Pavel:
- Improve refill entry alignment for better caching
- Various cleanups, especially around deduplicating normal
memory vs dmabuf setup.
- Generalisation of the niov size (Patch 12). It's still hard
coded to PAGE_SIZE on init, but will let the user to specify
the rx buffer length on setup.
- Syscall / synchronous bufer return. It'll be used as a slow
fallback path for returning buffers when the refill queue is
full. Useful for tolerating slight queue size misconfiguration
or with inconsistent load.
- Accounting more memory to cgroups.
- Additional independent cleanups that will also be useful for
mutli-area support.
- Various fixes and cleanups
* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
io_uring: fix nvme's 32b cqes on mixed cq
io_uring/query: cap number of queries
io_uring/query: prevent infinite loops
io_uring/zcrx: account niov arrays to cgroup
io_uring/zcrx: allow synchronous buffer return
io_uring/zcrx: introduce io_parse_rqe()
io_uring/zcrx: don't adjust free cache space
io_uring/zcrx: use guards for the refill lock
io_uring/zcrx: reduce netmem scope in refill
io_uring/zcrx: protect netdev with pp_lock
io_uring/zcrx: rename dma lock
io_uring/zcrx: make niov size variable
io_uring/zcrx: set sgt for umem area
io_uring/zcrx: remove dmabuf_offset
io_uring/zcrx: deduplicate area mapping
io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
io_uring/zcrx: check all niovs filled with dma addresses
io_uring/zcrx: move area reg checks into io_import_area
io_uring/zcrx: don't pass slot to io_zcrx_create_area
...
A recent commit:
fc582cd26e ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU")
fixed an issue with not deferring freeing of io_kiocb structs that
msg_ring allocates to after the current RCU grace period. But this only
covers requests that don't end up in the allocation cache. If a request
goes into the alloc cache, it can get reused before it is sane to do so.
A recent syzbot report would seem to indicate that there's something
there, however it may very well just be because of the KASAN poisoning
that the alloc_cache handles manually.
Rather than attempt to make the alloc_cache sane for that use case, just
drop the usage of the alloc_cache for msg_ring request payload data.
Fixes: 50cf5f3842 ("io_uring/msg_ring: add an alloc cache for io_kiocb entries")
Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/
Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When running task_work for an exiting task, rather than perform the
issue retry attempt, the task_work is canceled. However, this isn't
done for a ring that has been closed. This can lead to requests being
successfully completed post the ring being closed, which is somewhat
confusing and surprising to an application.
Rather than just check the task exit state, also include the ring
ref state in deciding whether or not to terminate a given request when
run from task_work.
Cc: stable@vger.kernel.org # 6.1+
Link: https://github.com/axboe/liburing/discussions/1459
Reported-by: Benedek Thaler <thaler@thaler.hu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a 32b CQE is required, don't double the size of the overflow struct,
just add the size of the io_uring_cqe addition that is needed. This
avoids allocating too much memory, as the io_overflow_cqe size includes
the list member required to queue them too.
Fixes: e26dca67fd ("io_uring: add support for IORING_SETUP_CQE_MIXED")
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.
queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.
This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring/io_uring.h doesn't use anything declared in
io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h
includes in .c files previously relying on the transitive include from
io_uring.h.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add constants for supported setup / request / feature flags as well as
the feature mask. They'll be used in the next patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no need to use WRITE_ONCE() to set ctx->submitter_task in
io_uring_create() since no other task can access the io_ring_ctx until a
file descriptor is associated with it. So use a normal assignment
instead of WRITE_ONCE().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250904161223.2600435-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normal rings support 16b CQEs for posting completions, while certain
features require the ring to be configured with IORING_SETUP_CQE32, as
they need to convey more information per completion. This, in turn,
makes ALL the CQEs be 32b in size. This is somewhat wasteful and
inefficient, particularly when only certain CQEs need to be of the
bigger variant.
This adds support for setting up a ring with mixed CQE sizes, using
IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring
may be either 16b or 32b in size. If a CQE is 32b in size, then
IORING_CQE_F_32 is set in the CQE flags to indicate that this is the
case. If this flag isn't set, the CQE is the normal 16b variant.
CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set.
This can happen if the ring is one (small) CQE entry away from wrapping,
and an attempt is made to post a 32b CQE. As CQEs must be contigious in
the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single
dummy CQE is posted with the SKIP flag set. The application should
simply ignore those.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the buffer list is stored in struct io_kiocb. The buffer list
can be of two types:
1) Classic/legacy buffer list. These don't need to get referenced after
a buffer pick, and hence storing them in struct io_kiocb is perfectly
fine.
2) Ring provided buffer lists. These DO need to be referenced after the
initial buffer pick, as they need to get consumed later on. This can
be either just incrementing the head of the ring, or it can be
consuming parts of a buffer if incremental buffer consumptions has
been configured.
For case 2, io_uring needs to be careful not to access the buffer list
after the initial pick-and-execute context. The core does recycling of
these, but it's easy to make a mistake, because it's stored in the
io_kiocb which does persist across multiple execution contexts. Either
because it's a multishot request, or simply because it needed some kind
of async trigger (eg poll) for retry purposes.
Add a struct io_buffer_list to struct io_br_sel, which is always on
stack for the various users of it. This prevents the buffer list from
leaking outside of that execution context, and additionally it enables
kbuf to not even pass back the struct io_buffer_list if the given
context isn't appropriately locked already.
This doesn't fix any bugs, it's simply a defensive measure to prevent
any issues with reuse of a buffer list.
Link: https://lore.kernel.org/r/20250821020750.598432-12-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than have this implied being in the io_kiocb, pass it in directly
so it's immediately obvious where these users of ->buf_list are coming
from.
Link: https://lore.kernel.org/r/20250821020750.598432-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Picking multiple buffers always requires the ring lock to be held across
the operation, so there's no need to pass in the issue_flags to
io_put_kbufs(). On the single buffer side, if the initial picking of a
ring buffer was unlocked, then it will have been committed already. For
legacy buffers, no locking is required, as they will simply be freed.
Link: https://lore.kernel.org/r/20250821020750.598432-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Poison various request fields on free. __io_req_caches_free() is a slow
path, so can be done unconditionally, but gate it on kasan for
io_req_add_to_cache(). Note that some fields are logically retained
between cache allocations and can't be poisoned in
io_req_add_to_cache().
Ideally, it'd be replaced with KASAN'ed caches, but that can't be
enabled because of some synchronisation nuances.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a78e8a7f5be434313c400650b862e36c211b312.1755459452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Opcode handlers like POLL_ADD will use ->async_data as the pointer for
double poll handling, which is a bit different than the usual case
where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more
proactive in handling ->async_data, and clear it to NULL as part of
regular init. Init is touching that cacheline anyway, so might as well
clear it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu
8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6
V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp
PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff
FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM
PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He
+2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9
pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli
2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60
+ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy
8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy
2cMeDukZ/w==
=EGH5
-----END PGP SIGNATURE-----
Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Optimization to avoid reference counts on non-cloned registered
buffers. This is how these buffers were handled prior to having
cloning support, and we can still use that approach as long as the
buffers haven't been cloned to another ring.
- Cleanup and improvement for uring_cmd, where btrfs was the only user
of storing allocated data for the lifetime of the uring_cmd. Clean
that up so we can get rid of the need to do that.
- Avoid unnecessary memory copies in uring_cmd usage. This is
particularly important as a lot of uring_cmd usage necessitates the
use of 128b SQEs.
- A few updates for recv multishot, where it's now possible to add
fairness limits for limiting how much is transferred for each retry
loop. Additionally, recv multishot now supports an overall cap as
well, where once reached the multishot recv will terminate. The
latter is useful for buffer management and juggling many recv streams
at the same time.
- Add support for returning the TX timestamps via a new socket command.
This feature can work in either singleshot or multishot mode, where
the latter triggers a completion whenever new timestamps are
available. This is an alternative to using the existing error queue.
- Add support for an io_uring "mock" file, which is the start of being
able to do 100% targeted testing in terms of exercising io_uring
request handling. The idea is to have a file type that can be
anything the tester would like, and behave exactly how you want it to
behave in terms of hitting the code paths you want.
- Improve zcrx by using sgtables to de-duplicate and improve dma
address handling.
- Prep work for supporting larger pages for zcrx.
- Various little improvements and fixes.
* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
io_uring/zcrx: fix leaking pages on sg init fail
io_uring/zcrx: don't leak pages on account failure
io_uring/zcrx: fix null ifq on area destruction
io_uring: fix breakage in EXPERT menu
io_uring/cmd: remove struct io_uring_cmd_data
btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
io_uring/zcrx: account area memory
io_uring: export io_[un]account_mem
io_uring/net: Support multishot receive len cap
io_uring: deduplicate wakeup handling
io_uring/net: cast min_not_zero() type
io_uring/poll: cleanup apoll freeing
io_uring/net: allow multishot receive per-invocation cap
io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
io_uring/net: use passed in 'len' in io_recv_buf_select()
io_uring/zcrx: prepare fallback for larger pages
io_uring/zcrx: assert area type in io_zcrx_iov_page
io_uring/zcrx: allocate sgtable for umem areas
io_uring/zcrx: introduce io_populate_area_dma
...
No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in
IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then
io_clean_op() check for it and clean it.
Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from
IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and
freeing.
Link: https://lore.kernel.org/io-uring/20250712000344.1579663-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 6f11adcc6f.
The problematic commit was fixed in mainline, so the work-around in
io_uring can be removed at this point. Anonymous inodes no longer
pretend to be regular files after:
1e7ab6f678 ("anon_inode: rework assertions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in 6.16 io_uring fixes, to avoid clashes with pending net and
settings changes.
* io_uring-6.16:
io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
io_uring/kbuf: flag partial buffer mappings
io_uring/net: mark iov as dynamically allocated even for single segments
io_uring: fix resource leak in io_import_dmabuf()
io_uring: don't assume uaddr alignment in io_vec_fill_bvec
io_uring/rsrc: don't rely on user vaddr alignment
io_uring/rsrc: fix folio unpinning
io_uring: make fallocate be hashed work
io_uring marks a request as dealing with a regular file on S_ISREG. This
drives things like retries on short reads or writes, which is generally
not expected on a regular file (or bdev). Applications tend to not
expect that, so io_uring tries hard to ensure it doesn't deliver short
IO on regular files.
However, a recent commit added S_IFREG to anonymous inodes. When
io_uring is used to read from various things that are backed by anon
inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait
for more data when rather than deliver what was read or written in a
single operation. This breaks applications that issue reads on anon
inodes, if they ask for more data than a single read delivers.
Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to
prevent that.
Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org
Link: https://github.com/ghostty-org/ghostty/discussions/7720
Fixes: cfd86ef7e8 ("anon_inode: use a proper mode internally")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd
helper on top. As it specifically works with requests, the helper ignore
the passed in cqe->user_data and sets it to the one stored in the
request.
The command helper is only valid with multishot requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>