mirror of https://github.com/torvalds/linux.git
207 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
2b2fc0be98
|
fs: fix missing declaration of init_files
fs/file.c should include include/linux/init_task.h for declaration of init_files. This fixes the sparse warning: fs/file.c:501:21: warning: symbol 'init_files' was not declared. Should it be static? Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com> Link: https://lore.kernel.org/r/20241217071836.2634868-1-zhangkunbo@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
4c797b11a8 |
vfs-6.13.file
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcW4gAKCRCRxhvAZXjc
okF+AP9xTMb2SlnRPBOBd9yFcmVXmQi86TSCUPAEVb+wIldGYwD/RIOdvXYJlp9v
RgJkU1DC3ddkXtONNDY6gFaP+siIWA0=
=gMc7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file updates from Christian Brauner:
"This contains changes the changes for files for this cycle:
- Introduce a new reference counting mechanism for files.
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
it has O(N^2) behaviour under contention with N concurrent
operations and it is in a hot path in __fget_files_rcu().
The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make
this work and requiring rcu protection for the data structure in
question. This not just scales better it also introduces overflow
protection.
However, in contrast to generic rcuref, files require a memory
barrier and thus cannot rely on *_relaxed() atomic operations and
also require to be built on atomic_long_t as having massive amounts
of reference isn't unheard of even if it is just an attack.
This adds a file specific variant instead of making this a generic
library.
This has been tested by various people and it gives consistent
improvement up to 3-5% on workloads with loads of threads.
- Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
via find_next_zero_bit() when there is a free slot in the word that
contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
and write by 4% on Intel ICX 160.
- Conditionally clear full_fds_bits since it's very likely that a bit
in full_fds_bits has been cleared during __clear_open_fds(). This
improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
Intel ICX 160.
- Get rid of all lookup_*_fdget_rcu() variants. They were used to
lookup files without taking a reference count. That became invalid
once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
always taking a reference count. Switch to an already existing
helper and remove the legacy variants.
- Remove pointless includes of <linux/fdtable.h>.
- Avoid cmpxchg() in close_files() as nobody else has a reference to
the files_struct at that point.
- Move close_range() into fs/file.c and fold __close_range() into it.
- Cleanup calling conventions of alloc_fdtable() and expand_files().
- Merge __{set,clear}_close_on_exec() into one.
- Make __set_open_fd() set cloexec as well instead of doing it in two
separate steps"
* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
fs: port files to file_ref
fs: add file_ref
expand_files(): simplify calling conventions
make __set_open_fd() set cloexec state as well
fs: protect backing files with rcu
file.c: merge __{set,clear}_close_on_exec()
alloc_fdtable(): change calling conventions.
fs/file.c: add fast path in find_next_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
move close_range(2) into fs/file.c, fold __close_range() into it
close_files(): don't bother with xchg()
remove pointless includes of <linux/fdtable.h>
get rid of ...lookup...fdget_rcu() family
|
|
|
|
2ec67bb4f9
|
Merge branch 'work.fdtable' into vfs.file
Bring in the fdtable changes for this cycle. Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
90ee6ed776
|
fs: port files to file_ref
Port files to rely on file_ref reference to improve scaling and gain
overflow protection.
- We continue to WARN during get_file() in case a file that is already
marked dead is revived as get_file() is only valid if the caller
already holds a reference to the file. This hasn't changed just the
check changes.
- The semantics for epoll and ttm's dmabuf usage have changed. Both
epoll and ttm synchronize with __fput() to prevent the underlying file
from beeing freed.
(1) epoll
Explaining epoll is straightforward using a simple diagram.
Essentially, the mutex of the epoll instance needs to be taken in both
__fput() and around epi_fget() preventing the file from being freed
while it is polled or preventing the file from being resurrected.
CPU1 CPU2
fput(file)
-> __fput(file)
-> eventpoll_release(file)
-> eventpoll_release_file(file)
mutex_lock(&ep->mtx)
epi_item_poll()
-> epi_fget()
-> file_ref_get(file)
mutex_unlock(&ep->mtx)
mutex_lock(&ep->mtx);
__ep_remove()
mutex_unlock(&ep->mtx);
-> kmem_cache_free(file)
(2) ttm dmabuf
This explanation is a bit more involved. A regular dmabuf file stashed
the dmabuf in file->private_data and the file in dmabuf->file:
file->private_data = dmabuf;
dmabuf->file = file;
The generic release method of a dmabuf file handles file specific
things:
f_op->release::dma_buf_file_release()
while the generic dentry release method of a dmabuf handles dmabuf
freeing including driver specific things:
dentry->d_release::dma_buf_release()
During ttm dmabuf initialization in ttm_object_device_init() the ttm
driver copies the provided struct dma_buf_ops into a private location:
struct ttm_object_device {
spinlock_t object_lock;
struct dma_buf_ops ops;
void (*dmabuf_release)(struct dma_buf *dma_buf);
struct idr idr;
};
ttm_object_device_init(const struct dma_buf_ops *ops)
{
// copy original dma_buf_ops in private location
tdev->ops = *ops;
// stash the release method of the original struct dma_buf_ops
tdev->dmabuf_release = tdev->ops.release;
// override the release method in the copy of the struct dma_buf_ops
// with ttm's own dmabuf release method
tdev->ops.release = ttm_prime_dmabuf_release;
}
When a new dmabuf is created the struct dma_buf_ops with the overriden
release method set to ttm_prime_dmabuf_release is passed in exp_info.ops:
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
exp_info.ops = &tdev->ops;
exp_info.size = prime->size;
exp_info.flags = flags;
exp_info.priv = prime;
The call to dma_buf_export() then sets
mutex_lock_interruptible(&prime->mutex);
dma_buf = dma_buf_export(&exp_info)
{
dmabuf->ops = exp_info->ops;
}
mutex_unlock(&prime->mutex);
which creates a new dmabuf file and then install a file descriptor to
it in the callers file descriptor table:
ret = dma_buf_fd(dma_buf, flags);
When that dmabuf file is closed we now get:
fput(file)
-> __fput(file)
-> f_op->release::dma_buf_file_release()
-> dput()
-> d_op->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()
mutex_lock(&prime->mutex);
if (prime->dma_buf == dma_buf)
prime->dma_buf = NULL;
mutex_unlock(&prime->mutex);
Where we can see that prime->dma_buf is set to NULL. So when we have
the following diagram:
CPU1 CPU2
fput(file)
-> __fput(file)
-> f_op->release::dma_buf_file_release()
-> dput()
-> d_op->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()
ttm_prime_handle_to_fd()
mutex_lock_interruptible(&prime->mutex)
dma_buf = prime->dma_buf
dma_buf && get_dma_buf_unless_doomed(dma_buf)
-> file_ref_get(dma_buf->file)
mutex_unlock(&prime->mutex);
mutex_lock(&prime->mutex);
if (prime->dma_buf == dma_buf)
prime->dma_buf = NULL;
mutex_unlock(&prime->mutex);
-> kmem_cache_free(file)
The logic of the mechanism is the same as for epoll: sync with
__fput() preventing the file from being freed. Here the
synchronization happens through the ttm instance's prime->mutex.
Basically, the lifetime of the dma_buf and the file are tighly
coupled.
Both (1) and (2) used to call atomic_inc_not_zero() to check whether
the file has already been marked dead and then refuse to revive it.
This is only safe because both (1) and (2) sync with __fput() and thus
prevent kmem_cache_free() on the file being called and thus prevent
the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU.
Both (1) and (2) have been ported from atomic_inc_not_zero() to
file_ref_get(). That means a file that is already in the process of
being marked as FILE_REF_DEAD:
file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)
can be revived again:
CPU1 CPU2
file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
file_ref_get()
// Brings reference back to FILE_REF_ONEREF
atomic_long_add_negative()
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)
This is fine and inherent to the file_ref_get()/file_ref_put()
semantics. For both (1) and (2) this is safe because __fput() is
prevented from making progress if file_ref_get() fails due to the
aforementioned synchronization mechanisms.
Two cases need to be considered that affect both (1) epoll and (2) ttm
dmabuf:
(i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but
before that fput() can mark the file as FILE_REF_DEAD someone
manages to sneak in a file_ref_get() and brings the refcount back
from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original
fput() doesn't call __fput(). For epoll the poll will finish and
for ttm dmabuf the file can be used again. For ttm dambuf this is
actually an advantage because it avoids immediately allocating
a new dmabuf object.
CPU1 CPU2
file_ref_put()
cnt = atomic_long_dec_return()
-> __file_ref_put(cnt)
if (cnt == FIlE_REF_NOREF)
file_ref_get()
// Brings reference back to FILE_REF_ONEREF
atomic_long_add_negative()
atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)
(ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and
also suceeds in actually marking it FILE_REF_DEAD and then calls
into __fput() to free the file.
When either (1) or (2) call file_ref_get() they fail as
atomic_long_add_negative() will return true.
At the same time, both (1) and (2) all file_ref_get() under
mutexes that __fput() must also acquire preventing
kmem_cache_free() from freeing the file.
So while this might be treated as a change in semantics for (1) and
(2) it really isn't. It if should end up causing issues this can be
fixed by adding a helper that does something like:
long cnt = atomic_long_read(&ref->refcnt);
do {
if (cnt < 0)
return false;
} while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1));
return true;
which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions.
- Jann correctly pointed out that kmem_cache_zalloc() cannot be used
anymore once files have been ported to file_ref_t.
The kmem_cache_zalloc() call will memset() the whole struct file to
zero when it is reallocated. This will also set file->f_ref to zero
which mens that a concurrent file_ref_get() can return true:
CPU1 CPU2
__get_file_rcu()
rcu_dereference_raw()
close()
[frees file]
alloc_empty_file()
kmem_cache_zalloc()
[reallocates same file]
memset(..., 0, ...)
file_ref_get()
[increments 0->1, returns true]
init_file()
file_ref_init(..., 1)
[sets to 0]
rcu_dereference_raw()
fput()
file_ref_put()
[decrements 0->FILE_REF_NOREF, frees file]
[UAF]
causing a concurrent __get_file_rcu() call to acquire a reference to
the file that is about to be reallocated and immediately freeing it
on realizing that it has been recycled. This causes a UAF for the
task that reallocated/recycled the file.
This is prevented by switching from kmem_cache_zalloc() to
kmem_cache_alloc() and initializing the fields manually. With
file->f_ref initialized last.
Note that a memset() also isn't guaranteed to atomically update an
unsigned long so it's theoretically possible to see torn and
therefore bogus counter values.
Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
08ef26ea9a
|
fs: add file_ref
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().
The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.
However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.
As suggested by Linus, add a file specific variant instead of making
this a generic library.
Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.
So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task.
To guard against this the vfs will pick the file from the file
descriptor table twice. Once before the refcount increment and once
after to compare the pointers (grossly simplified). If they match then
the file is still valid. If not the caller needs to fput() it.
The unconditional increment makes the following race possible as
illustrated by rcuref:
> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
> T1 T2
> put() get()
> // ref->refcnt = ONEREF
> if (!atomic_add_negative(-1, &ref->refcnt))
> return false; <- Not taken
>
> // ref->refcnt == NOREF
> --> preemption
> // Elevates ref->refcnt to ONEREF
> if (!atomic_add_negative(1, &ref->refcnt))
> return true; <- taken
>
> if (put(&p->ref)) { <-- Succeeds
> remove_pointer(p);
> kfree_rcu(p, rcu);
> }
>
> RCU grace period ends, object is freed
>
> atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.
Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
this deconstruction race. Afaict, the only interesting case would be
someone freeing the file and someone immediately recycling it within the
same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.
But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should
be safe from rcuref's perspective.
T1 T2 T3
fput() fget()
// f_count->refcnt = ONEREF
if (!atomic_add_negative(-1, &f_count->refcnt))
return false; <- Not taken
// f_count->refcnt == NOREF
--> preemption
// Elevates f_count->refcnt to ONEREF
if (!atomic_add_negative(1, &f_count->refcnt))
return true; <- taken
if (put(&f_count)) { <-- Succeeds
remove_pointer(p);
/*
* Cache is SLAB_TYPESAFE_BY_RCU
* so this is freed without a grace period.
*/
kmem_cache_free(p);
}
kmem_cache_alloc()
init_file() {
// Sets f_count->refcnt to ONEREF
rcuref_long_init(&f->f_count, 1);
}
Object has been reused within the same grace period
via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.
/*
* With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
* it would work correctly because the atomic_cmpxchg()
* will fail because the refcount has been reset to ONEREF by T3.
*/
atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF
However, there are other cases to consider:
(1) Benign race due to multiple atomic_long_read()
CPU1 CPU2
file_ref_put()
// last reference
// => count goes negative/FILE_REF_NOREF
atomic_long_add_negative_release(-1, &ref->refcnt)
-> __file_ref_put()
file_ref_get()
// goes back from negative/FILE_REF_NOREF to 0
// and file_ref_get() succeeds
atomic_long_add_negative(1, &ref->refcnt)
// This is immediately followed by file_ref_put()
// managing to set FILE_REF_DEAD
file_ref_put()
// __file_ref_put() continues and sees
// cnt > FILE_REF_RELEASED // and splats with
// "imbalanced put on file reference count"
cnt = atomic_long_read(&ref->refcnt);
The race however is benign and the problem is the
atomic_long_read(). Instead of performing a separate read this uses
atomic_long_dec_return() and pass the value to __file_ref_put().
Thanks to Linus for pointing out that braino.
(2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead
When a file is recycled the following race exists:
CPU1 CPU2
// @file is already dead and thus
// cnt >= FILE_REF_RELEASED.
file_ref_get(file)
atomic_long_add_negative(1, &ref->refcnt)
// We thus call into __file_ref_get()
-> __file_ref_get()
// which sees cnt >= FILE_REF_RELEASED
cnt = atomic_long_read(&ref->refcnt);
// In the meantime @file gets freed
kmem_cache_free()
// and is immediately recycled
file = kmem_cache_zalloc()
// and the reference count is reinitialized
// and the file alive again in someone
// else's file descriptor table
file_ref_init(&ref->refcnt, 1);
// the __file_ref_get() slowpath now continues
// and as it saw earlier that cnt >= FILE_REF_RELEASED
// it wants to ensure that we're staying in the middle
// of the deadzone and unconditionally sets
// FILE_REF_DEAD.
// This marks @file dead for CPU2...
atomic_long_set(&ref->refcnt, FILE_REF_DEAD);
// Caller issues a close() system call to close @file
close(fd)
file = file_close_fd_locked()
filp_flush()
// The caller sees that cnt >= FILE_REF_RELEASED
// and warns the first time...
CHECK_DATA_CORRUPTION(file_count(file) == 0)
// and then splats a second time because
// __file_ref_put() sees cnt >= FILE_REF_RELEASED
file_ref_put(&ref->refcnt);
-> __file_ref_put()
My initial inclination was to replace the unconditional
atomic_long_set() with an atomic_long_try_cmpxchg() but Linus
pointed out that:
> I think we should just make file_ref_get() do a simple
>
> return !atomic_long_add_negative(1, &ref->refcnt));
>
> and nothing else. Yes, multiple CPU's can race, and you can increment
> more than once, but the gap - even on 32-bit - between DEAD and
> becoming close to REF_RELEASED is so big that we simply don't care.
> That's the point of having a gap.
I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):
processor : 511
vendor_id : AuthenticAMD
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor
and I consistently get a 3-5% improvement on 256+ threads.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202410151043.5d224a27-oliver.sang@intel.com
Closes: https://lore.kernel.org/all/202410151611.f4cd71f2-oliver.sang@intel.com
Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
6a8126f077 |
expand_files(): simplify calling conventions
All callers treat 0 and 1 returned by expand_files() in the same way now since the call in alloc_fd() had been made conditional. Just make it return 0 on success and be done with it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
b8ea429d72 |
make __set_open_fd() set cloexec state as well
->close_on_exec[] state is maintained only for opened descriptors; as the result, anything that marks a descriptor opened has to set its cloexec state explicitly. As the result, all calls of __set_open_fd() are followed by __set_close_on_exec(); might as well fold it into __set_open_fd() so that cloexec state is defined as soon as the descriptor is marked opened. [braino fix folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
70d7f7dbd9
|
Merge patch series "File abstractions needed by Rust Binder"
Alice Ryhl <aliceryhl@google.com> says: This patchset contains the file abstractions needed by the Rust implementation of the Binder driver. Please see the Rust Binder RFC for usage examples: https://lore.kernel.org/rust-for-linux/20231101-rust-binder-v1-0-08ba9197f637@google.com Users of "rust: types: add `NotThreadSafe`": [PATCH 5/9] rust: file: add `FileDescriptorReservation` Users of "rust: task: add `Task::current_raw`": [PATCH 7/9] rust: file: add `Kuid` wrapper [PATCH 8/9] rust: file: add `DeferredFdCloser` Users of "rust: file: add Rust abstraction for `struct file`": [PATCH RFC 02/20] rust_binder: add binderfs support to Rust binder [PATCH RFC 03/20] rust_binder: add threading support Users of "rust: cred: add Rust abstraction for `struct cred`": [PATCH RFC 05/20] rust_binder: add nodes and context managers [PATCH RFC 06/20] rust_binder: add oneway transactions [PATCH RFC 11/20] rust_binder: send nodes in transaction [PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support Users of "rust: security: add abstraction for secctx": [PATCH RFC 06/20] rust_binder: add oneway transactions Users of "rust: file: add `FileDescriptorReservation`": [PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support [PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support Users of "rust: file: add `Kuid` wrapper": [PATCH RFC 05/20] rust_binder: add nodes and context managers [PATCH RFC 06/20] rust_binder: add oneway transactions Users of "rust: file: add abstraction for `poll_table`": [PATCH RFC 07/20] rust_binder: add epoll support This patchset has some uses of read_volatile in place of READ_ONCE. Please see the following rfc for context on this: https://lore.kernel.org/all/20231025195339.1431894-1-boqun.feng@gmail.com/ * patches from https://lore.kernel.org/r/20240915-alice-file-v10-0-88484f7a3dcf@google.com: rust: file: add abstraction for `poll_table` rust: file: add `Kuid` wrapper rust: file: add `FileDescriptorReservation` rust: security: add abstraction for secctx rust: cred: add Rust abstraction for `struct cred` rust: file: add Rust abstraction for `struct file` rust: task: add `Task::current_raw` rust: types: add `NotThreadSafe` Link: https://lore.kernel.org/r/20240915-alice-file-v10-0-88484f7a3dcf@google.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
e880d33b49 |
file.c: merge __{set,clear}_close_on_exec()
they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
1d3b4bec3c |
alloc_fdtable(): change calling conventions.
First of all, tell it how many slots do we want, not which slot is wanted. It makes one caller (dup_fd()) more straightforward and doesn't harm another (expand_fdtable()). Furthermore, make it return ERR_PTR() on failure rather than returning NULL. Simplifies the callers. Simplify the size calculation, while we are at it - note that we always have slots_wanted greater than BITS_PER_LONG. What the rules boil down to is * use the smallest power of two large enough to give us that many slots * on 32bit skip 64 and 128 - the minimal capacity we want there is 256 slots (i.e. 1Kb fd array). * on 64bit don't skip anything, the minimal capacity is 128 - and we'll never be asked for 64 or less. 128 slots means 1Kb fd array, again. * on 128bit, if that ever happens, don't skip anything - we'll never be asked for 128 or less, so the fd array allocation will be at least 2Kb. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
0c40bf47cf |
fs/file.c: add fast path in find_next_fd()
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the word contains next_fd, as: (1) next_fd indicates the lower bound for the first free fd. (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up searching. (3) After fdt is expanded (the bitmap size doubled for each time of expansion), it would never be shrunk. The search size increases but there are few open fds available here. This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by Jan Kara <jack@suse.cz>, which is more generic and scalable than previous versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Yu Ma <yu.ma@intel.com> Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
c9a3019603 |
fs/file.c: conditionally clear full_fds
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit
|
|
|
|
52732bb9ab |
fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
alloc_fd() has a sanity check inside to make sure the struct file mapping to the allocated fd is NULL. Remove this sanity check since it can be assured by exisitng zero initilization and NULL set when recycling fd. Meanwhile, add likely/unlikely and expand_file() call avoidance to reduce the work under file_lock. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Yu Ma <yu.ma@intel.com> Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
cab0515211 |
move close_range(2) into fs/file.c, fold __close_range() into it
We never had callers for __close_range() except for close_range(2) itself. Nothing of that sort has appeared in four years and if any users do show up, we can always separate those suckers again. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
1fa4ffd8e6 |
close_files(): don't bother with xchg()
At that point nobody else has references to the victim files_struct; as the matter of fact, the caller will free it immediately after close_files() returns, with no RCU delays or anything of that sort. That's why we are not protecting against fdtable reallocation on expansion, not cleaning the bitmaps, etc. There's no point zeroing the pointers in ->fd[] either, let alone make that an atomic operation. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
8fd3395ec9 |
get rid of ...lookup...fdget_rcu() family
Once upon a time, predecessors of those used to do file lookup without bumping a refcount, provided that caller held rcu_read_lock() across the lookup and whatever it wanted to read from the struct file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU, that stopped being feasible and these primitives started to bump the file refcount for lookup result, requiring the caller to call fput() afterwards. But that turned them pointless - e.g. rcu_read_lock(); file = lookup_fdget_rcu(fd); rcu_read_unlock(); is equivalent to file = fget_raw(fd); and all callers of lookup_fdget_rcu() are of that form. Similarly, task_lookup_fdget_rcu() calls can be replaced with calling fget_task(). task_lookup_next_fdget_rcu() doesn't have direct counterparts, but its callers would be happier if we replaced it with an analogue that deals with RCU internally. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
851849824b
|
rust: file: add Rust abstraction for `struct file`
This abstraction makes it possible to manipulate the open files for a process. The new `File` struct wraps the C `struct file`. When accessing it using the smart pointer `ARef<File>`, the pointer will own a reference count to the file. When accessing it as `&File`, then the reference does not own a refcount, but the borrow checker will ensure that the reference count does not hit zero while the `&File` is live. Since this is intended to manipulate the open files of a process, we introduce an `fget` constructor that corresponds to the C `fget` method. In future patches, it will become possible to create a new fd in a process and bind it to a `File`. Rust Binder will use these to send fds from one process to another. We also provide a method for accessing the file's flags. Rust Binder will use this to access the flags of the Binder fd to check whether the non-blocking flag is set, which affects what the Binder ioctl does. This introduces a struct for the EBADF error type, rather than just using the Error type directly. This has two advantages: * `File::fget` returns a `Result<ARef<File>, BadFdError>`, which the compiler will represent as a single pointer, with null being an error. This is possible because the compiler understands that `BadFdError` has only one possible value, and it also understands that the `ARef<File>` smart pointer is guaranteed non-null. * Additionally, we promise to users of the method that the method can only fail with EBADF, which means that they can rely on this promise without having to inspect its implementation. That said, there are also two disadvantages: * Defining additional error types involves boilerplate. * The question mark operator will only utilize the `From` trait once, which prevents you from using the question mark operator on `BadFdError` in methods that return some third error type that the kernel `Error` is convertible into. (However, it works fine in methods that return `Error`.) Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com> Co-developed-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Co-developed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Signed-off-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20240915-alice-file-v10-3-88484f7a3dcf@google.com Reviewed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
678379e1d4 |
close_range(): fix the logics in descriptor table trimming
Cloning a descriptor table picks the size that would cover all currently opened files. That's fine for clone() and unshare(), but for close_range() there's an additional twist - we clone before we close, and it would be a shame to have close_range(3, ~0U, CLOSE_RANGE_UNSHARE) leave us with a huge descriptor table when we are not going to keep anything past stderr, just because some large file descriptor used to be open before our call has taken it out. Unfortunately, it had been dealt with in an inherently racy way - sane_fdtable_size() gets a "don't copy anything past that" argument (passed via unshare_fd() and dup_fd()), close_range() decides how much should be trimmed and passes that to unshare_fd(). The problem is, a range that used to extend to the end of descriptor table back when close_range() had looked at it might very well have stuff grown after it by the time dup_fd() has allocated a new files_struct and started to figure out the capacity of fdtable to be attached to that. That leads to interesting pathological cases; at the very least it's a QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes a snapshot of descriptor table one might have observed at some point. Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination of unshare(CLONE_FILES) with plain close_range(), ending up with a weird state that would never occur with unshare(2) is confusing, to put it mildly. It's not hard to get rid of - all it takes is passing both ends of the range down to sane_fdtable_size(). There we are under ->files_lock, so the race is trivially avoided. So we do the following: * switch close_files() from calling unshare_fd() to calling dup_fd(). * undo the calling convention change done to unshare_fd() in |
|
|
|
f8ffbc365f |
struct fd layout change (and conversion to accessor helpers)
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d 592+iDgLsema/H/0/CqfqlaNtDNY8Q0= =HUl5 -----END PGP SIGNATURE----- Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it. |
|
|
|
215ab0d8af |
file: remove outdated comment after close_fd()
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: linux-fsdevel@vger.kernel.org The comment on EXPORT_SYMBOL(close_fd) was added in commit |
|
|
|
de12c3391b |
add struct fd constructors, get rid of __to_fd()
Make __fdget() et.al. return struct fd directly. New helpers: BORROWED_FD(file) and CLONED_FD(file), for borrowed and cloned file references resp. NOTE: this might need tuning; in particular, inline on __fget_light() is there to keep the code generation same as before - we probably want to keep it inlined in fdget() et.al. (especially so in fdget_pos()), but that needs profiling. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
9a2fa14720 |
fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The minimally invasive fix would be to deal with that in dup_fd(). If this proves to add measurable overhead, we can go that way, but let's try to fix copy_fd_bitmaps() first. * new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size). * make copy_fd_bitmaps() take the bitmap size in words, rather than bits; it's 'count' argument is always a multiple of BITS_PER_LONG, so we are not losing any information, and that way we can use the same helper for all three bitmaps - compiler will see that count is a multiple of BITS_PER_LONG for the large ones, so it'll generate plain memcpy()+memset(). Reproducer added to tools/testing/selftests/core/close_range_test.c Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
8aa37bde1a |
protect the fetch of ->fd[fd] in do_dup2() from mispredictions
both callers have verified that fd is not greater than ->max_fds;
however, misprediction might end up with
tofree = fdt->fd[fd];
being speculatively executed. That's wrong for the same reasons
why it's wrong in close_fd()/file_close_fd_locked(); the same
solution applies - array_index_nospec(fd, fdt->max_fds) could differ
from fd only in case of speculative execution on mispredicted path.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
ed8c7fbdfe |
fs/file: fix the check in find_next_fd()
The maximum possible return value of find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) is maxbit. This return value, multiplied by BITS_PER_LONG, gives the value of bitbit, which can never be greater than maxfd, it can only be equal to maxfd at most, so the following check 'if (bitbit > maxfd)' will never be true. Moreover, when bitbit equals maxfd, it indicates that there are no unused fds, and the function can directly return. Fix this check. Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev> Link: https://lore.kernel.org/r/20240529160656.209352-1-yuntao.wang@linux.dev Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
613aee94dd |
get_file_rcu(): no need to check for NULL separately
IS_ERR(NULL) is false and IS_ERR() already comes with unlikely()... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
c4aab26253 |
fd_is_open(): move to fs/file.c
no users outside that... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
f60d374d2c |
close_on_exec(): pass files_struct instead of fdtable
both callers are happier that way... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
4e94ddfe2a
|
file: remove __receive_fd()
Honestly, there's little value in having a helper with and without that int __user *ufd argument. It's just messy and doesn't really give us anything. Just expose receive_fd() with that argument and get rid of that helper. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
24fa3ae946
|
file: remove pointless wrapper
Only io_uring uses __close_fd_get_file(). All it does is hide current->files but io_uring accesses files_struct directly right now anyway so it's a bit pointless. Just rename pick_file() to file_close_fd_locked() and let io_uring use it. Add a lockdep assert in there that we expect the caller to hold file_lock while we're at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-2-e73ca6f4ea83@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
a88c955fcf
|
file: s/close_fd_get_file()/file_close_fd()/g
That really shouldn't have "get" in there as that implies we're bumping the reference count which we don't do at all. We used to but not anmore. Now we're just closing the fd and pick that file from the fdtable without bumping the reference count. Update the wrong documentation while at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-1-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
253ca8678d
|
Improve __fget_files_rcu() code generation (and thus __fget_light())
Commit
|
|
|
|
61d4fb0b34
|
file, i915: fix file reference for mmap_singleton()
Today we got a report at [1] for rcu stalls on the i915 testsuite in [2]
due to the conversion of files to SLAB_TYPSSAFE_BY_RCU. Afaict,
get_file_rcu() goes into an infinite loop trying to carefully verify
that i915->gem.mmap_singleton hasn't changed - see the splat below.
So I stared at this code to figure out what it actually does. It seems
that the i915->gem.mmap_singleton pointer itself never had rcu semantics.
The i915->gem.mmap_singleton is replaced in
file->f_op->release::singleton_release():
static int singleton_release(struct inode *inode, struct file *file)
{
struct drm_i915_private *i915 = file->private_data;
cmpxchg(&i915->gem.mmap_singleton, file, NULL);
drm_dev_put(&i915->drm);
return 0;
}
The cmpxchg() is ordered against a concurrent update of
i915->gem.mmap_singleton from mmap_singleton(). IOW, when
mmap_singleton() fails to get a reference on i915->gem.mmap_singleton:
While mmap_singleton() does
rcu_read_lock();
file = get_file_rcu(&i915->gem.mmap_singleton);
rcu_read_unlock();
it allocates a new file via anon_inode_getfile() and does
smp_store_mb(i915->gem.mmap_singleton, file);
So, then what happens in the case of this bug is that at some point
fput() is called and drops the file->f_count to zero leaving the pointer
in i915->gem.mmap_singleton in tact.
Now, there might be delays until
file->f_op->release::singleton_release() is called and
i915->gem.mmap_singleton is set to NULL.
Say concurrently another task hits mmap_singleton() and does:
rcu_read_lock();
file = get_file_rcu(&i915->gem.mmap_singleton);
rcu_read_unlock();
When get_file_rcu() fails to get a reference via atomic_inc_not_zero()
it will try the reload from i915->gem.mmap_singleton expecting it to be
NULL, assuming it has comparable semantics as we expect in
__fget_files_rcu().
But it hasn't so it reloads the same pointer again, trying the same
atomic_inc_not_zero() again and doing so until
file->f_op->release::singleton_release() of the old file has been
called.
So, in contrast to __fget_files_rcu() here we want to not retry when
atomic_inc_not_zero() has failed. We only want to retry in case we
managed to get a reference but the pointer did change on reload.
<3> [511.395679] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
<3> [511.395716] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-9): P6238
<3> [511.395934] rcu: (detected by 16, t=65002 jiffies, g=123977, q=439 ncpus=20)
<6> [511.395944] task:i915_selftest state:R running task stack:10568 pid:6238 tgid:6238 ppid:1001 flags:0x00004002
<6> [511.395962] Call Trace:
<6> [511.395966] <TASK>
<6> [511.395974] ? __schedule+0x3a8/0xd70
<6> [511.395995] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
<6> [511.396003] ? lockdep_hardirqs_on+0xc3/0x140
<6> [511.396013] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
<6> [511.396029] ? get_file_rcu+0x10/0x30
<6> [511.396039] ? get_file_rcu+0x10/0x30
<6> [511.396046] ? i915_gem_object_mmap+0xbc/0x450 [i915]
<6> [511.396509] ? i915_gem_mmap+0x272/0x480 [i915]
<6> [511.396903] ? mmap_region+0x253/0xb60
<6> [511.396925] ? do_mmap+0x334/0x5c0
<6> [511.396939] ? vm_mmap_pgoff+0x9f/0x1c0
<6> [511.396949] ? rcu_is_watching+0x11/0x50
<6> [511.396962] ? igt_mmap_offset+0xfc/0x110 [i915]
<6> [511.397376] ? __igt_mmap+0xb3/0x570 [i915]
<6> [511.397762] ? igt_mmap+0x11e/0x150 [i915]
<6> [511.398139] ? __trace_bprintk+0x76/0x90
<6> [511.398156] ? __i915_subtests+0xbf/0x240 [i915]
<6> [511.398586] ? __pfx___i915_live_setup+0x10/0x10 [i915]
<6> [511.399001] ? __pfx___i915_live_teardown+0x10/0x10 [i915]
<6> [511.399433] ? __run_selftests+0xbc/0x1a0 [i915]
<6> [511.399875] ? i915_live_selftests+0x4b/0x90 [i915]
<6> [511.400308] ? i915_pci_probe+0x106/0x200 [i915]
<6> [511.400692] ? pci_device_probe+0x95/0x120
<6> [511.400704] ? really_probe+0x164/0x3c0
<6> [511.400715] ? __pfx___driver_attach+0x10/0x10
<6> [511.400722] ? __driver_probe_device+0x73/0x160
<6> [511.400731] ? driver_probe_device+0x19/0xa0
<6> [511.400741] ? __driver_attach+0xb6/0x180
<6> [511.400749] ? __pfx___driver_attach+0x10/0x10
<6> [511.400756] ? bus_for_each_dev+0x77/0xd0
<6> [511.400770] ? bus_add_driver+0x114/0x210
<6> [511.400781] ? driver_register+0x5b/0x110
<6> [511.400791] ? i915_init+0x23/0xc0 [i915]
<6> [511.401153] ? __pfx_i915_init+0x10/0x10 [i915]
<6> [511.401503] ? do_one_initcall+0x57/0x270
<6> [511.401515] ? rcu_is_watching+0x11/0x50
<6> [511.401521] ? kmalloc_trace+0xa3/0xb0
<6> [511.401532] ? do_init_module+0x5f/0x210
<6> [511.401544] ? load_module+0x1d00/0x1f60
<6> [511.401581] ? init_module_from_file+0x86/0xd0
<6> [511.401590] ? init_module_from_file+0x86/0xd0
<6> [511.401613] ? idempotent_init_module+0x17c/0x230
<6> [511.401639] ? __x64_sys_finit_module+0x56/0xb0
<6> [511.401650] ? do_syscall_64+0x3c/0x90
<6> [511.401659] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<6> [511.401684] </TASK>
Link: [1]: https://lore.kernel.org/intel-gfx/SJ1PR11MB6129CB39EED831784C331BAFB9DEA@SJ1PR11MB6129.namprd11.prod.outlook.com
Link: [2]: https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@live@mman.html#dmesg-warnings10963
Cc: Jann Horn <jannh@google.com>,
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231025-formfrage-watscheln-84526cd3bd7d@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
6cf41fcfe0
|
backing file: free directly
Backing files as used by overlayfs are never installed into file descriptor tables and are explicitly documented as such. They aren't subject to rcu access conditions like regular files are. Their lifetime is bound to the lifetime of the overlayfs file, i.e., they're stashed in ovl_file->private_data and go away otherwise. If they're set as vma->vm_file - which is their main purpose - then they're subject to regular refcount rules and vma->vm_file can't be installed into an fdtable after having been set. All in all I don't see any need for rcu delay here. So free it directly. This all hinges on such hybrid beasts to never actually be installed into fdtables which - as mentioned before - is not allowed. So add an explicit WARN_ON_ONCE() so we catch any case where someone is suddenly trying to install one of those things into a file descriptor table so we can have a nice long chat with them. Link: https://lore.kernel.org/r/20231005-sakralbau-wappnen-f5c31755ed70@brauner Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
0ede61d858
|
file: convert to SLAB_TYPESAFE_BY_RCU
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
de16588a77 |
v6.6-vfs.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
=pSEW
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual filesystems.
Features:
- Block mode changes on symlinks and rectify our broken semantics
- Report file modifications via fsnotify() for splice
- Allow specifying an explicit timeout for the "rootwait" kernel
command line option. This allows to timeout and reboot instead of
always waiting indefinitely for the root device to show up
- Use synchronous fput for the close system call
Cleanups:
- Get rid of open-coded lockdep workarounds for async io submitters
and replace it all with a single consolidated helper
- Simplify epoll allocation helper
- Convert simple_write_begin and simple_write_end to use a folio
- Convert page_cache_pipe_buf_confirm() to use a folio
- Simplify __range_close to avoid pointless locking
- Disable per-cpu buffer head cache for isolated cpus
- Port ecryptfs to kmap_local_page() api
- Remove redundant initialization of pointer buf in pipe code
- Unexport the d_genocide() function which is only used within core
vfs
- Replace printk(KERN_ERR) and WARN_ON() with WARN()
Fixes:
- Fix various kernel-doc issues
- Fix refcount underflow for eventfds when used as EFD_SEMAPHORE
- Fix a mainly theoretical issue in devpts
- Check the return value of __getblk() in reiserfs
- Fix a racy assert in i_readcount_dec
- Fix integer conversion issues in various functions
- Fix LSM security context handling during automounts that prevented
NFS superblock sharing"
* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
cachefiles: use kiocb_{start,end}_write() helpers
ovl: use kiocb_{start,end}_write() helpers
aio: use kiocb_{start,end}_write() helpers
io_uring: use kiocb_{start,end}_write() helpers
fs: create kiocb_{start,end}_write() helpers
fs: add kerneldoc to file_{start,end}_write() helpers
io_uring: rename kiocb_end_write() local helper
splice: Convert page_cache_pipe_buf_confirm() to use a folio
libfs: Convert simple_write_begin and simple_write_end to use a folio
fs/dcache: Replace printk and WARN_ON by WARN
fs/pipe: remove redundant initialization of pointer buf
fs: Fix kernel-doc warnings
devpts: Fix kernel-doc warnings
doc: idmappings: fix an error and rephrase a paragraph
init: Add support for rootwait timeout parameter
vfs: fix up the assert in i_readcount_dec
fs: Fix one kernel-doc comment
docs: filesystems: idmappings: clarify from where idmappings are taken
fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
...
|
|
|
|
35931eb394 |
fs: Fix kernel-doc warnings
These have a variety of causes and a corresponding variety of solutions. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Message-Id: <20230818200824.2720007-1-willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
7d84d1b9af
|
fs: rely on ->iterate_shared to determine f_pos locking
Now that we removed ->iterate we don't need to check for either ->iterate or ->iterate_shared in file_needs_f_pos_lock(). Simply check for ->iterate_shared instead. This will tell us whether we need to unconditionally take the lock. Not just does it allow us to avoid checking f_inode's mode it also actually clearly shows that we're locking because of readdir. Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
797964253d |
file: reinstate f_pos locking optimization for regular files
In commit
|
|
|
|
ed192c59f8 |
file: mostly eliminate spurious relocking in __range_close
Stock code takes a lock trip for every fd in range, but this can be trivially avoided and real-world consumers do have plenty of already closed cases. Just booting Debian 12 with a debug printk shows: (sh) min 3 max 17 closed 15 empty 0 (sh) min 19 max 63 closed 31 empty 14 (sh) min 4 max 63 closed 0 empty 60 (spawn) min 3 max 63 closed 13 empty 48 (spawn) min 3 max 63 closed 13 empty 48 (mount) min 3 max 17 closed 15 empty 0 (mount) min 19 max 63 closed 32 empty 13 and so on. While here use more idiomatic naming. An avoidable relock is left in place to avoid uglifying the code. The code was not switched to bitmap traversal for the same reason. Tested with ltp kernel/syscalls/close_range Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Message-Id: <20230727113809.800067-1-mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
20ea1e7d13 |
file: always lock position for FMODE_ATOMIC_POS
The pidfd_getfd() system call allows a caller with ptrace_may_access()
abilities on another process to steal a file descriptor from this
process. This system call is used by debuggers, container runtimes,
system call supervisors, networking proxies etc. So while it is a
special interest system call it is used in common tools.
That ability ends up breaking our long-time optimization in fdget_pos(),
which "knew" that if we had exclusive access to the file descriptor
nobody else could access it, and we didn't need the lock for the file
position.
That check for file_count(file) was always fairly subtle - it depended
on __fdget() not incrementing the file count for single-threaded
processes and thus included that as part of the rule - but it did mean
that we didn't need to take the lock in all those traditional unix
process contexts.
So it's sad to see this go, and I'd love to have some way to re-instate
the optimization. At the same time, the lock obviously isn't ever
contended in the case we optimized, so all we were optimizing away is
the atomics and the cacheline dirtying. Let's see if anybody even
notices that the optimization is gone.
Link: https://lore.kernel.org/linux-fsdevel/20230724-vfs-fdget_pos-v1-1-a4abfd7103f3@kernel.org/
Fixes:
|
|
|
|
609d544414 |
fs: prevent out-of-bounds array speculation when closing a file descriptor
Google-Bug-Id: 114199369 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
7ee47dcfff |
fs: use acquire ordering in __fget_light()
We must prevent the CPU from reordering the files->count read with the
FD table access like this, on architectures where read-read reordering is
possible:
files_lookup_fd_raw()
close_fd()
put_files_struct()
atomic_read(&files->count)
I would like to mark this for stable, but the stable rules explicitly say
"no theoretical races", and given that the FD table pointer and
files->count are explicitly stored in the same cacheline, this sort of
reordering seems quite unlikely in practice...
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
4480c27ca3 |
gfs2: Add glockfd debugfs file
When a process has a gfs2 file open, the file is keeping a reference on the underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in shared mode. In other words, the process depends on the iopen glock of each open gfs2 file. Expose those dependencies in a new "glockfd" debugfs file. The new debugfs file contains one line for each gfs2 file descriptor, specifying the tgid, file descriptor number, and glock name, e.g., 1601 6 5/816d This list is compiled by iterating all tasks on the system using find_ge_pid(), and all file descriptors of each task using task_lookup_next_fd_rcu(). To make that work from gfs2, export those two functions. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> |
|
|
|
40a1926022 |
fix the breakage in close_fd_get_file() calling conventions change
It used to grab an extra reference to struct file rather than
just transferring to caller the one it had removed from descriptor
table. New variant doesn't, and callers need to be adjusted.
Reported-and-tested-by: syzbot+47dd250f527cb7bebf24@syzkaller.appspotmail.com
Fixes:
|
|
|
|
6319194ec5 |
Unify the primitives for file descriptor closing
Currently we have 3 primitives for removing an opened file from descriptor
table - pick_file(), __close_fd_get_file() and close_fd_get_file(). Their
calling conventions are rather odd and there's a code duplication for no
good reason. They can be unified -
1) have __range_close() cap max_fd in the very beginning; that way
we don't need separate way for pick_file() to report being past the end
of descriptor table.
2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
than returning it via struct file ** argument. Don't bother with
(bogus) return value - nobody wants that -ENOENT.
3) make pick_file() return NULL on unopened descriptor - the only caller
that used to care about the distinction between descriptor past the end
of descriptor table and finding NULL in descriptor table doesn't give
a damn after (1).
4) lift ->files_lock out of pick_file()
That actually simplifies the callers, as well as the primitives themselves.
Code duplication is also gone...
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
81132a39c1 |
fs: remove fget_many and fput_many interface
These two interface were added in |
|
|
|
d888c83fce |
fs: fix fd table size alignment properly
Jason Donenfeld reports that my commit |
|
|
|
1c24a18639 |
fs: fd tables have to be multiples of BITS_PER_LONG
This has always been the rule: fdtables have several bitmaps in them,
and as a result they have to be sized properly for bitmaps. We walk
those bitmaps in chunks of 'unsigned long' in serveral cases, but even
when we don't, we use the regular kernel bitops that are defined to work
on arrays of 'unsigned long', not on some byte array.
Now, the distinction between arrays of bytes and 'unsigned long'
normally only really ends up being noticeable on big-endian systems, but
Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps()
could be called with an argument that wasn't even a multiple of
BITS_PER_BYTE. And then it fails to do the proper copy even on
little-endian machines.
The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which
didn't actually sanitize the fdtable size sufficiently, and never made
sure it had the proper BITS_PER_LONG alignment.
That's partly because the alignment historically came not from having to
explicitly align things, but simply from previous fdtable sizes, and
from count_open_files(), which counts the file descriptors by walking
them one 'unsigned long' word at a time and thus naturally ends up doing
sizing in the proper 'chunks of unsigned long'.
But with the introduction of close_range(), we now have an external
source of "this is how many files we want to have", and so
sane_fdtable_size() needs to do a better job.
This also adds that explicit alignment to alloc_fdtable(), although
there it is mainly just for documentation at a source code level. The
arithmetic we do there to pick a reasonable fdtable size already aligns
the result sufficiently.
In fact,clang notices that the added ALIGN() in that function doesn't
actually do anything, and does not generate any extra code for it.
It turns out that gcc ends up confusing itself by combining a previous
constant-sized shift operation with the variable-sized shift operations
in roundup_pow_of_two(). And probably due to that doesn't notice that
the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't
matter. Having the explicit alignment makes sense, and would actually
matter on a 128-bit architecture if we ever go there.
This also adds big comments above both functions about how fdtable sizes
have to have that BITS_PER_LONG alignment.
Fixes:
|
|
|
|
e386dfc56f |
fget: clarify and improve __fget_files() implementation
Commit
|
|
|
|
054aa8d439 |
fget: check that the fd still exists after getting a ref to it
Jann Horn points out that there is another possible race wrt Unix domain socket garbage collection, somewhat reminiscent of the one fixed in commit |