-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
olEcAP4qG313oT/tm4W3nC4g2k8S//KqET97B80pSX0K3DvQEwD+LSCf1Th3RnsV
EAMHczmCtRlbcFPqYOFVAMS8VxOyVg0=
=7Ca4
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull minix fixes from Christian Brauner:
"Fix two syzbot corruption bugs in the minix filesystem.
Syzbot fuzzes filesystems by trying to mount and manipulate
deliberately corrupted images. This should not lead to BUG_ONs and
WARN_ONs for easy to detect corruptions.
- Add error handling to minix filesystem for inode corruption
detection, enabling the filesystem to report such corruptions
cleanly.
- Fix a drop_nlink warning in minix_rmdir() triggered by corrupted
directory link counts.
- Fix a drop_nlink warning in minix_rename() triggered by corrupted
inode link counts"
* tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Fix a drop_nlink warning in minix_rename
Fix a drop_nlink warning in minix_rmdir
Add error handling to minix filesystem for inode corruption detection
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
opxBAQCjNjr0yTSoaGRM0CJXg79Of3DLIlBdB7TygibTN16WhwEA+VKWoHL5eRjg
PZlwZD4Ei2ymeQYxi+6owTF8G806tAs=
=m/Bt
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock lock guard updates from Christian Brauner:
"This starts the work of introducing guards for superblock related
locks.
Introduce super_write_guard for scoped superblock write protection.
This provides a guard-based alternative to the manual sb_start_write()
and sb_end_write() pattern, allowing the compiler to automatically
handle the cleanup"
* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
oq2EAQD09y/qVU81E7Qg7Cn4n5/3WTlnQjx0aSvhb4p6dFUcFwD+K9uVJNP8x8tA
xTaPt59nZbEX9BIAwtLChSPa4CZsnwM=
=XrvE
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
orJLAP9UD+dX6cicJDkzFZowDakmoIQkR5ZSDwChSlmvLcmquwEAlSq4svVd9Bdl
7kOFUk71DqhVHrPAwO7ap0BxehokEAA=
=Cli6
-----END PGP SIGNATURE-----
Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
"This contains substantial credential infrastructure improvements
adding guard-based credential management that simplifies code and
eliminates manual reference counting in many subsystems.
Features:
- Kernel Credential Guards
Add with_kernel_creds() and scoped_with_kernel_creds() guards that
allow using the kernel credentials without allocating and copying
them. This was requested by Linus after seeing repeated
prepare_kernel_creds() calls that duplicate the kernel credentials
only to drop them again later.
The new guards completely avoid the allocation and never expose the
temporary variable to hold the kernel credentials anywhere in
callers.
- Generic Credential Guards
Add scoped_with_creds() guards for the common override_creds() and
revert_creds() pattern. This builds on earlier work that made
override_creds()/revert_creds() completely reference count free.
- Prepare Credential Guards
Add prepare credential guards for the more complex pattern of
preparing a new set of credentials and overriding the current
credentials with them:
- prepare_creds()
- modify new creds
- override_creds()
- revert_creds()
- put_cred()
Cleanups:
- Make init_cred static since it should not be directly accessed
- Add kernel_cred() helper to properly access the kernel credentials
- Fix scoped_class() macro that was introduced two cycles ago
- coredump: split out do_coredump() from vfs_coredump() for cleaner
credential handling
- coredump: move revert_cred() before coredump_cleanup()
- coredump: mark struct mm_struct as const
- coredump: pass struct linux_binfmt as const
- sev-dev: use guard for path"
* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
trace: use override credential guard
trace: use prepare credential guard
coredump: use override credential guard
coredump: use prepare credential guard
coredump: split out do_coredump() from vfs_coredump()
coredump: mark struct mm_struct as const
coredump: pass struct linux_binfmt as const
coredump: move revert_cred() before coredump_cleanup()
sev-dev: use override credential guards
sev-dev: use prepare credential guard
sev-dev: use guard for path
cred: add prepare credential guard
net/dns_resolver: use credential guards in dns_query()
cgroup: use credential guards in cgroup_attach_permissions()
act: use credential guards in acct_write_process()
smb: use credential guards in cifs_get_spnego_key()
nfs: use credential guards in nfs_idmap_get_key()
nfs: use credential guards in nfs_local_call_write()
nfs: use credential guards in nfs_local_call_read()
erofs: use credential guards
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
onGBAQDtqeO0jZzS7q9UxlJ84Wj/H9w+9INpO4jMxtWK4svhUAEAghG4qVxRvkE2
Qh+wrpTPIC7OCQ78k8psDRmkj9cn8QA=
=FCVN
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
oji0AQC5jl35xh04fJKB343InVAxtRFp8mSkJJ9Bx6x7xA7a+QEAiBMxYilUgYIW
bZMcI5LU+gNO/1y076QkVt84jTUQLww=
=WIBZ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfd and coredump updates from Christian Brauner:
"Features:
- Expose coredump signal via pidfd
Expose the signal that caused the coredump through the pidfd
interface. The recent changes to rework coredump handling to rely
on unix sockets are in the process of being used in systemd. The
previous systemd coredump container interface requires the coredump
file descriptor and basic information including the signal number
to be sent to the container. This means the signal number needs to
be available before sending the coredump to the container.
- Add supported_mask field to pidfd
Add a new supported_mask field to struct pidfd_info that indicates
which information fields are supported by the running kernel. This
allows userspace to detect feature availability without relying on
error codes or kernel version checks.
Cleanups:
- Drop struct pidfs_exit_info and prepare to drop exit_info pointer,
simplifying the internal publication mechanism for exit and
coredump information retrievable via the pidfd ioctl
- Use guard() for task_lock in pidfs
- Reduce wait_pidfd lock scope
- Add missing PIDFD_INFO_SIZE_VER1 constant
- Add missing BUILD_BUG_ON() assert on struct pidfd_info
Fixes:
- Fix PIDFD_INFO_COREDUMP handling
Selftests:
- Split out coredump socket tests and common helpers into separate
files for better organization
- Fix userspace coredump client detection issues
- Handle edge-triggered epoll correctly
- Ignore ENOSPC errors in tests
- Add debug logging to coredump socket tests, socket protocol tests,
and test helpers
- Add tests for PIDFD_INFO_COREDUMP_SIGNAL
- Add tests for supported_mask field
- Update pidfd header for selftests"
* tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
pidfs: reduce wait_pidfd lock scope
selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: ignore ENOSPC errors
selftests/coredump: add debug logging to coredump socket protocol tests
selftests/coredump: add debug logging to coredump socket tests
selftests/coredump: add debug logging to test helpers
selftests/coredump: handle edge-triggered epoll correctly
selftests/coredump: fix userspace coredump client detection
selftests/coredump: fix userspace client detection
selftests/coredump: split out coredump socket tests
selftests/coredump: split out common helpers
selftests/pidfd: add second supported_mask test
selftests/pidfd: add first supported_mask test
selftests/pidfd: update pidfd header
pidfs: expose coredump signal
pidfs: drop struct pidfs_exit_info
pidfs: prepare to drop exit_info pointer
pidfd: add a new supported_mask field
pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
ooKwAP4kR5kMjHlthf8jHmmCjVU3nQFO9hUZsIQL9gFJLOIQMAD+LLoTaq1WJufl
oSgZpREXZVmI1TK61eR6EZMB1YikGAo=
=TExi
-----END PGP SIGNATURE-----
Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey
25ZNdExoUojAOj5wVn+jUep3u54jBws=
=/toi
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC
RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM=
=zo/J
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
onGCAQDEHKNEuZMhkyd3K5YsJtMzZlW/uXp4+Wddeob+5yQp0wEA09xN4CJNMwhP
J6Kjaa80hWfrFacqSvyMUwQHHw6mngs=
=5Mom
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72
c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk=
=jInB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
Rationale is that if the parent dentry is the same and the length is the
same, then you have to be unlucky for the name to not match.
At the same time the dentry was literally just found on the hash, so you
have to be even more unlucky to determine it is unhashed.
While here add commentary while d_unhashed() is necessary. It was
already removed once and brought back in:
2e321806b6 ("Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu"")
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251127131526.4137768-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
German Maglione is a co-maintainer of the virtiofsd userspace device
implementation (https://gitlab.com/virtio-fs/virtiofsd) and is currently
one of the most active virtiofs developers outside the kernel.
I have not worked on virtiofs except to review kernel patches for a few
years now and would like German to take over from me gradually. It is
healthier to have a kernel maintainer who is actively involved. I expect
to remove myself in a few months.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://patch.msgid.link/20251126211548.598469-1-stefanha@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The primary consumer is link_path_walk(), calling walk_component() every
time which in turn calls step_into().
Inlining these saves overhead of 2 function calls per path component,
along with allowing the compiler to do better job optimizing them in place.
step_into() had absolutely atrocious assembly to facilitate the
slowpath. In order to lessen the burden at the callsite all the hard
work is moved into step_into_slowpath() and instead an inline-able
fastpath is implemented for rcu-walk.
The new fastpath is a stripped down step_into() RCU handling with a
d_managed() check from handle_mounts().
Benchmarked as follows on Sapphire Rapids:
1. the "before" was a kernel with not-yet-merged optimizations (notably
elision of calls to security_inode_permission() and marking ext4
inodes as not having acls as applicable)
2. "after" is the same + the prep patch + this patch
3. benchmark consists of issuing 205 calls to access(2) in a loop with
pathnames lifted out of gcc and the linker building real code, most
of which have several path components and 118 of which fail with
-ENOENT.
Result in terms of ops/s:
before: 21619
after: 22536 (+4%)
profile before:
20.25% [kernel] [k] __d_lookup_rcu
10.54% [kernel] [k] link_path_walk
10.22% [kernel] [k] entry_SYSCALL_64
6.50% libc.so.6 [.] __GI___access
6.35% [kernel] [k] strncpy_from_user
4.87% [kernel] [k] step_into
3.68% [kernel] [k] kmem_cache_alloc_noprof
2.88% [kernel] [k] walk_component
2.86% [kernel] [k] kmem_cache_free
2.14% [kernel] [k] set_root
2.08% [kernel] [k] lookup_fast
after:
23.38% [kernel] [k] __d_lookup_rcu
11.27% [kernel] [k] entry_SYSCALL_64
10.89% [kernel] [k] link_path_walk
7.00% libc.so.6 [.] __GI___access
6.88% [kernel] [k] strncpy_from_user
3.50% [kernel] [k] kmem_cache_alloc_noprof
2.01% [kernel] [k] kmem_cache_free
2.00% [kernel] [k] set_root
1.99% [kernel] [k] lookup_fast
1.81% [kernel] [k] do_syscall_64
1.69% [kernel] [k] entry_SYSCALL_64_safe_stack
While walk_component() and step_into() of course disappear from the
profile, the link_path_walk() barely gets more overhead despite the
inlining thanks to the fast path added and while completing more walks
per second.
I did not investigate why overhead grew a lot on __d_lookup_rcu().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Symlink handling is already marked as unlikely and pushing out some of
it into pick_link() reduces register spillage on entry to step_into()
with gcc 14.2.
The compiler needed additional convincing that handle_mounts() is
unlikely to fail.
At the same time neither clang nor gcc could be convinced to tail-call
into pick_link().
While pick_link() takes an address of stack-based object as an argument
(which definitely prevents the optimization), splitting it into separate
<dentry, mount> tuple did not help. The issue persists even when
compiled without stack protector. As such nothing was done about this
for the time being to not grow the diff.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@lst.de> says:
[Fix] the layering bypass in btrfs when updating timestamps on device
files for devices removed from btrfs usage, and FMODE_NOCMTIME handling
in the VFS now that nfsd started using it. Note that I'm still not sure
that nfsd usage is fully correct for all file systems, as only XFS
explicitly supports FMODE_NOCMTIME, but at least the generic code does
the right thing now.
* patches from https://patch.msgid.link/20251120064859.2911749-1-hch@lst.de:
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
Link: https://patch.msgid.link/20251120064859.2911749-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Orangefs has no i_version handling and __orangefs_setattr already
explicitly marks the inode dirty. So instead of the using
the flags return value from generic_update_time, just call the
lower level inode_update_timestamps helper directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-7-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since commit e41f941a23 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Btrfs updates the device node timestamps for block device special files
when it stop using the device.
Commit 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode. This is wrong and got fixed in
commit 54fde91f52 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method. Fix this by using the propet vfs_utimes
interface.
Fixes: 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This will be used to replace an incorrect direct call into
generic_update_time in btrfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-4-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based
"invisible I/O", but commit e5e9b24ab8 ("nfsd: freeze c/mtime updates
with outstanding WRITE_ATTRS delegation") started using it from
generic callers.
I'm not sure other file systems are actually read for this in general,
so the above commit should get a closer look, but for it to make any
sense, file_update_time needs to respect the flag.
Lift the check from file_modified_flags to file_update_time so that
users of file_update_time inherit the behavior and so that all the
checks are done in one place.
Fixes: e5e9b24ab8 ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently the two high-level APIs use two helper functions to implement
almost all of the logic. Refactor the two helpers and the common logic
into a new file_update_time_flags routine that gets the iocb flags or
0 in case of file_update_time passed so that the entire logic is
contained in a single function and can be easily understood and modified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
For consistency with sb routines.
ext4 is the only consumer outside of evict(). Damage-controlling it is
outside of the scope of this cleanup.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251103230911.516866-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
(inode_lru_list_del()) and similar routines for sb and io list
management
4. push list presence check into inode_lru_list_del(), just like sb and
io list
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the inode hash code grab the state while ->i_lock is held. If found
to be set, synchronize the sleep once more with the lock held.
In the real world the flag is not set most of the time.
Apart from being simpler to reason about, it comes with a minor speed up
as now clearing the flag does not require the smp_mb() fence.
While here rename wait_on_inode() to wait_on_new_inode() to line it up
with __wait_on_freeing_inode().
Christian Brauner <brauner@kernel.org> says:
As per the discussion in [1] I folded in the diff sent in [2].
Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1]
Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since commit 222f2c7c6d14 ("iomap: always run error completions in user
context"), read error completions are deferred to s_dio_done_wq. This
means the workqueue also needs to be allocated for async reads.
Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context")
Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de
Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There are some cases where when iomap_read_end() is called, the folio
may already have been marked uptodate. For example, if the iomap block
needed zeroing, then the folio may have been marked uptodate after the
zeroing.
iomap_read_end() should unlock the folio instead of calling
folio_end_read(), which is how these cases were handled prior to commit
f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for
reads"). Calling folio_end_read() on an uptodate folio leads to buggy
behavior where marking an already uptodate folio as uptodate will XOR it
to be marked nonuptodate.
Fixes: f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for reads")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251118211111.1027272-2-joannelkoong@gmail.com
Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@lst.de> says:
Currently iomap defers all write completions to interrupt context. This
was based on my assumption that no one cares about the latency of those
to simplify the code vs the old direct-io.c. It turns out someone cared,
as Avi reported a lot of context switches with ScyllaDB, which at least
in older kernels with workqueue scheduling issues caused really high
tail latencies.
Fortunately allowing the direct completions is pretty easy with all the
other iomap changes we had since.
While doing this I've also found dead code which gets removed (patch 1)
and an incorrect assumption in zonefs that read completions are called
in user context, which it assumes for it's error handling. Fix this by
always calling error completions from user context (patch 2).
Against the vfs-6.19.iomap branch.
* patches from https://patch.msgid.link/20251113170633.1453259-1-hch@lst.de:
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
Link: https://patch.msgid.link/20251113170633.1453259-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace IOMAP_DIO_INLINE_COMP with a flag to indicate that the
completion should be offloaded. This removes a tiny bit of boilerplate
code, but more importantly just makes the code easier to follow as this
new flag gets set most of the time and only cleared in one place, while
it was the inverse for the old version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-6-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Completions for pure overwrites don't need to be deferred to a workqueue
as there is no work to be done, or at least no work that needs a user
context. Set the IOMAP_DIO_INLINE_COMP by default for writes like we
already do for reads, and the clear it for all the cases that actually
do need a user context for completions to update the inode size or
record updates to the logical to physical mapping.
I've audited all users of the ->end_io callback, and they only require
user context for I/O that involves unwritten extents, COW, size
extensions, or error handling and all those are still run from workqueue
context.
This restores the behavior of the old pre-iomap direct I/O code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-5-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
The way how iomap_dio_can_use_fua and the caller is structured is
a bit confusing, as the main guarding condition is hidden in the
helper, and the secondary conditions are split between caller and
callee.
Refactor the code, so that iomap_dio_bio_iter itself tracks if a write
might need metadata updates based on the iomap type and flags, and
then have a condition based on that to use the FUA flag.
Note that this also moves the REQ_OP_WRITE assignment to the end of
the branch to improve readability a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-4-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
At least zonefs expects error completions to be able to sleep. Because
error completions aren't performance critical, just defer them to workqueue
context unconditionally.
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This was added by commit 099ada2c87 ("io_uring/rw: add write support
for IOCB_DIO_CALLER_COMP") and disabled a little later by commit
838b35bb6a ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it
didn't work. Remove all the related code that sat unused for 2 years.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This series contains several fixes and cleanups:
* Renaming bytes_pending/bytes_accounted to
bytes_submitted/bytes_not_submitted for improved code clarity
* Accounting for unaligned end offsets when truncating read ranges
* Adding documentation for iomap_finish_folio_write() requirements
* Optimizing pending async writeback accounting logic
* Simplifying error handling in ->read_folio_range() for read operations
* Streamlining logic for skipping reads during write operations
* Replacing manual bitmap scanning with find_next_bit() for both dirty
and uptodate bitmaps, improving performance
* patches from https://patch.msgid.link/20251111193658.3495942-1-joannelkoong@gmail.com:
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
Link: https://patch.msgid.link/20251111193658.3495942-1-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use find_next_bit()/find_next_zero_bit() for iomap uptodate bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next uptodate or non-uptodate bit than iterating through the
the bitmap range testing every bit.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-10-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use find_next_bit()/find_next_zero_bit() for iomap dirty bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next dirty or clean bit than iterating through the bitmap
range testing every bit.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-9-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In a recent commit, I inadvertently changed a comparison from being an
unsigned comparison (on 64-bit systems) to being a signed comparison
(which it had always been on 32-bit systems). This led to a sporadic
fstests failure.
To make sure this comparison is always unsigned, introduce a new type,
uoff_t which is the unsigned version of loff_t. Generally file sizes
are restricted to being a signed integer, but in these two places it is
convenient to pass -1 to indicate "up to the end of the file".
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251123220518.1447261-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Otherwise it gets inlined notably in walk_component(), which convinces
the compiler to push/pop additional registers in the fast path to
accomodate existence of the inlined version.
Shortens the fast path of that routine from 87 to 71 bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119144930.2911698-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stats from nd->depth usage during the venerable kernel build collected like so:
bpftrace -e 'kprobe:terminate_walk,kprobe:walk_component,kprobe:legitimize_links
{ @[probe] = lhist(((struct nameidata *)arg0)->depth, 0, 8, 1); }'
@[kprobe:legitimize_links]:
[0, 1) 6554906 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 3534 | |
@[kprobe:terminate_walk]:
[0, 1) 12153664 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@[kprobe:walk_component]:
[0, 1) 53075749 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 971421 | |
[2, 3) 84946 | |
Additionally a custom probe was added for depth within link_path_walk():
bpftrace -e 'kprobe:link_path_walk_probe { @[probe] = lhist(arg0, 0, 8, 1); }'
@[kprobe:link_path_walk_probe]:
[0, 1) 7528231 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 407905 |@@ |
Given these results:
1. terminate_walk() is called towards the end of the lookup and in this
test it never had any links to clean up.
2. legitimize_links() is also called towards the end of lookup and most
of the time there s 0 depth. Patch consumers to avoid calling into it
in that case.
3. walk_component() is typically called with WALK_MORE and zero depth,
checked in that order. Check depth first and predict it is 0.
4. link_path_walk() also does not deal with a symlink most of the time
when !*name
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119142954.2909394-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the stock variant the compiler spills several registers on the stack
and employs stack smashing protection, adding even more code + a branch
on exit..
The actual fast path is small enough that the compiler inlines it for
all callers -- the symbol is no longer emitted.
Forcing noinline on it just for code-measurement purposes shows the fast
path dropping from 111 to 39 bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251114201803.2183505-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit 2f952c9e8f ("fs: split fileattr related helpers into separate
file") added various exports without users despite claiming to be a
simple refactor. Drop them again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251119101415.2732320-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace the now deprecated kmap_atomic() with kmap_local_page().
Optimize for the non-highmem cases and avoid disabling preemption and
pagefaults, the caller's context is atomic anyway, but that is irrelevant
to kmap. The memcpy itself does not require any such semantics and the
mapping would hold valid across context switches anyway. Further, highmem
is planned to to be removed[1].
[1] https://lore.kernel.org/all/4ff89b72-03ff-4447-9d21-dd6a5fe1550f@app.fastmail.com/
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://patch.msgid.link/20251118210706.1816303-1-dave@stgolabs.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
Documentation build reported:
Warning: kernel/nstree.c:325 function parameter 'ns_tree' not described in '__ns_tree_adjoined_rcu'
Warning: kernel/nstree.c:325 expecting prototype for ns_tree_adjoined_rcu(). Prototype was for __ns_tree_adjoined_rcu() instead
Warning: kernel/nstree.c:353 expecting prototype for ns_tree_gen_id(). Prototype was for __ns_tree_gen_id() instead
The kernel-doc comments for `__ns_tree_adjoined_rcu()` and
`__ns_tree_gen_id()` had mismatched function names and a missing
parameter description. This patch updates the function names in the
kernel-doc headers and adds the missing `@ns_tree` parameter description
for `__ns_tree_adjoined_rcu()`.
Fixes: 885fc8ac0a ("nstree: make iterator generic")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511061542.0LO7xKs8-lkp@intel.com
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Link: https://patch.msgid.link/20251111112533.2254432-1-kriish.sharma2006@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rationale:
- ND_ROOT_PRESET is only set in a condition already marked unlikely
- LOOKUP_IS_SCOPED already has unlikely on it, but inconsistently
applied
- set_root() only fails if there is a bug
- most names are not empty (see !*s)
- most of the time path_init() does not encounter LOOKUP_CACHED without
LOOKUP_RCU
- LOOKUP_IN_ROOT is a rarely seen flag
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105150630.756606-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
On stock kernel gcc 14 emits avoidable register spillage:
endbr64
call ffffffff81374630 <__fentry__>
push %r13
push %r12
push %rbx
sub $0x8,%rsp
[snip]
Total fast path is 99 bytes.
Moving the slowpath out avoids it and shortens the fast path to 74
bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251110095634.1433061-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Helps out some of the asm, the routine is still a mess.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251109125254.1288882-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Support for block sizes greater than the page size depends on large
folios, which in turn require CONFIG_TRANSPARENT_HUGEPAGE to be enabled.
Because the code is wrapped in multiple layers of abstraction, this
dependency is rather obscure, so users may not realize it and may be
unsure how to enable LBS.
As suggested by Theodore, I have added hint messages in sb_set_blocksize
so that users can distinguish whether a mount failure with block size
larger than page size is due to lack of filesystem support or the absence
of CONFIG_TRANSPARENT_HUGEPAGE.
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251110043226.GD2988753@mit.edu
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251110124714.1329978-1-libaokun@huaweicloud.com
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Mateusz Guzik <mjguzik@gmail.com> says:
In short, MAY_WRITE checks are elided.
This obsoletes the idea of pre-computing if perm checks are necessary as
that turned out to be too hairy. The new code has 2 more branches per
path component compared to that idea, but the perf difference for
typical paths (< 6 components) was basically within noise. To be
revisited if someone(tm) removes other slowdowns.
Instead of the pre-computing thing I added IOP_FASTPERM_MAY_EXEC so that
filesystems like btrfs can still avoid the hard work.
* patches from https://patch.msgid.link/20251107142149.989998-1-mjguzik@gmail.com:
fs: retire now stale MAY_WRITE predicts in inode_permission()
btrfs: utilize IOP_FASTPERM_MAY_EXEC
fs: speed up path lookup with cheaper handling of MAY_EXEC
Link: https://patch.msgid.link/20251107142149.989998-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The primary non-MAY_WRITE consumer now uses lookup_inode_permission_may_exec().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-4-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Root filesystem was ext4, btrfs was mounted on /testfs.
Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h
on Sapphire Rapids (ops/s):
before: 3447976
after: 3620879 (+5%)
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The generic inode_permission() routine does work which is known to be of
no significance for lookup. There are checks for MAY_WRITE, while the
requested permission is MAY_EXEC. Additionally devcgroup_inode_permission()
is called to check for devices, but it is an invariant the inode is a
directory.
Absent a ->permission func, execution lands in generic_permission()
which checks upfront if the requested permission is granted for
everyone.
We can elide the branches which are guaranteed to be false and cut
straight to the check if everyone happens to be allowed MAY_EXEC on the
inode (which holds true most of the time).
Moreover, filesystems which provide their own ->permission routine can
take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC
flag on their inodes, which they can legitimately do if their MAY_EXEC
handling matches generic_permission().
As a simple benchmark, as part of compilation gcc issues access(2) on
numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o
Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s):
before: 3797556
after: 3987789 (+5%)
Note: this depends on the not-yet-landed ext4 patch to mark inodes with
cache_no_acl()
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that we build with -fms-extensions, union pipe_index can be
included as an anonymous member in struct pipe_inode_info, avoiding
the duplication.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Link: https://patch.msgid.link/20251023082142.2104456-1-linux@rasmusvillemoes.dk
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently, the logic for skipping the read range for a write is
if (!(iter->flags & IOMAP_UNSHARE) &&
(from <= poff || from >= poff + plen) &&
(to <= poff || to >= poff + plen))
which breaks down to skipping the read if any of these are true:
a) from <= poff && to <= poff
b) from <= poff && to >= poff + plen
c) from >= poff + plen && to <= poff
d) from >= poff + plen && to >= poff + plen
This can be simplified to
if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen)
from the following reasoning:
a) from <= poff && to <= poff
This reduces to 'to <= poff' since it is guaranteed that 'from <= to'
(since to = from + len). It is not possible for 'from <= to' to be true
here because we only reach here if plen > 0 (thanks to the preceding 'if
(plen == 0)' check that would break us out of the loop). If 'to <=
poff', plen would have to be 0 since poff and plen get adjusted in
lockstep for uptodate blocks. This means we can eliminate this check.
c) from >= poff + plen && to <= poff
This is not possible since 'from <= to' and 'plen > 0'. We can eliminate
this check.
d) from >= poff + plen && to >= poff + plen
This reduces to 'from >= poff + plen' since 'from <= to'.
It is not possible for 'from >= poff + plen' to be true here. We only
reach here if plen > 0 and for writes, poff and plen will always be
block-aligned, which means poff <= from < poff + plen. We can eliminate
this check.
The only valid check is b) from <= poff && to >= poff + plen.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-7-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Instead of requiring that the caller calls iomap_finish_folio_read()
even if the ->read_folio_range() callback returns an error, account for
this internally in iomap instead, which makes the interface simpler and
makes it match writeback's ->read_folio_range() error handling
expectations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pending writebacks must be accounted for to determine when all requests
have completed and writeback on the folio should be ended. Currently
this is done by atomically incrementing ifs->write_bytes_pending for
every range to be written back.
Instead, the number of atomic operations can be minimized by setting
ifs->write_bytes_pending to the folio size, internally tracking how many
bytes are written back asynchronously, and then after sending off all
the requests, decrementing ifs->write_bytes_pending by the number of
bytes not written back asynchronously. Now, for N ranges written back,
only N + 2 atomic operations are required instead of 2N + 2.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Document that iomap_finish_folio_write() must be called after writeback
on the range completes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-4-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The end position to start truncating from may be at an offset into a
block, which under the current logic would result in overtruncation.
Adjust the calculation to account for unaligned end offsets.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-3-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The naming "bytes_pending" and "bytes_accounted" may be confusing and
could be better named. Rename this to "bytes_submitted" and
"bytes_not_submitted" to make it more clear that these are bytes we
passed to the IO helper to read in.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-2-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This should avoid *some* cache misses.
Successful path lookup is guaranteed to load at least ->i_mode,
->i_opflags and ->i_acl. At the same time the common case will avoid
looking at more fields.
struct inode is not guaranteed to have any particular alignment, notably
ext4 has it only aligned to 8 bytes meaning nearby fields might happen
to be on the same or only adjacent cache lines depending on luck (or no
luck).
According to pahole:
umode_t i_mode; /* 0 2 */
short unsigned int i_opflags; /* 2 2 */
kuid_t i_uid; /* 4 4 */
kgid_t i_gid; /* 8 4 */
unsigned int i_flags; /* 12 4 */
struct posix_acl * i_acl; /* 16 8 */
struct posix_acl * i_default_acl; /* 24 8 */
->i_acl is unnecessarily separated by 8 bytes from the other fields.
With struct inode being offset 48 bytes into the cacheline this means an
avoidable miss. Note it will still be there for the 56 byte case.
New layout:
umode_t i_mode; /* 0 2 */
short unsigned int i_opflags; /* 2 2 */
unsigned int i_flags; /* 4 4 */
struct posix_acl * i_acl; /* 8 8 */
struct posix_acl * i_default_acl; /* 16 8 */
kuid_t i_uid; /* 24 4 */
kgid_t i_gid; /* 28 4 */
I verified with pahole there are no size or hole changes.
This is stopgap until someone(tm) sanitizes the layout in the first
place, allocation methods aside.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251109121931.1285366-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
Cleanup the namespace headers by splitting them into types and helpers.
Better separate common namepace types and functions from namespace tree
types and functions.
Fix the reference counts of initial namespaces so we don't do any
pointless cacheline ping-pong for them when we know they can never go
away. Add a bunch of asserts for both the passive and active reference
counts to catch any changes that would break it.
* patches from https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-0-e8a9264e0fb9@kernel.org:
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-0-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stop playing games with the namespace id and use a boolean instead:
* This will remove the special-casing we need to do everywhere for mount
namespaces.
* It will allow us to use asserts on the namespace id for initial
namespaces everywhere.
* It will allow us to put anonymous mount namespaces on the namespaces
trees in the future and thus make them available to statmount() and
listmount().
Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-10-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Bring in the shared branch with the kbuild tree to enable
'-fms-extensions' for 6.19. Further namespace cleanup work
requires this extension.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
* Make sure to initialize the active reference count for the initial
network namespace and prevent __ns_common_init() from returning too
early.
* Make sure that passive reference counts are dropped outside of rcu
read locks as some namespaces such as the mount namespace do in fact
sleep when putting the last reference.
* The setns() system call supports:
(1) namespace file descriptors (nsfd)
(2) process file descriptors (pidfd)
When using nsfds the namespaces will remain active because they are
pinned by the vfs. However, when pidfds are used things are more
complicated.
When the target task exits and passes through exit_nsproxy_namespaces()
or is reaped and thus also passes through exit_cred_namespaces() after
the setns()'ing task has called prepare_nsset() but before the active
reference count of the set of namespaces it wants to setns() to might
have been dropped already:
P1 P2
pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
pidfd = pidfd_open(pid_p1)
setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
prepare_nsset()
exit(0)
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 1
-> exit_nsproxy_namespaces()
-> exit_cred_namespaces()
// ns_active_ref_put() will also put
// the reference on the owner of the
// namespace. If the only reason the
// owning namespace was alive was
// because it was a parent of @ns
// it's active reference count now goes
// to zero... --------------------------------
// |
// ns->__ns_active_ref == 0 |
// parent_ns->__ns_active_ref == 0 |
| commit_nsset()
-----------------> // If setns()
// now manages to install the namespaces
// it will call ns_active_ref_get()
// on them thus bumping the active reference
// count from zero again but without also
// taking the required reference on the owner.
// Thus we get:
//
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 0
When later someone does ns_active_ref_put() on @ns it will underflow
parent_ns->__ns_active_ref leading to a splat from our asserts
thinking there are still active references when in fact the counter
just underflowed.
So resurrect the ownership chain if necessary as well. If the caller
succeeded to grab passive references to the set of namespaces the
setns() should simply succeed even if the target task exists or gets
reaped in the meantime.
The race is rare and can only be triggered when using pidfs to setns()
to namespaces. Also note that active reference on initial namespaces are
nops.
Since we now always handle parent references directly we can drop
ns_ref_active_get_owner() when adding a namespace to a namespace tree.
This is now all handled uniformly in the places where the new namespaces
actually become active.
* patches from https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org:
selftests/namespaces: test for efault
selftests/namespaces: add active reference count regression test
ns: add asserts for active refcount underflow
ns: handle setns(pidfd, ...) cleanly
ns: return EFAULT on put_user() error
ns: make sure reference are dropped outside of rcu lock
ns: don't increment or decrement initial namespaces
ns: don't skip active reference count initialization
Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR74yXHMTGczQHYypIdayaRccAalgUCaQzbRwAKCRAdayaRccAa
ls8lAP9Dj1mOl+KTtajMvDnDTym4Sso9CaFP+5maFAv9CflAIwEA5QEtSwI9sMcH
ty8x9Y6TTuib+ns37i2jxR8cIt4jHwU=
=WA6M
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaRGyjgAKCRCRxhvAZXjc
okJ7AQDFRmoidpqHmRlvZQ3aismKkNHfx2k67QtRlX+YxDi8rQEAmmKyKUiX/SZV
39TroGfiJ5ytQuXwz3QxG/34cA+kAgQ=
=HqhX
-----END PGP SIGNATURE-----
Merge patch "kbuild: Add '-fms-extensions' to areas with dedicated CFLAGS"
Nathan Chancellor <nathan@kernel.org> says:
Shared branch between Kbuild and other trees for enabling
'-fms-extensions' for 6.19.
* tag 'kbuild-ms-extensions-6.19' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: Add '-fms-extensions' to areas with dedicated CFLAGS
Kbuild: enable -fms-extensions
jfs: Rename _inline to avoid conflict with clang's '-fms-extensions'
Link: https://patch.msgid.link/20251101-kbuild-ms-extensions-dedicated-cflags-v1-1-38004aba524b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The setns() system call supports:
(1) namespace file descriptors (nsfd)
(2) process file descriptors (pidfd)
When using nsfds the namespaces will remain active because they are
pinned by the vfs. However, when pidfds are used things are more
complicated.
When the target task exits and passes through exit_nsproxy_namespaces()
or is reaped and thus also passes through exit_cred_namespaces() after
the setns()'ing task has called prepare_nsset() but before the active
reference count of the set of namespaces it wants to setns() to might
have been dropped already:
P1 P2
pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
pidfd = pidfd_open(pid_p1)
setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
prepare_nsset()
exit(0)
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 1
-> exit_nsproxy_namespaces()
-> exit_cred_namespaces()
// ns_active_ref_put() will also put
// the reference on the owner of the
// namespace. If the only reason the
// owning namespace was alive was
// because it was a parent of @ns
// it's active reference count now goes
// to zero... --------------------------------
// |
// ns->__ns_active_ref == 0 |
// parent_ns->__ns_active_ref == 0 |
| commit_nsset()
-----------------> // If setns()
// now manages to install the namespaces
// it will call ns_active_ref_get()
// on them thus bumping the active reference
// count from zero again but without also
// taking the required reference on the owner.
// Thus we get:
//
// ns->__ns_active_ref == 1
// parent_ns->__ns_active_ref == 0
When later someone does ns_active_ref_put() on @ns it will underflow
parent_ns->__ns_active_ref leading to a splat from our asserts
thinking there are still active references when in fact the counter
just underflowed.
So resurrect the ownership chain if necessary as well. If the caller
succeeded to grab passive references to the set of namespaces the
setns() should simply succeed even if the target task exists or gets
reaped in the meantime and thus has dropped all active references to its
namespaces.
The race is rare and can only be triggered when using pidfs to setns()
to namespaces. Also note that active reference on initial namespaces are
nops.
Since we now always handle parent references directly we can drop
ns_ref_active_get_owner() when adding a namespace to a namespace tree.
This is now all handled uniformly in the places where the new namespaces
actually become active.
Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-5-ae8a4ad5a3b3@kernel.org
Fixes: 3c9820d5c64a ("ns: add active reference count")
Reported-by: syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The mount namespace may in fact sleep when putting the last passive
reference so we need to drop the namespace reference outside of the rcu
read lock. Do this by delaying the put until the next iteration where
we've already moved on to the next namespace and legitimized it. Once we
drop the rcu read lock to call put_user() we will also drop the
reference to the previous namespace in the tree.
Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-3-ae8a4ad5a3b3@kernel.org
Fixes: 76b6f5dfb3 ("nstree: add listns()")
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
This converts most users combining
* prepare_creds()
* modify new creds
* override_creds()
* revert_creds()
* put_cred()
to rely on credentials guards.
* patches from https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-0-b447b82f2c9b@kernel.org:
trace: use override credential guard
trace: use prepare credential guard
coredump: use override credential guard
coredump: use prepare credential guard
coredump: split out do_coredump() from vfs_coredump()
coredump: mark struct mm_struct as const
coredump: pass struct linux_binfmt as const
coredump: move revert_cred() before coredump_cleanup()
sev-dev: use override credential guards
sev-dev: use prepare credential guard
sev-dev: use guard for path
cred: add prepare credential guard
Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-0-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
I'm in the process of adding a few more guards for vfs constructs.
I've chosen the easy case of super_start_write() and super_end_write()
and converted eligible callers. I think long-term we can move a lot of
the manual placement to completely rely on guards - where sensible.
* patches from https://patch.msgid.link/20251104-work-guards-v1-0-5108ac78a171@kernel.org:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
Link: https://patch.msgid.link/20251104-work-guards-v1-0-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
There is no good reason to have this as a func call, other than avoiding
the churn of adding fs_struct.h as needed.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jori Koolstra <jkoolstra@xs4all.nl> says:
Syzbot fuzzes /fs by trying to mount and manipulate deliberately
corrupted filesystems. This should not lead to BUG_ONs and WARN_ONs for
easy to detect corruptions. This series adds code to be able to report
such corruptions and fixes two syzbot bugs on this kind.
* patches from https://patch.msgid.link/20251104143005.3283980-1-jkoolstra@xs4all.nl:
Fix a drop_nlink warning in minix_rename
Fix a drop_nlink warning in minix_rmdir
Add error handling to minix filesystem for inode corruption detection
Link: https://patch.msgid.link/20251104143005.3283980-1-jkoolstra@xs4all.nl
Signed-off-by: Christian Brauner <brauner@kernel.org>
We would like to provide early and specific warnings of filesystem
corruption without running into generic WARN_ONs and BUG_ONs.
Towards this goal, ext4, e.g., has a EFSCORRUPTED errno and a
standardized inode corruption message format. This patch adds this
errno and message format to the minix filesystem.
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20251104143005.3283980-2-jkoolstra@xs4all.nl
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@lst.de> says:
This series enables the new block layer support for misaligned
individual vectors for zoned XFS.
The first patch is the from Qu and supposedly already applied to
the vfs iomap 6.19 branch, but I can't find it there. The next
two are small fixups for it, and the last one makes use of this
new functionality in XFS.
* patches from https://patch.msgid.link/20251031131045.1613229-1-hch@lst.de:
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
Link: https://patch.msgid.link/20251031131045.1613229-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that the block layer and iomap have grown support to indicate
the bio sector size explicitly instead of assuming the device sector
size, we can ask for logical block size alignment and thus support
direct I/O writes where the overall size is logical block size
aligned, but the boundaries between vectors might not be.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251031131045.1613229-3-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Btrfs requires all of its bios to be fs block aligned, normally it's
totally fine but with the incoming block size larger than page size
(bs > ps) support, the requirement is no longer met for direct IOs.
Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only
requiring alignment to be bdev_logical_block_size().
In the real world that value is either 512 or 4K, on 4K page sized
systems it means bio_iov_iter_get_pages() can break the bio at any page
boundary, breaking btrfs' requirement for bs > ps cases.
To address this problem, introduce a new public iomap dio flag,
IOMAP_DIO_FSBLOCK_ALIGNED.
When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will
inherit that new flag, and iomap_dio_bio_iter() will take fs block size
into the calculation of the alignment, and pass the alignment to
bio_iov_iter_get_pages(), respecting the fs block size requirement.
The initial user of this flag will be btrfs, which needs to calculate the
checksum for direct read and thus requires the biovec to be fs block
aligned for the incoming bs > ps support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
[hch: also align pos/len, incorporate the trace flags from Darrick]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster <bfoster@redhat.com> says:
This adds folio batch support for iomap. This initially only targets
zero range, the use case being zeroing of dirty folios over unwritten
mappings. There is potential to support other operations in the future:
iomap seek data/hole has similar raciness issues as zero range, the
prospect of using this for buffered write has been raised for granular
locking purposes, etc.
The one major caveat with this zero range implementation is that it
doesn't look at iomap_folio_state to determine whether to zero a
sub-folio portion of the folio. Instead it just relies on whether the
folio was dirty or not. This means that spurious zeroing of unwritten
ranges is possible if a folio is dirty but the target range includes a
subrange that is not.
The reasoning is that this is essentially a complexity tradeoff. The
current use cases for iomap_zero_range() are limited mostly to partial
block zeroing scenarios. It's relatively harmless to zero an unwritten
block (i.e. not a correctness issue), and this is something that
filesystems have done in the past without much notice or issue. The
advantage is less code and this makes it a little easier to use a
filemap lookup function for the batch rather than open coding more logic
in iomap. That said, this can probably be enhanced to look at ifs in the
future if the use case expands and/or other operations justify it.
WRT testing, I've tested with and without a local hack to redirect
fallocate zero range calls to iomap_zero_range() in XFS. This helps test
beyond the partial block/folio use case, i.e. to cover boundary
conditions like full folio batch handling, etc. I recently added patch 7
in spirit of that, which turns this logic into an XFS errortag. Further
comments on that are inline with patch 7.
* patches from https://lore.kernel.org/20251003134642.604736-1-bfoster@redhat.com:
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
xfs: always trim mapping to requested range for zero range
iomap: optional zero range dirty folio processing
iomap: remove pos+len BUG_ON() to after folio lookup
filemap: add helper to look up dirty folios in a range
Signed-off-by: Christian Brauner <brauner@kernel.org>
iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.
While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
iomap_zero_range() optimizes the partial eof block zeroing use case
by force zeroing if the mapping is dirty. This is to avoid frequent
flushing on file extending workloads, which hurts performance.
Now that the folio batch mechanism provides a more generic solution
and is used by the only real zero range user (XFS), this isolated
optimization is no longer needed. Remove the unnecessary code and
let callers use the folio batch or fall back to flushing by default.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use the iomap folio batch mechanism to select folios to zero on zero
range of unwritten mappings. Trim the resulting mapping if the batch
is filled (unlikely for current use cases) to distinguish between a
range to skip and one that requires another iteration due to a full
batch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Refactor and tweak the IOMAP_ZERO logic in preparation to support
filling the folio batch for unwritten mappings. Drop the superfluous
imap offset check since the hole case has already been filtered out.
Split the the delalloc case handling into a sub-branch, and always
trim the imap to the requested offset/count so it can be more easily
used to bound the range to lookup in pagecache.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Joanne Koong <joannelkoong@gmail.com> says:
This series adds fuse iomap support for buffered reads and readahead.
This is needed so that granular uptodate tracking can be used in fuse when
large folios are enabled so that only the non-uptodate portions of the folio
need to be read in instead of having to read in the entire folio. It also is
needed in order to turn on large folios for servers that use the writeback
cache since otherwise there is a race condition that may lead to data
corruption if there is a partial write, then a read and the read happens
before the write has undergone writeback, since otherwise the folio will not
be marked uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.
This is on top of two locally-patched iomap patches [1] [2] patched on top of
commit f1c864be6e88 ("Merge branch 'vfs-6.18.async' into vfs.all") in
Christian's vfs.all tree.
This series was run through fstests on fuse passthrough_hp with an
out-of kernel patch enabling fuse large folios.
This patchset does not enable large folios on fuse yet. That will be part
of a different patchset.
* patches from https://lore.kernel.org/20250926002609.1302233-1-joannelkoong@gmail.com:
fuse: remove fc->blkbits workaround for partial writes
fuse: use iomap for readahead
fuse: use iomap for read_folio
iomap: make iomap_read_folio() a void return
iomap: move buffered io bio logic into new file
iomap: add caller-provided callbacks for read and readahead
iomap: set accurate iter->pos when reading folio ranges
iomap: track pending read bytes more optimally
iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
iomap: iterate over folio mapping in iomap_readpage_iter()
iomap: store read/readahead bio generically
iomap: move read/readahead bio submission logic into helper function
iomap: move bio read logic into helper function
Signed-off-by: Christian Brauner <brauner@kernel.org>
The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.
Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide a set of dirty folios that require zeroing. In
turn, rather than flush or iterate file offsets, zero range can
iterate on folios in the batch and advance over clean or uncached
ranges in between.
Add a folio_batch in struct iomap and provide a helper for
filesystems to populate the batch at lookup time. Update the folio
lookup path to return the next folio in the batch, if provided, and
advance the iter if the folio starts beyond the current offset.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The bug checks at the top of iomap_write_begin() assume the pos/len
reflect exactly the next range to process. This may no longer be the
case once the get folio path is able to process a folio batch from
the filesystem. On top of that, len is already trimmed to within the
iomap/srcmap by iomap_length(), so these checks aren't terribly
useful. Remove the unnecessary BUG_ON() checks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Do readahead in fuse using iomap. This gives us granular uptodate
tracking for large folios, which optimizes how much data needs to be
read in. If some portions of the folio are already uptodate (eg through
a prior write), we only need to read in the non-uptodate portions.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new filemap_get_folios_dirty() helper to look up existing dirty
folios in a range and add them to a folio_batch. This is to support
optimization of certain iomap operations that only care about dirty
folios in a target range. For example, zero range only zeroes the subset
of dirty pages over unwritten mappings, seek hole/data may use similar
logic in the future, etc.
Note that the helper is intended for use under internal fs locks.
Therefore it trylocks folios in order to filter out clean folios.
This loosely follows the logic from filemap_range_has_writeback().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Read folio data into the page cache using iomap. This gives us granular
uptodate tracking for large folios, which optimizes how much data needs
to be read in. If some portions of the folio are already uptodate (eg
through a prior write), we only need to read in the non-uptodate
portions.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
No errors are propagated in iomap_read_folio(). Change
iomap_read_folio() to a void return to make this clearer to callers.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move bio logic in the buffered io code into its own file and remove
CONFIG_BLOCK gating for iomap read/readahead.
[1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.
In particular, this:
* Modifies the read and readahead interface to take in a
struct iomap_read_folio_ctx that is publicly defined as:
struct iomap_read_folio_ctx {
const struct iomap_read_ops *ops;
struct folio *cur_folio;
struct readahead_control *rac;
void *read_ctx;
};
where struct iomap_read_ops is defined as:
struct iomap_read_ops {
int (*read_folio_range)(const struct iomap_iter *iter,
struct iomap_read_folio_ctx *ctx,
size_t len);
void (*read_submit)(struct iomap_read_folio_ctx *ctx);
};
read_folio_range() reads in the folio range and is required by the
caller to provide. read_submit() is optional and is used for
submitting any pending read requests.
* Modifies existing filesystems that use iomap for read and readahead to
use the new API, through the new statically inlined helpers
iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
in functionality for those filesystems.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Advance iter to the correct position before calling an IO helper to read
in a folio range. This allows the helper to reliably use iter->pos to
determine the starting offset for reading.
This will simplify the interface for reading in folio ranges when iomap
read/readahead supports caller-provided callbacks.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Instead of incrementing read_bytes_pending for every folio range read in
(which requires acquiring the spinlock to do so), set read_bytes_pending
to the folio size when the first range is asynchronously read in, keep
track of how many bytes total are asynchronously read in, and adjust
read_bytes_pending accordingly after issuing requests to read in all the
necessary ranges.
iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
value for pending bytes necessarily indicates the folio is in the bio.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace deprecated simple_strtoul() with kstrtouint() for better error
handling and input validation. Return 0 on parsing failure to indicate
invalid parameter, maintaining existing behavior for valid inputs.
The simple_strtoul() function is deprecated in favor of kstrtoint()
family functions which provide better error handling and are recommended
for new code and replacements.
Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com>
Link: https://patch.msgid.link/20251103080627.1844645-1-kaushlendra.kumar@intel.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Qu Wenruo <wqu@suse.com> says:
The first patch is a cleanup related to sync_inodes_one_sb() callback.
Since it always wait for the writeback, there is no need to pass any
parameter for it.
The second patch is a fix mostly affecting btrfs, as btrfs requires a
explicit sync_fc() call with wait == 1, to commit its super blocks,
and sync_bdevs() won't cut it at all.
However the current emergency sync never passes wait == 1, it means
btrfs will writeback all dirty data and metadata, but still no super
block update, resulting everything still pointing back to the old
data/metadata.
This lead to a problem where btrfs doesn't seem to do anything during
emergency sync.
The second patch fixes the problem by passing wait == 1 for the second
iteration of sync_fs_one_sb().
* patches from https://patch.msgid.link/cover.1762142636.git.wqu@suse.com:
fs: fully sync all fses even for an emergency sync
fs: do not pass a parameter for sync_inodes_one_sb()
Link: https://patch.msgid.link/cover.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
[BUG]
There is a bug report that during emergency sync, btrfs only write back
all the dirty data and metadadta, but no full transaction commit,
resulting the super block still pointing to the old trees, thus the end
user can only see the old data, not the newer one.
[CAUSE]
Initially this looks like a btrfs specific bug, since ext4 doesn't get
affected by this one.
But the root problem here is, a combination of btrfs features and the no
wait nature of emergency sync.
Firstly do_sync_work() will call sync_inodes_one_sb() for every fs, to
writeback all the dirty pages for the fs.
Btrfs will properly writeback all dirty pages, including both data and
the updated metadata. So far so good.
Then sync_fs_one_sb() called with @nowait, in the case of btrfs it means
no full transaction commit, thus no super block update.
At this stage, btrfs is only one super block update away to be fully committed.
I believe it's the more or less the same for other fses too.
The problem is the next step, sync_bdevs().
Normally other fses have their super block already updated in the page
cache of the block device, but btrfs only updates the super block during
full transaction commit.
So sync_bdevs() may work for other fses, but not for btrfs, btrfs is
still using its older super block, all pointing back to the old metadata
and data.
Thus if after emergency sync, power loss happened, the end user will
only see the old data, not the newer one, despite that everything but the
super block is already written back.
[FIX]
Since the emergency sync is already executing in a workqueue, I didn't
see much need to only do a nowait sync.
Especially after the fact that sync_inodes_one_sb() always wait for the
writeback to finish.
Instead for the second iteration of sync_fs_one_sb(), pass wait == 1
into it, so fses like btrfs can properly commit its super blocks.
Reported-by: Askar Safin <safinaskar@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20251101150429.321537-1-safinaskar@gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/7b7fd40c5fe440b633b6c0c741d96ce93eb5a89a.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The function sync_inodes_one_sb() will always wait for the writeback,
and ignore the optional parameter.
Explicitly pass NULL as parameter for the call sites inside
do_sync_work().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/8079af1c4798cb36887022a8c51547a727c353cf.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
The recent changes to rework coredump handling to rely on unix sockets
are in the process of being used in systemd. Yu reported on shortcoming
nameling that the signal causing the coredump was available before the
crashing process was reaped.
The previous systemd coredump container interface requires the coredump
file descriptor, and basic information including the signal number to be
sent to the container. This means we need to have the signal number
available before sending the coredump to the container.
In general, the extension makes sense and fits with the rest of the
coredump information.
In addition to this extension this fixes a bunch of the tests that were
failing and reworks the publication mechanism for exit and coredump info
retrievable via the pidfd ioctl.
* patches from https://patch.msgid.link/20251028-work-coredump-signal-v1-0-ca449b7b7aa0@kernel.org: (22 commits)
selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: ignore ENOSPC errors
selftests/coredump: add debug logging to coredump socket protocol tests
selftests/coredump: add debug logging to coredump socket tests
selftests/coredump: add debug logging to test helpers
selftests/coredump: handle edge-triggered epoll correctly
selftests/coredump: fix userspace coredump client detection
selftests/coredump: fix userspace client detection
selftests/coredump: split out coredump socket tests
selftests/coredump: split out common helpers
selftests/pidfd: add second supported_mask test
selftests/pidfd: add first supported_mask test
selftests/pidfd: update pidfd header
pidfs: expose coredump signal
pidfs: drop struct pidfs_exit_info
pidfs: prepare to drop exit_info pointer
pidfd: add a new supported_mask field
pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
pidfs: add missing PIDFD_INFO_SIZE_VER1
...
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-0-ca449b7b7aa0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Verify that when using simple socket-based coredump (@ pattern),
the coredump_signal field is correctly exposed as SIGABRT.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-22-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Verify that when using simple socket-based coredump (@ pattern),
the coredump_signal field is correctly exposed as SIGSEGV.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-21-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
If we crash multiple processes at the same time we may run out of space.
Just ignore those errors. They're not actually all that relevant for the
test.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-20-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
PIDFD_INFO_COREDUMP is only retrievable until the task has exited. After
it has exited task->mm is NULL. So if the task didn't actually coredump
we can't retrieve it's dumpability settings anymore. Only if the task
did coredump will we have stashed the coredump information in the
respective struct pid.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-15-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Verify that when PIDFD_INFO_SUPPORTED_MASK is requested, the kernel
returns the supported_mask field indicating which flags the kernel
supports.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-10-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
This converts all users of override_creds() to rely on credentials
guards. Leave all those that do the prepare_creds() + modify creds +
override_creds() dance alone for now. Some of them qualify for their own
variant.
* patches from https://patch.msgid.link/20251103-work-creds-guards-simple-v1-0-a3e156839e7f@kernel.org:
net/dns_resolver: use credential guards in dns_query()
cgroup: use credential guards in cgroup_attach_permissions()
act: use credential guards in acct_write_process()
smb: use credential guards in cifs_get_spnego_key()
nfs: use credential guards in nfs_idmap_get_key()
nfs: use credential guards in nfs_local_call_write()
nfs: use credential guards in nfs_local_call_read()
erofs: use credential guards
binfmt_misc: use credential guards
backing-file: use credential guards for mmap
backing-file: use credential guards for splice write
backing-file: use credential guards for splice read
backing-file: use credential guards for writes
backing-file: use credential guards for reads
aio: use credential guards
cred: add {scoped_}with_creds() guards
Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-0-a3e156839e7f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
A few months ago I did work to make override_creds()/revert_creds()
completely reference count free - mostly for the sake of
overlayfs but it has been beneficial to everyone using this.
In a recent pull request from Jens that introduced another round of
override_creds()/revert_creds() for nbd Linus asked whether we could
avoide the prepare_kernel_creds() calls that duplicate the kernel
credentials and then drop them again later.
Yes, we can actually. We can use the guard infrastructure to completely
avoid the allocation and then also to never expose the temporary
variable to hold the kernel credentials anywhere in the callers.
So add with_kernel_creds() and scoped_with_kernel_creds() for this
purpose. Also take the opportunity to fixup the scoped_class() macro I
introduced two cycles ago.
* patches from https://patch.msgid.link/20251103-work-creds-init_cred-v1-0-cb3ec8711a6a@kernel.org:
unix: don't copy creds
target: don't copy kernel creds
nbd: don't copy kernel creds
firmware: don't copy kernel creds
cred: add {scoped_}with_kernel_creds
cred: make init_cred static
cred: add kernel_cred() helper
cleanup: fix scoped_class()
Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-0-cb3ec8711a6a@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new cleanup class for override creds. We can make use of this in a
bunch of places going forward.
Based on this scoped_with_kernel_creds() that can be used to temporarily
assume kernel credentials for specific tasks such as firmware loading,
or coredump socket connections. At no point will the caller interact
with the kernel credentials directly.
Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-4-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is a class, not a guard so why on earth is it checking for guard
pointers or conditional lock acquisition? None of it makes any sense at
all.
I'm not sure what happened back then. Maybe I had a brief psychedelic
period that I completely forgot about and spaced out into a zone where
that initial macro implementation made any sense at all.
Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-1-cb3ec8711a6a@kernel.org
Fixes: 5c21c5f22d ("cleanup: add a scoped version of CLASS()")
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
As announced a while ago this is the next step building on the nstree
work from prior cycles. There's a bunch of fixes and semantic cleanups
in here and a ton of tests.
Currently listns() is relying on active namespace reference counts which
are introduced alongside this series.
While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.
On current kernels a namespace is visible to userspace in the
following cases:
(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount).
Note that (2) only cares about direct persistence of the namespace
itself not indirectly via e.g., file->f_cred file references or
similar.
(3) The namespace is a hierarchical namespace type and is the parent of
a single or multiple child namespaces.
Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().
Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.
For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.
Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.
So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.
To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.
The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.
If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.
So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.
Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.
The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).
A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.
Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.
The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.
The active reference counts of the user- and pid namespace are
decremented once the task is reaped.
Based on the namespace trees and the active reference count, a new
listns() system call that allows userspace to iterate through namespaces
in the system. This provides a programmatic interface to discover and
inspect namespaces, enhancing existing namespace apis.
Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:
1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
running process but are kept alive by file descriptors, bind mounts,
or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
namespaces.
The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.
/*
* @req: Pointer to struct ns_id_req specifying search parameters
* @ns_ids: User buffer to receive namespace IDs
* @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
* @flags: Reserved for future use (must be 0)
*/
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code
/*
* @size: Structure size
* @ns_id: Starting point for iteration; use 0 for first call, then
* use the last returned ID for subsequent calls to paginate
* @ns_type: Bitmask of namespace types to include (from enum ns_type):
* 0: Return all namespace types
* MNT_NS: Mount namespaces
* NET_NS: Network namespaces
* USER_NS: User namespaces
* etc. Can be OR'd together
* @user_ns_id: Filter results to namespaces owned by this user namespace:
* 0: Return all namespaces (subject to permission checks)
* LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
* Other value: Namespaces owned by the specified user namespace ID
*/
struct ns_id_req {
__u32 size; /* sizeof(struct ns_id_req) */
__u32 spare; /* Reserved, must be 0 */
__u64 ns_id; /* Last seen namespace ID (for pagination) */
__u32 ns_type; /* Filter by namespace type(s) */
__u32 spare2; /* Reserved, must be 0 */
__u64 user_ns_id; /* Filter by owning user namespace */
};
Example 1: List all namespaces
void list_all_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0, /* Start from beginning */
.ns_type = 0, /* All types */
.user_ns_id = 0, /* All user namespaces */
};
uint64_t ids[100];
ssize_t ret;
printf("All namespaces in the system:\n");
do {
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
break;
}
for (ssize_t i = 0; i < ret; i++)
printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]);
/* Continue from last seen ID */
if (ret > 0)
req.ns_id = ids[ret - 1];
} while (ret == 100); /* Buffer was full, more may exist */
}
Example 2 : List network namespaces only
void list_network_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS, /* Only network namespaces */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Network namespaces: %zd found\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" netns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 3 : List namespaces owned by current user namespace
void list_owned_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0, /* All types */
.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Namespaces owned by my user namespace: %zd\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" ns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 4 : List multiple namespace types
void list_network_and_mount_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS | MNT_NS, /* Network and mount */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
printf("Network and mount namespaces: %zd found\n", ret);
}
Example 5 : Pagination through large namespace sets
void list_all_with_pagination(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0,
.user_ns_id = 0,
};
uint64_t ids[50];
size_t total = 0;
ssize_t ret;
printf("Enumerating all namespaces with pagination:\n");
while (1) {
ret = listns(&req, ids, 50, 0);
if (ret < 0) {
perror("listns");
break;
}
if (ret == 0)
break; /* No more namespaces */
total += ret;
printf(" Batch: %zd namespaces\n", ret);
/* Last ID in this batch becomes start of next batch */
req.ns_id = ids[ret - 1];
if (ret < 50)
break; /* Partial batch = end of results */
}
printf("Total: %zu namespaces\n", total);
}
listns() respects namespace isolation and capabilities:
(1) Global listing (user_ns_id = 0):
- Requires CAP_SYS_ADMIN in the namespace's owning user namespace
- OR the namespace must be in the caller's namespace context (e.g.,
a namespace the caller is currently using)
- User namespaces additionally allow listing if the caller has
CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
- Requires CAP_SYS_ADMIN in the specified owner user namespace
- OR the namespace must be in the caller's namespace context
- This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
- Only "active" namespaces are listed
- A namespace is active if it has a non-zero __ns_ref_active count
- This includes namespaces used by running processes, held by open
file descriptors, or kept active by bind mounts
- Inactive namespaces (kept alive only by internal kernel
references) are not visible via listns()
* patches from https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-0-2e6f823ebdc0@kernel.org: (74 commits)
selftests/namespace: test listns() pagination
selftests/namespace: add stress test
selftests/namespace: commit_creds() active reference tests
selftests/namespace: third threaded active reference count test
selftests/namespace: second threaded active reference count test
selftests/namespace: first threaded active reference count test
selftests/namespaces: twelth inactive namespace resurrection test
selftests/namespaces: eleventh inactive namespace resurrection test
selftests/namespaces: tenth inactive namespace resurrection test
selftests/namespaces: ninth inactive namespace resurrection test
selftests/namespaces: eigth inactive namespace resurrection test
selftests/namespaces: seventh inactive namespace resurrection test
selftests/namespaces: sixth inactive namespace resurrection test
selftests/namespaces: fifth inactive namespace resurrection test
selftests/namespaces: fourth inactive namespace resurrection test
selftests/namespaces: third inactive namespace resurrection test
selftests/namespaces: second inactive namespace resurrection test
selftests/namespaces: first inactive namespace resurrection test
selftests/namespaces: seventh listns() permission test
selftests/namespaces: sixth listns() permission test
...
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-0-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stress tests for namespace active reference counting.
These tests validate that the active reference counting system can
handle high load scenarios including rapid namespace
creation/destruction, large numbers of concurrent namespaces, and
various edge cases under stress.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-71-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that namespaces become inactive after subprocess with multiple
threads exits. Create a subprocess that unshares user and network
namespaces, then creates two threads that share those namespaces. Verify
that after all threads and subprocess exit, the namespaces are no longer
listed by listns() and cannot be opened by open_by_handle_at().
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-69-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that a namespace remains active while a thread holds an fd to it.
Even after the thread exits, the namespace should remain active as long
as another thread holds a file descriptor to it.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-68-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test multi-level namespace resurrection across three user namespace levels.
This test creates a complex namespace hierarchy with three levels of user
namespaces and a network namespace at the deepest level. It verifies that
the resurrection semantics work correctly when SIOCGSKNS is called on a
socket from an inactive namespace tree, and that listns() and
open_by_handle_at() correctly respect visibility rules.
Hierarchy after child processes exit (all with 0 active refcount):
net_L3A (0) <- Level 3 network namespace
|
+
userns_L3 (0) <- Level 3 user namespace
|
+
userns_L2 (0) <- Level 2 user namespace
|
+
userns_L1 (0) <- Level 1 user namespace
|
x
init_user_ns
The test verifies:
1. SIOCGSKNS on a socket from inactive net_L3A resurrects the entire chain
2. After resurrection, all namespaces are visible in listns()
3. Resurrected namespaces can be reopened via file handles
4. Closing the netns FD cascades down: the entire ownership chain
(userns_L3 -> userns_L2 -> userns_L1) becomes inactive again
5. Inactive namespaces disappear from listns() and cannot be reopened
6. Calling SIOCGSKNS again on the same socket resurrects the tree again
7. After second resurrection, namespaces are visible and can be reopened
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-66-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test combined listns() and file handle operations with socket-kept
netns. Create a netns, keep it alive with a socket, verify it appears in
listns(), then reopen it via file handle obtained from listns() entry.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-65-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that socket-kept netns can be reopened via file handle.
Verify that a network namespace kept alive by a socket FD can be
reopened using file handles even after the creating process exits.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-64-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that socket-kept netns appears in listns() output.
Verify that a network namespace kept alive by a socket FD appears in
listns() output even after the creating process exits, and that it
disappears when the socket is closed.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-63-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that user namespace as a child also propagates correctly.
Create user_A -> user_B, verify when user_B is active that user_A
is also active. This is different from non-user namespace children.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-36-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test that parent stays active as long as ANY child is active.
Create parent user namespace with two child net namespaces.
Parent should remain active until BOTH children are inactive.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-35-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test hierarchical propagation with deep namespace hierarchy.
Create: init_user_ns -> user_A -> user_B -> net_ns
When net_ns is active, both user_A and user_B should be active.
This verifies the conditional recursion in __ns_ref_active_put() works.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-34-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test hierarchical active reference propagation.
When a child namespace is active, its owning user namespace should also
be active automatically due to hierarchical active reference propagation.
This ensures parents are always reachable when children are active.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-29-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Test namespace lifecycle: create a namespace in a child process, get a
file handle while it's active, then try to reopen after the process
exits (namespace becomes inactive).
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-24-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.
Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:
1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
running process but are kept alive by file descriptors, bind mounts,
or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
namespaces.
The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.
/*
* @req: Pointer to struct ns_id_req specifying search parameters
* @ns_ids: User buffer to receive namespace IDs
* @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
* @flags: Reserved for future use (must be 0)
*/
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code
/*
* @size: Structure size
* @ns_id: Starting point for iteration; use 0 for first call, then
* use the last returned ID for subsequent calls to paginate
* @ns_type: Bitmask of namespace types to include (from enum ns_type):
* 0: Return all namespace types
* MNT_NS: Mount namespaces
* NET_NS: Network namespaces
* USER_NS: User namespaces
* etc. Can be OR'd together
* @user_ns_id: Filter results to namespaces owned by this user namespace:
* 0: Return all namespaces (subject to permission checks)
* LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
* Other value: Namespaces owned by the specified user namespace ID
*/
struct ns_id_req {
__u32 size; /* sizeof(struct ns_id_req) */
__u32 spare; /* Reserved, must be 0 */
__u64 ns_id; /* Last seen namespace ID (for pagination) */
__u32 ns_type; /* Filter by namespace type(s) */
__u32 spare2; /* Reserved, must be 0 */
__u64 user_ns_id; /* Filter by owning user namespace */
};
Example 1: List all namespaces
void list_all_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0, /* Start from beginning */
.ns_type = 0, /* All types */
.user_ns_id = 0, /* All user namespaces */
};
uint64_t ids[100];
ssize_t ret;
printf("All namespaces in the system:\n");
do {
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
break;
}
for (ssize_t i = 0; i < ret; i++)
printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]);
/* Continue from last seen ID */
if (ret > 0)
req.ns_id = ids[ret - 1];
} while (ret == 100); /* Buffer was full, more may exist */
}
Example 2: List network namespaces only
void list_network_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS, /* Only network namespaces */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Network namespaces: %zd found\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" netns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 3: List namespaces owned by current user namespace
void list_owned_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0, /* All types */
.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
if (ret < 0) {
perror("listns");
return;
}
printf("Namespaces owned by my user namespace: %zd\n", ret);
for (ssize_t i = 0; i < ret; i++)
printf(" ns ID: %llu\n", (unsigned long long)ids[i]);
}
Example 4: List multiple namespace types
void list_network_and_mount_namespaces(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = NET_NS | MNT_NS, /* Network and mount */
.user_ns_id = 0,
};
uint64_t ids[100];
ssize_t ret;
ret = listns(&req, ids, 100, 0);
printf("Network and mount namespaces: %zd found\n", ret);
}
Example 5: Pagination through large namespace sets
void list_all_with_pagination(void)
{
struct ns_id_req req = {
.size = sizeof(req),
.ns_id = 0,
.ns_type = 0,
.user_ns_id = 0,
};
uint64_t ids[50];
size_t total = 0;
ssize_t ret;
printf("Enumerating all namespaces with pagination:\n");
while (1) {
ret = listns(&req, ids, 50, 0);
if (ret < 0) {
perror("listns");
break;
}
if (ret == 0)
break; /* No more namespaces */
total += ret;
printf(" Batch: %zd namespaces\n", ret);
/* Last ID in this batch becomes start of next batch */
req.ns_id = ids[ret - 1];
if (ret < 50)
break; /* Partial batch = end of results */
}
printf("Total: %zu namespaces\n", total);
}
Permission Model
listns() respects namespace isolation and capabilities:
(1) Global listing (user_ns_id = 0):
- Requires CAP_SYS_ADMIN in the namespace's owning user namespace
- OR the namespace must be in the caller's namespace context (e.g.,
a namespace the caller is currently using)
- User namespaces additionally allow listing if the caller has
CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
- Requires CAP_SYS_ADMIN in the specified owner user namespace
- OR the namespace must be in the caller's namespace context
- This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
- Only "active" namespaces are listed
- A namespace is active if it has a non-zero __ns_ref_active count
- This includes namespaces used by running processes, held by open
file descriptors, or kept active by bind mounts
- Inactive namespaces (kept alive only by internal kernel
references) are not visible via listns()
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-19-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The namespace tree doesn't express the ownership concept of namespace
appropriately. Maintain a list of directly owned namespaces per user
namespace. This will allow userspace and the kernel to use the listns()
system call to walk the namespace tree by owning user namespace. The
rbtree is used to find the relevant namespace entry point which allows
to continue iteration and the owner list can be used to walk the tree
completely lock free.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-16-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The initial set of namespace comes with fixed inode numbers making it
easy for userspace to identify them solely based on that information.
This has long preceeded anything here.
Similarly, let's assign fixed namespace ids for the initial namespaces.
Kill the cookie and use a sequentially increasing number. This has the
nice side-effect that the owning user namespace will always have a
namespace id that is smaller than any of it's descendant namespaces.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The namespace file handle struct nsfs_file_handle is uapi and userspace
is expressly allowed to generate file handles without going through
name_to_handle_at().
Allow userspace to generate a file handle where both the inode number
and the namespace type are zero and just pass in the unique namespace
id. The kernel uses the unified namespace tree to find the namespace and
open the file handle.
When the kernel creates a file handle via name_to_handle_at() it will
always fill in the type and the inode number allowing userspace to
retrieve core information.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-14-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The namespace tree is, among other things, currently used to support
file handles for namespaces. When a namespace is created it is placed on
the namespace trees and when it is destroyed it is removed from the
namespace trees.
While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.
On current kernels a namespace is visible to userspace in the
following cases:
(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount).
Note that (2) only cares about direct persistence of the namespace
itself not indirectly via e.g., file->f_cred file references or
similar.
(3) The namespace is a hierarchical namespace type and is the parent of
a single or multiple child namespaces.
Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().
Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.
For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.
Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.
So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.
To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.
The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.
If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.
So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.
Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.
The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).
A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.
Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.
The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.
The active reference counts of the user- and pid namespace are
decremented once the task is reaped.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The current naming is very misleading as this really isn't exiting all
of the task's namespaces. It is only exiting the namespaces that hang of
off nsproxy. Reflect that in the name.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-10-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add an initializer that can be used for the ns common initialization for
static namespace such as most init namespaces.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/87ecqhy2y5.ffs@tglx
Signed-off-by: Christian Brauner <brauner@kernel.org>
1. we already expect the refcount is 1.
2. path creation predicts name == iname
I verified this straightens out the asm, no functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029134952.658450-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Matthew Wilcox (Oracle) <willy@infradead.org> says:
It's relatively common in filesystems to want to know the end of the
current folio we're looking at. So common in fact that btrfs has its own
helper for that. Lift that helper to filemap and use it everywhere that
I've noticed it could be used. This actually fixes a long-standing bug
in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this
is not a common configuration, but I've marked it for backport anyway.
The other filesystems are all fine; none of them have a bug, they're
just mildly inefficient. I think this should all go in via Christian's
tree, ideally with acks from the various fs maintainers (cc'd on their
individual patches).
* patches from https://patch.msgid.link/20251024170822.1427218-1-willy@infradead.org:
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
Link: https://patch.msgid.link/20251024170822.1427218-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-11-willy@infradead.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-10-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-9-willy@infradead.org
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paulo Alcantara <pc@manguebit.org>
Cc: netfs@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-8-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-7-willy@infradead.org
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: gfs2@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-6-willy@infradead.org
Reviewed-by: Chao Yu <chao@kernel.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-5-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-4-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace the open-coded implementation in ocfs2 (which loses the top
32 bits on 32-bit architectures) with a helper in pagemap.h.
Fixes: 35edec1d52 (ocfs2: update truncate handling of partial clusters)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-2-willy@infradead.org
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
While pidfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-4-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
While nsfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-3-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Currently nsfs uses the default inode_generic_drop() fallback which
drops the inode when it's unlinked or when it's unhashed. Since nsfs
never hashes inodes that always amounts to dropping the inode.
But that's just annoying to have to reason through every time we look at
this code. Switch to inode_just_drop() which always drops the inode
explicitly. This also aligns the behavior with pidfs which does the
same.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-2-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
eCryptfs uses MD5 for a couple unusual purposes: to "mix" the key into
the IVs for file contents encryption (similar to ESSIV), and to prepend
some key-dependent bytes to the plaintext when encrypting filenames
(which is useless since eCryptfs encrypts the filenames with ECB).
Currently, eCryptfs computes these MD5 hashes using the crypto_shash
API. Update it to instead use the MD5 library API. This is simpler and
faster: the library doesn't require memory allocations, can't fail, and
provides direct access to MD5 without overhead such as indirect calls.
To preserve the existing behavior of eCryptfs support being disabled
when the kernel is booted with "fips=1", make ecryptfs_get_tree() check
fips_enabled itself. Previously it relied on crypto_alloc_shash("md5")
failing. I don't know for sure that this is actually needed; e.g., it
could be argued that eCryptfs's use of MD5 isn't for a security purpose
as far as FIPS is concerned. But this preserves the existing behavior.
Tested by verifying that an existing eCryptfs can still be mounted with
a kernel that has this commit, with all the files matching. Also tested
creating a filesystem with this commit and mounting+reading it without.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251011200010.193140-1-ebiggers@kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
iomap_dio_zero() uses a custom allocated memory of zeroes for padding
zeroes. This was a temporary solution until there was a way to request a
zero folio that was greater than the PAGE_SIZE.
Use largest_zero_folio() function instead of using the custom allocated
memory of zeroes. There is no guarantee from largest_zero_folio()
function that it will always return a PMD sized folio. Adapt the code so
that it can also work if largest_zero_folio() returns a ZERO_PAGE.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace simple_strtol() with the recommended kstrtoint() for parsing the
'ramdisk_start=' boot parameter. Unlike simple_strtol(), which returns a
a long, kstrtoint() converts the string directly to an integer and
avoids implicit casting.
Check the return value of kstrtoint() and reject invalid values. This
adds error handling while preserving existing behavior for valid values,
and removes use of the deprecated simple_strtol() helper.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is a follow up to commit c4781dc3d1 ("Kbuild: enable
-fms-extensions") but in a separate change due to being substantially
different from the initial submission.
There are many places within the kernel that use their own CFLAGS
instead of the main KBUILD_CFLAGS, meaning code written with the main
kernel's use of '-fms-extensions' in mind that may be tangentially
included in these areas will result in "error: declaration does not
declare anything" messages from the compiler.
Add '-fms-extensions' to all these areas to ensure consistency, along
with -Wno-microsoft-anon-tag to silence clang's warning about use of the
extension that the kernel cares about using. parisc does not build with
clang so it does not need this warning flag. LoongArch does not need it
either because -W flags from KBUILD_FLAGS are pulled into cflags-vdso.
Reported-by: Christian Brauner <brauner@kernel.org>
Closes: https://lore.kernel.org/20251030-meerjungfrau-getrocknet-7b46eacc215d@brauner/
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Userspace needs access to the signal that caused the coredump before the
coredumping process has been reaped. Expose it as part of the coredump
information in struct pidfd_info. After the process has been reaped that
info is also available as part of PIDFD_INFO_EXIT's exit_code field.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-8-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is not needed anymore now that we have the new scheme to guarantee
all-or-nothing information exposure.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-7-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There will likely be more info that we need to store in struct
pidfs_attr. We need to make sure that some of the information such as
exit info or coredump info that consists of multiple bits is either
available completely or not at all, but never partially. Currently we
use a pointer that we assign to. That doesn't scale. We can't waste a
pointer for each mulit-part information struct we want to expose. Use a
bitmask instead.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-6-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Some of the future fields in struct pidfd_info can be optional. If the
kernel has nothing to emit in that field, then it doesn't set the flag
in the reply. This presents a problem: There is currently no way to know
what mask flags the kernel supports since one can't always count on them
being in the reply.
Add a new PIDFD_INFO_SUPPORTED_MASK flag and field that the kernel can
set in the reply. Userspace can use this to determine if the fields it
requires from the kernel are supported. This also gives us a way to
deprecate fields in the future, if that should become necessary.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-5-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Validate that the size of struct pidfd_info is correctly updated.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-4-ca449b7b7aa0@kernel.org
Fixes: 1d8db6fd69 ("pidfs, coredump: add PIDFD_INFO_COREDUMP")
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When PIDFD_INFO_COREDUMP is requested we raise it unconditionally in the
returned mask even if no coredump actually did take place. This was
done because we assumed that the later check whether ->coredump_mask as
non-zero detects that it is zero and then retrieves the dumpability
settings from the task's mm. This has issues though becuase there are
tasks that might not have any mm. Also it's just not very cleanly
implemented. Fix this.
Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-2-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Building fs/jfs with clang and '-fms-extensions' errors with:
In file included from fs/jfs/jfs_unicode.c:8:
fs/jfs/jfs_incore.h:86:13: error: type name does not allow function specifier to be specified
86 | unchar _inline[128];
| ^
fs/jfs/jfs_incore.h:86:20: error: expected member name or ';' after declaration specifiers
86 | unchar _inline[128];
| ~~~~~~~~~~~~~~^
'-fms-extensions' in clang enables several other Microsoft specific
keywords such as _inline [1], presumably for compatibility with MSVC, as
Microsoft's documentation [2] mentions:
For compatibility with previous versions, _inline and _forceinline are
synonyms for __inline and __forceinline, respectively
Rename the _inline array in 'struct jfs_inode_info' to _inline_sym to
avoid this conflict, which is not a large workaround as this member is
only ever referred to via the i_inline macro.
Link: 249883d0c5/clang/include/clang/Basic/TokenKinds.def (L744-L79) [1]
Link: https://learn.microsoft.com/en-us/cpp/c-language/inline-functions [2]
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://patch.msgid.link/20251023-jfs-fix-conflict-with-clang-ms-ext-v1-1-e219d59a1e68@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.
This patch has only passed compilation tests, but it should be fine.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@lst.de> says:
The relatively low minimal writeback size of 4MiB leads means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached relative
to the available writeback bandwidth.
Add a superblock field that allows the file system to override the
default size, and set it to the zone size for zoned XFS.
* patches from https://patch.msgid.link/20251017034611.651385-1-hch@lst.de:
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
Link: https://patch.msgid.link/20251017034611.651385-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Set s_min_writeback_pages to the zone size, so that writeback always
writes up to a full zone. This ensures that writeback does not add
spurious file fragmentation when writing back a large number of
files that are larger than the zone size.
Fixes: 4e4d520755 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-4-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The relatively low minimal writeback size of 4MiB means that written back
inodes on rotational media are switched a lot. Besides introducing
additional seeks, this also can lead to extreme file fragmentation on
zoned devices when a lot of files are cached relative to the available
writeback bandwidth.
Add a superblock field that allows the file system to override the
default size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Return the pages directly when calculated instead of first assigning
them back to a variable, and directly return for the data integrity /
tagged case instead of going through an else clause.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-2-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@lst.de> says:
While looking at the filemap writeback code, I think adding
filemap_fdatawrite_wbc ended up being a mistake, as all but the original
btrfs caller should be using better high level interfaces instead. This
series removes all these, switches btrfs to a more specific interfaces
and also cleans up another too low-level interface. With this the
writeback_control that is passed to the writeback code is only
initialized in three places, although there are a lot more places in
file system code that never reach the common writeback code.
* patches from https://patch.msgid.link/20251024080431.324236-1-hch@lst.de:
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
Link: https://patch.msgid.link/20251024080431.324236-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Rename filemap_fdatawrite_range_kick to filemap_flush_range because it
is the ranged version of filemap_flush.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-11-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use filemap_fdatawrite_range and filemap_fdatawrite_range_kick instead
of the low-level __filemap_fdatawrite_range that requires the caller
to know the internals of the writeback_control structure and remove
__filemap_fdatawrite_range now that it is trivial and only two callers
would be left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-10-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Replace filemap_fdatawrite_wbc, which exposes a writeback_control to the
callers with a filemap_writeback helper that takes all the possible
arguments and declares the writeback_control itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-9-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
And rewrite filemap_fdatawrite to use filemap_fdatawrite_range instead
to have a simpler call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-8-hch@lst.de
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Abstract out the btrfs-specific behavior of kicking off I/O on a number
of pages on an address_space into a well-defined helper.
Note: there is no kerneldoc comment for the new function because it is
not part of the public API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-7-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In preparation for changing the filemap_fdatawrite_wbc API to not expose
the writeback_control to the callers, push the wbc declaration next to
the filemap_fdatawrite_wbc call and just pass the nr_to_write value to
start_delalloc_inodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-6-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
start_delalloc_inodes has a struct inode * pointer available in the
main loop, use it instead of re-calculating it from the btrfs inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-5-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc. There is a slight change in the conversion
as nr_to_write is now set to LONG_MAX instead of double the number
of the pages in the range. LONG_MAX is the usual nr_to_write for
WB_SYNC_ALL writeback, and the value expected by lower layers here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-4-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-2-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When a writeback work lasts for sysctl_hung_task_timeout_secs, we want
to identify that there are tasks waiting for a long time-this helps us
pinpoint potential issues.
Additionally, recording the starting jiffies is useful when debugging a
crashed vmcore.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Writing back a large number of pages can take a lots of time.
This issue is exacerbated when the underlying device is slow or
subject to block layer rate limiting, which in turn triggers
unexpected hung task warnings.
We can trigger a wake-up once a chunk has been written back and the
waiting time for writeback exceeds half of
sysctl_hung_task_timeout_secs.
This action allows the hung task detector to be aware of the writeback
progress, thereby eliminating these unexpected hung task warnings.
This patch has passed the xfstests 'check -g quick' test based on ext4,
with no additional failures introduced.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Mateusz Guzik <mjguzik@gmail.com> says:
Open-coded accesses prevent asserting they are done correctly. One
obvious aspect is locking, but significantly more can checked. For
example it can be detected when the code is clearing flags which are
already missing, or is setting flags when it is illegal (e.g., I_FREEING
when ->i_count > 0).
In order to keep things manageable this patchset merely gets the thing
off the ground with only lockdep checks baked in.
Current consumers can be trivially converted.
Suppose flags I_A and I_B are to be handled.
If ->i_lock is held, then:
state = inode->i_state => state = inode_state_read(inode)
inode->i_state |= (I_A | I_B) => inode_state_set(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) => inode_state_clear(inode, I_A | I_B)
inode->i_state = I_A | I_B => inode_state_assign(inode, I_A | I_B)
If ->i_lock is not held or only held conditionally:
state = inode->i_state => state = inode_state_read_once(inode)
inode->i_state |= (I_A | I_B) => inode_state_set_raw(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) => inode_state_clear_raw(inode, I_A | I_B)
inode->i_state = I_A | I_B => inode_state_assign_raw(inode, I_A | I_B)
The "_once" vs "_raw" discrepancy stems from the read variant differing
by READ_ONCE as opposed to just lockdep checks.
Finally, if you want to atomically clear flags and set new ones, the
following:
state = inode->i_state;
state &= ~I_A;
state |= I_B;
inode->i_state = state;
turns into:
inode_state_replace(inode, I_A, I_B);
* patches from https://lore.kernel.org/20251009075929.1203950-1-mjguzik@gmail.com:
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
Signed-off-by: Christian Brauner <brauner@kernel.org>
... to make sure all accesses are properly validated.
Merely renaming the var to __i_state still lets the compiler make the
following suggestion:
error: 'struct inode' has no member named 'i_state'; did you mean '__i_state'?
Unfortunately some people will add the __'s and call it a day.
In order to make it harder to mess up in this way, hide it behind a
struct. The resulting error message should be convincing in terms of
checking what to do:
error: invalid operands to binary & (have 'struct inode_state_flags' and 'int')
Of course people determined to do a plain access can still do it, but
nothing can be done for that case.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Open-coded accesses prevent asserting they are done correctly. One
obvious aspect is locking, but significantly more can checked. For
example it can be detected when the code is clearing flags which are
already missing, or is setting flags when it is illegal (e.g., I_FREEING
when ->i_count > 0).
In order to keep things manageable this patchset merely gets the thing
off the ground with only lockdep checks baked in.
Current consumers can be trivially converted.
Suppose flags I_A and I_B are to be handled.
If ->i_lock is held, then:
state = inode->i_state => state = inode_state_read(inode)
inode->i_state |= (I_A | I_B) => inode_state_set(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) => inode_state_clear(inode, I_A | I_B)
inode->i_state = I_A | I_B => inode_state_assign(inode, I_A | I_B)
If ->i_lock is not held or only held conditionally:
state = inode->i_state => state = inode_state_read_once(inode)
inode->i_state |= (I_A | I_B) => inode_state_set_raw(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) => inode_state_clear_raw(inode, I_A | I_B)
inode->i_state = I_A | I_B => inode_state_assign_raw(inode, I_A | I_B)
The "_once" vs "_raw" discrepancy stems from the read variant differing
by READ_ONCE as opposed to just lockdep checks.
Finally, if you want to atomically clear flags and set new ones, the
following:
state = inode->i_state;
state &= ~I_A;
state |= I_B;
inode->i_state = state;
turns into:
inode_state_replace(inode, I_A, I_B);
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The incomming helpers don't ship with _release/_acquire variants, for
the time being anyway.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The only consumer outside of fs/inode.c is gfs2 and it already includes
fs.h in the relevant file.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
and unlock_new_inode().
In principle the latter can clear the I_NEW flag before prior stores
into the inode were made visible.
The former can in turn observe I_NEW is cleared and proceed to use the
inode, while possibly reading from not-yet-published areas.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This postpones the writeout to ocfs2_evict_inode(), which I'm told is
fine (tm).
The intent is to retire the I_WILL_FREE flag.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Tinguely <amrk.tinguely@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Notably make sure the count is 0 after the return from ->drop_inode(),
provided we are going to drop.
Inspired by suspicious games played by f2fs.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Also remove the now redundant comment.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
->readpage was deprecated and reads are now on folios.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
->readpage was deprecated and reads are now on folios.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Iterate over all non-uptodate ranges of a folio mapping in a single call
to iomap_readpage_iter() instead of leaving the partial iteration to the
caller.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
iomap_adjust_read_range() assumes that the position and length passed in
are block-aligned. This is not always the case however, as shown in the
syzbot generated case for erofs. This causes too many bytes to be
skipped for uptodate blocks, which results in returning the incorrect
position and length to read in. If all the blocks are uptodate, this
underflows length and returns a position beyond the folio.
Fix the calculation to also take into account the block offset when
calculating how many bytes can be skipped for uptodate blocks.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Store the iomap_readpage_ctx bio generically as a "void *read_ctx".
This makes the read/readahead interface more generic, which allows it to
be used by filesystems that may not be block-based and may not have
CONFIG_BLOCK set.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Most callers of iomap_iter_advance() do not need the remaining length
returned. Get rid of the extra iomap_length() call that
iomap_iter_advance() does.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move the read/readahead bio submission logic into a separate helper.
This is needed to make iomap read/readahead more generically usable,
especially for filesystems that do not require CONFIG_BLOCK.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move the iomap_readpage_iter() bio read logic into a separate helper
function, iomap_bio_read_folio_range(). This is needed to make iomap
read/readahead more generically usable, especially for filesystems that
do not require CONFIG_BLOCK.
Additionally rename buffered write's iomap_read_folio_range() function
to iomap_bio_read_folio_range_sync() to better describe its synchronous
behavior.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
313 changed files with 16067 additions and 4720 deletions