Compare commits

...

339 Commits

Author SHA1 Message Date
Linus Torvalds 4664fb427c vfs-6.19-rc1.minix
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
 olEcAP4qG313oT/tm4W3nC4g2k8S//KqET97B80pSX0K3DvQEwD+LSCf1Th3RnsV
 EAMHczmCtRlbcFPqYOFVAMS8VxOyVg0=
 =7Ca4
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull minix fixes from Christian Brauner:
 "Fix two syzbot corruption bugs in the minix filesystem.

  Syzbot fuzzes filesystems by trying to mount and manipulate
  deliberately corrupted images. This should not lead to BUG_ONs and
  WARN_ONs for easy to detect corruptions.

   - Add error handling to minix filesystem for inode corruption
     detection, enabling the filesystem to report such corruptions
     cleanly.

   - Fix a drop_nlink warning in minix_rmdir() triggered by corrupted
     directory link counts.

   - Fix a drop_nlink warning in minix_rename() triggered by corrupted
     inode link counts"

* tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Fix a drop_nlink warning in minix_rename
  Fix a drop_nlink warning in minix_rmdir
  Add error handling to minix filesystem for inode corruption detection
2025-12-01 15:22:40 -08:00
Linus Torvalds 978d337c2e vfs-6.19-rc1.guards
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
 opxBAQCjNjr0yTSoaGRM0CJXg79Of3DLIlBdB7TygibTN16WhwEA+VKWoHL5eRjg
 PZlwZD4Ei2ymeQYxi+6owTF8G806tAs=
 =m/Bt
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock lock guard updates from Christian Brauner:
 "This starts the work of introducing guards for superblock related
  locks.

  Introduce super_write_guard for scoped superblock write protection.

  This provides a guard-based alternative to the manual sb_start_write()
  and sb_end_write() pattern, allowing the compiler to automatically
  handle the cleanup"

* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: use super write guard in xfs_file_ioctl()
  open: use super write guard in do_ftruncate()
  btrfs: use super write guard in relocating_repair_kthread()
  ext4: use super write guard in write_mmp_block()
  btrfs: use super write guard in sb_start_write()
  btrfs: use super write guard btrfs_run_defrag_inode()
  btrfs: use super write guard in btrfs_reclaim_bgs_work()
  fs: add super_write_guard
2025-12-01 14:39:03 -08:00
Linus Torvalds afdf0fb340 vfs-6.19-rc1.fs_header
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
 oq2EAQD09y/qVU81E7Qg7Cn4n5/3WTlnQjx0aSvhb4p6dFUcFwD+K9uVJNP8x8tA
 xTaPt59nZbEX9BIAwtLChSPa4CZsnwM=
 =XrvE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fs header updates from Christian Brauner:
 "This contains initial work to start splitting up fs.h.

  Begin the long-overdue work of splitting up the monolithic fs.h
  header. The header has grown to over 3000 lines and includes types and
  functions for many different subsystems, making it difficult to
  navigate and causing excessive compilation dependencies.

  This series introduces new focused headers for superblock-related
  code:

   - Rename fs_types.h to fs_dirent.h to better reflect its actual
     content (directory entry types)

   - Add fs/super_types.h containing superblock type definitions

   - Add fs/super.h containing superblock function declarations

  This is the first step in a longer effort to modularize the VFS
  headers.

  Cleanups:

   - Inode Field Layout Optimization (Mateusz Guzik)

     Move inode fields used during fast path lookup closer together to
     improve cache locality during path resolution.

   - current_umask() Optimization (Mateusz Guzik)

     Inline current_umask() and move it to fs_struct.h. This improves
     performance by avoiding function call overhead for this
     frequently-used function, and places it in a more appropriate
     header since it operates on fs_struct"

* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: move inode fields used during fast path lookup closer together
  fs: inline current_umask() and move it to fs_struct.h
  fs: add fs/super.h header
  fs: add fs/super_types.h header
  fs: rename fs_types.h to fs_dirent.h
2025-12-01 14:18:01 -08:00
Linus Torvalds 1d18101a64 kernel-6.19-rc1.cred
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 orJLAP9UD+dX6cicJDkzFZowDakmoIQkR5ZSDwChSlmvLcmquwEAlSq4svVd9Bdl
 7kOFUk71DqhVHrPAwO7ap0BxehokEAA=
 =Cli6
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull cred guard updates from Christian Brauner:
 "This contains substantial credential infrastructure improvements
  adding guard-based credential management that simplifies code and
  eliminates manual reference counting in many subsystems.

  Features:

   - Kernel Credential Guards

     Add with_kernel_creds() and scoped_with_kernel_creds() guards that
     allow using the kernel credentials without allocating and copying
     them. This was requested by Linus after seeing repeated
     prepare_kernel_creds() calls that duplicate the kernel credentials
     only to drop them again later.

     The new guards completely avoid the allocation and never expose the
     temporary variable to hold the kernel credentials anywhere in
     callers.

   - Generic Credential Guards

     Add scoped_with_creds() guards for the common override_creds() and
     revert_creds() pattern. This builds on earlier work that made
     override_creds()/revert_creds() completely reference count free.

   - Prepare Credential Guards

     Add prepare credential guards for the more complex pattern of
     preparing a new set of credentials and overriding the current
     credentials with them:
      - prepare_creds()
      - modify new creds
      - override_creds()
      - revert_creds()
      - put_cred()

  Cleanups:

   - Make init_cred static since it should not be directly accessed

   - Add kernel_cred() helper to properly access the kernel credentials

   - Fix scoped_class() macro that was introduced two cycles ago

   - coredump: split out do_coredump() from vfs_coredump() for cleaner
     credential handling

   - coredump: move revert_cred() before coredump_cleanup()

   - coredump: mark struct mm_struct as const

   - coredump: pass struct linux_binfmt as const

   - sev-dev: use guard for path"

* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  trace: use override credential guard
  trace: use prepare credential guard
  coredump: use override credential guard
  coredump: use prepare credential guard
  coredump: split out do_coredump() from vfs_coredump()
  coredump: mark struct mm_struct as const
  coredump: pass struct linux_binfmt as const
  coredump: move revert_cred() before coredump_cleanup()
  sev-dev: use override credential guards
  sev-dev: use prepare credential guard
  sev-dev: use guard for path
  cred: add prepare credential guard
  net/dns_resolver: use credential guards in dns_query()
  cgroup: use credential guards in cgroup_attach_permissions()
  act: use credential guards in acct_write_process()
  smb: use credential guards in cifs_get_spnego_key()
  nfs: use credential guards in nfs_idmap_get_key()
  nfs: use credential guards in nfs_local_call_write()
  nfs: use credential guards in nfs_local_call_read()
  erofs: use credential guards
  ...
2025-12-01 13:45:41 -08:00
Linus Torvalds f2e74ecfba vfs-6.19-rc1.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 onGBAQDtqeO0jZzS7q9UxlJ84Wj/H9w+9INpO4jMxtWK4svhUAEAghG4qVxRvkE2
 Qh+wrpTPIC7OCQ78k8psDRmkj9cn8QA=
 =FCVN
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull folio updates from Christian Brauner:
 "Add a new folio_next_pos() helper function that returns the file
  position of the first byte after the current folio. This is a common
  operation in filesystems when needing to know the end of the current
  folio.

  The helper is lifted from btrfs which already had its own version, and
  is now used across multiple filesystems and subsystems:
   - btrfs
   - buffer
   - ext4
   - f2fs
   - gfs2
   - iomap
   - netfs
   - xfs
   - mm

  This fixes a long-standing bug in ocfs2 on 32-bit systems with files
  larger than 2GiB. Presumably this is not a common configuration, but
  the fix is backported anyway. The other filesystems did not have bugs,
  they were just mildly inefficient.

  This also introduce uoff_t as the unsigned version of loff_t. A recent
  commit inadvertently changed a comparison from being unsigned (on
  64-bit systems) to being signed (which it had always been on 32-bit
  systems), leading to sporadic fstests failures.

  Generally file sizes are restricted to being a signed integer, but in
  places where -1 is passed to indicate "up to the end of the file", it
  is convenient to have an unsigned type to ensure comparisons are
  always unsigned regardless of architecture"

* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Add uoff_t
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()
2025-12-01 10:26:38 -08:00
Linus Torvalds 212c4053a1 vfs-6.19-rc1.coredump
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 oji0AQC5jl35xh04fJKB343InVAxtRFp8mSkJJ9Bx6x7xA7a+QEAiBMxYilUgYIW
 bZMcI5LU+gNO/1y076QkVt84jTUQLww=
 =WIBZ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfd and coredump updates from Christian Brauner:
 "Features:

   - Expose coredump signal via pidfd

     Expose the signal that caused the coredump through the pidfd
     interface. The recent changes to rework coredump handling to rely
     on unix sockets are in the process of being used in systemd. The
     previous systemd coredump container interface requires the coredump
     file descriptor and basic information including the signal number
     to be sent to the container. This means the signal number needs to
     be available before sending the coredump to the container.

   - Add supported_mask field to pidfd

     Add a new supported_mask field to struct pidfd_info that indicates
     which information fields are supported by the running kernel. This
     allows userspace to detect feature availability without relying on
     error codes or kernel version checks.

  Cleanups:

   - Drop struct pidfs_exit_info and prepare to drop exit_info pointer,
     simplifying the internal publication mechanism for exit and
     coredump information retrievable via the pidfd ioctl

   - Use guard() for task_lock in pidfs

   - Reduce wait_pidfd lock scope

   - Add missing PIDFD_INFO_SIZE_VER1 constant

   - Add missing BUILD_BUG_ON() assert on struct pidfd_info

  Fixes:

   - Fix PIDFD_INFO_COREDUMP handling

  Selftests:

   - Split out coredump socket tests and common helpers into separate
     files for better organization

   - Fix userspace coredump client detection issues

   - Handle edge-triggered epoll correctly

   - Ignore ENOSPC errors in tests

   - Add debug logging to coredump socket tests, socket protocol tests,
     and test helpers

   - Add tests for PIDFD_INFO_COREDUMP_SIGNAL

   - Add tests for supported_mask field

   - Update pidfd header for selftests"

* tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
  pidfs: reduce wait_pidfd lock scope
  selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
  selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
  selftests/coredump: ignore ENOSPC errors
  selftests/coredump: add debug logging to coredump socket protocol tests
  selftests/coredump: add debug logging to coredump socket tests
  selftests/coredump: add debug logging to test helpers
  selftests/coredump: handle edge-triggered epoll correctly
  selftests/coredump: fix userspace coredump client detection
  selftests/coredump: fix userspace client detection
  selftests/coredump: split out coredump socket tests
  selftests/coredump: split out common helpers
  selftests/pidfd: add second supported_mask test
  selftests/pidfd: add first supported_mask test
  selftests/pidfd: update pidfd header
  pidfs: expose coredump signal
  pidfs: drop struct pidfs_exit_info
  pidfs: prepare to drop exit_info pointer
  pidfd: add a new supported_mask field
  pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
  ...
2025-12-01 10:17:39 -08:00
Linus Torvalds 415d34b92c namespace-6.19-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 ooKwAP4kR5kMjHlthf8jHmmCjVU3nQFO9hUZsIQL9gFJLOIQMAD+LLoTaq1WJufl
 oSgZpREXZVmI1TK61eR6EZMB1YikGAo=
 =TExi
 -----END PGP SIGNATURE-----

Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains substantial namespace infrastructure changes including a new
  system call, active reference counting, and extensive header cleanups.
  The branch depends on the shared kbuild branch for -fms-extensions support.

  Features:

   - listns() system call

     Add a new listns() system call that allows userspace to iterate
     through namespaces in the system. This provides a programmatic
     interface to discover and inspect namespaces, addressing
     longstanding limitations:

     Currently, there is no direct way for userspace to enumerate
     namespaces. Applications must resort to scanning /proc/*/ns/ across
     all processes, which is:
      - Inefficient - requires iterating over all processes
      - Incomplete - misses namespaces not attached to any running
        process but kept alive by file descriptors, bind mounts, or
        parent references
      - Permission-heavy - requires access to /proc for many processes
      - No ordering or ownership information
      - No filtering per namespace type

     The listns() system call solves these problems:

       ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
                      size_t nr_ns_ids, unsigned int flags);

       struct ns_id_req {
             __u32 size;
             __u32 spare;
             __u64 ns_id;
             struct /* listns */ {
                     __u32 ns_type;
                     __u32 spare2;
                     __u64 user_ns_id;
             };
       };

     Features include:
      - Pagination support for large namespace sets
      - Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
      - Filtering by owning user namespace
      - Permission checks respecting namespace isolation

   - Active Reference Counting

     Introduce an active reference count that tracks namespace
     visibility to userspace. A namespace is visible in the following
     cases:
      - The namespace is in use by a task
      - The namespace is persisted through a VFS object (namespace file
        descriptor or bind-mount)
      - The namespace is a hierarchical type and is the parent of child
        namespaces

     The active reference count does not regulate lifetime (that's still
     done by the normal reference count) - it only regulates visibility
     to namespace file handles and listns().

     This prevents resurrection of namespaces that are pinned only for
     internal kernel reasons (e.g., user namespaces held by
     file->f_cred, lazy TLB references on idle CPUs, etc.) which should
     not be accessible via (1)-(3).

   - Unified Namespace Tree

     Introduce a unified tree structure for all namespaces with:
      - Fixed IDs assigned to initial namespaces
      - Lookup based solely on inode number
      - Maintained list of owned namespaces per user namespace
      - Simplified rbtree comparison helpers

   Cleanups

    - Header Reorganization:
      - Move namespace types into separate header (ns_common_types.h)
      - Decouple nstree from ns_common header
      - Move nstree types into separate header
      - Switch to new ns_tree_{node,root} structures with helper functions
      - Use guards for ns_tree_lock

   - Initial Namespace Reference Count Optimization
      - Make all reference counts on initial namespaces a nop to avoid
        pointless cacheline ping-pong for namespaces that can never go
        away
      - Drop custom reference count initialization for initial namespaces
      - Add NS_COMMON_INIT() macro and use it for all namespaces
      - pid: rely on common reference count behavior

   - Miscellaneous Cleanups
      - Rename exit_task_namespaces() to exit_nsproxy_namespaces()
      - Rename is_initial_namespace() and make argument const
      - Use boolean to indicate anonymous mount namespace
      - Simplify owner list iteration in nstree
      - nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
      - nsfs: use inode_just_drop()
      - pidfs: raise DCACHE_DONTCACHE explicitly
      - pidfs: simplify PIDFD_GET__NAMESPACE ioctls
      - libfs: allow to specify s_d_flags
      - cgroup: add cgroup namespace to tree after owner is set
      - nsproxy: fix free_nsproxy() and simplify create_new_namespaces()

  Fixes:

   - setns(pidfd, ...) race condition

     Fix a subtle race when using pidfds with setns(). When the target
     task exits after prepare_nsset() but before commit_nsset(), the
     namespace's active reference count might have been dropped. If
     setns() then installs the namespaces, it would bump the active
     reference count from zero without taking the required reference on
     the owner namespace, leading to underflow when later decremented.

     The fix resurrects the ownership chain if necessary - if the caller
     succeeded in grabbing passive references, the setns() should
     succeed even if the target task exits or gets reaped.

   - Return EFAULT on put_user() error instead of success

   - Make sure references are dropped outside of RCU lock (some
     namespaces like mount namespace sleep when putting the last
     reference)

   - Don't skip active reference count initialization for network
     namespace

   - Add asserts for active refcount underflow

   - Add asserts for initial namespace reference counts (both passive
     and active)

   - ipc: enable is_ns_init_id() assertions

   - Fix kernel-doc comments for internal nstree functions

   - Selftests
      - 15 active reference count tests
      - 9 listns() functionality tests
      - 7 listns() permission tests
      - 12 inactive namespace resurrection tests
      - 3 threaded active reference count tests
      - commit_creds() active reference tests
      - Pagination and stress tests
      - EFAULT handling test
      - nsid tests fixes"

* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
  pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
  nstree: fix kernel-doc comments for internal functions
  nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
  selftests/namespaces: fix nsid tests
  ns: drop custom reference count initialization for initial namespaces
  pid: rely on common reference count behavior
  ns: add asserts for initial namespace active reference counts
  ns: add asserts for initial namespace reference counts
  ns: make all reference counts on initial namespace a nop
  ipc: enable is_ns_init_id() assertions
  fs: use boolean to indicate anonymous mount namespace
  ns: rename is_initial_namespace()
  ns: make is_initial_namespace() argument const
  nstree: use guards for ns_tree_lock
  nstree: simplify owner list iteration
  nstree: switch to new structures
  nstree: add helper to operate on struct ns_tree_{node,root}
  nstree: move nstree types into separate header
  nstree: decouple from ns_common header
  ns: move namespace types into separate header
  ...
2025-12-01 09:47:41 -08:00
Linus Torvalds ebaeabfa5a vfs-6.19-rc1.writeback
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey
 25ZNdExoUojAOj5wVn+jUep3u54jBws=
 =/toi
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull writeback updates from Christian Brauner:
 "Features:

   - Allow file systems to increase the minimum writeback chunk size.

     The relatively low minimal writeback size of 4MiB means that
     written back inodes on rotational media are switched a lot. Besides
     introducing additional seeks, this also can lead to extreme file
     fragmentation on zoned devices when a lot of files are cached
     relative to the available writeback bandwidth.

     This adds a superblock field that allows the file system to
     override the default size, and sets it to the zone size for zoned
     XFS.

   - Add logging for slow writeback when it exceeds
     sysctl_hung_task_timeout_secs. This helps identify tasks waiting
     for a long time and pinpoint potential issues. Recording the
     starting jiffies is also useful when debugging a crashed vmcore.

   - Wake up waiting tasks when finishing the writeback of a chunk

  Cleanups:

   - filemap_* writeback interface cleanups.

     Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
     the original btrfs caller should be using better high level
     interfaces instead.

     This series removes all these low-level interfaces, switches btrfs
     to a more specific interface, and cleans up other too low-level
     interfaces. With this the writeback_control that is passed to the
     writeback code is only initialized in three places.

   - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
     filemap_fdatawrite_wbc

   - Add filemap_flush_nr helper for btrfs

   - Push struct writeback_control into start_delalloc_inodes in btrfs

   - Rename filemap_fdatawrite_range_kick to filemap_flush_range

   - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm

   - Make wbc_to_tag() inline and use it in fs"

* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Make wbc_to_tag() inline and use it in fs.
  xfs: set s_min_writeback_pages for zoned file systems
  writeback: allow the file system to override MIN_WRITEBACK_PAGES
  writeback: cleanup writeback_chunk_size
  mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
  mm: remove __filemap_fdatawrite_range
  mm: remove filemap_fdatawrite_wbc
  mm: remove __filemap_fdatawrite
  mm,btrfs: add a filemap_flush_nr helper
  btrfs: push struct writeback_control into start_delalloc_inodes
  btrfs: use the local tmp_inode variable in start_delalloc_inodes
  ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
  9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
  mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
  writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
  writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-12-01 09:20:51 -08:00
Linus Torvalds 9368f0f941 vfs-6.19-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC
 RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM=
 =zo/J
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Linus Torvalds b04b2e7a61 vfs-6.19-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 onGCAQDEHKNEuZMhkyd3K5YsJtMzZlW/uXp4+Wddeob+5yQp0wEA09xN4CJNMwhP
 J6Kjaa80hWfrFacqSvyMUwQHHw6mngs=
 =5Mom
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
     permission checks during path lookup and adds the
     IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
     expensive permission work.

   - Hide dentry_cache behind runtime const machinery.

   - Add German Maglione as virtiofs co-maintainer.

  Cleanups:

   - Tidy up and inline step_into() and walk_component() for improved
     code generation.

   - Re-enable IOCB_NOWAIT writes to files. This refactors file
     timestamp update logic, fixing a layering bypass in btrfs when
     updating timestamps on device files and improving FMODE_NOCMTIME
     handling in VFS now that nfsd started using it.

   - Path lookup optimizations extracting slowpaths into dedicated
     routines and adding branch prediction hints for mntput_no_expire(),
     fd_install(), lookup_slow(), and various other hot paths.

   - Enable clang's -fms-extensions flag, requiring a JFS rename to
     avoid conflicts.

   - Remove spurious exports in fs/file_attr.c.

   - Stop duplicating union pipe_index declaration. This depends on the
     shared kbuild branch that brings in -fms-extensions support which
     is merged into this branch.

   - Use MD5 library instead of crypto_shash in ecryptfs.

   - Use largest_zero_folio() in iomap_dio_zero().

   - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
     initrd code.

   - Various typo fixes.

  Fixes:

   - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
     call with wait == 1 to commit super blocks. The emergency sync path
     never passed this, leaving btrfs data uncommitted during emergency
     sync.

   - Use local kmap in watch_queue's post_one_notification().

   - Add hint prints in sb_set_blocksize() for LBS dependency on THP"

* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  MAINTAINERS: add German Maglione as virtiofs co-maintainer
  fs: inline step_into() and walk_component()
  fs: tidy up step_into() & friends before inlining
  orangefs: use inode_update_timestamps directly
  btrfs: fix the comment on btrfs_update_time
  btrfs: use vfs_utimes to update file timestamps
  fs: export vfs_utimes
  fs: lift the FMODE_NOCMTIME check into file_update_time_flags
  fs: refactor file timestamp update logic
  include/linux/fs.h: trivial fix: regualr -> regular
  fs/splice.c: trivial fix: pipes -> pipe's
  fs: mark lookup_slow() as noinline
  fs: add predicts based on nd->depth
  fs: move mntput_no_expire() slowpath into a dedicated routine
  fs: remove spurious exports in fs/file_attr.c
  watch_queue: Use local kmap in post_one_notification()
  fs: touch up predicts in path lookup
  fs: move fd_install() slowpath into a dedicated routine and provide commentary
  fs: hide dentry_cache behind runtime const machinery
  fs: touch predicts in do_dentry_open()
  ...
2025-12-01 08:44:26 -08:00
Linus Torvalds 1885cdbfbb vfs-6.19-rc1.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72
 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk=
 =jInB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:
 "FUSE iomap Support for Buffered Reads:

    This adds iomap support for FUSE buffered reads and readahead. This
    enables granular uptodate tracking with large folios so only
    non-uptodate portions need to be read. Also fixes a race condition
    with large folios + writeback cache that could cause data corruption
    on partial writes followed by reads.

     - Refactored iomap read/readahead bio logic into helpers
     - Added caller-provided callbacks for read operations
     - Moved buffered IO bio logic into new file
     - FUSE now uses iomap for read_folio and readahead

  Zero Range Folio Batch Support:

    Add folio batch support for iomap_zero_range() to handle dirty
    folios over unwritten mappings. Fix raciness issues where dirty data
    could be lost during zero range operations.

     - filemap_get_folios_tag_range() helper for dirty folio lookup
     - Optional zero range dirty folio processing
     - XFS fills dirty folios on zero range of unwritten mappings
     - Removed old partial EOF zeroing optimization

  DIO Write Completions from Interrupt Context:

    Restore pre-iomap behavior where pure overwrite completions run
    inline rather than being deferred to workqueue. Reduces context
    switches for high-performance workloads like ScyllaDB.

     - Removed unused IOCB_DIO_CALLER_COMP code
     - Error completions always run in user context (fixes zonefs)
     - Reworked REQ_FUA selection logic
     - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP

  Buffered IO Cleanups:

    Some performance and code clarity improvements:

     - Replace manual bitmap scanning with find_next_bit()
     - Simplify read skip logic for writes
     - Optimize pending async writeback accounting
     - Better variable naming
     - Documentation for iomap_finish_folio_write() requirements

  Misaligned Vectors for Zoned XFS:

    Enables sub-block aligned vectors in XFS always-COW mode for zoned
    devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.

  Bug Fixes:

     - Allocate s_dio_done_wq for async reads (fixes syzbot report after
       error completion changes)
     - Fix iomap_read_end() for already uptodate folios (regression fix)"

* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
  iomap: allocate s_dio_done_wq for async reads as well
  iomap: fix iomap_read_end() for already uptodate folios
  iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
  iomap: support write completions from interrupt context
  iomap: rework REQ_FUA selection
  iomap: always run error completions in user context
  fs, iomap: remove IOCB_DIO_CALLER_COMP
  iomap: use find_next_bit() for uptodate bitmap scanning
  iomap: use find_next_bit() for dirty bitmap scanning
  iomap: simplify when reads can be skipped for writes
  iomap: simplify ->read_folio_range() error handling for reads
  iomap: optimize pending async writeback accounting
  docs: document iomap writeback's iomap_finish_folio_write() requirement
  iomap: account for unaligned end offsets when truncating read range
  iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
  xfs: support sub-block aligned vectors in always COW mode
  iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
  xfs: error tag to force zeroing on debug kernels
  iomap: remove old partial eof zeroing optimization
  xfs: fill dirty folios on zero range of unwritten mappings
  ...
2025-12-01 08:14:00 -08:00
Mateusz Guzik ca0d620b0a
dcache: touch up predicts in __d_lookup_rcu()
Rationale is that if the parent dentry is the same and the length is the
same, then you have to be unlucky for the name to not match.

At the same time the dentry was literally just found on the hash, so you
have to be even more unlucky to determine it is unhashed.

While here add commentary while d_unhashed() is necessary. It was
already removed once and brought back in:
2e321806b6 ("Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu"")

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251127131526.4137768-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 10:31:45 +01:00
Stefan Hajnoczi ebf8538979
MAINTAINERS: add German Maglione as virtiofs co-maintainer
German Maglione is a co-maintainer of the virtiofsd userspace device
implementation (https://gitlab.com/virtio-fs/virtiofsd) and is currently
one of the most active virtiofs developers outside the kernel.

I have not worked on virtiofs except to review kernel patches for a few
years now and would like German to take over from me gradually. It is
healthier to have a kernel maintainer who is actively involved. I expect
to remove myself in a few months.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://patch.msgid.link/20251126211548.598469-1-stefanha@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-27 10:00:09 +01:00
Christian Brauner f403e1206b
Merge patch series "fs: tidy up step_into() & friends before inlining"
Cleanup step_into() and walk_component() and inline them both.

* patches from https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com:
  fs: inline step_into() and walk_component()
  fs: tidy up step_into() & friends before inlining

Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:52:07 +01:00
Mateusz Guzik 177fdbae39
fs: inline step_into() and walk_component()
The primary consumer is link_path_walk(), calling walk_component() every
time which in turn calls step_into().

Inlining these saves overhead of 2 function calls per path component,
along with allowing the compiler to do better job optimizing them in place.

step_into() had absolutely atrocious assembly to facilitate the
slowpath. In order to lessen the burden at the callsite all the hard
work is moved into step_into_slowpath() and instead an inline-able
fastpath is implemented for rcu-walk.

The new fastpath is a stripped down step_into() RCU handling with a
d_managed() check from handle_mounts().

Benchmarked as follows on Sapphire Rapids:
1. the "before" was a kernel with not-yet-merged optimizations (notably
   elision of calls to security_inode_permission() and marking ext4
   inodes as not having acls as applicable)
2. "after" is the same + the prep patch + this patch
3. benchmark consists of issuing 205 calls to access(2) in a loop with
   pathnames lifted out of gcc and the linker building real code, most
   of which have several path components and 118 of which fail with
   -ENOENT.

Result in terms of ops/s:
before:	21619
after:	22536 (+4%)

profile before:
  20.25%  [kernel]                  [k] __d_lookup_rcu
  10.54%  [kernel]                  [k] link_path_walk
  10.22%  [kernel]                  [k] entry_SYSCALL_64
   6.50%  libc.so.6                 [.] __GI___access
   6.35%  [kernel]                  [k] strncpy_from_user
   4.87%  [kernel]                  [k] step_into
   3.68%  [kernel]                  [k] kmem_cache_alloc_noprof
   2.88%  [kernel]                  [k] walk_component
   2.86%  [kernel]                  [k] kmem_cache_free
   2.14%  [kernel]                  [k] set_root
   2.08%  [kernel]                  [k] lookup_fast

after:
  23.38%  [kernel]                  [k] __d_lookup_rcu
  11.27%  [kernel]                  [k] entry_SYSCALL_64
  10.89%  [kernel]                  [k] link_path_walk
   7.00%  libc.so.6                 [.] __GI___access
   6.88%  [kernel]                  [k] strncpy_from_user
   3.50%  [kernel]                  [k] kmem_cache_alloc_noprof
   2.01%  [kernel]                  [k] kmem_cache_free
   2.00%  [kernel]                  [k] set_root
   1.99%  [kernel]                  [k] lookup_fast
   1.81%  [kernel]                  [k] do_syscall_64
   1.69%  [kernel]                  [k] entry_SYSCALL_64_safe_stack

While walk_component() and step_into() of course disappear from the
profile, the link_path_walk() barely gets more overhead despite the
inlining thanks to the fast path added and while completing more walks
per second.

I did not investigate why overhead grew a lot on __d_lookup_rcu().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:52:02 +01:00
Mateusz Guzik 9d2a6211a7
fs: tidy up step_into() & friends before inlining
Symlink handling is already marked as unlikely and pushing out some of
it into pick_link() reduces register spillage on entry to step_into()
with gcc 14.2.

The compiler needed additional convincing that handle_mounts() is
unlikely to fail.

At the same time neither clang nor gcc could be convinced to tail-call
into pick_link().

While pick_link() takes an address of stack-based object as an argument
(which definitely prevents the optimization), splitting it into separate
<dentry, mount> tuple did not help. The issue persists even when
compiled without stack protector. As such nothing was done about this
for the time being to not grow the diff.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:52:02 +01:00
Christian Brauner 1ed45a4ddc
Merge patch series "re-enable IOCB_NOWAIT writes to files v2"
Christoph Hellwig <hch@lst.de> says:

[Fix] the layering bypass in btrfs when updating timestamps on device
files for devices removed from btrfs usage, and FMODE_NOCMTIME handling
in the VFS now that nfsd started using it.  Note that I'm still not sure
that nfsd usage is fully correct for all file systems, as only XFS
explicitly supports FMODE_NOCMTIME, but at least the generic code does
the right thing now.

* patches from https://patch.msgid.link/20251120064859.2911749-1-hch@lst.de:
  orangefs: use inode_update_timestamps directly
  btrfs: fix the comment on btrfs_update_time
  btrfs: use vfs_utimes to update file timestamps
  fs: export vfs_utimes
  fs: lift the FMODE_NOCMTIME check into file_update_time_flags
  fs: refactor file timestamp update logic

Link: https://patch.msgid.link/20251120064859.2911749-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:17 +01:00
Christoph Hellwig eff094a58d
orangefs: use inode_update_timestamps directly
Orangefs has no i_version handling and __orangefs_setattr already
explicitly marks the inode dirty.  So instead of the using
the flags return value from generic_update_time, just call the
lower level inode_update_timestamps helper directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-7-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Christoph Hellwig f981264ae7
btrfs: fix the comment on btrfs_update_time
Since commit e41f941a23 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Christoph Hellwig ded9958704
btrfs: use vfs_utimes to update file timestamps
Btrfs updates the device node timestamps for block device special files
when it stop using the device.

Commit 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode.  This is wrong and got fixed in
commit 54fde91f52 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method.  Fix this by using the propet vfs_utimes
interface.

Fixes: 8f96a5bfa1 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Christoph Hellwig 0139836652
fs: export vfs_utimes
This will be used to replace an incorrect direct call into
generic_update_time in btrfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-4-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Christoph Hellwig 7f30e7a423
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based
"invisible I/O", but commit e5e9b24ab8 ("nfsd: freeze c/mtime updates
with outstanding WRITE_ATTRS delegation") started using it from
generic callers.

I'm not sure other file systems are actually read for this in general,
so the above commit should get a closer look, but for it to make any
sense, file_update_time needs to respect the flag.

Lift the check from file_modified_flags to file_update_time so that
users of file_update_time inherit the behavior and so that all the
checks are done in one place.

Fixes: e5e9b24ab8 ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Christoph Hellwig 3cd9a42f1b
fs: refactor file timestamp update logic
Currently the two high-level APIs use two helper functions to implement
almost all of the logic.  Refactor the two helpers and the common logic
into a new file_update_time_flags routine that gets the iocb flags or
0 in case of file_update_time passed so that the entire logic is
contained in a single function and can be easily understood and modified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26 14:50:10 +01:00
Mateusz Guzik 003a660730
fs: push list presence check into inode_io_list_del()
For consistency with sb routines.

ext4 is the only consumer outside of evict(). Damage-controlling it is
outside of the scope of this cleanup.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251103230911.516866-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:34:49 +01:00
Mateusz Guzik 4c6b40877b
fs: cosmetic fixes to lru handling
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
   inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
   needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
   (inode_lru_list_del()) and similar routines for sb and io list
   management
4. push list presence check into inode_lru_list_del(), just like sb and
   io list

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:34:49 +01:00
Mateusz Guzik a27628f436
fs: rework I_NEW handling to operate without fences
In the inode hash code grab the state while ->i_lock is held. If found
to be set, synchronize the sleep once more with the lock held.

In the real world the flag is not set most of the time.

Apart from being simpler to reason about, it comes with a minor speed up
as now clearing the flag does not require the smp_mb() fence.

While here rename wait_on_inode() to wait_on_new_inode() to line it up
with __wait_on_freeing_inode().

Christian Brauner <brauner@kernel.org> says:

As per the discussion in [1] I folded in the diff sent in [2].

Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1]
Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:32:39 +01:00
Christoph Hellwig 7fd8720dff
iomap: allocate s_dio_done_wq for async reads as well
Since commit 222f2c7c6d14 ("iomap: always run error completions in user
context"), read error completions are deferred to s_dio_done_wq.  This
means the workqueue also needs to be allocated for async reads.

Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context")
Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de
Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Joanne Koong d7ff85d4b8
iomap: fix iomap_read_end() for already uptodate folios
There are some cases where when iomap_read_end() is called, the folio
may already have been marked uptodate. For example, if the iomap block
needed zeroing, then the folio may have been marked uptodate after the
zeroing.

iomap_read_end() should unlock the folio instead of calling
folio_end_read(), which is how these cases were handled prior to commit
f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for
reads"). Calling folio_end_read() on an uptodate folio leads to buggy
behavior where marking an already uptodate folio as uptodate will XOR it
to be marked nonuptodate.

Fixes: f8eaf79406 ("iomap: simplify ->read_folio_range() error handling for reads")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251118211111.1027272-2-joannelkoong@gmail.com
Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christian Brauner 5ec58e6acd
Merge patch series "enable iomap dio write completions from interrupt context v2"
Christoph Hellwig <hch@lst.de> says:

Currently iomap defers all write completions to interrupt context.  This
was based on my assumption that no one cares about the latency of those
to simplify the code vs the old direct-io.c.  It turns out someone cared,
as Avi reported a lot of context switches with ScyllaDB, which at least
in older kernels with workqueue scheduling issues caused really high
tail latencies.

Fortunately allowing the direct completions is pretty easy with all the
other iomap changes we had since.

While doing this I've also found dead code which gets removed (patch 1)
and an incorrect assumption in zonefs that read completions are called
in user context, which it assumes for it's error handling.  Fix this by
always calling error completions from user context (patch 2).
Against the vfs-6.19.iomap branch.

* patches from https://patch.msgid.link/20251113170633.1453259-1-hch@lst.de:
  iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
  iomap: support write completions from interrupt context
  iomap: rework REQ_FUA selection
  iomap: always run error completions in user context
  fs, iomap: remove IOCB_DIO_CALLER_COMP

Link: https://patch.msgid.link/20251113170633.1453259-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig 76192a42c2
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
Replace IOMAP_DIO_INLINE_COMP with a flag to indicate that the
completion should be offloaded.  This removes a tiny bit of boilerplate
code, but more importantly just makes the code easier to follow as this
new flag gets set most of the time and only cleared in one place, while
it was the inverse for the old version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-6-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig eca9dc2089
iomap: support write completions from interrupt context
Completions for pure overwrites don't need to be deferred to a workqueue
as there is no work to be done, or at least no work that needs a user
context.  Set the IOMAP_DIO_INLINE_COMP by default for writes like we
already do for reads, and the clear it for all the cases that actually
do need a user context for completions to update the inode size or
record updates to the logical to physical mapping.

I've audited all users of the ->end_io callback, and they only require
user context for I/O that involves unwritten extents, COW, size
extensions, or error handling and all those are still run from workqueue
context.

This restores the behavior of the old pre-iomap direct I/O code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-5-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:19 +01:00
Christoph Hellwig 29086a31b3
iomap: rework REQ_FUA selection
The way how iomap_dio_can_use_fua and the caller is structured is
a bit confusing, as the main guarding condition is hidden in the
helper, and the secondary conditions are split between caller and
callee.

Refactor the code, so that iomap_dio_bio_iter itself tracks if a write
might need metadata updates based on the iomap type and flags, and
then have a condition based on that to use the FUA flag.

Note that this also moves the REQ_OP_WRITE assignment to the end of
the branch to improve readability a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-4-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Christoph Hellwig ddb4873286
iomap: always run error completions in user context
At least zonefs expects error completions to be able to sleep.  Because
error completions aren't performance critical, just defer them to workqueue
context unconditionally.

Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Christoph Hellwig f9f8514999
fs, iomap: remove IOCB_DIO_CALLER_COMP
This was added by commit 099ada2c87 ("io_uring/rw: add write support
for IOCB_DIO_CALLER_COMP") and disabled a little later by commit
838b35bb6a ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it
didn't work.  Remove all the related code that sat unused for 2 years.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:18 +01:00
Christian Brauner f53d302ee8
Merge patch series "iomap: buffered io changes"
This series contains several fixes and cleanups:

* Renaming bytes_pending/bytes_accounted to
  bytes_submitted/bytes_not_submitted for improved code clarity

* Accounting for unaligned end offsets when truncating read ranges

* Adding documentation for iomap_finish_folio_write() requirements

* Optimizing pending async writeback accounting logic

* Simplifying error handling in ->read_folio_range() for read operations

* Streamlining logic for skipping reads during write operations

* Replacing manual bitmap scanning with find_next_bit() for both dirty
  and uptodate bitmaps, improving performance

* patches from https://patch.msgid.link/20251111193658.3495942-1-joannelkoong@gmail.com:
  iomap: use find_next_bit() for uptodate bitmap scanning
  iomap: use find_next_bit() for dirty bitmap scanning
  iomap: simplify when reads can be skipped for writes
  iomap: simplify ->read_folio_range() error handling for reads
  iomap: optimize pending async writeback accounting
  docs: document iomap writeback's iomap_finish_folio_write() requirement
  iomap: account for unaligned end offsets when truncating read range
  iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted

Link: https://patch.msgid.link/20251111193658.3495942-1-joannelkoong@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:10 +01:00
Joanne Koong b56c1c54f2
iomap: use find_next_bit() for uptodate bitmap scanning
Use find_next_bit()/find_next_zero_bit() for iomap uptodate bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next uptodate or non-uptodate bit than iterating through the
the bitmap range testing every bit.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-10-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:10 +01:00
Joanne Koong fed9c62d28
iomap: use find_next_bit() for dirty bitmap scanning
Use find_next_bit()/find_next_zero_bit() for iomap dirty bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next dirty or clean bit than iterating through the bitmap
range testing every bit.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-9-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:22:10 +01:00
Askar Safin 54ca9e913e
include/linux/fs.h: trivial fix: regualr -> regular
Trivial fix.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
Link: https://patch.msgid.link/20251120195140.571608-1-safinaskar@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:13:09 +01:00
Askar Safin bef0202fb7
fs/splice.c: trivial fix: pipes -> pipe's
Trivial fix.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
Link: https://patch.msgid.link/20251120211316.706725-1-safinaskar@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:11:16 +01:00
Matthew Wilcox (Oracle) 37d369fa97
fs: Add uoff_t
In a recent commit, I inadvertently changed a comparison from being an
unsigned comparison (on 64-bit systems) to being a signed comparison
(which it had always been on 32-bit systems).  This led to a sporadic
fstests failure.

To make sure this comparison is always unsigned, introduce a new type,
uoff_t which is the unsigned version of loff_t.  Generally file sizes
are restricted to being a signed integer, but in these two places it is
convenient to pass -1 to indicate "up to the end of the file".

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251123220518.1447261-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:07:42 +01:00
Mateusz Guzik 8d79ec9e7f
fs: mark lookup_slow() as noinline
Otherwise it gets inlined notably in walk_component(), which convinces
the compiler to push/pop additional registers in the fast path to
accomodate existence of the inlined version.

Shortens the fast path of that routine from 87 to 71 bytes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119144930.2911698-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:04:38 +01:00
Mateusz Guzik 7c179096e7
fs: add predicts based on nd->depth
Stats from nd->depth usage during the venerable kernel build collected like so:
bpftrace -e 'kprobe:terminate_walk,kprobe:walk_component,kprobe:legitimize_links
{ @[probe] = lhist(((struct nameidata *)arg0)->depth, 0, 8, 1); }'

@[kprobe:legitimize_links]:
[0, 1)           6554906 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)              3534 |                                                    |

@[kprobe:terminate_walk]:
[0, 1)          12153664 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[kprobe:walk_component]:
[0, 1)          53075749 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            971421 |                                                    |
[2, 3)             84946 |                                                    |

Additionally a custom probe was added for depth within link_path_walk():
bpftrace -e 'kprobe:link_path_walk_probe { @[probe] = lhist(arg0, 0, 8, 1); }'
@[kprobe:link_path_walk_probe]:
[0, 1)           7528231 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            407905 |@@                                                  |

Given these results:
1. terminate_walk() is called towards the end of the lookup and in this
   test it never had any links to clean up.
2. legitimize_links() is also called towards the end of lookup and most
   of the time there s 0 depth. Patch consumers to avoid calling into it
   in that case.
3. walk_component() is typically called with WALK_MORE and zero depth,
   checked in that order. Check depth first and predict it is 0.
4. link_path_walk() also does not deal with a symlink most of the time
   when !*name

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119142954.2909394-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25 10:04:01 +01:00
Mateusz Guzik bfef6e1f34
fs: move mntput_no_expire() slowpath into a dedicated routine
In the stock variant the compiler spills several registers on the stack
and employs stack smashing protection, adding even more code + a branch
on exit..

The actual fast path is small enough that the compiler inlines it for
all callers -- the symbol is no longer emitted.

Forcing noinline on it just for code-measurement purposes shows the fast
path dropping from 111 to 39 bytes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251114201803.2183505-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 14:49:28 +01:00
Christoph Hellwig 6d228c181e
fs: remove spurious exports in fs/file_attr.c
Commit 2f952c9e8f ("fs: split fileattr related helpers into separate
file") added various exports without users despite claiming to be a
simple refactor.  Drop them again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251119101415.2732320-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 12:17:31 +01:00
Davidlohr Bueso c29383a874
watch_queue: Use local kmap in post_one_notification()
Replace the now deprecated kmap_atomic() with kmap_local_page().

Optimize for the non-highmem cases and avoid disabling preemption and
pagefaults, the caller's context is atomic anyway, but that is irrelevant
to kmap. The memcpy itself does not require any such semantics and the
mapping would hold valid across context switches anyway. Further, highmem
is planned to to be removed[1].

[1] https://lore.kernel.org/all/4ff89b72-03ff-4447-9d21-dd6a5fe1550f@app.fastmail.com/

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://patch.msgid.link/20251118210706.1816303-1-dave@stgolabs.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 12:17:28 +01:00
Christian Brauner a71e4f103a
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
We have reworked namespaces sufficiently that all this special-casing
shouldn't be needed anymore

Link: https://patch.msgid.link/20251117-eidesstattlich-apotheke-36d2e644079f@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-17 16:23:13 +01:00
Kriish Sharma cc7d6c65b8
nstree: fix kernel-doc comments for internal functions
Documentation build reported:

  Warning: kernel/nstree.c:325 function parameter 'ns_tree' not described in '__ns_tree_adjoined_rcu'
  Warning: kernel/nstree.c:325 expecting prototype for ns_tree_adjoined_rcu(). Prototype was for __ns_tree_adjoined_rcu() instead
  Warning: kernel/nstree.c:353 expecting prototype for ns_tree_gen_id(). Prototype was for __ns_tree_gen_id() instead

The kernel-doc comments for `__ns_tree_adjoined_rcu()` and
`__ns_tree_gen_id()` had mismatched function names and a missing
parameter description. This patch updates the function names in the
kernel-doc headers and adds the missing `@ns_tree` parameter description
for `__ns_tree_adjoined_rcu()`.

Fixes: 885fc8ac0a ("nstree: make iterator generic")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511061542.0LO7xKs8-lkp@intel.com
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Link: https://patch.msgid.link/20251111112533.2254432-1-kriish.sharma2006@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:10:38 +01:00
Christian Brauner cefd55bd21
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Make it possible to handle NULL being passed to the reference count
helpers instead of forcing the caller to handle this. Afterwards we can
nicely allow a cleanup guard to handle nsproxy freeing.

Active reference count handling is not done in nsproxy_free() but rather
in free_nsproxy() as nsproxy_free() is also called from setns() failure
paths where a new nsproxy has been prepared but has not been marked as
active via switch_task_namespaces().

Link: https://lore.kernel.org/690bfb9e.050a0220.2e3c35.0013.GAE@google.com
Link: https://patch.msgid.link/20251111-sakralbau-guthaben-7dcc277d337f@brauner
Fixes: 3c9820d5c64a ("ns: add active reference count")
Reported-by: syzbot+0b2e79f91ff6579bfa5b@syzkaller.appspotmail.com
Reported-by: syzbot+0a8655a80e189278487e@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:10:38 +01:00
Mateusz Guzik 030e86dfda
fs: touch up predicts in path lookup
Rationale:
- ND_ROOT_PRESET is only set in a condition already marked unlikely
- LOOKUP_IS_SCOPED already has unlikely on it, but inconsistently
  applied
- set_root() only fails if there is a bug
- most names are not empty (see !*s)
- most of the time path_init() does not encounter LOOKUP_CACHED without
  LOOKUP_RCU
- LOOKUP_IN_ROOT is a rarely seen flag

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105150630.756606-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-13 14:22:25 +01:00
Mateusz Guzik 9eda581bfe
fs: move fd_install() slowpath into a dedicated routine and provide commentary
On stock kernel gcc 14 emits avoidable register spillage:
	endbr64
	call   ffffffff81374630 <__fentry__>
	push   %r13
	push   %r12
	push   %rbx
	sub    $0x8,%rsp
	[snip]

Total fast path is 99 bytes.

Moving the slowpath out avoids it and shortens the fast path to 74
bytes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251110095634.1433061-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:09 +01:00
Mateusz Guzik 21b561dab1
fs: hide dentry_cache behind runtime const machinery
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105153622.758836-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:09 +01:00
Mateusz Guzik e41c1f4291
fs: touch predicts in do_dentry_open()
Helps out some of the asm, the routine is still a mess.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251109125254.1288882-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:09 +01:00
Baokun Li 50b2a4f19b
bdev: add hint prints in sb_set_blocksize() for LBS dependency on THP
Support for block sizes greater than the page size depends on large
folios, which in turn require CONFIG_TRANSPARENT_HUGEPAGE to be enabled.

Because the code is wrapped in multiple layers of abstraction, this
dependency is rather obscure, so users may not realize it and may be
unsure how to enable LBS.

As suggested by Theodore, I have added hint messages in sb_set_blocksize
so that users can distinguish whether a mount failure with block size
larger than page size is due to lack of filesystem support or the absence
of CONFIG_TRANSPARENT_HUGEPAGE.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251110043226.GD2988753@mit.edu
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251110124714.1329978-1-libaokun@huaweicloud.com
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:09 +01:00
Christian Brauner 04f0955b60
Merge patch series "cheaper MAY_EXEC handling for path lookup"
Mateusz Guzik <mjguzik@gmail.com> says:

In short, MAY_WRITE checks are elided.

This obsoletes the idea of pre-computing if perm checks are necessary as
that turned out to be too hairy. The new code has 2 more branches per
path component compared to that idea, but the perf difference for
typical paths (< 6 components) was basically within noise. To be
revisited if someone(tm) removes other slowdowns.

Instead of the pre-computing thing I added IOP_FASTPERM_MAY_EXEC so that
filesystems like btrfs can still avoid the hard work.

* patches from https://patch.msgid.link/20251107142149.989998-1-mjguzik@gmail.com:
  fs: retire now stale MAY_WRITE predicts in inode_permission()
  btrfs: utilize IOP_FASTPERM_MAY_EXEC
  fs: speed up path lookup with cheaper handling of MAY_EXEC

Link: https://patch.msgid.link/20251107142149.989998-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:08 +01:00
Mateusz Guzik a0a28c4e41
fs: retire now stale MAY_WRITE predicts in inode_permission()
The primary non-MAY_WRITE consumer now uses lookup_inode_permission_may_exec().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-4-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:08 +01:00
Mateusz Guzik 3e18f6256e
btrfs: utilize IOP_FASTPERM_MAY_EXEC
Root filesystem was ext4, btrfs was mounted on /testfs.

Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h
on Sapphire Rapids (ops/s):

before: 3447976
after:	3620879 (+5%)

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:08 +01:00
Mateusz Guzik e631df89cd
fs: speed up path lookup with cheaper handling of MAY_EXEC
The generic inode_permission() routine does work which is known to be of
no significance for lookup. There are checks for MAY_WRITE, while the
requested permission is MAY_EXEC. Additionally devcgroup_inode_permission()
is called to check for devices, but it is an invariant the inode is a
directory.

Absent a ->permission func, execution lands in generic_permission()
which checks upfront if the requested permission is granted for
everyone.

We can elide the branches which are guaranteed to be false and cut
straight to the check if everyone happens to be allowed MAY_EXEC on the
inode (which holds true most of the time).

Moreover, filesystems which provide their own ->permission routine can
take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC
flag on their inodes, which they can legitimately do if their MAY_EXEC
handling matches generic_permission().

As a simple benchmark, as part of compilation gcc issues access(2) on
numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o

Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s):
before: 3797556
after:  3987789 (+5%)

Note: this depends on the not-yet-landed ext4 patch to mark inodes with
cache_no_acl()

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:19:08 +01:00
Rasmus Villemoes 854e8df2ce
fs/pipe: stop duplicating union pipe_index declaration
Now that we build with -fms-extensions, union pipe_index can be
included as an anonymous member in struct pipe_inode_info, avoiding
the duplication.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Link: https://patch.msgid.link/20251023082142.2104456-1-linux@rasmusvillemoes.dk
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 12:18:56 +01:00
Joanne Koong a298febc47
iomap: simplify when reads can be skipped for writes
Currently, the logic for skipping the read range for a write is

if (!(iter->flags & IOMAP_UNSHARE) &&
    (from <= poff || from >= poff + plen) &&
    (to <= poff || to >= poff + plen))

which breaks down to skipping the read if any of these are true:
a) from <= poff && to <= poff
b) from <= poff && to >= poff + plen
c) from >= poff + plen && to <= poff
d) from >= poff + plen && to >= poff + plen

This can be simplified to
if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen)

from the following reasoning:

a) from <= poff && to <= poff
This reduces to 'to <= poff' since it is guaranteed that 'from <= to'
(since to = from + len). It is not possible for 'from <= to' to be true
here because we only reach here if plen > 0 (thanks to the preceding 'if
(plen == 0)' check that would break us out of the loop). If 'to <=
poff', plen would have to be 0 since poff and plen get adjusted in
lockstep for uptodate blocks. This means we can eliminate this check.

c) from >= poff + plen && to <= poff
This is not possible since 'from <= to' and 'plen > 0'. We can eliminate
this check.

d) from >= poff + plen && to >= poff + plen
This reduces to 'from >= poff + plen' since 'from <= to'.
It is not possible for 'from >= poff + plen' to be true here. We only
reach here if plen > 0 and for writes, poff and plen will always be
block-aligned, which means poff <= from < poff + plen. We can eliminate
this check.

The only valid check is b) from <= poff && to >= poff + plen.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-7-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong f8eaf79406
iomap: simplify ->read_folio_range() error handling for reads
Instead of requiring that the caller calls iomap_finish_folio_read()
even if the ->read_folio_range() callback returns an error, account for
this internally in iomap instead, which makes the interface simpler and
makes it match writeback's ->read_folio_range() error handling
expectations.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong 6b1fd2281f
iomap: optimize pending async writeback accounting
Pending writebacks must be accounted for to determine when all requests
have completed and writeback on the folio should be ended. Currently
this is done by atomically incrementing ifs->write_bytes_pending for
every range to be written back.

Instead, the number of atomic operations can be minimized by setting
ifs->write_bytes_pending to the folio size, internally tracking how many
bytes are written back asynchronously, and then after sending off all
the requests, decrementing ifs->write_bytes_pending by the number of
bytes not written back asynchronously. Now, for N ranges written back,
only N + 2 atomic operations are required instead of 2N + 2.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong 7e6cea5ae2
docs: document iomap writeback's iomap_finish_folio_write() requirement
Document that iomap_finish_folio_write() must be called after writeback
on the range completes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-4-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:32 +01:00
Joanne Koong 9d875e0eef
iomap: account for unaligned end offsets when truncating read range
The end position to start truncating from may be at an offset into a
block, which under the current logic would result in overtruncation.

Adjust the calculation to account for unaligned end offsets.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-3-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:31 +01:00
Joanne Koong a0f1cabe29
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
The naming "bytes_pending" and "bytes_accounted" may be confusing and
could be better named. Rename this to "bytes_submitted" and
"bytes_not_submitted" to make it more clear that these are bytes we
passed to the IO helper to read in.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-2-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 10:50:31 +01:00
Mateusz Guzik dca3aa666f
fs: move inode fields used during fast path lookup closer together
This should avoid *some* cache misses.

Successful path lookup is guaranteed to load at least ->i_mode,
->i_opflags and ->i_acl. At the same time the common case will avoid
looking at more fields.

struct inode is not guaranteed to have any particular alignment, notably
ext4 has it only aligned to 8 bytes meaning nearby fields might happen
to be on the same or only adjacent cache lines depending on luck (or no
luck).

According to pahole:
        umode_t                    i_mode;               /*     0     2 */
        short unsigned int         i_opflags;            /*     2     2 */
        kuid_t                     i_uid;                /*     4     4 */
        kgid_t                     i_gid;                /*     8     4 */
        unsigned int               i_flags;              /*    12     4 */
        struct posix_acl *         i_acl;                /*    16     8 */
        struct posix_acl *         i_default_acl;        /*    24     8 */

->i_acl is unnecessarily separated by 8 bytes from the other fields.
With struct inode being offset 48 bytes into the cacheline this means an
avoidable miss. Note it will still be there for the 56 byte case.

New layout:
        umode_t                    i_mode;               /*     0     2 */
        short unsigned int         i_opflags;            /*     2     2 */
        unsigned int               i_flags;              /*     4     4 */
        struct posix_acl *         i_acl;                /*     8     8 */
        struct posix_acl *         i_default_acl;        /*    16     8 */
        kuid_t                     i_uid;                /*    24     4 */
        kgid_t                     i_gid;                /*    28     4 */

I verified with pahole there are no size or hole changes.

This is stopgap until someone(tm) sanitizes the layout in the first
place, allocation methods aside.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251109121931.1285366-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:49:54 +01:00
Christian Brauner 18b5c40048
Merge patch series "ns: header cleanups and initial namespace reference count improvements"
Christian Brauner <brauner@kernel.org> says:

Cleanup the namespace headers by splitting them into types and helpers.
Better separate common namepace types and functions from namespace tree
types and functions.

Fix the reference counts of initial namespaces so we don't do any
pointless cacheline ping-pong for them when we know they can never go
away. Add a bunch of asserts for both the passive and active reference
counts to catch any changes that would break it.

* patches from https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-0-e8a9264e0fb9@kernel.org:
  selftests/namespaces: fix nsid tests
  ns: drop custom reference count initialization for initial namespaces
  pid: rely on common reference count behavior
  ns: add asserts for initial namespace active reference counts
  ns: add asserts for initial namespace reference counts
  ns: make all reference counts on initial namespace a nop
  ipc: enable is_ns_init_id() assertions
  fs: use boolean to indicate anonymous mount namespace
  ns: rename is_initial_namespace()
  ns: make is_initial_namespace() argument const
  nstree: use guards for ns_tree_lock
  nstree: simplify owner list iteration
  nstree: switch to new structures
  nstree: add helper to operate on struct ns_tree_{node,root}
  nstree: move nstree types into separate header
  nstree: decouple from ns_common header
  ns: move namespace types into separate header

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-0-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:37 +01:00
Christian Brauner 6453937581
selftests/namespaces: fix nsid tests
Ensure that we always kill and cleanup all processes.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-17-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:32 +01:00
Christian Brauner c2bbd2db52
ns: drop custom reference count initialization for initial namespaces
Initial namespaces don't modify their reference count anymore.
They remain fixed at one so drop the custom refcount initializations.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-16-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:32 +01:00
Christian Brauner 282879afa0
pid: rely on common reference count behavior
Now that we changed the generic reference counting mechanism for all
namespaces to never manipulate reference counts of initial namespaces we
can drop the special handling for pid namespaces.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-15-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:32 +01:00
Christian Brauner 7118daabb6
ns: add asserts for initial namespace active reference counts
They always remain fixed at one. Notice when that assumptions is broken.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-14-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:32 +01:00
Christian Brauner 2b60d56acc
ns: add asserts for initial namespace reference counts
They always remain fixed at one. Notice when that assumptions is broken.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-13-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner 657aeb436d
ns: make all reference counts on initial namespace a nop
They are always active so no need to needlessly cacheline ping-pong.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-12-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner 3826d5dd06
ipc: enable is_ns_init_id() assertions
The ipc namespace may call put_ipc_ns() and get_ipc_ns() before it is
added to the namespace tree. Assign the id early like we do for a some
other namespaces.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-11-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner d9a44089ac
fs: use boolean to indicate anonymous mount namespace
Stop playing games with the namespace id and use a boolean instead:

* This will remove the special-casing we need to do everywhere for mount
  namespaces.

* It will allow us to use asserts on the namespace id for initial
  namespaces everywhere.

* It will allow us to put anonymous mount namespaces on the namespaces
  trees in the future and thus make them available to statmount() and
  listmount().

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-10-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner 6bf253855a
ns: rename is_initial_namespace()
Rename is_initial_namespace() to ns_init_inum() and make it symmetrical
with the ns id variant.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-9-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner ed93c0697a
ns: make is_initial_namespace() argument const
We don't modify the data structure at all so pass it as const.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-8-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner 298ab06ae4
nstree: use guards for ns_tree_lock
Make use of the guard infrastructure for ns_tree_lock.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-7-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Christian Brauner 8a30420c89
nstree: simplify owner list iteration
Make use of list_for_each_entry_from_rcu().

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-6-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner a657bc8a75
nstree: switch to new structures
Switch the nstree management to the new combined structures.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-5-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner d12ea8062f
nstree: add helper to operate on struct ns_tree_{node,root}
Add helpers that work on the combined rbtree and rculist combined.
This will make the code a lot more managable and legible.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-4-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner 1c64fb02ac
nstree: move nstree types into separate header
Introduce two new fundamental data structures for namespace tree
management in a separate header file.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-3-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner ea1549e628
nstree: decouple from ns_common header
Foward declare struct ns_common and remove the include of ns_common.h.
We want ns_common.h to possibly include nstree structures but not the
other way around.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-2-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner 2b9a0f21fb
ns: move namespace types into separate header
Add a dedicated header for namespace types.

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-1-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:30 +01:00
Christian Brauner a67ee4e2ba
Merge branch 'kbuild-6.19.fms.extension'
Bring in the shared branch with the kbuild tree to enable
'-fms-extensions' for 6.19. Further namespace cleanup work
requires this extension.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 09:59:08 +01:00
Christian Brauner ae901e5e2e
Merge patch series "ns: fixes for namespace iteration and active reference counting"
Christian Brauner <brauner@kernel.org> says:

* Make sure to initialize the active reference count for the initial
  network namespace and prevent __ns_common_init() from returning too
  early.

* Make sure that passive reference counts are dropped outside of rcu
  read locks as some namespaces such as the mount namespace do in fact
  sleep when putting the last reference.

* The setns() system call supports:

  (1) namespace file descriptors (nsfd)
  (2) process file descriptors (pidfd)

  When using nsfds the namespaces will remain active because they are
  pinned by the vfs. However, when pidfds are used things are more
  complicated.

  When the target task exits and passes through exit_nsproxy_namespaces()
  or is reaped and thus also passes through exit_cred_namespaces() after
  the setns()'ing task has called prepare_nsset() but before the active
  reference count of the set of namespaces it wants to setns() to might
  have been dropped already:

    P1                                                              P2

    pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                    pidfd = pidfd_open(pid_p1)
                                                                    setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                    prepare_nsset()

    exit(0)
    // ns->__ns_active_ref        == 1
    // parent_ns->__ns_active_ref == 1
    -> exit_nsproxy_namespaces()
    -> exit_cred_namespaces()

    // ns_active_ref_put() will also put
    // the reference on the owner of the
    // namespace. If the only reason the
    // owning namespace was alive was
    // because it was a parent of @ns
    // it's active reference count now goes
    // to zero... --------------------------------
    //                                           |
    // ns->__ns_active_ref        == 0           |
    // parent_ns->__ns_active_ref == 0           |
                                                 |                  commit_nsset()
                                                 -----------------> // If setns()
                                                                    // now manages to install the namespaces
                                                                    // it will call ns_active_ref_get()
                                                                    // on them thus bumping the active reference
                                                                    // count from zero again but without also
                                                                    // taking the required reference on the owner.
                                                                    // Thus we get:
                                                                    //
                                                                    // ns->__ns_active_ref        == 1
                                                                    // parent_ns->__ns_active_ref == 0

    When later someone does ns_active_ref_put() on @ns it will underflow
    parent_ns->__ns_active_ref leading to a splat from our asserts
    thinking there are still active references when in fact the counter
    just underflowed.

  So resurrect the ownership chain if necessary as well. If the caller
  succeeded to grab passive references to the set of namespaces the
  setns() should simply succeed even if the target task exists or gets
  reaped in the meantime.

  The race is rare and can only be triggered when using pidfs to setns()
  to namespaces. Also note that active reference on initial namespaces are
  nops.

  Since we now always handle parent references directly we can drop
  ns_ref_active_get_owner() when adding a namespace to a namespace tree.
  This is now all handled uniformly in the places where the new namespaces
  actually become active.

* patches from https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org:
  selftests/namespaces: test for efault
  selftests/namespaces: add active reference count regression test
  ns: add asserts for active refcount underflow
  ns: handle setns(pidfd, ...) cleanly
  ns: return EFAULT on put_user() error
  ns: make sure reference are dropped outside of rcu lock
  ns: don't increment or decrement initial namespaces
  ns: don't skip active reference count initialization

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-0-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 15:54:02 +01:00
Christian Brauner 07d7ad46da
selftests/namespaces: test for efault
Ensure that put_user() can fail and that namespace cleanup works
correctly.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-8-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 15:53:56 +01:00
Christian Brauner aa70b9cf68
Merge branch 'kbuild-6.19.fms.extension'
Bring in the shared branch with the Kbuild tree for enabling
'-fms-extensions' for 6.19.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:41:58 +01:00
Christian Brauner 3c60b0b1e5 Shared branch between Kbuild and other trees for enabling '-fms-extensions' for 6.19
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR74yXHMTGczQHYypIdayaRccAalgUCaQzbRwAKCRAdayaRccAa
 ls8lAP9Dj1mOl+KTtajMvDnDTym4Sso9CaFP+5maFAv9CflAIwEA5QEtSwI9sMcH
 ty8x9Y6TTuib+ns37i2jxR8cIt4jHwU=
 =WA6M
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaRGyjgAKCRCRxhvAZXjc
 okJ7AQDFRmoidpqHmRlvZQ3aismKkNHfx2k67QtRlX+YxDi8rQEAmmKyKUiX/SZV
 39TroGfiJ5ytQuXwz3QxG/34cA+kAgQ=
 =HqhX
 -----END PGP SIGNATURE-----

Merge patch "kbuild: Add '-fms-extensions' to areas with dedicated CFLAGS"

Nathan Chancellor <nathan@kernel.org> says:

Shared branch between Kbuild and other trees for enabling
'-fms-extensions' for 6.19.

* tag 'kbuild-ms-extensions-6.19' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
  kbuild: Add '-fms-extensions' to areas with dedicated CFLAGS
  Kbuild: enable -fms-extensions
  jfs: Rename _inline to avoid conflict with clang's '-fms-extensions'

Link: https://patch.msgid.link/20251101-kbuild-ms-extensions-dedicated-cflags-v1-1-38004aba524b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:38:07 +01:00
Christian Brauner 88efd7c699
selftests/namespaces: add active reference count regression test
Add a regression test for setns() with pidfd.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-7-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:54 +01:00
Christian Brauner 57b39aabb9
ns: add asserts for active refcount underflow
Add a few more assert to detect active reference count underflows.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-6-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:54 +01:00
Christian Brauner f8d5a8970d
ns: handle setns(pidfd, ...) cleanly
The setns() system call supports:

(1) namespace file descriptors (nsfd)
(2) process file descriptors (pidfd)

When using nsfds the namespaces will remain active because they are
pinned by the vfs. However, when pidfds are used things are more
complicated.

When the target task exits and passes through exit_nsproxy_namespaces()
or is reaped and thus also passes through exit_cred_namespaces() after
the setns()'ing task has called prepare_nsset() but before the active
reference count of the set of namespaces it wants to setns() to might
have been dropped already:

  P1                                                              P2

  pid_p1 = clone(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                  pidfd = pidfd_open(pid_p1)
                                                                  setns(pidfd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS)
                                                                  prepare_nsset()

  exit(0)
  // ns->__ns_active_ref        == 1
  // parent_ns->__ns_active_ref == 1
  -> exit_nsproxy_namespaces()
  -> exit_cred_namespaces()

  // ns_active_ref_put() will also put
  // the reference on the owner of the
  // namespace. If the only reason the
  // owning namespace was alive was
  // because it was a parent of @ns
  // it's active reference count now goes
  // to zero... --------------------------------
  //                                           |
  // ns->__ns_active_ref        == 0           |
  // parent_ns->__ns_active_ref == 0           |
                                               |                  commit_nsset()
                                               -----------------> // If setns()
                                                                  // now manages to install the namespaces
                                                                  // it will call ns_active_ref_get()
                                                                  // on them thus bumping the active reference
                                                                  // count from zero again but without also
                                                                  // taking the required reference on the owner.
                                                                  // Thus we get:
                                                                  //
                                                                  // ns->__ns_active_ref        == 1
                                                                  // parent_ns->__ns_active_ref == 0

  When later someone does ns_active_ref_put() on @ns it will underflow
  parent_ns->__ns_active_ref leading to a splat from our asserts
  thinking there are still active references when in fact the counter
  just underflowed.

So resurrect the ownership chain if necessary as well. If the caller
succeeded to grab passive references to the set of namespaces the
setns() should simply succeed even if the target task exists or gets
reaped in the meantime and thus has dropped all active references to its
namespaces.

The race is rare and can only be triggered when using pidfs to setns()
to namespaces. Also note that active reference on initial namespaces are
nops.

Since we now always handle parent references directly we can drop
ns_ref_active_get_owner() when adding a namespace to a namespace tree.
This is now all handled uniformly in the places where the new namespaces
actually become active.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-5-ae8a4ad5a3b3@kernel.org
Fixes: 3c9820d5c64a ("ns: add active reference count")
Reported-by: syzbot+1957b26299cf3ff7890c@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:54 +01:00
Christian Brauner a51dce7c32
ns: return EFAULT on put_user() error
Don't return EINVAL, return EFAULT just like we do in other system
calls.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-4-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:54 +01:00
Christian Brauner 2ec2aff3c8
ns: make sure reference are dropped outside of rcu lock
The mount namespace may in fact sleep when putting the last passive
reference so we need to drop the namespace reference outside of the rcu
read lock. Do this by delaying the put until the next iteration where
we've already moved on to the next namespace and legitimized it. Once we
drop the rcu read lock to call put_user() we will also drop the
reference to the previous namespace in the tree.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-3-ae8a4ad5a3b3@kernel.org
Fixes: 76b6f5dfb3 ("nstree: add listns()")
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:53 +01:00
Christian Brauner 7cd3d20441
ns: don't increment or decrement initial namespaces
There's no need to bump the active reference counts of initial
namespaces as they're always active and can simply remain at 1.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-2-ae8a4ad5a3b3@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:53 +01:00
Christian Brauner 0355dcae2d
ns: don't skip active reference count initialization
Don't skip active reference count initialization for initial namespaces.
Doing this will break network namespace active reference counting.

Link: https://patch.msgid.link/20251109-namespace-6-19-fixes-v1-1-ae8a4ad5a3b3@kernel.org
Fixes: 3a18f80918 ("ns: add active reference count")
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10 10:20:50 +01:00
Christian Brauner c8e00cdc74
Merge patch series "credential guards: credential preparation"
Christian Brauner <brauner@kernel.org> says:

This converts most users combining

* prepare_creds()
* modify new creds
* override_creds()
* revert_creds()
* put_cred()

to rely on credentials guards.

* patches from https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-0-b447b82f2c9b@kernel.org:
  trace: use override credential guard
  trace: use prepare credential guard
  coredump: use override credential guard
  coredump: use prepare credential guard
  coredump: split out do_coredump() from vfs_coredump()
  coredump: mark struct mm_struct as const
  coredump: pass struct linux_binfmt as const
  coredump: move revert_cred() before coredump_cleanup()
  sev-dev: use override credential guards
  sev-dev: use prepare credential guard
  sev-dev: use guard for path
  cred: add prepare credential guard

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-0-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner 06765b6efc
trace: use override credential guard
Use override credential guards for scoped credential override with
automatic restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-12-b447b82f2c9b@kernel.org
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner 2ed6a34de9
trace: use prepare credential guard
Use the prepare credential guard for allocating a new set of
credentials.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-11-b447b82f2c9b@kernel.org
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner 545985dd37
coredump: use override credential guard
Use override credential guards for scoped credential override with
automatic restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-10-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner 8ed3473c5a
coredump: use prepare credential guard
Use the prepare credential guard for allocating a new set of
credentials.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-9-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner af9803d4b8
coredump: split out do_coredump() from vfs_coredump()
Make the function easier to follow and prepare for some of the following
changes.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-8-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:52 +01:00
Christian Brauner 313a335057
coredump: mark struct mm_struct as const
We don't actually modify it.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-7-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:51 +01:00
Christian Brauner 1ec760fb42
coredump: pass struct linux_binfmt as const
We don't actually modify it.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-6-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:51 +01:00
Christian Brauner eb937201ba
coredump: move revert_cred() before coredump_cleanup()
There's no need to pin the credentials across the coredump_cleanup()
call. Nothing in there depends on elevated credentials.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-5-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:51 +01:00
Christian Brauner b7b4f7554b
sev-dev: use override credential guards
Use override credential guards for scoped credential override with
automatic restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-4-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 23:11:42 +01:00
Christian Brauner 73fd0dba0b
Merge patch series "fs: introduce super write guard"
Christian Brauner <brauner@kernel.org> says:

I'm in the process of adding a few more guards for vfs constructs.
I've chosen the easy case of super_start_write() and super_end_write()
and converted eligible callers. I think long-term we can move a lot of
the manual placement to completely rely on guards - where sensible.

* patches from https://patch.msgid.link/20251104-work-guards-v1-0-5108ac78a171@kernel.org:
  xfs: use super write guard in xfs_file_ioctl()
  open: use super write guard in do_ftruncate()
  btrfs: use super write guard in relocating_repair_kthread()
  ext4: use super write guard in write_mmp_block()
  btrfs: use super write guard in sb_start_write()
  btrfs: use super write guard btrfs_run_defrag_inode()
  btrfs: use super write guard in btrfs_reclaim_bgs_work()
  fs: add super_write_guard

Link: https://patch.msgid.link/20251104-work-guards-v1-0-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:59:31 +01:00
Christian Brauner ab5f296076
xfs: use super write guard in xfs_file_ioctl()
Link: https://patch.msgid.link/20251104-work-guards-v1-8-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:59:27 +01:00
Christian Brauner 97f9d2d282
open: use super write guard in do_ftruncate()
Link: https://patch.msgid.link/20251104-work-guards-v1-7-5108ac78a171@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner b7b8aca68e
btrfs: use super write guard in relocating_repair_kthread()
Link: https://patch.msgid.link/20251104-work-guards-v1-6-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner 2774bac21f
ext4: use super write guard in write_mmp_block()
Link: https://patch.msgid.link/20251104-work-guards-v1-5-5108ac78a171@kernel.org
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner 6e5b78cb17
btrfs: use super write guard in sb_start_write()
Link: https://patch.msgid.link/20251104-work-guards-v1-4-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner e79a4512cc
btrfs: use super write guard btrfs_run_defrag_inode()
Link: https://patch.msgid.link/20251104-work-guards-v1-3-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner a5e3d0be9e
btrfs: use super write guard in btrfs_reclaim_bgs_work()
Link: https://patch.msgid.link/20251104-work-guards-v1-2-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Christian Brauner 8e4d576ed3
fs: add super_write_guard
Link: https://patch.msgid.link/20251104-work-guards-v1-1-5108ac78a171@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Mateusz Guzik 5b8ed52866
fs: inline current_umask() and move it to fs_struct.h
There is no good reason to have this as a func call, other than avoiding
the churn of adding fs_struct.h as needed.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:51:23 +01:00
Christian Brauner 723cd9872d
Merge patch series "fs: start to split up fs.h"
Christian Brauner <brauner@kernel.org> says:

Take first steps to split up fs.h. Add fs/super_types.h and fs/super.h
headers that contain the types and functions associated with super
blocks respectively.

* patches from https://patch.msgid.link/20251104-work-fs-header-v1-0-fb39a2efe39e@kernel.org:
  fs: add fs/super.h header
  fs: add fs/super_types.h header
  fs: rename fs_types.h to fs_dirent.h

Link: https://patch.msgid.link/20251104-work-fs-header-v1-0-fb39a2efe39e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:51:23 +01:00
Christian Brauner f7b3d14165
fs: add fs/super.h header
Split out super block associated functions into a separate header.

Link: https://patch.msgid.link/20251104-work-fs-header-v1-3-fb39a2efe39e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:51:21 +01:00
Christian Brauner e0b62a4dee
fs: add fs/super_types.h header
Split out super block associated structures into a separate header.

Link: https://patch.msgid.link/20251104-work-fs-header-v1-2-fb39a2efe39e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:47:52 +01:00
Christian Brauner 0d534518ce
Merge patch series "Fix two syzbot corruption bugs in minix filesystem"
Jori Koolstra <jkoolstra@xs4all.nl> says:

Syzbot fuzzes /fs by trying to mount and manipulate deliberately
corrupted filesystems. This should not lead to BUG_ONs and WARN_ONs for
easy to detect corruptions. This series adds code to be able to report
such corruptions and fixes two syzbot bugs on this kind.

* patches from https://patch.msgid.link/20251104143005.3283980-1-jkoolstra@xs4all.nl:
  Fix a drop_nlink warning in minix_rename
  Fix a drop_nlink warning in minix_rmdir
  Add error handling to minix filesystem for inode corruption detection

Link: https://patch.msgid.link/20251104143005.3283980-1-jkoolstra@xs4all.nl
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:45:26 +01:00
Jori Koolstra 009a2ba403
Fix a drop_nlink warning in minix_rename
Syzbot found a drop_nlink warning that is triggered by an easy to
detect nlink corruption. This patch adds sanity checks to minix_unlink
and minix_rename to prevent the warning and instead return EFSCORRUPTED
to the caller.

The changes were tested using the syzbot reproducer as well as local
testing.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20251104143005.3283980-4-jkoolstra@xs4all.nl
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: syzbot+a65e824272c5f741247d@syzkaller.appspotmail.com
Closes: https://syzbot.org/bug?extid=a65e824272c5f741247d
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:45:21 +01:00
Jori Koolstra d3e0e8661c
Fix a drop_nlink warning in minix_rmdir
Syzbot found a drop_nlink warning that is triggered by an easy to
detect nlink corruption of a directory. This patch adds a sanity check
to minix_rmdir to prevent the warning and instead return EFSCORRUPTED to
the caller.

The changes were tested using the syzbot reproducer as well as local
testing.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20251104143005.3283980-3-jkoolstra@xs4all.nl
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: syzbot+4e49728ec1cbaf3b91d2@syzkaller.appspotmail.com
Closes: https://syzbot.org/bug?extid=4e49728ec1cbaf3b91d2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:45:21 +01:00
Jori Koolstra 21215ce7a9
Add error handling to minix filesystem for inode corruption detection
We would like to provide early and specific warnings of filesystem
corruption without running into generic WARN_ONs and BUG_ONs.
Towards this goal, ext4, e.g., has a EFSCORRUPTED errno and a
standardized inode corruption message format. This patch adds this
errno and message format to the minix filesystem.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Link: https://patch.msgid.link/20251104143005.3283980-2-jkoolstra@xs4all.nl
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:45:21 +01:00
Christian Brauner ca3557a686
Merge patch series "alloc misaligned vectors for zoned XFS v2"
Christoph Hellwig <hch@lst.de> says:

This series enables the new block layer support for misaligned
individual vectors for zoned XFS.

The first patch is the from Qu and supposedly already applied to
the vfs iomap 6.19 branch, but I can't find it there.  The next
two are small fixups for it, and the last one makes use of this
new functionality in XFS.

* patches from https://patch.msgid.link/20251031131045.1613229-1-hch@lst.de:
  xfs: support sub-block aligned vectors in always COW mode
  iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag

Link: https://patch.msgid.link/20251031131045.1613229-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:09:32 +01:00
Christoph Hellwig 8caec6c9fe
xfs: support sub-block aligned vectors in always COW mode
Now that the block layer and iomap have grown support to indicate
the bio sector size explicitly instead of assuming the device sector
size, we can ask for logical block size alignment and thus support
direct I/O writes where the overall size is logical block size
aligned, but the boundaries between vectors might not be.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251031131045.1613229-3-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:09:27 +01:00
Qu Wenruo 001397f5ef
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
Btrfs requires all of its bios to be fs block aligned, normally it's
totally fine but with the incoming block size larger than page size
(bs > ps) support, the requirement is no longer met for direct IOs.

Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only
requiring alignment to be bdev_logical_block_size().

In the real world that value is either 512 or 4K, on 4K page sized
systems it means bio_iov_iter_get_pages() can break the bio at any page
boundary, breaking btrfs' requirement for bs > ps cases.

To address this problem, introduce a new public iomap dio flag,
IOMAP_DIO_FSBLOCK_ALIGNED.

When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will
inherit that new flag, and iomap_dio_bio_iter() will take fs block size
into the calculation of the alignment, and pass the alignment to
bio_iov_iter_get_pages(), respecting the fs block size requirement.

The initial user of this flag will be btrfs, which needs to calculate the
checksum for direct read and thus requires the biovec to be fs block
aligned for the incoming bs > ps support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
[hch: also align pos/len, incorporate the trace flags from Darrick]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 13:09:27 +01:00
Christian Brauner 560507cbc1
Merge patch series "iomap: zero range folio batch support"
Brian Foster <bfoster@redhat.com> says:

This adds folio batch support for iomap. This initially only targets
zero range, the use case being zeroing of dirty folios over unwritten
mappings. There is potential to support other operations in the future:
iomap seek data/hole has similar raciness issues as zero range, the
prospect of using this for buffered write has been raised for granular
locking purposes, etc.

The one major caveat with this zero range implementation is that it
doesn't look at iomap_folio_state to determine whether to zero a
sub-folio portion of the folio. Instead it just relies on whether the
folio was dirty or not. This means that spurious zeroing of unwritten
ranges is possible if a folio is dirty but the target range includes a
subrange that is not.

The reasoning is that this is essentially a complexity tradeoff. The
current use cases for iomap_zero_range() are limited mostly to partial
block zeroing scenarios. It's relatively harmless to zero an unwritten
block (i.e. not a correctness issue), and this is something that
filesystems have done in the past without much notice or issue. The
advantage is less code and this makes it a little easier to use a
filemap lookup function for the batch rather than open coding more logic
in iomap. That said, this can probably be enhanced to look at ifs in the
future if the use case expands and/or other operations justify it.

WRT testing, I've tested with and without a local hack to redirect
fallocate zero range calls to iomap_zero_range() in XFS. This helps test
beyond the partial block/folio use case, i.e. to cover boundary
conditions like full folio batch handling, etc. I recently added patch 7
in spirit of that, which turns this logic into an XFS errortag. Further
comments on that are inline with patch 7.

* patches from https://lore.kernel.org/20251003134642.604736-1-bfoster@redhat.com:
  xfs: error tag to force zeroing on debug kernels
  iomap: remove old partial eof zeroing optimization
  xfs: fill dirty folios on zero range of unwritten mappings
  xfs: always trim mapping to requested range for zero range
  iomap: optional zero range dirty folio processing
  iomap: remove pos+len BUG_ON() to after folio lookup
  filemap: add helper to look up dirty folios in a range

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Brian Foster 66d78a1147
xfs: error tag to force zeroing on debug kernels
iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.

While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Brian Foster 39be21386d
iomap: remove old partial eof zeroing optimization
iomap_zero_range() optimizes the partial eof block zeroing use case
by force zeroing if the mapping is dirty. This is to avoid frequent
flushing on file extending workloads, which hurts performance.

Now that the folio batch mechanism provides a more generic solution
and is used by the only real zero range user (XFS), this isolated
optimization is no longer needed. Remove the unnecessary code and
let callers use the folio batch or fall back to flushing by default.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Brian Foster 77c475692c
xfs: fill dirty folios on zero range of unwritten mappings
Use the iomap folio batch mechanism to select folios to zero on zero
range of unwritten mappings. Trim the resulting mapping if the batch
is filled (unlikely for current use cases) to distinguish between a
range to skip and one that requires another iteration due to a full
batch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Brian Foster 5c13dde963
xfs: always trim mapping to requested range for zero range
Refactor and tweak the IOMAP_ZERO logic in preparation to support
filling the folio batch for unwritten mappings. Drop the superfluous
imap offset check since the hole case has already been filtered out.
Split the the delalloc case handling into a sub-branch, and always
trim the imap to the requested offset/count so it can be more easily
used to bound the range to lookup in pagecache.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:25 +01:00
Christian Brauner 4966b46652
Merge patch series "fuse: use iomap for buffered reads + readahead"
Joanne Koong <joannelkoong@gmail.com> says:

This series adds fuse iomap support for buffered reads and readahead.
This is needed so that granular uptodate tracking can be used in fuse when
large folios are enabled so that only the non-uptodate portions of the folio
need to be read in instead of having to read in the entire folio. It also is
needed in order to turn on large folios for servers that use the writeback
cache since otherwise there is a race condition that may lead to data
corruption if there is a partial write, then a read and the read happens
before the write has undergone writeback, since otherwise the folio will not
be marked uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.

This is on top of two locally-patched iomap patches [1] [2] patched on top of
commit f1c864be6e88 ("Merge branch 'vfs-6.18.async' into vfs.all") in
Christian's vfs.all tree.

This series was run through fstests on fuse passthrough_hp with an
out-of kernel patch enabling fuse large folios.

This patchset does not enable large folios on fuse yet. That will be part
of a different patchset.

* patches from https://lore.kernel.org/20250926002609.1302233-1-joannelkoong@gmail.com:
  fuse: remove fc->blkbits workaround for partial writes
  fuse: use iomap for readahead
  fuse: use iomap for read_folio
  iomap: make iomap_read_folio() a void return
  iomap: move buffered io bio logic into new file
  iomap: add caller-provided callbacks for read and readahead
  iomap: set accurate iter->pos when reading folio ranges
  iomap: track pending read bytes more optimally
  iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
  iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
  iomap: iterate over folio mapping in iomap_readpage_iter()
  iomap: store read/readahead bio generically
  iomap: move read/readahead bio submission logic into helper function
  iomap: move bio read logic into helper function

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Brian Foster 395ed1ef00
iomap: optional zero range dirty folio processing
The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.

Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide a set of dirty folios that require zeroing. In
turn, rather than flush or iterate file offsets, zero range can
iterate on folios in the batch and advance over clean or uncached
ranges in between.

Add a folio_batch in struct iomap and provide a helper for
filesystems to populate the batch at lookup time. Update the folio
lookup path to return the next folio in the batch, if provided, and
advance the iter if the folio starts beyond the current offset.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Joanne Koong 93570c652b
fuse: remove fc->blkbits workaround for partial writes
Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Brian Foster 49590716be
iomap: remove pos+len BUG_ON() to after folio lookup
The bug checks at the top of iomap_write_begin() assume the pos/len
reflect exactly the next range to process. This may no longer be the
case once the get folio path is able to process a folio batch from
the filesystem. On top of that, len is already trimmed to within the
iomap/srcmap by iomap_length(), so these checks aren't terribly
useful. Remove the unnecessary BUG_ON() checks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Joanne Koong 4ea907108a
fuse: use iomap for readahead
Do readahead in fuse using iomap. This gives us granular uptodate
tracking for large folios, which optimizes how much data needs to be
read in. If some portions of the folio are already uptodate (eg through
a prior write), we only need to read in the non-uptodate portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Brian Foster f8d98072fe
filemap: add helper to look up dirty folios in a range
Add a new filemap_get_folios_dirty() helper to look up existing dirty
folios in a range and add them to a folio_batch. This is to support
optimization of certain iomap operations that only care about dirty
folios in a target range. For example, zero range only zeroes the subset
of dirty pages over unwritten mappings, seek hole/data may use similar
logic in the future, etc.

Note that the helper is intended for use under internal fs locks.
Therefore it trylocks folios in order to filter out clean folios.
This loosely follows the logic from filemap_range_has_writeback().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:24 +01:00
Joanne Koong 03e9618e82
fuse: use iomap for read_folio
Read folio data into the page cache using iomap. This gives us granular
uptodate tracking for large folios, which optimizes how much data needs
to be read in. If some portions of the folio are already uptodate (eg
through a prior write), we only need to read in the non-uptodate
portions.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Joanne Koong d4e88bb08e
iomap: make iomap_read_folio() a void return
No errors are propagated in iomap_read_folio(). Change
iomap_read_folio() to a void return to make this clearer to callers.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Christoph Hellwig [1] c2b1adc462
iomap: move buffered io bio logic into new file
Move bio logic in the buffered io code into its own file and remove
CONFIG_BLOCK gating for iomap read/readahead.

[1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Joanne Koong b2f35ac414
iomap: add caller-provided callbacks for read and readahead
Add caller-provided callbacks for read and readahead so that it can be
used generically, especially by filesystems that are not block-based.

In particular, this:
* Modifies the read and readahead interface to take in a
  struct iomap_read_folio_ctx that is publicly defined as:

  struct iomap_read_folio_ctx {
	const struct iomap_read_ops *ops;
	struct folio *cur_folio;
	struct readahead_control *rac;
	void *read_ctx;
  };

  where struct iomap_read_ops is defined as:

  struct iomap_read_ops {
      int (*read_folio_range)(const struct iomap_iter *iter,
                             struct iomap_read_folio_ctx *ctx,
                             size_t len);
      void (*read_submit)(struct iomap_read_folio_ctx *ctx);
  };

  read_folio_range() reads in the folio range and is required by the
  caller to provide. read_submit() is optional and is used for
  submitting any pending read requests.

* Modifies existing filesystems that use iomap for read and readahead to
  use the new API, through the new statically inlined helpers
  iomap_bio_read_folio() and iomap_bio_readahead(). There is no change
  in functionality for those filesystems.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Joanne Koong fb7a10ac47
iomap: set accurate iter->pos when reading folio ranges
Advance iter to the correct position before calling an IO helper to read
in a folio range. This allows the helper to reliably use iter->pos to
determine the starting offset for reading.

This will simplify the interface for reading in folio ranges when iomap
read/readahead supports caller-provided callbacks.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Joanne Koong d43558ae67
iomap: track pending read bytes more optimally
Instead of incrementing read_bytes_pending for every folio range read in
(which requires acquiring the spinlock to do so), set read_bytes_pending
to the folio size when the first range is asynchronously read in, keep
track of how many bytes total are asynchronously read in, and adjust
read_bytes_pending accordingly after issuing requests to read in all the
necessary ranges.

iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero
value for pending bytes necessarily indicates the folio is in the bio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:57:23 +01:00
Kaushlendra Kumar a6446829f8
init: Replace simple_strtoul() with kstrtouint() in root_delay_setup()
Replace deprecated simple_strtoul() with kstrtouint() for better error
handling and input validation. Return 0 on parsing failure to indicate
invalid parameter, maintaining existing behavior for valid inputs.

The simple_strtoul() function is deprecated in favor of kstrtoint()
family functions which provide better error handling and are recommended
for new code and replacements.

Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com>
Link: https://patch.msgid.link/20251103080627.1844645-1-kaushlendra.kumar@intel.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:49:38 +01:00
Christian Brauner a4db63b88f
Merge patch series "fs: fully sync all fsese even for an emergency sync"
Qu Wenruo <wqu@suse.com> says:

The first patch is a cleanup related to sync_inodes_one_sb() callback.
Since it always wait for the writeback, there is no need to pass any
parameter for it.

The second patch is a fix mostly affecting btrfs, as btrfs requires a
explicit sync_fc() call with wait == 1, to commit its super blocks,
and sync_bdevs() won't cut it at all.

However the current emergency sync never passes wait == 1, it means
btrfs will writeback all dirty data and metadata, but still no super
block update, resulting everything still pointing back to the old
data/metadata.

This lead to a problem where btrfs doesn't seem to do anything during
emergency sync.

The second patch fixes the problem by passing wait == 1 for the second
iteration of sync_fs_one_sb().

* patches from https://patch.msgid.link/cover.1762142636.git.wqu@suse.com:
  fs: fully sync all fses even for an emergency sync
  fs: do not pass a parameter for sync_inodes_one_sb()

Link: https://patch.msgid.link/cover.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:30:05 +01:00
Qu Wenruo 2706659d64
fs: fully sync all fses even for an emergency sync
[BUG]
There is a bug report that during emergency sync, btrfs only write back
all the dirty data and metadadta, but no full transaction commit,
resulting the super block still pointing to the old trees, thus the end
user can only see the old data, not the newer one.

[CAUSE]
Initially this looks like a btrfs specific bug, since ext4 doesn't get
affected by this one.

But the root problem here is, a combination of btrfs features and the no
wait nature of emergency sync.

Firstly do_sync_work() will call sync_inodes_one_sb() for every fs, to
writeback all the dirty pages for the fs.

Btrfs will properly writeback all dirty pages, including both data and
the updated metadata. So far so good.

Then sync_fs_one_sb() called with @nowait, in the case of btrfs it means
no full transaction commit, thus no super block update.

At this stage, btrfs is only one super block update away to be fully committed.
I believe it's the more or less the same for other fses too.

The problem is the next step, sync_bdevs().
Normally other fses have their super block already updated in the page
cache of the block device, but btrfs only updates the super block during
full transaction commit.

So sync_bdevs() may work for other fses, but not for btrfs, btrfs is
still using its older super block, all pointing back to the old metadata
and data.

Thus if after emergency sync, power loss happened, the end user will
only see the old data, not the newer one, despite that everything but the
super block is already written back.

[FIX]
Since the emergency sync is already executing in a workqueue, I didn't
see much need to only do a nowait sync.
Especially after the fact that sync_inodes_one_sb() always wait for the
writeback to finish.

Instead for the second iteration of sync_fs_one_sb(), pass wait == 1
into it, so fses like btrfs can properly commit its super blocks.

Reported-by: Askar Safin <safinaskar@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20251101150429.321537-1-safinaskar@gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/7b7fd40c5fe440b633b6c0c741d96ce93eb5a89a.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:29:59 +01:00
Qu Wenruo fbc22c2996
fs: do not pass a parameter for sync_inodes_one_sb()
The function sync_inodes_one_sb() will always wait for the writeback,
and ignore the optional parameter.

Explicitly pass NULL as parameter for the call sites inside
do_sync_work().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/8079af1c4798cb36887022a8c51547a727c353cf.1762142636.git.wqu@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 12:29:59 +01:00
Christian Brauner 0485a18d91
fs: rename fs_types.h to fs_dirent.h
We will split out a bunch of types into a separate header.
So free up the appropriate name for it.

Link: https://patch.msgid.link/20251104-work-fs-header-v1-1-fb39a2efe39e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 09:51:30 +01:00
Christian Brauner 390d967653
pidfs: reduce wait_pidfd lock scope
There's no need to hold the lock after we realized that pid->attr is
set. We're holding a reference to struct pid so it won't go away and
pidfs_exit() is called once per struct pid.

Link: https://patch.msgid.link/20251105-work-pidfs-wait_pidfd-lock-v1-1-02638783be07@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 00:09:06 +01:00
Christian Brauner a45ff1c7c9
Merge patch series "coredump: cleanups & pidfd extension"
Christian Brauner <brauner@kernel.org> says:

The recent changes to rework coredump handling to rely on unix sockets
are in the process of being used in systemd. Yu reported on shortcoming
nameling that the signal causing the coredump was available before the
crashing process was reaped.

The previous systemd coredump container interface requires the coredump
file descriptor, and basic information including the signal number to be
sent to the container. This means we need to have the signal number
available before sending the coredump to the container.

In general, the extension makes sense and fits with the rest of the
coredump information.

In addition to this extension this fixes a bunch of the tests that were
failing and reworks the publication mechanism for exit and coredump info
retrievable via the pidfd ioctl.

* patches from https://patch.msgid.link/20251028-work-coredump-signal-v1-0-ca449b7b7aa0@kernel.org: (22 commits)
  selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
  selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
  selftests/coredump: ignore ENOSPC errors
  selftests/coredump: add debug logging to coredump socket protocol tests
  selftests/coredump: add debug logging to coredump socket tests
  selftests/coredump: add debug logging to test helpers
  selftests/coredump: handle edge-triggered epoll correctly
  selftests/coredump: fix userspace coredump client detection
  selftests/coredump: fix userspace client detection
  selftests/coredump: split out coredump socket tests
  selftests/coredump: split out common helpers
  selftests/pidfd: add second supported_mask test
  selftests/pidfd: add first supported_mask test
  selftests/pidfd: update pidfd header
  pidfs: expose coredump signal
  pidfs: drop struct pidfs_exit_info
  pidfs: prepare to drop exit_info pointer
  pidfd: add a new supported_mask field
  pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
  pidfs: add missing PIDFD_INFO_SIZE_VER1
  ...

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-0-ca449b7b7aa0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:05:03 +01:00
Christian Brauner cbb842548a
selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
Verify that when using simple socket-based coredump (@ pattern),
the coredump_signal field is correctly exposed as SIGABRT.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-22-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:05:01 +01:00
Christian Brauner 619e2227cc
selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
Verify that when using simple socket-based coredump (@ pattern),
the coredump_signal field is correctly exposed as SIGSEGV.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-21-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:59 +01:00
Christian Brauner 32ae33f796
selftests/coredump: ignore ENOSPC errors
If we crash multiple processes at the same time we may run out of space.
Just ignore those errors. They're not actually all that relevant for the
test.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-20-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:57 +01:00
Christian Brauner 408a0ed9ee
selftests/coredump: add debug logging to coredump socket protocol tests
So it's easier to figure out bugs.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-19-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:55 +01:00
Christian Brauner 2343cbee9f
selftests/coredump: add debug logging to coredump socket tests
So it's easier to figure out bugs.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-18-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:53 +01:00
Christian Brauner d5694db5dc
selftests/coredump: add debug logging to test helpers
so we can easily figure out why something failed.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-17-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:51 +01:00
Christian Brauner 305e6b167c
selftests/coredump: handle edge-triggered epoll correctly
by putting the file descriptor into non-blocking mode.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-16-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:48 +01:00
Christian Brauner 8b64f54c81
selftests/coredump: fix userspace coredump client detection
PIDFD_INFO_COREDUMP is only retrievable until the task has exited. After
it has exited task->mm is NULL. So if the task didn't actually coredump
we can't retrieve it's dumpability settings anymore. Only if the task
did coredump will we have stashed the coredump information in the
respective struct pid.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-15-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:46 +01:00
Christian Brauner 32ae9fa406
selftests/coredump: fix userspace client detection
We need to request PIDFD_INFO_COREDUMP in the first place.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-14-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:44 +01:00
Christian Brauner c09ea6659e
selftests/coredump: split out coredump socket tests
Split the coredump socket tests into separate files.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-13-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:42 +01:00
Christian Brauner c71147f42b
selftests/coredump: split out common helpers
into separate files.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-12-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:40 +01:00
Christian Brauner 2593deaac8
selftests/pidfd: add second supported_mask test
Verify that supported_mask is returned even when other fields are
requested.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-11-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:38 +01:00
Christian Brauner e12f734208
selftests/pidfd: add first supported_mask test
Verify that when PIDFD_INFO_SUPPORTED_MASK is requested, the kernel
returns the supported_mask field indicating which flags the kernel
supports.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-10-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:36 +01:00
Christian Brauner a945535dfd
selftests/pidfd: update pidfd header
Include the new defines and members.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-9-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 22:04:32 +01:00
Christian Brauner 89c545e29e
sev-dev: use prepare credential guard
Use the prepare credential guard for allocating a new set of
credentials.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-3-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:37:01 +01:00
Christian Brauner 4c5941ca11
sev-dev: use guard for path
Just use a guard and also move the path_put() out of the credential
change's scope. There's no need to do this with the overridden
credentials.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-2-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:37:00 +01:00
Christian Brauner c8ad3098e1
cred: add prepare credential guard
A lot of code uses the following pattern:

* prepare new credentials
* modify them for their use-case
* drop them

Support that easier with the new guard infrastructure.

Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-1-b447b82f2c9b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:57 +01:00
Christian Brauner a85787996a
Merge patch series "credentials guards: the easy cases"
Christian Brauner <brauner@kernel.org> says:

This converts all users of override_creds() to rely on credentials
guards. Leave all those that do the prepare_creds() + modify creds +
override_creds() dance alone for now. Some of them qualify for their own
variant.

* patches from https://patch.msgid.link/20251103-work-creds-guards-simple-v1-0-a3e156839e7f@kernel.org:
  net/dns_resolver: use credential guards in dns_query()
  cgroup: use credential guards in cgroup_attach_permissions()
  act: use credential guards in acct_write_process()
  smb: use credential guards in cifs_get_spnego_key()
  nfs: use credential guards in nfs_idmap_get_key()
  nfs: use credential guards in nfs_local_call_write()
  nfs: use credential guards in nfs_local_call_read()
  erofs: use credential guards
  binfmt_misc: use credential guards
  backing-file: use credential guards for mmap
  backing-file: use credential guards for splice write
  backing-file: use credential guards for splice read
  backing-file: use credential guards for writes
  backing-file: use credential guards for reads
  aio: use credential guards
  cred: add {scoped_}with_creds() guards

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-0-a3e156839e7f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:53 +01:00
Christian Brauner 4037e28cd4
net/dns_resolver: use credential guards in dns_query()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-16-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:51 +01:00
Christian Brauner b66c7af4d8
cgroup: use credential guards in cgroup_attach_permissions()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-15-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:50 +01:00
Christian Brauner 5db84abd2a
act: use credential guards in acct_write_process()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-14-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:49 +01:00
Christian Brauner c5c92c624a
smb: use credential guards in cifs_get_spnego_key()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-13-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:48 +01:00
Christian Brauner f41799b2e1
nfs: use credential guards in nfs_idmap_get_key()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-12-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:45 +01:00
Christian Brauner bff3c841f7
nfs: use credential guards in nfs_local_call_write()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-11-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:43 +01:00
Christian Brauner 94afb627df
nfs: use credential guards in nfs_local_call_read()
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-10-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:42 +01:00
Christian Brauner 5e88d1aadc
erofs: use credential guards
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-9-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:40 +01:00
Christian Brauner ff2044cd27
binfmt_misc: use credential guards
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-8-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:39 +01:00
Christian Brauner 6e1d1c1fa7
backing-file: use credential guards for mmap
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-7-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:38 +01:00
Christian Brauner b688171f91
backing-file: use credential guards for splice write
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-6-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:37 +01:00
Christian Brauner c3076d146e
backing-file: use credential guards for splice read
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-5-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:36 +01:00
Christian Brauner f119feaa06
backing-file: use credential guards for writes
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-4-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:35 +01:00
Christian Brauner 4f0a482578
backing-file: use credential guards for reads
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-3-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:34 +01:00
Christian Brauner 84c1a329b4
aio: use credential guards
Use credential guards for scoped credential override with automatic
restoration on scope exit.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-2-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:33 +01:00
Christian Brauner 019e52e8d3
cred: add scoped_with_creds() guards
and implement scoped_with_kernel_creds() on top of it.

Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-1-a3e156839e7f@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:29 +01:00
Christian Brauner e0876bde29
Merge patch series "creds: add {scoped_}with_kernel_creds()"
Christian Brauner <brauner@kernel.org> says:

A few months ago I did work to make override_creds()/revert_creds()
completely reference count free - mostly for the sake of
overlayfs but it has been beneficial to everyone using this.

In a recent pull request from Jens that introduced another round of
override_creds()/revert_creds() for nbd Linus asked whether we could
avoide the prepare_kernel_creds() calls that duplicate the kernel
credentials and then drop them again later.

Yes, we can actually. We can use the guard infrastructure to completely
avoid the allocation and then also to never expose the temporary
variable to hold the kernel credentials anywhere in the callers.

So add with_kernel_creds() and scoped_with_kernel_creds() for this
purpose. Also take the opportunity to fixup the scoped_class() macro I
introduced two cycles ago.

* patches from https://patch.msgid.link/20251103-work-creds-init_cred-v1-0-cb3ec8711a6a@kernel.org:
  unix: don't copy creds
  target: don't copy kernel creds
  nbd: don't copy kernel creds
  firmware: don't copy kernel creds
  cred: add {scoped_}with_kernel_creds
  cred: make init_cred static
  cred: add kernel_cred() helper
  cleanup: fix scoped_class()

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-0-cb3ec8711a6a@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:24 +01:00
Christian Brauner 1ad5b411af
unix: don't copy creds
No need to copy kernel credentials.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-8-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:21 +01:00
Christian Brauner 0f0e7cee34
target: don't copy kernel creds
Get rid of all the boilerplate and tightly scope when the task runs with
kernel creds.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-7-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:18 +01:00
Christian Brauner 4601b7923d
nbd: don't copy kernel creds
No need to copy kernel credentials.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-6-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:16 +01:00
Christian Brauner b9e3594e70
firmware: don't copy kernel creds
No need to copy kernel credentials.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-5-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:10 +01:00
Christian Brauner ae40e6c657
cred: add scoped_with_kernel_creds()
Add a new cleanup class for override creds. We can make use of this in a
bunch of places going forward.

Based on this scoped_with_kernel_creds() that can be used to temporarily
assume kernel credentials for specific tasks such as firmware loading,
or coredump socket connections. At no point will the caller interact
with the kernel credentials directly.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-4-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:07 +01:00
Christian Brauner 40314c2818
cred: make init_cred static
There's zero need to expose struct init_cred. The very few places that
need access can just go through init_task which is already exported.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-3-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:36:02 +01:00
Christian Brauner 4c7ceeb62d
cred: add kernel_cred() helper
Access kernel creds based off of init_task. This will let us avoid any
direct access to init_cred.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-2-cb3ec8711a6a@kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:35:52 +01:00
Christian Brauner 4e97bae1b4
cleanup: fix scoped_class()
This is a class, not a guard so why on earth is it checking for guard
pointers or conditional lock acquisition? None of it makes any sense at
all.

I'm not sure what happened back then. Maybe I had a brief psychedelic
period that I completely forgot about and spaced out into a zone where
that initial macro implementation made any sense at all.

Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-1-cb3ec8711a6a@kernel.org
Fixes: 5c21c5f22d ("cleanup: add a scoped version of CLASS()")
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04 12:35:49 +01:00
Christian Brauner 8ebfb9896c
Merge patch series "nstree: listns()"
Christian Brauner <brauner@kernel.org> says:

As announced a while ago this is the next step building on the nstree
work from prior cycles. There's a bunch of fixes and semantic cleanups
in here and a ton of tests.

Currently listns() is relying on active namespace reference counts which
are introduced alongside this series.

While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.

On current kernels a namespace is visible to userspace in the
following cases:

(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
    descriptor or bind-mount).
    Note that (2) only cares about direct persistence of the namespace
    itself not indirectly via e.g., file->f_cred file references or
    similar.
(3) The namespace is a hierarchical namespace type and is the parent of
    a single or multiple child namespaces.

Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().

Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.

For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.

Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.

So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.

To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.

The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.

If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.

So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.

Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.

The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).

A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.

Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.

The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.

The active reference counts of the user- and pid namespace are
decremented once the task is reaped.

Based on the namespace trees and the active reference count, a new
listns() system call that allows userspace to iterate through namespaces
in the system. This provides a programmatic interface to discover and
inspect namespaces, enhancing existing namespace apis.

Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:

1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
   running process but are kept alive by file descriptors, bind mounts,
   or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
   namespaces.

The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.

/*
 * @req: Pointer to struct ns_id_req specifying search parameters
 * @ns_ids: User buffer to receive namespace IDs
 * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
 * @flags: Reserved for future use (must be 0)
 */
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
               size_t nr_ns_ids, unsigned int flags);

Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code

/*
 * @size: Structure size
 * @ns_id: Starting point for iteration; use 0 for first call, then
 *         use the last returned ID for subsequent calls to paginate
 * @ns_type: Bitmask of namespace types to include (from enum ns_type):
 *           0: Return all namespace types
 *           MNT_NS: Mount namespaces
 *           NET_NS: Network namespaces
 *           USER_NS: User namespaces
 *           etc. Can be OR'd together
 * @user_ns_id: Filter results to namespaces owned by this user namespace:
 *              0: Return all namespaces (subject to permission checks)
 *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
 *              Other value: Namespaces owned by the specified user namespace ID
 */
struct ns_id_req {
        __u32 size;         /* sizeof(struct ns_id_req) */
        __u32 spare;        /* Reserved, must be 0 */
        __u64 ns_id;        /* Last seen namespace ID (for pagination) */
        __u32 ns_type;      /* Filter by namespace type(s) */
        __u32 spare2;       /* Reserved, must be 0 */
        __u64 user_ns_id;   /* Filter by owning user namespace */
};

Example 1: List all namespaces

void list_all_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,      /* Start from beginning */
		.ns_type = 0,    /* All types */
		.user_ns_id = 0, /* All user namespaces */
	};
	uint64_t ids[100];
	ssize_t ret;

	printf("All namespaces in the system:\n");
	do {
		ret = listns(&req, ids, 100, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}

		for (ssize_t i = 0; i < ret; i++)
			printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);

		/* Continue from last seen ID */
		if (ret > 0)
			req.ns_id = ids[ret - 1];
	} while (ret == 100); /* Buffer was full, more may exist */
}

Example 2 : List network namespaces only

void list_network_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS, /* Only network namespaces */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Network namespaces: %zd found\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 3 : List namespaces owned by current user namespace

void list_owned_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,                      /* All types */
		.user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	if (ret < 0) {
		perror("listns");
		return;
	}

	printf("Namespaces owned by my user namespace: %zd\n", ret);
	for (ssize_t i = 0; i < ret; i++)
		printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 4 : List multiple namespace types

void list_network_and_mount_namespaces(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = NET_NS | MNT_NS, /* Network and mount */
		.user_ns_id = 0,
	};
	uint64_t ids[100];
	ssize_t ret;

	ret = listns(&req, ids, 100, 0);
	printf("Network and mount namespaces: %zd found\n", ret);
}

Example 5 : Pagination through large namespace sets

void list_all_with_pagination(void)
{
	struct ns_id_req req = {
		.size = sizeof(req),
		.ns_id = 0,
		.ns_type = 0,
		.user_ns_id = 0,
	};
	uint64_t ids[50];
	size_t total = 0;
	ssize_t ret;

	printf("Enumerating all namespaces with pagination:\n");

	while (1) {
		ret = listns(&req, ids, 50, 0);
		if (ret < 0) {
			perror("listns");
			break;
		}
		if (ret == 0)
			break; /* No more namespaces */

		total += ret;
		printf("  Batch: %zd namespaces\n", ret);

		/* Last ID in this batch becomes start of next batch */
		req.ns_id = ids[ret - 1];

		if (ret < 50)
			break; /* Partial batch = end of results */
	}

	printf("Total: %zu namespaces\n", total);
}

listns() respects namespace isolation and capabilities:

(1) Global listing (user_ns_id = 0):
    - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
    - OR the namespace must be in the caller's namespace context (e.g.,
      a namespace the caller is currently using)
    - User namespaces additionally allow listing if the caller has
      CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
    - Requires CAP_SYS_ADMIN in the specified owner user namespace
    - OR the namespace must be in the caller's namespace context
    - This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
    - Only "active" namespaces are listed
    - A namespace is active if it has a non-zero __ns_ref_active count
    - This includes namespaces used by running processes, held by open
      file descriptors, or kept active by bind mounts
    - Inactive namespaces (kept alive only by internal kernel
      references) are not visible via listns()

* patches from https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-0-2e6f823ebdc0@kernel.org: (74 commits)
  selftests/namespace: test listns() pagination
  selftests/namespace: add stress test
  selftests/namespace: commit_creds() active reference tests
  selftests/namespace: third threaded active reference count test
  selftests/namespace: second threaded active reference count test
  selftests/namespace: first threaded active reference count test
  selftests/namespaces: twelth inactive namespace resurrection test
  selftests/namespaces: eleventh inactive namespace resurrection test
  selftests/namespaces: tenth inactive namespace resurrection test
  selftests/namespaces: ninth inactive namespace resurrection test
  selftests/namespaces: eigth inactive namespace resurrection test
  selftests/namespaces: seventh inactive namespace resurrection test
  selftests/namespaces: sixth inactive namespace resurrection test
  selftests/namespaces: fifth inactive namespace resurrection test
  selftests/namespaces: fourth inactive namespace resurrection test
  selftests/namespaces: third inactive namespace resurrection test
  selftests/namespaces: second inactive namespace resurrection test
  selftests/namespaces: first inactive namespace resurrection test
  selftests/namespaces: seventh listns() permission test
  selftests/namespaces: sixth listns() permission test
  ...

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-0-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:25 +01:00
Christian Brauner 2cc1c01fe9
selftests/namespace: test listns() pagination
Minimal test case to reproduce KASAN out-of-bounds in listns pagination.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-72-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:25 +01:00
Christian Brauner fc85885692
selftests/namespace: add stress test
Stress tests for namespace active reference counting.

These tests validate that the active reference counting system can
handle high load scenarios including rapid namespace
creation/destruction, large numbers of concurrent namespaces, and
various edge cases under stress.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-71-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:25 +01:00
Christian Brauner d18cf3f9a4
selftests/namespace: commit_creds() active reference tests
Test credential changes and their impact on namespace active references.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-70-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner 80fedf8168
selftests/namespace: third threaded active reference count test
Test that namespaces become inactive after subprocess with multiple
threads exits. Create a subprocess that unshares user and network
namespaces, then creates two threads that share those namespaces. Verify
that after all threads and subprocess exit, the namespaces are no longer
listed by listns() and cannot be opened by open_by_handle_at().

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-69-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner ee86103238
selftests/namespace: second threaded active reference count test
Test that a namespace remains active while a thread holds an fd to it.
Even after the thread exits, the namespace should remain active as long
as another thread holds a file descriptor to it.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-68-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner 29f083c499
selftests/namespace: first threaded active reference count test
Test that namespace becomes inactive after thread exits. This verifies
active reference counting works with threads, not just processes.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-67-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner c89d100f6a
selftests/namespaces: twelth inactive namespace resurrection test
Test multi-level namespace resurrection across three user namespace levels.

This test creates a complex namespace hierarchy with three levels of user
namespaces and a network namespace at the deepest level. It verifies that
the resurrection semantics work correctly when SIOCGSKNS is called on a
socket from an inactive namespace tree, and that listns() and
open_by_handle_at() correctly respect visibility rules.

Hierarchy after child processes exit (all with 0 active refcount):

         net_L3A (0)                <- Level 3 network namespace
             |
             +
         userns_L3 (0)              <- Level 3 user namespace
             |
             +
         userns_L2 (0)              <- Level 2 user namespace
             |
             +
         userns_L1 (0)              <- Level 1 user namespace
             |
             x
         init_user_ns

The test verifies:
1. SIOCGSKNS on a socket from inactive net_L3A resurrects the entire chain
2. After resurrection, all namespaces are visible in listns()
3. Resurrected namespaces can be reopened via file handles
4. Closing the netns FD cascades down: the entire ownership chain
   (userns_L3 -> userns_L2 -> userns_L1) becomes inactive again
5. Inactive namespaces disappear from listns() and cannot be reopened
6. Calling SIOCGSKNS again on the same socket resurrects the tree again
7. After second resurrection, namespaces are visible and can be reopened

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-66-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner c80168b677
selftests/namespaces: eleventh inactive namespace resurrection test
Test combined listns() and file handle operations with socket-kept
netns. Create a netns, keep it alive with a socket, verify it appears in
listns(), then reopen it via file handle obtained from listns() entry.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-65-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner 3798991a9f
selftests/namespaces: tenth inactive namespace resurrection test
Test that socket-kept netns can be reopened via file handle.
Verify that a network namespace kept alive by a socket FD can be
reopened using file handles even after the creating process exits.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-64-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner b9d09f568b
selftests/namespaces: ninth inactive namespace resurrection test
Test that socket-kept netns appears in listns() output.
Verify that a network namespace kept alive by a socket FD appears in
listns() output even after the creating process exits, and that it
disappears when the socket is closed.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-63-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:24 +01:00
Christian Brauner 6de17ec3cc
selftests/namespaces: eigth inactive namespace resurrection test
Test IPv6 sockets also work with SIOCGSKNS.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-62-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner 54a29d1233
selftests/namespaces: seventh inactive namespace resurrection test
Test socket keeps netns active after creating process exits. Verify that
as long as the socket FD exists, the namespace remains active.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-61-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner aec2237695
selftests/namespaces: sixth inactive namespace resurrection test
Test multiple sockets keep the same network namespace active. Create
multiple sockets, verify closing some doesn't affect others.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-60-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner 2b9fa5bf0c
selftests/namespaces: fifth inactive namespace resurrection test
Test SIOCGSKNS fails on non-socket file descriptors.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-59-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner 40226da471
selftests/namespaces: fourth inactive namespace resurrection test
Test SIOCGSKNS across setns. Create a socket in netns A, switch to netns
B, verify SIOCGSKNS still returns netns A.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-58-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner 5aec9f455c
selftests/namespaces: third inactive namespace resurrection test
Test SIOCGSKNS with different socket types (TCP, UDP, RAW).

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-57-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner c0f06da568
selftests/namespaces: second inactive namespace resurrection test
Test that socket file descriptors keep network namespaces active. Create
a network namespace, create a socket in it, then exit the namespace. The
namespace should remain active while the socket FD is held.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-56-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:23 +01:00
Christian Brauner a1e49d8d18
selftests/namespaces: first inactive namespace resurrection test
Test basic SIOCGSKNS functionality. Create a socket and verify SIOCGSKNS
returns the correct network namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-55-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 39bcc7ae57
selftests/namespaces: seventh listns() permission test
Test that dropping CAP_SYS_ADMIN restricts what we can see.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-54-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner cff66421ee
selftests/namespaces: sixth listns() permission test
Test that we can see user namespaces we have CAP_SYS_ADMIN inside of.
This is different from seeing namespaces owned by a user namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-53-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 1c28817eb3
selftests/namespaces: fifth listns() permission test
Test that CAP_SYS_ADMIN in parent user namespace allows seeing
child user namespace's owned namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-52-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 6f360f2b2f
selftests/namespaces: fourth listns() permission test
Test permission checking with LISTNS_CURRENT_USER.
Verify that listing with LISTNS_CURRENT_USER respects permissions.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-51-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 2635f93989
selftests/namespaces: third listns() permission test
Test that users cannot see namespaces from unrelated user namespaces.
Create two sibling user namespaces, verify they can't see each other's
owned namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-50-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner ec38237731
selftests/namespaces: second listns() permission test
Test that users with CAP_SYS_ADMIN in a user namespace can see
all namespaces owned by that user namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-49-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 1f8ee4a1f9
selftests/namespaces: first listns() permission test
Test that unprivileged users can only see namespaces they're currently
in. Create a namespace, drop privileges, verify we can only see our own
namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-48-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:22 +01:00
Christian Brauner 674294a479
selftests/namespaces: ninth listns() test
Test error cases for listns().

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-47-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner b0de4c80fb
selftests/namespaces: eigth listns() test
Test that hierarchical active reference propagation keeps parent
user namespaces visible in listns().

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-46-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner 6aeca1dd49
selftests/namespaces: seventh listns() test
Test listns() with multiple namespace types filter.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-45-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner bc8da67e0e
selftests/namespaces: sixth listns() test
Test listns() with specific user namespace ID.
Create a user namespace and list namespaces it owns.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-44-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner 4080b9d946
selftests/namespaces: fifth listns() test
Test that listns() only returns active namespaces.
Create a namespace, let it become inactive, verify it's not listed.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-43-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner abac8de3e5
selftests/namespaces: fourth listns() test
Test listns() with LISTNS_CURRENT_USER.
List namespaces owned by current user namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-42-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner 46909d1343
selftests/namespaces: third listns() test
Test listns() pagination.
List namespaces in batches.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-41-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner 6a68c7f919
selftests/namespaces: second listns() test
test listns() with type filtering.
List only network namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-40-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:21 +01:00
Christian Brauner e2ff8d8864
selftests/namespaces: first listns() test
Test basic listns() functionality with the unified namespace tree.
List all active namespaces globally.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-39-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner 158c5c786e
selftests/namespaces: add listns() wrapper
Add a wrapper for the listns() system call.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-38-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner da3c02b70c
selftests/namespaces: fifteenth active reference count tests
Test different namespace types (net, uts, ipc) all contributing
active references to the same owning user namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-37-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner a9d84bf7bf
selftests/namespaces: fourteenth active reference count tests
Test that user namespace as a child also propagates correctly.
Create user_A -> user_B, verify when user_B is active that user_A
is also active. This is different from non-user namespace children.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-36-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner 2a94bf7bb8
selftests/namespaces: thirteenth active reference count tests
Test that parent stays active as long as ANY child is active.
Create parent user namespace with two child net namespaces.
Parent should remain active until BOTH children are inactive.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-35-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner 04aee1a346
selftests/namespaces: twelth active reference count tests
Test hierarchical propagation with deep namespace hierarchy.
Create: init_user_ns -> user_A -> user_B -> net_ns
When net_ns is active, both user_A and user_B should be active.
This verifies the conditional recursion in __ns_ref_active_put() works.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-34-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner 26d238ea6a
selftests/namespaces: eleventh active reference count tests
Test that different namespace types with same owner all contribute
active references to the owning user namespace.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-33-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner e7585a9ef5
selftests/namespaces: tenth active reference count tests
Test multiple children sharing same parent.
Parent should stay active as long as ANY child is active.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-32-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:20 +01:00
Christian Brauner a8ce47a1ac
selftests/namespaces: ninth active reference count tests
Test multi-level hierarchy (3+ levels deep).
Grandparent → Parent → Child
When child is active, both parent AND grandparent should be active.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-31-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 94f8711080
selftests/namespaces: eigth active reference count tests
Test that bind mounts keep namespaces in the tree even when inactive

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-30-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 4b971b07e4
selftests/namespaces: seventh active reference count tests
Test hierarchical active reference propagation.
When a child namespace is active, its owning user namespace should also
be active automatically due to hierarchical active reference propagation.
This ensures parents are always reachable when children are active.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-29-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 47a5fd8ce1
selftests/namespaces: sixth active reference count tests
Test that an open file descriptor keeps a namespace active.
Even after the creating process exits, the namespace should remain
active as long as an fd is held open.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-28-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner c4803b255f
selftests/namespaces: fifth active reference count tests
Test PID namespace active ref tracking

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-27-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 28655ff253
selftests/namespaces: fourth active reference count tests
Test user namespace active ref tracking via credential lifecycle.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-26-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner c6e25d930b
selftests/namespaces: third active reference count tests
Test that a namespace remains active while a process is using it,
even after the creating process exits.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-25-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 721c7e41b1
selftests/namespaces: second active reference count tests
Test namespace lifecycle: create a namespace in a child process, get a
file handle while it's active, then try to reopen after the process
exits (namespace becomes inactive).

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-24-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:19 +01:00
Christian Brauner 6bdce845fd
selftests/namespaces: first active reference count tests
Test that initial namespaces can be reopened via file handle. Initial
namespaces should always have a ref count of one from boot.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-23-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner e2b6e5eadc
selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
This is effectively unused and doesn't really server any purpose after
having reviewed all of the tests that rely on it.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-22-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner 6fc9baa49d
nsfs: update tools header
Ensure all the new uapi bits are visible for the selftests.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-21-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner b36d4b6aa8
arch: hookup listns() system call
Add the listns() system call to all architectures.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner 76b6f5dfb3
nstree: add listns()
Add a new listns() system call that allows userspace to iterate through
namespaces in the system. This provides a programmatic interface to
discover and inspect namespaces, enhancing existing namespace apis.

Currently, there is no direct way for userspace to enumerate namespaces
in the system. Applications must resort to scanning /proc/<pid>/ns/
across all processes, which is:

1. Inefficient - requires iterating over all processes
2. Incomplete - misses inactive namespaces that aren't attached to any
   running process but are kept alive by file descriptors, bind mounts,
   or parent namespace references
3. Permission-heavy - requires access to /proc for many processes
4. No ordering or ownership.
5. No filtering per namespace type: Must always iterate and check all
   namespaces.

The list goes on. The listns() system call solves these problems by
providing direct kernel-level enumeration of namespaces. It is similar
to listmount() but obviously tailored to namespaces.

/*
 * @req: Pointer to struct ns_id_req specifying search parameters
 * @ns_ids: User buffer to receive namespace IDs
 * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
 * @flags: Reserved for future use (must be 0)
 */
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
               size_t nr_ns_ids, unsigned int flags);

Returns:
- On success: Number of namespace IDs written to ns_ids
- On error: Negative error code

/*
 * @size: Structure size
 * @ns_id: Starting point for iteration; use 0 for first call, then
 *         use the last returned ID for subsequent calls to paginate
 * @ns_type: Bitmask of namespace types to include (from enum ns_type):
 *           0: Return all namespace types
 *           MNT_NS: Mount namespaces
 *           NET_NS: Network namespaces
 *           USER_NS: User namespaces
 *           etc. Can be OR'd together
 * @user_ns_id: Filter results to namespaces owned by this user namespace:
 *              0: Return all namespaces (subject to permission checks)
 *              LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
 *              Other value: Namespaces owned by the specified user namespace ID
 */
struct ns_id_req {
        __u32 size;         /* sizeof(struct ns_id_req) */
        __u32 spare;        /* Reserved, must be 0 */
        __u64 ns_id;        /* Last seen namespace ID (for pagination) */
        __u32 ns_type;      /* Filter by namespace type(s) */
        __u32 spare2;       /* Reserved, must be 0 */
        __u64 user_ns_id;   /* Filter by owning user namespace */
};

Example 1: List all namespaces

void list_all_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,          /* Start from beginning */
        .ns_type = 0,        /* All types */
        .user_ns_id = 0,     /* All user namespaces */
    };
    uint64_t ids[100];
    ssize_t ret;

    printf("All namespaces in the system:\n");
    do {
        ret = listns(&req, ids, 100, 0);
        if (ret < 0) {
            perror("listns");
            break;
        }

        for (ssize_t i = 0; i < ret; i++)
            printf("  Namespace ID: %llu\n", (unsigned long long)ids[i]);

        /* Continue from last seen ID */
        if (ret > 0)
            req.ns_id = ids[ret - 1];
    } while (ret == 100);  /* Buffer was full, more may exist */
}

Example 2: List network namespaces only

void list_network_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = NET_NS,   /* Only network namespaces */
        .user_ns_id = 0,
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    if (ret < 0) {
        perror("listns");
        return;
    }

    printf("Network namespaces: %zd found\n", ret);
    for (ssize_t i = 0; i < ret; i++)
        printf("  netns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 3: List namespaces owned by current user namespace

void list_owned_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = 0,                      /* All types */
        .user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    if (ret < 0) {
        perror("listns");
        return;
    }

    printf("Namespaces owned by my user namespace: %zd\n", ret);
    for (ssize_t i = 0; i < ret; i++)
        printf("  ns ID: %llu\n", (unsigned long long)ids[i]);
}

Example 4: List multiple namespace types

void list_network_and_mount_namespaces(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = NET_NS | MNT_NS,  /* Network and mount */
        .user_ns_id = 0,
    };
    uint64_t ids[100];
    ssize_t ret;

    ret = listns(&req, ids, 100, 0);
    printf("Network and mount namespaces: %zd found\n", ret);
}

Example 5: Pagination through large namespace sets

void list_all_with_pagination(void)
{
    struct ns_id_req req = {
        .size = sizeof(req),
        .ns_id = 0,
        .ns_type = 0,
        .user_ns_id = 0,
    };
    uint64_t ids[50];
    size_t total = 0;
    ssize_t ret;

    printf("Enumerating all namespaces with pagination:\n");

    while (1) {
        ret = listns(&req, ids, 50, 0);
        if (ret < 0) {
            perror("listns");
            break;
        }
        if (ret == 0)
            break;  /* No more namespaces */

        total += ret;
        printf("  Batch: %zd namespaces\n", ret);

        /* Last ID in this batch becomes start of next batch */
        req.ns_id = ids[ret - 1];

        if (ret < 50)
            break;  /* Partial batch = end of results */
    }

    printf("Total: %zu namespaces\n", total);
}

Permission Model

listns() respects namespace isolation and capabilities:

(1) Global listing (user_ns_id = 0):
    - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
    - OR the namespace must be in the caller's namespace context (e.g.,
      a namespace the caller is currently using)
    - User namespaces additionally allow listing if the caller has
      CAP_SYS_ADMIN in that user namespace itself
(2) Owner-filtered listing (user_ns_id != 0):
    - Requires CAP_SYS_ADMIN in the specified owner user namespace
    - OR the namespace must be in the caller's namespace context
    - This allows unprivileged processes to enumerate namespaces they own
(3) Visibility:
    - Only "active" namespaces are listed
    - A namespace is active if it has a non-zero __ns_ref_active count
    - This includes namespaces used by running processes, held by open
      file descriptors, or kept active by bind mounts
    - Inactive namespaces (kept alive only by internal kernel
      references) are not visible via listns()

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-19-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner 560e25e70f
nstree: add unified namespace list
Allow to walk the unified namespace list completely locklessly.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-18-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner a202a50092
nstree: simplify rbtree comparison helpers
They all do the same basic thing.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-17-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:18 +01:00
Christian Brauner 3c1a52f2a6
nstree: maintain list of owned namespaces
The namespace tree doesn't express the ownership concept of namespace
appropriately. Maintain a list of directly owned namespaces per user
namespace. This will allow userspace and the kernel to use the listns()
system call to walk the namespace tree by owning user namespace. The
rbtree is used to find the relevant namespace entry point which allows
to continue iteration and the owner list can be used to walk the tree
completely lock free.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-16-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 3760342fd6
nstree: assign fixed ids to the initial namespaces
The initial set of namespace comes with fixed inode numbers making it
easy for userspace to identify them solely based on that information.
This has long preceeded anything here.

Similarly, let's assign fixed namespace ids for the initial namespaces.

Kill the cookie and use a sequentially increasing number. This has the
nice side-effect that the owning user namespace will always have a
namespace id that is smaller than any of it's descendant namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 04173501a6
nstree: allow lookup solely based on inode
The namespace file handle struct nsfs_file_handle is uapi and userspace
is expressly allowed to generate file handles without going through
name_to_handle_at().

Allow userspace to generate a file handle where both the inode number
and the namespace type are zero and just pass in the unique namespace
id. The kernel uses the unified namespace tree to find the namespace and
open the file handle.

When the kernel creates a file handle via name_to_handle_at() it will
always fill in the type and the inode number allowing userspace to
retrieve core information.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-14-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 2ccaebc686
nstree: introduce a unified tree
This will allow userspace to lookup and stat a namespace simply by its
identifier without having to know what type of namespace it is.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-13-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 8895d2a3db
ns: use anonymous struct to group list member
Make it easier to spot that they belong together conceptually.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-12-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 3a18f80918
ns: add active reference count
The namespace tree is, among other things, currently used to support
file handles for namespaces. When a namespace is created it is placed on
the namespace trees and when it is destroyed it is removed from the
namespace trees.

While a namespace is on the namespace trees with a valid reference count
it is possible to reopen it through a namespace file handle. This is all
fine but has some issues that should be addressed.

On current kernels a namespace is visible to userspace in the
following cases:

(1) The namespace is in use by a task.
(2) The namespace is persisted through a VFS object (namespace file
    descriptor or bind-mount).
    Note that (2) only cares about direct persistence of the namespace
    itself not indirectly via e.g., file->f_cred file references or
    similar.
(3) The namespace is a hierarchical namespace type and is the parent of
    a single or multiple child namespaces.

Case (3) is interesting because it is possible that a parent namespace
might not fulfill any of (1) or (2), i.e., is invisible to userspace but
it may still be resurrected through the NS_GET_PARENT ioctl().

Currently namespace file handles allow much broader access to namespaces
than what is currently possible via (1)-(3). The reason is that
namespaces may remain pinned for completely internal reasons yet are
inaccessible to userspace.

For example, a user namespace my remain pinned by get_cred() calls to
stash the opener's credentials into file->f_cred. As it stands file
handles allow to resurrect such a users namespace even though this
should not be possible via (1)-(3). This is a fundamental uapi change
that we shouldn't do if we don't have to.

Consider the following insane case: Various architectures support the
CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
When this option is set a userspace task's struct mm_struct may be used
for kernel threads such as the idle task and will only be destroyed once
the cpu's runqueue switches back to another task. But because of ptrace()
permission checks struct mm_struct stashes the user namespace of the
task that struct mm_struct originally belonged to. The kernel thread
will take a reference on the struct mm_struct and thus pin it.

So on an idle system user namespaces can be persisted for arbitrary
amounts of time which also means that they can be resurrected using
namespace file handles. That makes no sense whatsoever. The problem is
of course excarabted on large systems with a huge number of cpus.

To handle this nicely we introduce an active reference count which
tracks (1)-(3). This is easy to do as all of these things are already
managed centrally. Only (1)-(3) will count towards the active reference
count and only namespaces which are active may be opened via namespace
file handles.

The problem is that namespaces may be resurrected. Which means that they
can become temporarily inactive and will be reactived some time later.
Currently the only example of this is the SIOGCSKNS socket ioctl. The
SIOCGSKNS ioctl allows to open a network namespace file descriptor based
on a socket file descriptor.

If a socket is tied to a network namespace that subsequently becomes
inactive but that socket is persisted by another process in another
network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
SIOCGSKNS ioctl will resurrect this network namespace.

So calls to open_related_ns() and open_namespace() will end up
resurrecting the corresponding namespace tree.

Note that the active reference count does not regulate the lifetime of
the namespace itself. This is still done by the normal reference count.
The active reference count can only be elevated if the regular reference
count is elevated.

The active reference count also doesn't regulate the presence of a
namespace on the namespace trees. It only regulates its visiblity to
namespace file handles (and in later patches to listns()).

A namespace remains on the namespace trees from creation until its
actual destruction. This will allow the kernel to always reach any
namespace trivially and it will also enable subsystems like bpf to walk
the namespace lists on the system for tracing or general introspection
purposes.

Note that different namespaces have different visibility lifetimes on
current kernels. While most namespace are immediately released when the
last task using them exits, the user- and pid namespace are persisted
and thus both remain accessible via /proc/<pid>/ns/<ns_type>.

The user namespace lifetime is aliged with struct cred and is only
released through exit_creds(). However, it becomes inaccessible to
userspace once the last task using it is reaped, i.e., when
release_task() is called and all proc entries are flushed. Similarly,
the pid namespace is also visible until the last task using it has been
reaped and the associated pid numbers are freed.

The active reference counts of the user- and pid namespace are
decremented once the task is reaped.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 4b06b70c82
ns: rename to exit_nsproxy_namespaces()
The current naming is very misleading as this really isn't exiting all
of the task's namespaces. It is only exiting the namespaces that hang of
off nsproxy. Reflect that in the name.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-10-2e6f823ebdc0@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:17 +01:00
Christian Brauner 6b053576ed
ns: add __ns_ref_read()
Implement ns_ref_read() the same way as ns_ref_{get,put}().
No point in making that any more special or different from the other
helpers.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-9-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:16 +01:00
Christian Brauner 3dd50c5866
ns: initialize ns_list_node for initial namespaces
Make sure that the list is always initialized for initial namespaces.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-8-2e6f823ebdc0@kernel.org
Fixes: 885fc8ac0a ("nstree: make iterator generic")
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:16 +01:00
Christian Brauner 0b1765830c
ns: use NS_COMMON_INIT() for all namespaces
Now that we have a common initializer use it for all static namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:16 +01:00
Christian Brauner d915fe20e5
ns: add NS_COMMON_INIT()
Add an initializer that can be used for the ns common initialization for
static namespace such as most init namespaces.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/87ecqhy2y5.ffs@tglx
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:41:16 +01:00
Christian Brauner 8627bc8c7d
ns: add missing authorship
I authored the files a short while ago.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03 17:39:20 +01:00
Mateusz Guzik 20052f2ef0
fs: touch up predicts in putname()
1. we already expect the refcount is 1.
2. path creation predicts name == iname

I verified this straightens out the asm, no functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029134952.658450-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:18:08 +01:00
Christian Brauner a77a59592f
Merge patch series "Add and use folio_next_pos()"
Matthew Wilcox (Oracle) <willy@infradead.org> says:

It's relatively common in filesystems to want to know the end of the
current folio we're looking at.  So common in fact that btrfs has its own
helper for that.  Lift that helper to filemap and use it everywhere that
I've noticed it could be used.  This actually fixes a long-standing bug
in ocfs2 on 32-bit systems with files larger than 2GiB.  Presumably this
is not a common configuration, but I've marked it for backport anyway.

The other filesystems are all fine; none of them have a bug, they're
just mildly inefficient.  I think this should all go in via Christian's
tree, ideally with acks from the various fs maintainers (cc'd on their
individual patches).

* patches from https://patch.msgid.link/20251024170822.1427218-1-willy@infradead.org:
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()

Link: https://patch.msgid.link/20251024170822.1427218-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:44 +01:00
Matthew Wilcox (Oracle) 60a70e6143
mm: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-11-willy@infradead.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) ac0a11113d
xfs: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-10-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) 2408900d40
netfs: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-9-willy@infradead.org
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paulo Alcantara <pc@manguebit.org>
Cc: netfs@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) ac97520804
iomap: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-8-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) 5f0fc78532
gfs2: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-7-willy@infradead.org
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: gfs2@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) 4fcafa30b7
f2fs: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-6-willy@infradead.org
Reviewed-by: Chao Yu <chao@kernel.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:38 +01:00
Matthew Wilcox (Oracle) 4db47b2521
ext4: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-5-willy@infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:37 +01:00
Matthew Wilcox (Oracle) 6870892b64
buffer: Use folio_next_pos()
This is one instruction more efficient than open-coding folio_pos() +
folio_size().  It's the equivalent of (x + y) << z rather than
x << z + y << z.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-4-willy@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:37 +01:00
Matthew Wilcox (Oracle) 48f3784b17
btrfs: Use folio_next_pos()
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:37 +01:00
Matthew Wilcox (Oracle) 4511fd86db
filemap: Add folio_next_pos()
Replace the open-coded implementation in ocfs2 (which loses the top
32 bits on 32-bit architectures) with a helper in pagemap.h.

Fixes: 35edec1d52 (ocfs2: update truncate handling of partial clusters)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-2-willy@infradead.org
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 13:11:37 +01:00
Christian Brauner 36a304de26
nstree: simplify return
node_to_ns() checks for NULL and the assert isn't really helpful and
will have to be dropped later anyway.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-7-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:24 +01:00
Christian Brauner 768b1565d9
cgroup: add cgroup namespace to tree after owner is set
Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw().

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org
Fixes: 7c60593985 ("cgroup: support ns lookup")
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:24 +01:00
Christian Brauner 4af033dad6
nsfs: raise SB_I_NODEV and SB_I_NOEXEC
There's zero need for nsfs to allow device nodes or execution.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-5-2e6f823ebdc0@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:24 +01:00
Christian Brauner b21cba8d87
pidfs: raise DCACHE_DONTCACHE explicitly
While pidfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-4-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:24 +01:00
Christian Brauner 6dbe134e4b
nsfs: raise DCACHE_DONTCACHE explicitly
While nsfs dentries are never hashed and thus retain_dentry() will never
consider them for placing them on the LRU it isn't great to always have
to go and remember that. Raise DCACHE_DONTCACHE explicitly as a visual
marker that dentries aren't kept but freed immediately instead.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-3-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:23 +01:00
Christian Brauner 1e9a9be249
nsfs: use inode_just_drop()
Currently nsfs uses the default inode_generic_drop() fallback which
drops the inode when it's unlinked or when it's unhashed. Since nsfs
never hashes inodes that always amounts to dropping the inode.

But that's just annoying to have to reason through every time we look at
this code. Switch to inode_just_drop() which always drops the inode
explicitly. This also aligns the behavior with pidfs which does the
same.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-2-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:23 +01:00
Christian Brauner c9822fad80
libfs: allow to specify s_d_flags
Make it possible for pseudo filesystems to specify default dentry flags.

Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-1-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:16:23 +01:00
Eric Biggers 0bbb838f38
ecryptfs: Use MD5 library instead of crypto_shash
eCryptfs uses MD5 for a couple unusual purposes: to "mix" the key into
the IVs for file contents encryption (similar to ESSIV), and to prepend
some key-dependent bytes to the plaintext when encrypting filenames
(which is useless since eCryptfs encrypts the filenames with ECB).

Currently, eCryptfs computes these MD5 hashes using the crypto_shash
API.  Update it to instead use the MD5 library API.  This is simpler and
faster: the library doesn't require memory allocations, can't fail, and
provides direct access to MD5 without overhead such as indirect calls.

To preserve the existing behavior of eCryptfs support being disabled
when the kernel is booted with "fips=1", make ecryptfs_get_tree() check
fips_enabled itself.  Previously it relied on crypto_alloc_shash("md5")
failing.  I don't know for sure that this is actually needed; e.g., it
could be argued that eCryptfs's use of MD5 isn't for a security purpose
as far as FIPS is concerned.  But this preserves the existing behavior.

Tested by verifying that an existing eCryptfs can still be mounted with
a kernel that has this commit, with all the files matching.  Also tested
creating a filesystem with this commit and mounting+reading it without.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251011200010.193140-1-ebiggers@kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:12:35 +01:00
Pankaj Raghav 10436adf9d
iomap: use largest_zero_folio() in iomap_dio_zero()
iomap_dio_zero() uses a custom allocated memory of zeroes for padding
zeroes. This was a temporary solution until there was a way to request a
zero folio that was greater than the PAGE_SIZE.

Use largest_zero_folio() function instead of using the custom allocated
memory of zeroes. There is no guarantee from largest_zero_folio()
function that it will always return a PMD sized folio. Adapt the code so
that it can also work if largest_zero_folio() returns a ZERO_PAGE.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:12:35 +01:00
Thorsten Blum b2c43efc3c
initrd: Replace simple_strtol with kstrtoint to improve ramdisk_start_setup
Replace simple_strtol() with the recommended kstrtoint() for parsing the
'ramdisk_start=' boot parameter. Unlike simple_strtol(), which returns a
a long, kstrtoint() converts the string directly to an integer and
avoids implicit casting.

Check the return value of kstrtoint() and reject invalid values. This
adds error handling while preserving existing behavior for valid values,
and removes use of the deprecated simple_strtol() helper.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31 10:12:32 +01:00
Nathan Chancellor 5ff8ad3909 kbuild: Add '-fms-extensions' to areas with dedicated CFLAGS
This is a follow up to commit c4781dc3d1 ("Kbuild: enable
-fms-extensions") but in a separate change due to being substantially
different from the initial submission.

There are many places within the kernel that use their own CFLAGS
instead of the main KBUILD_CFLAGS, meaning code written with the main
kernel's use of '-fms-extensions' in mind that may be tangentially
included in these areas will result in "error: declaration does not
declare anything" messages from the compiler.

Add '-fms-extensions' to all these areas to ensure consistency, along
with -Wno-microsoft-anon-tag to silence clang's warning about use of the
extension that the kernel cares about using. parisc does not build with
clang so it does not need this warning flag. LoongArch does not need it
either because -W flags from KBUILD_FLAGS are pulled into cflags-vdso.

Reported-by: Christian Brauner <brauner@kernel.org>
Closes: https://lore.kernel.org/20251030-meerjungfrau-getrocknet-7b46eacc215d@brauner/
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2025-10-30 21:26:28 -04:00
Christian Brauner 036375522b pidfs: expose coredump signal
Userspace needs access to the signal that caused the coredump before the
coredumping process has been reaped. Expose it as part of the coredump
information in struct pidfd_info. After the process has been reaped that
info is also available as part of PIDFD_INFO_EXIT's exit_code field.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-8-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:14 +01:00
Christian Brauner 90df6ff685 pidfs: drop struct pidfs_exit_info
This is not needed anymore now that we have the new scheme to guarantee
all-or-nothing information exposure.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-7-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:14 +01:00
Christian Brauner ad6e3ea683 pidfs: prepare to drop exit_info pointer
There will likely be more info that we need to store in struct
pidfs_attr. We need to make sure that some of the information such as
exit info or coredump info that consists of multiple bits is either
available completely or not at all, but never partially. Currently we
use a pointer that we assign to. That doesn't scale. We can't waste a
pointer for each mulit-part information struct we want to expose. Use a
bitmask instead.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-6-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:14 +01:00
Christian Brauner dfd78546c9 pidfd: add a new supported_mask field
Some of the future fields in struct pidfd_info can be optional. If the
kernel has nothing to emit in that field, then it doesn't set the flag
in the reply. This presents a problem: There is currently no way to know
what mask flags the kernel supports since one can't always count on them
being in the reply.

Add a new PIDFD_INFO_SUPPORTED_MASK flag and field that the kernel can
set in the reply. Userspace can use this to determine if the fields it
requires from the kernel are supported. This also gives us a way to
deprecate fields in the future, if that should become necessary.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-5-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:13 +01:00
Christian Brauner d8fc51d8fa pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
Validate that the size of struct pidfd_info is correctly updated.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-4-ca449b7b7aa0@kernel.org
Fixes: 1d8db6fd69 ("pidfs, coredump: add PIDFD_INFO_COREDUMP")
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:13 +01:00
Christian Brauner 4061c43a99 pidfs: add missing PIDFD_INFO_SIZE_VER1
We grew struct pidfd_info not too long ago.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-3-ca449b7b7aa0@kernel.org
Fixes: 1d8db6fd69 ("pidfs, coredump: add PIDFD_INFO_COREDUMP")
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:13 +01:00
Christian Brauner fe0e6ce3fd pidfs: fix PIDFD_INFO_COREDUMP handling
When PIDFD_INFO_COREDUMP is requested we raise it unconditionally in the
returned mask even if no coredump actually did take place. This was
done because we assumed that the later check whether ->coredump_mask as
non-zero detects that it is zero and then retrieves the dumpability
settings from the task's mm. This has issues though becuase there are
tasks that might not have any mm. Also it's just not very cleanly
implemented. Fix this.

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-2-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:13 +01:00
Christian Brauner ccb3851ce7 pidfs: use guard() for task_lock
Use a guard().

Link: https://patch.msgid.link/20251028-work-coredump-signal-v1-1-ca449b7b7aa0@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30 14:25:13 +01:00
Rasmus Villemoes c4781dc3d1
Kbuild: enable -fms-extensions
Once in a while, it turns out that enabling -fms-extensions could
allow some slightly prettier code. But every time it has come up, the
code that had to be used instead has been deemed "not too awful" and
not worth introducing another compiler flag for.

That's probably true for each individual case, but then it's somewhat
of a chicken/egg situation.

If we just "bite the bullet" as Linus says and enable it once and for
all, it is available whenever a use case turns up, and no individual
case has to justify it.

A lore.kernel.org search provides these examples:

- https://lore.kernel.org/lkml/200706301813.58435.agruen@suse.de/
- https://lore.kernel.org/lkml/20180419152817.GD25406@bombadil.infradead.org/
- https://lore.kernel.org/lkml/170622208395.21664.2510213291504081000@noble.neil.brown.name/
- https://lore.kernel.org/lkml/87h6475w9q.fsf@prevas.dk/
- https://lore.kernel.org/lkml/CAHk-=wjeZwww6Zswn6F_iZTpUihTSNKYppLqj36iQDDhfntuEw@mail.gmail.com/

Undoubtedly, there are more places in the code where this could also
be used but where -fms-extensions just didn't come up in any
discussion.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://patch.msgid.link/20251020142228.1819871-2-linux@rasmusvillemoes.dk
[nathan: Move disabled clang warning to scripts/Makefile.extrawarn and
         adjust comment]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2025-10-29 16:23:47 -07:00
Nathan Chancellor a6773e6932
jfs: Rename _inline to avoid conflict with clang's '-fms-extensions'
Building fs/jfs with clang and '-fms-extensions' errors with:

  In file included from fs/jfs/jfs_unicode.c:8:
  fs/jfs/jfs_incore.h:86:13: error: type name does not allow function specifier to be specified
     86 |                                         unchar _inline[128];
        |                                                ^
  fs/jfs/jfs_incore.h:86:20: error: expected member name or ';' after declaration specifiers
     86 |                                         unchar _inline[128];
        |                                         ~~~~~~~~~~~~~~^

'-fms-extensions' in clang enables several other Microsoft specific
keywords such as _inline [1], presumably for compatibility with MSVC, as
Microsoft's documentation [2] mentions:

  For compatibility with previous versions, _inline and _forceinline are
  synonyms for __inline and __forceinline, respectively

Rename the _inline array in 'struct jfs_inode_info' to _inline_sym to
avoid this conflict, which is not a large workaround as this member is
only ever referred to via the i_inline macro.

Link: 249883d0c5/clang/include/clang/Basic/TokenKinds.def (L744-L79) [1]
Link: https://learn.microsoft.com/en-us/cpp/c-language/inline-functions [2]
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://patch.msgid.link/20251023-jfs-fix-conflict-with-clang-ms-ext-v1-1-e219d59a1e68@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2025-10-29 16:22:21 -07:00
Julian Sun 4952f35f05
fs: Make wbc_to_tag() inline and use it in fs.
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.

This patch has only passed compilation tests, but it should be fine.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 23:33:48 +01:00
Christian Brauner 891bea757c
Merge patch series "allow file systems to increase the minimum writeback chunk size v2"
Christoph Hellwig <hch@lst.de> says:

The relatively low minimal writeback size of 4MiB leads means that
written back inodes on rotational media are switched a lot.  Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached relative
to the available writeback bandwidth.

Add a superblock field that allows the file system to override the
default size, and set it to the zone size for zoned XFS.

* patches from https://patch.msgid.link/20251017034611.651385-1-hch@lst.de:
  xfs: set s_min_writeback_pages for zoned file systems
  writeback: allow the file system to override MIN_WRITEBACK_PAGES
  writeback: cleanup writeback_chunk_size

Link: https://patch.msgid.link/20251017034611.651385-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:54:36 +01:00
Christoph Hellwig 015a544077
xfs: set s_min_writeback_pages for zoned file systems
Set s_min_writeback_pages to the zone size, so that writeback always
writes up to a full zone.  This ensures that writeback does not add
spurious file fragmentation when writing back a large number of
files that are larger than the zone size.

Fixes: 4e4d520755 ("xfs: add the zoned space allocator")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-4-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:54:31 +01:00
Christoph Hellwig 90db4d4441
writeback: allow the file system to override MIN_WRITEBACK_PAGES
The relatively low minimal writeback size of 4MiB means that written back
inodes on rotational media are switched a lot.  Besides introducing
additional seeks, this also can lead to extreme file fragmentation on
zoned devices when a lot of files are cached relative to the available
writeback bandwidth.

Add a superblock field that allows the file system to override the
default size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:54:31 +01:00
Christoph Hellwig 151d0922bf
writeback: cleanup writeback_chunk_size
Return the pages directly when calculated instead of first assigning
them back to a variable, and directly return for the data integrity /
tagged case instead of going through an else clause.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251017034611.651385-2-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:54:31 +01:00
Christian Brauner 211c43d093
Merge patch series "filemap_* writeback interface cleanups v2"
Christoph Hellwig <hch@lst.de> says:

While looking at the filemap writeback code, I think adding
filemap_fdatawrite_wbc ended up being a mistake, as all but the original
btrfs caller should be using better high level interfaces instead.  This
series removes all these, switches btrfs to a more specific interfaces
and also cleans up another too low-level interface.  With this the
writeback_control that is passed to the writeback code is only
initialized in three places, although there are a lot more places in
file system code that never reach the common writeback code.

* patches from https://patch.msgid.link/20251024080431.324236-1-hch@lst.de:
  mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
  mm: remove __filemap_fdatawrite_range
  mm: remove filemap_fdatawrite_wbc
  mm: remove __filemap_fdatawrite
  mm,btrfs: add a filemap_flush_nr helper
  btrfs: push struct writeback_control into start_delalloc_inodes
  btrfs: use the local tmp_inode variable in start_delalloc_inodes
  ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
  9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
  mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode

Link: https://patch.msgid.link/20251024080431.324236-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:48 +01:00
Christoph Hellwig c28d67b33c
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
Rename filemap_fdatawrite_range_kick to filemap_flush_range because it
is the ranged version of filemap_flush.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-11-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:42 +01:00
Christoph Hellwig 45cbce5b88
mm: remove __filemap_fdatawrite_range
Use filemap_fdatawrite_range and filemap_fdatawrite_range_kick instead
of the low-level __filemap_fdatawrite_range that requires the caller
to know the internals of the writeback_control structure and remove
__filemap_fdatawrite_range now that it is trivial and only two callers
would be left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-10-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:42 +01:00
Christoph Hellwig 1bcb413d0c
mm: remove filemap_fdatawrite_wbc
Replace filemap_fdatawrite_wbc, which exposes a writeback_control to the
callers with a filemap_writeback helper that takes all the possible
arguments and declares the writeback_control itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-9-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig 7359651448
mm: remove __filemap_fdatawrite
And rewrite filemap_fdatawrite to use filemap_fdatawrite_range instead
to have a simpler call chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-8-hch@lst.de
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig 7fabcb7fba
mm,btrfs: add a filemap_flush_nr helper
Abstract out the btrfs-specific behavior of kicking off I/O on a number
of pages on an address_space into a well-defined helper.

Note: there is no kerneldoc comment for the new function because it is
not part of the public API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-7-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig c9501112e3
btrfs: push struct writeback_control into start_delalloc_inodes
In preparation for changing the filemap_fdatawrite_wbc API to not expose
the writeback_control to the callers, push the wbc declaration next to
the filemap_fdatawrite_wbc call and just pass the nr_to_write value to
start_delalloc_inodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-6-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig 41e52c6447
btrfs: use the local tmp_inode variable in start_delalloc_inodes
start_delalloc_inodes has a struct inode * pointer available in the
main loop, use it instead of re-calculating it from the btrfs inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-5-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig 890f141da0
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.  There is a slight change in the conversion
as nr_to_write is now set to LONG_MAX instead of double the number
of the pages in the range.  LONG_MAX is the usual nr_to_write for
WB_SYNC_ALL writeback, and the value expected by lower layers here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-4-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig 3c2e5cee5e
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:41 +01:00
Christoph Hellwig a21134b5d6
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-2-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29 15:50:40 +01:00
Julian Sun d6e6215907
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
When a writeback work lasts for sysctl_hung_task_timeout_secs, we want
to identify that there are tasks waiting for a long time-this helps us
pinpoint potential issues.

Additionally, recording the starting jiffies is useful when debugging a
crashed vmcore.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:39 +02:00
Julian Sun 1888635532
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
Writing back a large number of pages can take a lots of time.
This issue is exacerbated when the underlying device is slow or
subject to block layer rate limiting, which in turn triggers
unexpected hung task warnings.

We can trigger a wake-up once a chunk has been written back and the
waiting time for writeback exceeds half of
sysctl_hung_task_timeout_secs.
This action allows the hung task detector to be aware of the writeback
progress, thereby eliminating these unexpected hung task warnings.

This patch has passed the xfstests 'check -g quick' test based on ext4,
with no additional failures introduced.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:39 +02:00
Christian Brauner 11f2af2a80
Merge patch series "hide ->i_state behind accessors"
Mateusz Guzik <mjguzik@gmail.com> says:

Open-coded accesses prevent asserting they are done correctly. One
obvious aspect is locking, but significantly more can checked. For
example it can be detected when the code is clearing flags which are
already missing, or is setting flags when it is illegal (e.g., I_FREEING
when ->i_count > 0).

In order to keep things manageable this patchset merely gets the thing
off the ground with only lockdep checks baked in.

Current consumers can be trivially converted.

Suppose flags I_A and I_B are to be handled.

If ->i_lock is held, then:

state = inode->i_state          => state = inode_state_read(inode)
inode->i_state |= (I_A | I_B)   => inode_state_set(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B)  => inode_state_clear(inode, I_A | I_B)
inode->i_state = I_A | I_B      => inode_state_assign(inode, I_A | I_B)

If ->i_lock is not held or only held conditionally:

state = inode->i_state          => state = inode_state_read_once(inode)
inode->i_state |= (I_A | I_B)   => inode_state_set_raw(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B)  => inode_state_clear_raw(inode, I_A | I_B)
inode->i_state = I_A | I_B      => inode_state_assign_raw(inode, I_A | I_B)

The "_once" vs "_raw" discrepancy stems from the read variant differing
by READ_ONCE as opposed to just lockdep checks.

Finally, if you want to atomically clear flags and set new ones, the
following:

state = inode->i_state;
state &= ~I_A;
state |= I_B;
inode->i_state = state;

turns into:

inode_state_replace(inode, I_A, I_B);

* patches from https://lore.kernel.org/20251009075929.1203950-1-mjguzik@gmail.com:
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:28 +02:00
Mateusz Guzik 2ed81b4bef
fs: make plain ->i_state access fail to compile
... to make sure all accesses are properly validated.

Merely renaming the var to __i_state still lets the compiler make the
following suggestion:
error: 'struct inode' has no member named 'i_state'; did you mean '__i_state'?

Unfortunately some people will add the __'s and call it a day.

In order to make it harder to mess up in this way, hide it behind a
struct. The resulting error message should be convincing in terms of
checking what to do:
error: invalid operands to binary & (have 'struct inode_state_flags' and 'int')

Of course people determined to do a plain access can still do it, but
nothing can be done for that case.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:28 +02:00
Mateusz Guzik 18c61399f6
xfs: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik a18d43041b
nilfs2: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik ff175a4fc2
overlayfs: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik 40a4c512ad
gfs2: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik ba69118c52
f2fs: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik f5a67689ba
smb: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik fa49168ea0
ceph: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:27 +02:00
Mateusz Guzik 7b12a794bf
btrfs: use the new ->i_state accessors
Change generated with coccinelle and fixed up by hand as appropriate.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik f5aa78e2be
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Nothing to look at apart from iput_final().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik b4dbfd8653
Coccinelle-based conversion to use ->i_state accessors
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.

The script:
@@
expression inode, flags;
@@

- inode->i_state & flags
+ inode_state_read(inode) & flags

@@
expression inode, flags;
@@

- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)

@@
expression inode, flag1, flag2;
@@

- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)

@@
expression inode, flags;
@@

- inode->i_state |= flags
+ inode_state_set(inode, flags)

@@
expression inode, flags;
@@

- inode->i_state = flags
+ inode_state_assign(inode, flags)

@@
expression inode, flags;
@@

- flags = inode->i_state
+ flags = inode_state_read(inode)

@@
expression inode, flags;
@@

- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik d8753f788a
fs: provide accessors for ->i_state
Open-coded accesses prevent asserting they are done correctly. One
obvious aspect is locking, but significantly more can checked. For
example it can be detected when the code is clearing flags which are
already missing, or is setting flags when it is illegal (e.g., I_FREEING
when ->i_count > 0).

In order to keep things manageable this patchset merely gets the thing
off the ground with only lockdep checks baked in.

Current consumers can be trivially converted.

Suppose flags I_A and I_B are to be handled.

If ->i_lock is held, then:

state = inode->i_state  	=> state = inode_state_read(inode)
inode->i_state |= (I_A | I_B) 	=> inode_state_set(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) 	=> inode_state_clear(inode, I_A | I_B)
inode->i_state = I_A | I_B	=> inode_state_assign(inode, I_A | I_B)

If ->i_lock is not held or only held conditionally:

state = inode->i_state  	=> state = inode_state_read_once(inode)
inode->i_state |= (I_A | I_B) 	=> inode_state_set_raw(inode, I_A | I_B)
inode->i_state &= ~(I_A | I_B) 	=> inode_state_clear_raw(inode, I_A | I_B)
inode->i_state = I_A | I_B	=> inode_state_assign_raw(inode, I_A | I_B)

The "_once" vs "_raw" discrepancy stems from the read variant differing
by READ_ONCE as opposed to just lockdep checks.

Finally, if you want to atomically clear flags and set new ones, the
following:

state = inode->i_state;
state &= ~I_A;
state |= I_B;
inode->i_state = state;

turns into:

inode_state_replace(inode, I_A, I_B);

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik cb5db358ab
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
The incomming helpers don't ship with _release/_acquire variants, for
the time being anyway.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik af6023e2ce
fs: move wait_on_inode() from writeback.h to fs.h
The only consumer outside of fs/inode.c is gfs2 and it already includes
fs.h in the relevant file.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:26 +02:00
Mateusz Guzik 31e332b911
fs: add missing fences to I_NEW handling
Suppose there are 2 CPUs racing inode hash lookup func (say ilookup5())
and unlock_new_inode().

In principle the latter can clear the I_NEW flag before prior stores
into the inode were made visible.

The former can in turn observe I_NEW is cleared and proceed to use the
inode, while possibly reading from not-yet-published areas.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:25 +02:00
Mateusz Guzik 0f607a89af
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
This postpones the writeout to ocfs2_evict_inode(), which I'm told is
fine (tm).

The intent is to retire the I_WILL_FREE flag.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Tinguely <amrk.tinguely@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:25 +02:00
Mateusz Guzik be97a4b63c
fs: assert on ->i_count in iput_final()
Notably make sure the count is 0 after the return from ->drop_inode(),
provided we are going to drop.

Inspired by suspicious games played by f2fs.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:25 +02:00
Mateusz Guzik dc816f8d92
fs: assert ->i_lock held in __iget()
Also remove the now redundant comment.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:22:25 +02:00
Joanne Koong 87a13819dd
iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx
->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:26 +02:00
Joanne Koong 8805a9c64b
iomap: rename iomap_readpage_iter() to iomap_read_folio_iter()
->readpage was deprecated and reads are now on folios.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:26 +02:00
Joanne Koong e0e15340e4
iomap: iterate over folio mapping in iomap_readpage_iter()
Iterate over all non-uptodate ranges of a folio mapping in a single call
to iomap_readpage_iter() instead of leaving the partial iteration to the
caller.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:26 +02:00
Joanne Koong 7aa6bc3e87
iomap: adjust read range correctly for non-block-aligned positions
iomap_adjust_read_range() assumes that the position and length passed in
are block-aligned. This is not always the case however, as shown in the
syzbot generated case for erofs. This causes too many bytes to be
skipped for uptodate blocks, which results in returning the incorrect
position and length to read in. If all the blocks are uptodate, this
underflows length and returns a position beyond the folio.

Fix the calculation to also take into account the block offset when
calculating how many bytes can be skipped for uptodate blocks.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
Joanne Koong d1f9893fcd
iomap: store read/readahead bio generically
Store the iomap_readpage_ctx bio generically as a "void *read_ctx".
This makes the read/readahead interface more generic, which allows it to
be used by filesystems that may not be block-based and may not have
CONFIG_BLOCK set.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
Joanne Koong ca82a7ea22
iomap: simplify iomap_iter_advance()
Most callers of iomap_iter_advance() do not need the remaining length
returned. Get rid of the extra iomap_length() call that
iomap_iter_advance() does.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
Joanne Koong 7588469b5e
iomap: move read/readahead bio submission logic into helper function
Move the read/readahead bio submission logic into a separate helper.
This is needed to make iomap read/readahead more generically usable,
especially for filesystems that do not require CONFIG_BLOCK.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
Joanne Koong 573c14c821
iomap: move bio read logic into helper function
Move the iomap_readpage_iter() bio read logic into a separate helper
function, iomap_bio_read_folio_range(). This is needed to make iomap
read/readahead more generically usable, especially for filesystems that
do not require CONFIG_BLOCK.

Additionally rename buffered write's iomap_read_folio_range() function
to iomap_bio_read_folio_range_sync() to better describe its synchronous
behavior.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20 20:21:25 +02:00
313 changed files with 16067 additions and 4720 deletions

View File

@ -135,6 +135,27 @@ These ``struct kiocb`` flags are significant for buffered I/O with iomap:
* ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
``struct iomap_read_ops``
--------------------------
.. code-block:: c
struct iomap_read_ops {
int (*read_folio_range)(const struct iomap_iter *iter,
struct iomap_read_folio_ctx *ctx, size_t len);
void (*submit_read)(struct iomap_read_folio_ctx *ctx);
};
iomap calls these functions:
- ``read_folio_range``: Called to read in the range. This must be provided
by the caller. If this succeeds, iomap_finish_folio_read() must be called
after the range is read in, regardless of whether the read succeeded or
failed.
- ``submit_read``: Submit any pending read requests. This function is
optional.
Internal per-Folio State
------------------------
@ -182,6 +203,28 @@ The ``flags`` argument to ``->iomap_begin`` will be set to zero.
The pagecache takes whatever locks it needs before calling the
filesystem.
Both ``iomap_readahead`` and ``iomap_read_folio`` pass in a ``struct
iomap_read_folio_ctx``:
.. code-block:: c
struct iomap_read_folio_ctx {
const struct iomap_read_ops *ops;
struct folio *cur_folio;
struct readahead_control *rac;
void *read_ctx;
};
``iomap_readahead`` must set:
* ``ops->read_folio_range()`` and ``rac``
``iomap_read_folio`` must set:
* ``ops->read_folio_range()`` and ``cur_folio``
``ops->submit_read()`` and ``read_ctx`` are optional. ``read_ctx`` is used to
pass in any custom data the caller needs accessible in the ops callbacks for
fulfilling reads.
Buffered Writes
---------------
@ -317,6 +360,9 @@ The fields are as follows:
delalloc reservations to avoid having delalloc reservations for
clean pagecache.
This function must be supplied by the filesystem.
If this succeeds, iomap_finish_folio_write() must be called once writeback
completes for the range, regardless of whether the writeback succeeded or
failed.
- ``writeback_submit``: Submit the previous built writeback context.
Block based file systems should use the iomap_ioend_writeback_submit
@ -444,10 +490,6 @@ These ``struct kiocb`` flags are significant for direct I/O with iomap:
Only meaningful for asynchronous I/O, and only if the entire I/O can
be issued as a single ``struct bio``.
* ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
process context.
See ``linux/fs.h`` for more details.
Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
function for the file.

View File

@ -211,7 +211,7 @@ test and set for you.
e.g.::
inode = iget_locked(sb, ino);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
err = read_inode_from_disk(inode);
if (err < 0) {
iget_failed(inode);

View File

@ -27166,6 +27166,7 @@ F: arch/s390/include/uapi/asm/virtio-ccw.h
F: drivers/s390/virtio/
VIRTIO FILE SYSTEM
M: German Maglione <gmaglione@redhat.com>
M: Vivek Goyal <vgoyal@redhat.com>
M: Stefan Hajnoczi <stefanha@redhat.com>
M: Miklos Szeredi <miklos@szeredi.hu>

View File

@ -1061,6 +1061,9 @@ NOSTDINC_FLAGS += -nostdinc
# perform bounds checking.
KBUILD_CFLAGS += $(call cc-option, -fstrict-flex-arrays=3)
# Allow including a tagged struct or union anonymously in another struct/union.
KBUILD_CFLAGS += -fms-extensions
# disable invalid "can't wrap" optimizations for signed / pointers
KBUILD_CFLAGS += -fno-strict-overflow

View File

@ -509,3 +509,4 @@
577 common open_tree_attr sys_open_tree_attr
578 common file_getattr sys_file_getattr
579 common file_setattr sys_file_setattr
580 common listns sys_listns

View File

@ -484,3 +484,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -63,7 +63,7 @@ VDSO_CFLAGS += -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
$(filter -Werror,$(KBUILD_CPPFLAGS)) \
-Werror-implicit-function-declaration \
-Wno-format-security \
-std=gnu11
-std=gnu11 -fms-extensions
VDSO_CFLAGS += -O2
# Some useful compiler-dependent flags from top-level Makefile
VDSO_CFLAGS += $(call cc32-option,-Wno-pointer-sign)
@ -71,6 +71,7 @@ VDSO_CFLAGS += -fno-strict-overflow
VDSO_CFLAGS += $(call cc32-option,-Werror=strict-prototypes)
VDSO_CFLAGS += -Werror=date-time
VDSO_CFLAGS += $(call cc32-option,-Werror=incompatible-pointer-types)
VDSO_CFLAGS += $(if $(CONFIG_CC_IS_CLANG),-Wno-microsoft-anon-tag)
# Compile as THUMB2 or ARM. Unwinding via frame-pointers in THUMB2 is
# unreliable.

View File

@ -481,3 +481,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -19,7 +19,7 @@ ccflags-vdso := \
cflags-vdso := $(ccflags-vdso) \
-isystem $(shell $(CC) -print-file-name=include) \
$(filter -W%,$(filter-out -Wa$(comma)%,$(KBUILD_CFLAGS))) \
-std=gnu11 -O2 -g -fno-strict-aliasing -fno-common -fno-builtin \
-std=gnu11 -fms-extensions -O2 -g -fno-strict-aliasing -fno-common -fno-builtin \
-fno-stack-protector -fno-jump-tables -DDISABLE_BRANCH_PROFILING \
$(call cc-option, -fno-asynchronous-unwind-tables) \
$(call cc-option, -fno-stack-protector)

View File

@ -469,3 +469,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -475,3 +475,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -408,3 +408,4 @@
467 n32 open_tree_attr sys_open_tree_attr
468 n32 file_getattr sys_file_getattr
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns

View File

@ -384,3 +384,4 @@
467 n64 open_tree_attr sys_open_tree_attr
468 n64 file_getattr sys_file_getattr
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns

View File

@ -457,3 +457,4 @@
467 o32 open_tree_attr sys_open_tree_attr
468 o32 file_getattr sys_file_getattr
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns

View File

@ -18,7 +18,7 @@ KBUILD_CFLAGS += -fno-PIE -mno-space-regs -mdisable-fpregs -Os
ifndef CONFIG_64BIT
KBUILD_CFLAGS += -mfast-indirect-calls
endif
KBUILD_CFLAGS += -std=gnu11
KBUILD_CFLAGS += -std=gnu11 -fms-extensions
LDFLAGS_vmlinux := -X -e startup --as-needed -T
$(obj)/vmlinux: $(obj)/vmlinux.lds $(addprefix $(obj)/, $(OBJECTS)) $(LIBGCC) FORCE

View File

@ -468,3 +468,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -70,7 +70,7 @@ BOOTCPPFLAGS := -nostdinc $(LINUXINCLUDE)
BOOTCPPFLAGS += -isystem $(shell $(BOOTCC) -print-file-name=include)
BOOTCFLAGS := $(BOOTTARGETFLAGS) \
-std=gnu11 \
-std=gnu11 -fms-extensions \
-Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
-fno-strict-aliasing -O2 \
-msoft-float -mno-altivec -mno-vsx \
@ -86,6 +86,7 @@ BOOTARFLAGS := -crD
ifdef CONFIG_CC_IS_CLANG
BOOTCFLAGS += $(CLANG_FLAGS)
BOOTCFLAGS += -Wno-microsoft-anon-tag
BOOTAFLAGS += $(CLANG_FLAGS)
endif

View File

@ -560,3 +560,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -22,7 +22,7 @@ KBUILD_AFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -D__ASSEMBLY__
ifndef CONFIG_AS_IS_LLVM
KBUILD_AFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO),$(aflags_dwarf))
endif
KBUILD_CFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -O2 -mpacked-stack -std=gnu11
KBUILD_CFLAGS_DECOMPRESSOR := $(CLANG_FLAGS) -m64 -O2 -mpacked-stack -std=gnu11 -fms-extensions
KBUILD_CFLAGS_DECOMPRESSOR += -DDISABLE_BRANCH_PROFILING -D__NO_FORTIFY
KBUILD_CFLAGS_DECOMPRESSOR += -D__DECOMPRESSOR
KBUILD_CFLAGS_DECOMPRESSOR += -Wno-pointer-sign
@ -35,6 +35,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(call cc-disable-warning, address-of-packed-membe
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO),-g)
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option, -gdwarf-4,))
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_IS_CLANG),-Wno-microsoft-anon-tag)
UTS_MACHINE := s390x
STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384))

View File

@ -472,3 +472,4 @@
467 common open_tree_attr sys_open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr sys_file_setattr
470 common listns sys_listns sys_listns

View File

@ -13,7 +13,7 @@ CFLAGS_sha256.o := -D__NO_FORTIFY
$(obj)/mem.o: $(srctree)/arch/s390/lib/mem.S FORCE
$(call if_changed_rule,as_o_S)
KBUILD_CFLAGS := -std=gnu11 -fno-strict-aliasing -Wall -Wstrict-prototypes
KBUILD_CFLAGS := -std=gnu11 -fms-extensions -fno-strict-aliasing -Wall -Wstrict-prototypes
KBUILD_CFLAGS += -Wno-pointer-sign -Wno-sign-compare
KBUILD_CFLAGS += -fno-zero-initialized-in-bss -fno-builtin -ffreestanding
KBUILD_CFLAGS += -Os -m64 -msoft-float -fno-common
@ -21,6 +21,7 @@ KBUILD_CFLAGS += -fno-stack-protector
KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
KBUILD_CFLAGS += -D__DISABLE_EXPORTS
KBUILD_CFLAGS += $(CLANG_FLAGS)
KBUILD_CFLAGS += $(if $(CONFIG_CC_IS_CLANG),-Wno-microsoft-anon-tag)
KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
KBUILD_AFLAGS := $(filter-out -DCC_USING_EXPOLINE,$(KBUILD_AFLAGS))
KBUILD_AFLAGS += -D__DISABLE_EXPORTS

View File

@ -473,3 +473,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -515,3 +515,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -48,7 +48,8 @@ endif
# How to compile the 16-bit code. Note we always compile for -march=i386;
# that way we can complain to the user if the CPU is insufficient.
REALMODE_CFLAGS := -std=gnu11 -m16 -g -Os -DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS \
REALMODE_CFLAGS := -std=gnu11 -fms-extensions -m16 -g -Os \
-DDISABLE_BRANCH_PROFILING -D__DISABLE_EXPORTS \
-Wall -Wstrict-prototypes -march=i386 -mregparm=3 \
-fno-strict-aliasing -fomit-frame-pointer -fno-pic \
-mno-mmx -mno-sse $(call cc-option,-fcf-protection=none)
@ -60,6 +61,7 @@ REALMODE_CFLAGS += $(cc_stack_align4)
REALMODE_CFLAGS += $(CLANG_FLAGS)
ifdef CONFIG_CC_IS_CLANG
REALMODE_CFLAGS += -Wno-gnu
REALMODE_CFLAGS += -Wno-microsoft-anon-tag
endif
export REALMODE_CFLAGS

View File

@ -25,7 +25,7 @@ targets := vmlinux vmlinux.bin vmlinux.bin.gz vmlinux.bin.bz2 vmlinux.bin.lzma \
# avoid errors with '-march=i386', and future flags may depend on the target to
# be valid.
KBUILD_CFLAGS := -m$(BITS) -O2 $(CLANG_FLAGS)
KBUILD_CFLAGS += -std=gnu11
KBUILD_CFLAGS += -std=gnu11 -fms-extensions
KBUILD_CFLAGS += -fno-strict-aliasing -fPIE
KBUILD_CFLAGS += -Wundef
KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
@ -36,7 +36,10 @@ KBUILD_CFLAGS += -mno-mmx -mno-sse
KBUILD_CFLAGS += -ffreestanding -fshort-wchar
KBUILD_CFLAGS += -fno-stack-protector
KBUILD_CFLAGS += $(call cc-disable-warning, address-of-packed-member)
KBUILD_CFLAGS += $(call cc-disable-warning, gnu)
ifdef CONFIG_CC_IS_CLANG
KBUILD_CFLAGS += -Wno-gnu
KBUILD_CFLAGS += -Wno-microsoft-anon-tag
endif
KBUILD_CFLAGS += -Wno-pointer-sign
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
KBUILD_CFLAGS += -D__DISABLE_EXPORTS

View File

@ -475,3 +475,4 @@
467 i386 open_tree_attr sys_open_tree_attr
468 i386 file_getattr sys_file_getattr
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns

View File

@ -394,6 +394,7 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns
#
# Due to a historical design error, certain syscalls are numbered differently

View File

@ -440,3 +440,4 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
470 common listns sys_listns

View File

@ -67,7 +67,7 @@ static void bdev_write_inode(struct block_device *bdev)
int ret;
spin_lock(&inode->i_lock);
while (inode->i_state & I_DIRTY) {
while (inode_state_read(inode) & I_DIRTY) {
spin_unlock(&inode->i_lock);
ret = write_inode_now(inode, true);
if (ret)
@ -217,9 +217,26 @@ int set_blocksize(struct file *file, int size)
EXPORT_SYMBOL(set_blocksize);
static int sb_validate_large_blocksize(struct super_block *sb, int size)
{
const char *err_str = NULL;
if (!(sb->s_type->fs_flags & FS_LBS))
err_str = "not supported by filesystem";
else if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
err_str = "is only supported with CONFIG_TRANSPARENT_HUGEPAGE";
if (!err_str)
return 0;
pr_warn_ratelimited("%s: block size(%d) > page size(%lu) %s\n",
sb->s_type->name, size, PAGE_SIZE, err_str);
return -EINVAL;
}
int sb_set_blocksize(struct super_block *sb, int size)
{
if (!(sb->s_type->fs_flags & FS_LBS) && size > PAGE_SIZE)
if (size > PAGE_SIZE && sb_validate_large_blocksize(sb, size))
return 0;
if (set_blocksize(sb->s_bdev_file, size))
return 0;
@ -1265,7 +1282,7 @@ void sync_bdevs(bool wait)
struct block_device *bdev;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_NEW) ||
mapping->nrpages == 0) {
spin_unlock(&inode->i_lock);
continue;

View File

@ -540,12 +540,13 @@ const struct address_space_operations def_blk_aops = {
#else /* CONFIG_BUFFER_HEAD */
static int blkdev_read_folio(struct file *file, struct folio *folio)
{
return iomap_read_folio(folio, &blkdev_iomap_ops);
iomap_bio_read_folio(folio, &blkdev_iomap_ops);
return 0;
}
static void blkdev_readahead(struct readahead_control *rac)
{
iomap_readahead(rac, &blkdev_iomap_ops);
iomap_bio_readahead(rac, &blkdev_iomap_ops);
}
static ssize_t blkdev_writeback_range(struct iomap_writepage_ctx *wpc,

View File

@ -829,8 +829,6 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
size_t offset, u32 opt_flags)
{
struct firmware *fw = NULL;
struct cred *kern_cred = NULL;
const struct cred *old_cred;
bool nondirect = false;
int ret;
@ -871,45 +869,38 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
* called by a driver when serving an unrelated request from userland, we use
* the kernel credentials to read the file.
*/
kern_cred = prepare_kernel_cred(&init_task);
if (!kern_cred) {
ret = -ENOMEM;
goto out;
}
old_cred = override_creds(kern_cred);
scoped_with_kernel_creds() {
ret = fw_get_filesystem_firmware(device, fw->priv, "", NULL);
ret = fw_get_filesystem_firmware(device, fw->priv, "", NULL);
/* Only full reads can support decompression, platform, and sysfs. */
if (!(opt_flags & FW_OPT_PARTIAL))
nondirect = true;
/* Only full reads can support decompression, platform, and sysfs. */
if (!(opt_flags & FW_OPT_PARTIAL))
nondirect = true;
#ifdef CONFIG_FW_LOADER_COMPRESS_ZSTD
if (ret == -ENOENT && nondirect)
ret = fw_get_filesystem_firmware(device, fw->priv, ".zst",
fw_decompress_zstd);
if (ret == -ENOENT && nondirect)
ret = fw_get_filesystem_firmware(device, fw->priv, ".zst",
fw_decompress_zstd);
#endif
#ifdef CONFIG_FW_LOADER_COMPRESS_XZ
if (ret == -ENOENT && nondirect)
ret = fw_get_filesystem_firmware(device, fw->priv, ".xz",
fw_decompress_xz);
if (ret == -ENOENT && nondirect)
ret = fw_get_filesystem_firmware(device, fw->priv, ".xz",
fw_decompress_xz);
#endif
if (ret == -ENOENT && nondirect)
ret = firmware_fallback_platform(fw->priv);
if (ret == -ENOENT && nondirect)
ret = firmware_fallback_platform(fw->priv);
if (ret) {
if (!(opt_flags & FW_OPT_NO_WARN))
dev_warn(device,
"Direct firmware load for %s failed with error %d\n",
name, ret);
if (nondirect)
ret = firmware_fallback_sysfs(fw, name, device,
opt_flags, ret);
} else
ret = assign_fw(fw, device);
revert_creds(old_cred);
put_cred(kern_cred);
if (ret) {
if (!(opt_flags & FW_OPT_NO_WARN))
dev_warn(device,
"Direct firmware load for %s failed with error %d\n",
name, ret);
if (nondirect)
ret = firmware_fallback_sysfs(fw, name, device,
opt_flags, ret);
} else {
ret = assign_fw(fw, device);
}
}
out:
if (ret < 0) {

View File

@ -52,7 +52,6 @@
static DEFINE_IDR(nbd_index_idr);
static DEFINE_MUTEX(nbd_index_mutex);
static struct workqueue_struct *nbd_del_wq;
static struct cred *nbd_cred;
static int nbd_total_devices = 0;
struct nbd_sock {
@ -555,7 +554,6 @@ static int __sock_xmit(struct nbd_device *nbd, struct socket *sock, int send,
int result;
struct msghdr msg = {} ;
unsigned int noreclaim_flag;
const struct cred *old_cred;
if (unlikely(!sock)) {
dev_err_ratelimited(disk_to_dev(nbd->disk),
@ -564,34 +562,33 @@ static int __sock_xmit(struct nbd_device *nbd, struct socket *sock, int send,
return -EINVAL;
}
old_cred = override_creds(nbd_cred);
msg.msg_iter = *iter;
noreclaim_flag = memalloc_noreclaim_save();
do {
sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
sock->sk->sk_use_task_frag = false;
msg.msg_flags = msg_flags | MSG_NOSIGNAL;
if (send)
result = sock_sendmsg(sock, &msg);
else
result = sock_recvmsg(sock, &msg, msg.msg_flags);
scoped_with_kernel_creds() {
do {
sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
sock->sk->sk_use_task_frag = false;
msg.msg_flags = msg_flags | MSG_NOSIGNAL;
if (result <= 0) {
if (result == 0)
result = -EPIPE; /* short read */
break;
}
if (sent)
*sent += result;
} while (msg_data_left(&msg));
if (send)
result = sock_sendmsg(sock, &msg);
else
result = sock_recvmsg(sock, &msg, msg.msg_flags);
if (result <= 0) {
if (result == 0)
result = -EPIPE; /* short read */
break;
}
if (sent)
*sent += result;
} while (msg_data_left(&msg));
}
memalloc_noreclaim_restore(noreclaim_flag);
revert_creds(old_cred);
return result;
}
@ -2683,15 +2680,7 @@ static int __init nbd_init(void)
return -ENOMEM;
}
nbd_cred = prepare_kernel_cred(&init_task);
if (!nbd_cred) {
destroy_workqueue(nbd_del_wq);
unregister_blkdev(NBD_MAJOR, "nbd");
return -ENOMEM;
}
if (genl_register_family(&nbd_genl_family)) {
put_cred(nbd_cred);
destroy_workqueue(nbd_del_wq);
unregister_blkdev(NBD_MAJOR, "nbd");
return -EINVAL;
@ -2746,7 +2735,6 @@ static void __exit nbd_cleanup(void)
/* Also wait for nbd_dev_remove_work() completes */
destroy_workqueue(nbd_del_wq);
put_cred(nbd_cred);
idr_destroy(&nbd_index_idr);
unregister_blkdev(NBD_MAJOR, "nbd");
}

View File

@ -259,27 +259,20 @@ static int sev_cmd_buffer_len(int cmd)
static struct file *open_file_as_root(const char *filename, int flags, umode_t mode)
{
struct file *fp;
struct path root;
struct cred *cred;
const struct cred *old_cred;
struct path root __free(path_put) = {};
task_lock(&init_task);
get_fs_root(init_task.fs, &root);
task_unlock(&init_task);
cred = prepare_creds();
CLASS(prepare_creds, cred)();
if (!cred)
return ERR_PTR(-ENOMEM);
cred->fsuid = GLOBAL_ROOT_UID;
old_cred = override_creds(cred);
fp = file_open_root(&root, filename, flags, mode);
path_put(&root);
put_cred(revert_creds(old_cred));
return fp;
scoped_with_creds(cred)
return file_open_root(&root, filename, flags, mode);
}
static int sev_read_init_ex_file(void)

View File

@ -433,7 +433,7 @@ static struct dax_device *dax_dev_get(dev_t devt)
return NULL;
dax_dev = to_dax_dev(inode);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
set_bit(DAXDEV_ALIVE, &dax_dev->flags);
inode->i_cdev = &dax_dev->cdev;
inode->i_mode = S_IFCHR;

View File

@ -11,12 +11,12 @@ cflags-y := $(KBUILD_CFLAGS)
cflags-$(CONFIG_X86_32) := -march=i386
cflags-$(CONFIG_X86_64) := -mcmodel=small
cflags-$(CONFIG_X86) += -m$(BITS) -D__KERNEL__ -std=gnu11 \
cflags-$(CONFIG_X86) += -m$(BITS) -D__KERNEL__ -std=gnu11 -fms-extensions \
-fPIC -fno-strict-aliasing -mno-red-zone \
-mno-mmx -mno-sse -fshort-wchar \
-Wno-pointer-sign \
$(call cc-disable-warning, address-of-packed-member) \
$(call cc-disable-warning, gnu) \
$(if $(CONFIG_CC_IS_CLANG),-Wno-gnu -Wno-microsoft-anon-tag) \
-fno-asynchronous-unwind-tables \
$(CLANG_FLAGS)

View File

@ -3670,8 +3670,6 @@ static int __init target_core_init_configfs(void)
{
struct configfs_subsystem *subsys = &target_core_fabrics;
struct t10_alua_lu_gp *lu_gp;
struct cred *kern_cred;
const struct cred *old_cred;
int ret;
pr_debug("TARGET_CORE[0]: Loading Generic Kernel Storage"
@ -3748,16 +3746,8 @@ static int __init target_core_init_configfs(void)
if (ret < 0)
goto out;
/* We use the kernel credentials to access the target directory */
kern_cred = prepare_kernel_cred(&init_task);
if (!kern_cred) {
ret = -ENOMEM;
goto out;
}
old_cred = override_creds(kern_cred);
target_init_dbroot();
revert_creds(old_cred);
put_cred(kern_cred);
scoped_with_kernel_creds()
target_init_dbroot();
return 0;

View File

@ -6,6 +6,7 @@
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/fs_struct.h>
#include <net/9p/9p.h>
#include <net/9p/client.h>
#include <linux/slab.h>

View File

@ -483,24 +483,15 @@ v9fs_vm_page_mkwrite(struct vm_fault *vmf)
static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
{
struct inode *inode;
struct writeback_control wbc = {
.nr_to_write = LONG_MAX,
.sync_mode = WB_SYNC_ALL,
.range_start = (loff_t)vma->vm_pgoff * PAGE_SIZE,
/* absolute end, byte at end included */
.range_end = (loff_t)vma->vm_pgoff * PAGE_SIZE +
(vma->vm_end - vma->vm_start - 1),
};
if (!(vma->vm_flags & VM_SHARED))
return;
p9_debug(P9_DEBUG_VFS, "9p VMA close, %p, flushing", vma);
inode = file_inode(vma->vm_file);
filemap_fdatawrite_wbc(inode->i_mapping, &wbc);
filemap_fdatawrite_range(file_inode(vma->vm_file)->i_mapping,
(loff_t)vma->vm_pgoff * PAGE_SIZE,
(loff_t)vma->vm_pgoff * PAGE_SIZE +
(vma->vm_end - vma->vm_start - 1));
}
static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {

View File

@ -422,7 +422,7 @@ static struct inode *v9fs_qid_iget(struct super_block *sb,
inode = iget5_locked(sb, QID2INO(qid), test, v9fs_set_inode, st);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
/*
* initialize the inode with the stat info

View File

@ -112,7 +112,7 @@ static struct inode *v9fs_qid_iget_dotl(struct super_block *sb,
inode = iget5_locked(sb, QID2INO(qid), test, v9fs_set_inode_dotl, st);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
/*
* initialize the inode with the stat info

View File

@ -14,7 +14,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
fs_dirent.o fs_context.o fs_parser.o fsopen.o init.o \
kernel_read_file.o mnt_idmapping.o remap_range.o pidfs.o \
file_attr.o

View File

@ -29,7 +29,7 @@ struct inode *affs_iget(struct super_block *sb, unsigned long ino)
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
pr_debug("affs_iget(%lu)\n", inode->i_ino);

View File

@ -779,7 +779,7 @@ static struct inode *afs_do_lookup(struct inode *dir, struct dentry *dentry)
struct afs_vnode *dvnode = AFS_FS_I(dir), *vnode;
struct inode *inode = NULL, *ti;
afs_dataversion_t data_version = READ_ONCE(dvnode->status.data_version);
bool supports_ibulk;
bool supports_ibulk, isnew;
long ret;
int i;
@ -850,7 +850,7 @@ static struct inode *afs_do_lookup(struct inode *dir, struct dentry *dentry)
* callback counters.
*/
ti = ilookup5_nowait(dir->i_sb, vp->fid.vnode,
afs_ilookup5_test_by_fid, &vp->fid);
afs_ilookup5_test_by_fid, &vp->fid, &isnew);
if (!IS_ERR_OR_NULL(ti)) {
vnode = AFS_FS_I(ti);
vp->dv_before = vnode->status.data_version;

View File

@ -64,7 +64,7 @@ static struct inode *afs_iget_pseudo_dir(struct super_block *sb, ino_t ino)
vnode = AFS_FS_I(inode);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
netfs_inode_init(&vnode->netfs, NULL, false);
simple_inode_init_ts(inode);
set_nlink(inode, 2);
@ -259,7 +259,7 @@ static struct dentry *afs_lookup_atcell(struct inode *dir, struct dentry *dentry
vnode = AFS_FS_I(inode);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
netfs_inode_init(&vnode->netfs, NULL, false);
simple_inode_init_ts(inode);
set_nlink(inode, 1);
@ -384,7 +384,7 @@ struct inode *afs_dynroot_iget_root(struct super_block *sb)
vnode = AFS_FS_I(inode);
/* there shouldn't be an existing inode */
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
netfs_inode_init(&vnode->netfs, NULL, false);
simple_inode_init_ts(inode);
set_nlink(inode, 2);

View File

@ -427,7 +427,7 @@ static void afs_fetch_status_success(struct afs_operation *op)
struct afs_vnode *vnode = vp->vnode;
int ret;
if (vnode->netfs.inode.i_state & I_NEW) {
if (inode_state_read_once(&vnode->netfs.inode) & I_NEW) {
ret = afs_inode_init_from_status(op, vp, vnode);
afs_op_set_error(op, ret);
if (ret == 0)
@ -579,7 +579,7 @@ struct inode *afs_iget(struct afs_operation *op, struct afs_vnode_param *vp)
inode, vnode->fid.vid, vnode->fid.vnode, vnode->fid.unique);
/* deal with an existing inode */
if (!(inode->i_state & I_NEW)) {
if (!(inode_state_read_once(inode) & I_NEW)) {
_leave(" = %p", inode);
return inode;
}
@ -639,7 +639,7 @@ struct inode *afs_root_iget(struct super_block *sb, struct key *key)
_debug("GOT ROOT INODE %p { vl=%llx }", inode, as->volume->vid);
BUG_ON(!(inode->i_state & I_NEW));
BUG_ON(!(inode_state_read_once(inode) & I_NEW));
vnode = AFS_FS_I(inode);
vnode->cb_v_check = atomic_read(&as->volume->cb_v_break);
@ -748,7 +748,7 @@ void afs_evict_inode(struct inode *inode)
if ((S_ISDIR(inode->i_mode) ||
S_ISLNK(inode->i_mode)) &&
(inode->i_state & I_DIRTY) &&
(inode_state_read_once(inode) & I_DIRTY) &&
!sbi->dyn_root) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,

View File

@ -1640,10 +1640,10 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb,
static void aio_fsync_work(struct work_struct *work)
{
struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, fsync.work);
const struct cred *old_cred = override_creds(iocb->fsync.creds);
iocb->ki_res.res = vfs_fsync(iocb->fsync.file, iocb->fsync.datasync);
revert_creds(old_cred);
scoped_with_creds(iocb->fsync.creds)
iocb->ki_res.res = vfs_fsync(iocb->fsync.file, iocb->fsync.datasync);
put_cred(iocb->fsync.creds);
iocb_put(iocb);
}

View File

@ -157,13 +157,37 @@ static int backing_aio_init_wq(struct kiocb *iocb)
return sb_init_dio_done_wq(sb);
}
static int do_backing_file_read_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags)
{
struct backing_aio *aio = NULL;
int ret;
if (is_sync_kiocb(iocb)) {
rwf_t rwf = iocb_to_rw_flags(flags);
return vfs_iter_read(file, iter, &iocb->ki_pos, rwf);
}
aio = kmem_cache_zalloc(backing_aio_cachep, GFP_KERNEL);
if (!aio)
return -ENOMEM;
aio->orig_iocb = iocb;
kiocb_clone(&aio->iocb, iocb, get_file(file));
aio->iocb.ki_complete = backing_aio_rw_complete;
refcount_set(&aio->ref, 2);
ret = vfs_iocb_iter_read(file, &aio->iocb, iter);
backing_aio_put(aio);
if (ret != -EIOCBQUEUED)
backing_aio_cleanup(aio, ret);
return ret;
}
ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags,
struct backing_file_ctx *ctx)
{
struct backing_aio *aio = NULL;
const struct cred *old_cred;
ssize_t ret;
if (WARN_ON_ONCE(!(file->f_mode & FMODE_BACKING)))
@ -176,28 +200,8 @@ ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
!(file->f_mode & FMODE_CAN_ODIRECT))
return -EINVAL;
old_cred = override_creds(ctx->cred);
if (is_sync_kiocb(iocb)) {
rwf_t rwf = iocb_to_rw_flags(flags);
ret = vfs_iter_read(file, iter, &iocb->ki_pos, rwf);
} else {
ret = -ENOMEM;
aio = kmem_cache_zalloc(backing_aio_cachep, GFP_KERNEL);
if (!aio)
goto out;
aio->orig_iocb = iocb;
kiocb_clone(&aio->iocb, iocb, get_file(file));
aio->iocb.ki_complete = backing_aio_rw_complete;
refcount_set(&aio->ref, 2);
ret = vfs_iocb_iter_read(file, &aio->iocb, iter);
backing_aio_put(aio);
if (ret != -EIOCBQUEUED)
backing_aio_cleanup(aio, ret);
}
out:
revert_creds(old_cred);
scoped_with_creds(ctx->cred)
ret = do_backing_file_read_iter(file, iter, iocb, flags);
if (ctx->accessed)
ctx->accessed(iocb->ki_filp);
@ -206,11 +210,47 @@ ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(backing_file_read_iter);
static int do_backing_file_write_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags,
void (*end_write)(struct kiocb *, ssize_t))
{
struct backing_aio *aio;
int ret;
if (is_sync_kiocb(iocb)) {
rwf_t rwf = iocb_to_rw_flags(flags);
ret = vfs_iter_write(file, iter, &iocb->ki_pos, rwf);
if (end_write)
end_write(iocb, ret);
return ret;
}
ret = backing_aio_init_wq(iocb);
if (ret)
return ret;
aio = kmem_cache_zalloc(backing_aio_cachep, GFP_KERNEL);
if (!aio)
return -ENOMEM;
aio->orig_iocb = iocb;
aio->end_write = end_write;
kiocb_clone(&aio->iocb, iocb, get_file(file));
aio->iocb.ki_flags = flags;
aio->iocb.ki_complete = backing_aio_queue_completion;
refcount_set(&aio->ref, 2);
ret = vfs_iocb_iter_write(file, &aio->iocb, iter);
backing_aio_put(aio);
if (ret != -EIOCBQUEUED)
backing_aio_cleanup(aio, ret);
return ret;
}
ssize_t backing_file_write_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags,
struct backing_file_ctx *ctx)
{
const struct cred *old_cred;
ssize_t ret;
if (WARN_ON_ONCE(!(file->f_mode & FMODE_BACKING)))
@ -227,46 +267,8 @@ ssize_t backing_file_write_iter(struct file *file, struct iov_iter *iter,
!(file->f_mode & FMODE_CAN_ODIRECT))
return -EINVAL;
/*
* Stacked filesystems don't support deferred completions, don't copy
* this property in case it is set by the issuer.
*/
flags &= ~IOCB_DIO_CALLER_COMP;
old_cred = override_creds(ctx->cred);
if (is_sync_kiocb(iocb)) {
rwf_t rwf = iocb_to_rw_flags(flags);
ret = vfs_iter_write(file, iter, &iocb->ki_pos, rwf);
if (ctx->end_write)
ctx->end_write(iocb, ret);
} else {
struct backing_aio *aio;
ret = backing_aio_init_wq(iocb);
if (ret)
goto out;
ret = -ENOMEM;
aio = kmem_cache_zalloc(backing_aio_cachep, GFP_KERNEL);
if (!aio)
goto out;
aio->orig_iocb = iocb;
aio->end_write = ctx->end_write;
kiocb_clone(&aio->iocb, iocb, get_file(file));
aio->iocb.ki_flags = flags;
aio->iocb.ki_complete = backing_aio_queue_completion;
refcount_set(&aio->ref, 2);
ret = vfs_iocb_iter_write(file, &aio->iocb, iter);
backing_aio_put(aio);
if (ret != -EIOCBQUEUED)
backing_aio_cleanup(aio, ret);
}
out:
revert_creds(old_cred);
return ret;
scoped_with_creds(ctx->cred)
return do_backing_file_write_iter(file, iter, iocb, flags, ctx->end_write);
}
EXPORT_SYMBOL_GPL(backing_file_write_iter);
@ -275,15 +277,13 @@ ssize_t backing_file_splice_read(struct file *in, struct kiocb *iocb,
unsigned int flags,
struct backing_file_ctx *ctx)
{
const struct cred *old_cred;
ssize_t ret;
if (WARN_ON_ONCE(!(in->f_mode & FMODE_BACKING)))
return -EIO;
old_cred = override_creds(ctx->cred);
ret = vfs_splice_read(in, &iocb->ki_pos, pipe, len, flags);
revert_creds(old_cred);
scoped_with_creds(ctx->cred)
ret = vfs_splice_read(in, &iocb->ki_pos, pipe, len, flags);
if (ctx->accessed)
ctx->accessed(iocb->ki_filp);
@ -297,7 +297,6 @@ ssize_t backing_file_splice_write(struct pipe_inode_info *pipe,
size_t len, unsigned int flags,
struct backing_file_ctx *ctx)
{
const struct cred *old_cred;
ssize_t ret;
if (WARN_ON_ONCE(!(out->f_mode & FMODE_BACKING)))
@ -310,11 +309,11 @@ ssize_t backing_file_splice_write(struct pipe_inode_info *pipe,
if (ret)
return ret;
old_cred = override_creds(ctx->cred);
file_start_write(out);
ret = out->f_op->splice_write(pipe, out, &iocb->ki_pos, len, flags);
file_end_write(out);
revert_creds(old_cred);
scoped_with_creds(ctx->cred) {
file_start_write(out);
ret = out->f_op->splice_write(pipe, out, &iocb->ki_pos, len, flags);
file_end_write(out);
}
if (ctx->end_write)
ctx->end_write(iocb, ret);
@ -326,7 +325,6 @@ EXPORT_SYMBOL_GPL(backing_file_splice_write);
int backing_file_mmap(struct file *file, struct vm_area_struct *vma,
struct backing_file_ctx *ctx)
{
const struct cred *old_cred;
struct file *user_file = vma->vm_file;
int ret;
@ -338,9 +336,8 @@ int backing_file_mmap(struct file *file, struct vm_area_struct *vma,
vma_set_file(vma, file);
old_cred = override_creds(ctx->cred);
ret = vfs_mmap(vma->vm_file, vma);
revert_creds(old_cred);
scoped_with_creds(ctx->cred)
ret = vfs_mmap(vma->vm_file, vma);
if (ctx->accessed)
ctx->accessed(user_file);

View File

@ -307,7 +307,7 @@ static struct inode *befs_iget(struct super_block *sb, unsigned long ino)
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
befs_ino = BEFS_I(inode);

View File

@ -42,7 +42,7 @@ struct inode *bfs_iget(struct super_block *sb, unsigned long ino)
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
if ((ino < BFS_ROOT_INO) || (ino > BFS_SB(inode->i_sb)->si_lasti)) {

View File

@ -782,8 +782,6 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
return PTR_ERR(e);
if (e->flags & MISC_FMT_OPEN_FILE) {
const struct cred *old_cred;
/*
* Now that we support unprivileged binfmt_misc mounts make
* sure we use the credentials that the register @file was
@ -791,9 +789,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
* didn't matter much as only a privileged process could open
* the register file.
*/
old_cred = override_creds(file->f_cred);
f = open_exec(e->interpreter);
revert_creds(old_cred);
scoped_with_creds(file->f_cred)
f = open_exec(e->interpreter);
if (IS_ERR(f)) {
pr_notice("register: failed to install interpreter file %s\n",
e->interpreter);

View File

@ -1850,12 +1850,10 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
if (!btrfs_should_reclaim(fs_info))
return;
sb_start_write(fs_info->sb);
guard(super_write)(fs_info->sb);
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
sb_end_write(fs_info->sb);
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
return;
}
/*
* Long running balances can keep us blocked here for eternity, so
@ -1863,7 +1861,6 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
*/
if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
btrfs_exclop_finish(fs_info);
sb_end_write(fs_info->sb);
return;
}
@ -1947,7 +1944,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
/*
* Get out fast, in case we're read-only or unmounting the
* filesystem. It is OK to drop block groups from the list even
* for the read-only case. As we did sb_start_write(),
* for the read-only case. As we did take the super write lock,
* "mount -o remount,ro" won't happen and read-only filesystem
* means it is forced read-only due to a fatal error. So, it
* never gets back to read-write to let us reclaim again.
@ -2030,7 +2027,6 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
list_splice_tail(&retry_list, &fs_info->reclaim_bgs);
spin_unlock(&fs_info->unused_bgs_lock);
btrfs_exclop_finish(fs_info);
sb_end_write(fs_info->sb);
}
void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)

View File

@ -85,8 +85,8 @@ static inline u32 btrfs_calc_input_length(struct folio *folio, u64 range_end, u6
{
/* @cur must be inside the folio. */
ASSERT(folio_pos(folio) <= cur);
ASSERT(cur < folio_end(folio));
return min(range_end, folio_end(folio)) - cur;
ASSERT(cur < folio_next_pos(folio));
return umin(range_end, folio_next_pos(folio)) - cur;
}
int btrfs_alloc_compress_wsm(struct btrfs_fs_info *fs_info);

View File

@ -254,10 +254,9 @@ static int btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
range.extent_thresh = defrag->extent_thresh;
file_ra_state_init(ra, inode->vfs_inode.i_mapping);
sb_start_write(fs_info->sb);
ret = btrfs_defrag_file(inode, ra, &range, defrag->transid,
BTRFS_DEFRAG_BATCH);
sb_end_write(fs_info->sb);
scoped_guard(super_write, fs_info->sb)
ret = btrfs_defrag_file(inode, ra, &range,
defrag->transid, BTRFS_DEFRAG_BATCH);
iput(&inode->vfs_inode);
if (ret < 0)
@ -886,7 +885,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
}
lock_start = folio_pos(folio);
lock_end = folio_end(folio) - 1;
lock_end = folio_next_pos(folio) - 1;
/* Wait for any existing ordered extent in the range */
while (1) {
struct btrfs_ordered_extent *ordered;
@ -1178,7 +1177,8 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
if (!folio)
break;
if (start >= folio_end(folio) || start + len <= folio_pos(folio))
if (start >= folio_next_pos(folio) ||
start + len <= folio_pos(folio))
continue;
btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
@ -1219,7 +1219,7 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
folios[i] = NULL;
goto free_folios;
}
cur = folio_end(folios[i]);
cur = folio_next_pos(folios[i]);
}
for (int i = 0; i < nr_pages; i++) {
if (!folios[i])

View File

@ -333,7 +333,7 @@ static noinline int lock_delalloc_folios(struct inode *inode,
goto out;
}
range_start = max_t(u64, folio_pos(folio), start);
range_len = min_t(u64, folio_end(folio), end + 1) - range_start;
range_len = min_t(u64, folio_next_pos(folio), end + 1) - range_start;
btrfs_folio_set_lock(fs_info, folio, range_start, range_len);
processed_end = range_start + range_len - 1;
@ -387,7 +387,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
ASSERT(orig_end > orig_start);
/* The range should at least cover part of the folio */
ASSERT(!(orig_start >= folio_end(locked_folio) ||
ASSERT(!(orig_start >= folio_next_pos(locked_folio) ||
orig_end <= folio_pos(locked_folio)));
again:
/* step one, find a bunch of delalloc bytes starting at start */
@ -493,7 +493,7 @@ static void end_folio_read(struct folio *folio, bool uptodate, u64 start, u32 le
struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
ASSERT(folio_pos(folio) <= start &&
start + len <= folio_end(folio));
start + len <= folio_next_pos(folio));
if (uptodate && btrfs_verify_folio(folio, start, len))
btrfs_folio_set_uptodate(fs_info, folio, start, len);
@ -1201,7 +1201,7 @@ static bool can_skip_one_ordered_range(struct btrfs_inode *inode,
* finished our folio read and unlocked the folio.
*/
if (btrfs_folio_test_dirty(fs_info, folio, cur, blocksize)) {
u64 range_len = min(folio_end(folio),
u64 range_len = umin(folio_next_pos(folio),
ordered->file_offset + ordered->num_bytes) - cur;
ret = true;
@ -1223,7 +1223,7 @@ static bool can_skip_one_ordered_range(struct btrfs_inode *inode,
* So we return true and update @next_ret to the OE/folio boundary.
*/
if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
u64 range_len = min(folio_end(folio),
u64 range_len = umin(folio_next_pos(folio),
ordered->file_offset + ordered->num_bytes) - cur;
/*
@ -2215,7 +2215,7 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
for (int i = 0; i < num_extent_folios(eb); i++) {
struct folio *folio = eb->folios[i];
u64 range_start = max_t(u64, eb->start, folio_pos(folio));
u32 range_len = min_t(u64, folio_end(folio),
u32 range_len = min_t(u64, folio_next_pos(folio),
eb->start + eb->len) - range_start;
folio_lock(folio);
@ -2468,10 +2468,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
&BTRFS_I(inode)->runtime_flags))
wbc->tagged_writepages = 1;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
tag = wbc_to_tag(wbc);
retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, index, end);
@ -2627,7 +2624,7 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
continue;
}
cur_end = min_t(u64, folio_end(folio) - 1, end);
cur_end = min_t(u64, folio_next_pos(folio) - 1, end);
cur_len = cur_end + 1 - cur;
ASSERT(folio_test_locked(folio));
@ -3868,7 +3865,7 @@ int read_extent_buffer_pages_nowait(struct extent_buffer *eb, int mirror_num,
for (int i = 0; i < num_extent_folios(eb); i++) {
struct folio *folio = eb->folios[i];
u64 range_start = max_t(u64, eb->start, folio_pos(folio));
u32 range_len = min_t(u64, folio_end(folio),
u32 range_len = min_t(u64, folio_next_pos(folio),
eb->start + eb->len) - range_start;
bio_add_folio_nofail(&bbio->bio, folio, range_len,

View File

@ -89,7 +89,8 @@ int btrfs_dirty_folio(struct btrfs_inode *inode, struct folio *folio, loff_t pos
num_bytes = round_up(write_bytes + pos - start_pos,
fs_info->sectorsize);
ASSERT(num_bytes <= U32_MAX);
ASSERT(folio_pos(folio) <= pos && folio_end(folio) >= pos + write_bytes);
ASSERT(folio_pos(folio) <= pos &&
folio_next_pos(folio) >= pos + write_bytes);
end_of_last_block = start_pos + num_bytes - 1;
@ -799,7 +800,7 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
u64 len)
{
u64 clamp_start = max_t(u64, pos, folio_pos(folio));
u64 clamp_end = min_t(u64, pos + len, folio_end(folio));
u64 clamp_end = min_t(u64, pos + len, folio_next_pos(folio));
const u32 blocksize = inode_to_fs_info(inode)->sectorsize;
int ret = 0;
@ -1254,8 +1255,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *iter,
* The reserved range goes beyond the current folio, shrink the reserved
* space to the folio boundary.
*/
if (reserved_start + reserved_len > folio_end(folio)) {
const u64 last_block = folio_end(folio);
if (reserved_start + reserved_len > folio_next_pos(folio)) {
const u64 last_block = folio_next_pos(folio);
shrink_reserved_space(inode, *data_reserved, reserved_start,
reserved_len, last_block - reserved_start,

View File

@ -9,6 +9,7 @@
#include <linux/blk-cgroup.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/fs_struct.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/time.h>
@ -411,7 +412,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
continue;
}
index = folio_end(folio) >> PAGE_SHIFT;
index = folio_next_index(folio);
/*
* Here we just clear all Ordered bits for every page in the
* range, then btrfs_mark_ordered_io_finished() will handle
@ -2338,7 +2339,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct folio *locked_fol
* The range must cover part of the @locked_folio, or a return of 1
* can confuse the caller.
*/
ASSERT(!(end <= folio_pos(locked_folio) || start >= folio_end(locked_folio)));
ASSERT(!(end <= folio_pos(locked_folio) ||
start >= folio_next_pos(locked_folio)));
if (should_nocow(inode, start, end)) {
ret = run_delalloc_nocow(inode, locked_folio, start, end);
@ -2745,7 +2747,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
struct btrfs_inode *inode = fixup->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
u64 page_start = folio_pos(folio);
u64 page_end = folio_end(folio) - 1;
u64 page_end = folio_next_pos(folio) - 1;
int ret = 0;
bool free_delalloc_space = true;
@ -3886,7 +3888,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc)
ASSERT(ret != -ENOMEM);
return ret;
} else if (existing) {
WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING)));
WARN_ON(!(inode_state_read_once(&existing->vfs_inode) & (I_WILL_FREE | I_FREEING)));
}
return 0;
@ -4857,7 +4859,7 @@ static int truncate_block_zero_beyond_eof(struct btrfs_inode *inode, u64 start)
*/
zero_start = max_t(u64, folio_pos(folio), start);
zero_end = folio_end(folio);
zero_end = folio_next_pos(folio);
folio_zero_range(folio, zero_start - folio_pos(folio),
zero_end - zero_start);
@ -5040,7 +5042,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, u64 offset, u64 start, u64 e
* not reach disk, it still affects our page caches.
*/
zero_start = max_t(u64, folio_pos(folio), start);
zero_end = min_t(u64, folio_end(folio) - 1, end);
zero_end = min_t(u64, folio_next_pos(folio) - 1, end);
} else {
zero_start = max_t(u64, block_start, start);
zero_end = min_t(u64, block_end, end);
@ -5363,7 +5365,7 @@ static void evict_inode_truncate_pages(struct inode *inode)
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct rb_node *node;
ASSERT(inode->i_state & I_FREEING);
ASSERT(inode_state_read_once(inode) & I_FREEING);
truncate_inode_pages_final(&inode->i_data);
btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (u64)-1, false);
@ -5801,7 +5803,7 @@ struct btrfs_inode *btrfs_iget_path(u64 ino, struct btrfs_root *root,
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->vfs_inode.i_state & I_NEW))
if (!(inode_state_read_once(&inode->vfs_inode) & I_NEW))
return inode;
ret = btrfs_read_locked_inode(inode, path);
@ -5825,7 +5827,7 @@ struct btrfs_inode *btrfs_iget(u64 ino, struct btrfs_root *root)
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->vfs_inode.i_state & I_NEW))
if (!(inode_state_read_once(&inode->vfs_inode) & I_NEW))
return inode;
path = btrfs_alloc_path();
@ -5839,6 +5841,8 @@ struct btrfs_inode *btrfs_iget(u64 ino, struct btrfs_root *root)
if (ret)
return ERR_PTR(ret);
if (S_ISDIR(inode->vfs_inode.i_mode))
inode->vfs_inode.i_opflags |= IOP_FASTPERM_MAY_EXEC;
unlock_new_inode(&inode->vfs_inode);
return inode;
}
@ -6291,8 +6295,8 @@ static int btrfs_dirty_inode(struct btrfs_inode *inode)
}
/*
* This is a copy of file_update_time. We need this so we can return error on
* ENOSPC for updating the inode in the case of file write and mmap writes.
* We need our own ->update_time so that we can return error on ENOSPC for
* updating the inode in the case of file write and mmap writes.
*/
static int btrfs_update_time(struct inode *inode, int flags)
{
@ -6790,8 +6794,11 @@ static int btrfs_create_common(struct inode *dir, struct dentry *dentry,
}
ret = btrfs_create_new_inode(trans, &new_inode_args);
if (!ret)
if (!ret) {
if (S_ISDIR(inode->i_mode))
inode->i_opflags |= IOP_FASTPERM_MAY_EXEC;
d_instantiate_new(dentry, inode);
}
btrfs_end_transaction(trans);
btrfs_btree_balance_dirty(fs_info);
@ -7481,7 +7488,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
u64 page_start = folio_pos(folio);
u64 page_end = page_start + folio_size(folio) - 1;
u64 cur;
int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
int inode_evicting = inode_state_read_once(&inode->vfs_inode) & I_FREEING;
/*
* We have folio locked so no new ordered extent can be created on this
@ -8710,15 +8717,13 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode
* some fairly slow code that needs optimization. This walks the list
* of all the inodes with pending delalloc and forces them to disk.
*/
static int start_delalloc_inodes(struct btrfs_root *root,
struct writeback_control *wbc, bool snapshot,
bool in_reclaim_context)
static int start_delalloc_inodes(struct btrfs_root *root, long *nr_to_write,
bool snapshot, bool in_reclaim_context)
{
struct btrfs_delalloc_work *work, *next;
LIST_HEAD(works);
LIST_HEAD(splice);
int ret = 0;
bool full_flush = wbc->nr_to_write == LONG_MAX;
mutex_lock(&root->delalloc_mutex);
spin_lock(&root->delalloc_lock);
@ -8744,10 +8749,10 @@ static int start_delalloc_inodes(struct btrfs_root *root,
if (snapshot)
set_bit(BTRFS_INODE_SNAPSHOT_FLUSH, &inode->runtime_flags);
if (full_flush) {
work = btrfs_alloc_delalloc_work(&inode->vfs_inode);
if (nr_to_write == NULL) {
work = btrfs_alloc_delalloc_work(tmp_inode);
if (!work) {
iput(&inode->vfs_inode);
iput(tmp_inode);
ret = -ENOMEM;
goto out;
}
@ -8755,9 +8760,11 @@ static int start_delalloc_inodes(struct btrfs_root *root,
btrfs_queue_work(root->fs_info->flush_workers,
&work->work);
} else {
ret = filemap_fdatawrite_wbc(inode->vfs_inode.i_mapping, wbc);
ret = filemap_flush_nr(tmp_inode->i_mapping,
nr_to_write);
btrfs_add_delayed_iput(inode);
if (ret || wbc->nr_to_write <= 0)
if (ret || *nr_to_write <= 0)
goto out;
}
cond_resched();
@ -8783,29 +8790,17 @@ static int start_delalloc_inodes(struct btrfs_root *root,
int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context)
{
struct writeback_control wbc = {
.nr_to_write = LONG_MAX,
.sync_mode = WB_SYNC_NONE,
.range_start = 0,
.range_end = LLONG_MAX,
};
struct btrfs_fs_info *fs_info = root->fs_info;
if (BTRFS_FS_ERROR(fs_info))
return -EROFS;
return start_delalloc_inodes(root, &wbc, true, in_reclaim_context);
return start_delalloc_inodes(root, NULL, true, in_reclaim_context);
}
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
bool in_reclaim_context)
{
struct writeback_control wbc = {
.nr_to_write = nr,
.sync_mode = WB_SYNC_NONE,
.range_start = 0,
.range_end = LLONG_MAX,
};
long *nr_to_write = nr == LONG_MAX ? NULL : &nr;
struct btrfs_root *root;
LIST_HEAD(splice);
int ret;
@ -8817,13 +8812,6 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
spin_lock(&fs_info->delalloc_root_lock);
list_splice_init(&fs_info->delalloc_roots, &splice);
while (!list_empty(&splice)) {
/*
* Reset nr_to_write here so we know that we're doing a full
* flush.
*/
if (nr == LONG_MAX)
wbc.nr_to_write = LONG_MAX;
root = list_first_entry(&splice, struct btrfs_root,
delalloc_root);
root = btrfs_grab_root(root);
@ -8832,9 +8820,10 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
&fs_info->delalloc_roots);
spin_unlock(&fs_info->delalloc_root_lock);
ret = start_delalloc_inodes(root, &wbc, false, in_reclaim_context);
ret = start_delalloc_inodes(root, nr_to_write, false,
in_reclaim_context);
btrfs_put_root(root);
if (ret < 0 || wbc.nr_to_write <= 0)
if (ret < 0 || nr <= 0)
goto out;
spin_lock(&fs_info->delalloc_root_lock);
}
@ -9170,6 +9159,11 @@ int btrfs_prealloc_file_range_trans(struct inode *inode,
min_size, actual_len, alloc_hint, trans);
}
/*
* NOTE: in case you are adding MAY_EXEC check for directories:
* we are marking them with IOP_FASTPERM_MAY_EXEC, allowing path lookup to
* elide calls here.
*/
static int btrfs_permission(struct mnt_idmap *idmap,
struct inode *inode, int mask)
{

View File

@ -209,9 +209,4 @@ static inline bool bitmap_test_range_all_zero(const unsigned long *addr,
return (found_set == start + nbits);
}
static inline u64 folio_end(struct folio *folio)
{
return folio_pos(folio) + folio_size(folio);
}
#endif

View File

@ -359,7 +359,7 @@ static bool can_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
if (folio) {
ASSERT(folio->mapping);
ASSERT(folio_pos(folio) <= file_offset);
ASSERT(file_offset + len <= folio_end(folio));
ASSERT(file_offset + len <= folio_next_pos(folio));
/*
* Ordered flag indicates whether we still have

View File

@ -186,7 +186,8 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
* unmapped page like dummy extent buffer pages.
*/
if (folio->mapping)
ASSERT(folio_pos(folio) <= start && start + len <= folio_end(folio),
ASSERT(folio_pos(folio) <= start &&
start + len <= folio_next_pos(folio),
"start=%llu len=%u folio_pos=%llu folio_size=%zu",
start, len, folio_pos(folio), folio_size(folio));
}
@ -217,7 +218,7 @@ static void btrfs_subpage_clamp_range(struct folio *folio, u64 *start, u32 *len)
if (folio_pos(folio) >= orig_start + orig_len)
*len = 0;
else
*len = min_t(u64, folio_end(folio), orig_start + orig_len) - *start;
*len = min_t(u64, folio_next_pos(folio), orig_start + orig_len) - *start;
}
static bool btrfs_subpage_end_and_test_lock(const struct btrfs_fs_info *fs_info,

View File

@ -2002,14 +2002,11 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
static void update_dev_time(const char *device_path)
{
struct path path;
int ret;
ret = kern_path(device_path, LOOKUP_FOLLOW, &path);
if (ret)
return;
inode_update_time(d_inode(path.dentry), S_MTIME | S_CTIME | S_VERSION);
path_put(&path);
if (!kern_path(device_path, LOOKUP_FOLLOW, &path)) {
vfs_utimes(&path, NULL);
path_put(&path);
}
}
static int btrfs_rm_dev_item(struct btrfs_trans_handle *trans,
@ -4660,12 +4657,12 @@ static int balance_kthread(void *data)
struct btrfs_fs_info *fs_info = data;
int ret = 0;
sb_start_write(fs_info->sb);
guard(super_write)(fs_info->sb);
mutex_lock(&fs_info->balance_mutex);
if (fs_info->balance_ctl)
ret = btrfs_balance(fs_info, fs_info->balance_ctl, NULL);
mutex_unlock(&fs_info->balance_mutex);
sb_end_write(fs_info->sb);
return ret;
}
@ -8177,12 +8174,12 @@ static int relocating_repair_kthread(void *data)
target = cache->start;
btrfs_put_block_group(cache);
sb_start_write(fs_info->sb);
guard(super_write)(fs_info->sb);
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
btrfs_info(fs_info,
"zoned: skip relocating block group %llu to repair: EBUSY",
target);
sb_end_write(fs_info->sb);
return -EBUSY;
}
@ -8210,7 +8207,6 @@ static int relocating_repair_kthread(void *data)
btrfs_put_block_group(cache);
mutex_unlock(&fs_info->reclaim_bgs_lock);
btrfs_exclop_finish(fs_info);
sb_end_write(fs_info->sb);
return ret;
}

View File

@ -611,9 +611,9 @@ int generic_buffers_fsync_noflush(struct file *file, loff_t start, loff_t end,
return err;
ret = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY_ALL))
if (!(inode_state_read_once(inode) & I_DIRTY_ALL))
goto out;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
if (datasync && !(inode_state_read_once(inode) & I_DIRTY_DATASYNC))
goto out;
err = sync_inode_metadata(inode, 1);
@ -2732,7 +2732,7 @@ int block_write_full_folio(struct folio *folio, struct writeback_control *wbc,
loff_t i_size = i_size_read(inode);
/* Is the folio fully inside i_size? */
if (folio_pos(folio) + folio_size(folio) <= i_size)
if (folio_next_pos(folio) <= i_size)
return __block_write_full_folio(inode, folio, get_block, wbc);
/* Is the folio fully outside i_size? (truncate in progress) */

View File

@ -1045,11 +1045,7 @@ void ceph_init_writeback_ctl(struct address_space *mapping,
ceph_wbc->index = ceph_wbc->start_index;
ceph_wbc->end = -1;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) {
ceph_wbc->tag = PAGECACHE_TAG_TOWRITE;
} else {
ceph_wbc->tag = PAGECACHE_TAG_DIRTY;
}
ceph_wbc->tag = wbc_to_tag(wbc);
ceph_wbc->op_idx = -1;
ceph_wbc->num_ops = 0;

View File

@ -26,7 +26,7 @@ void ceph_fscache_register_inode_cookie(struct inode *inode)
return;
/* Only new inodes! */
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return;
WARN_ON_ONCE(ci->netfs.cache);

View File

@ -329,7 +329,7 @@ int ceph_encode_encrypted_dname(struct inode *parent, char *buf, int elen)
out:
kfree(cryptbuf);
if (dir != parent) {
if ((dir->i_state & I_NEW))
if ((inode_state_read_once(dir) & I_NEW))
discard_new_inode(dir);
else
iput(dir);
@ -438,7 +438,7 @@ int ceph_fname_to_usr(const struct ceph_fname *fname, struct fscrypt_str *tname,
fscrypt_fname_free_buffer(&_tname);
out_inode:
if (dir != fname->dir) {
if ((dir->i_state & I_NEW))
if ((inode_state_read_once(dir) & I_NEW))
discard_new_inode(dir);
else
iput(dir);

View File

@ -740,7 +740,7 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
vino.ino, ceph_ino(dir), dentry->d_name.name);
ceph_dir_clear_ordered(dir);
ceph_init_inode_acls(inode, as_ctx);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
/*
* If it's not I_NEW, then someone created this before
* we got here. Assume the server is aware of it at
@ -901,7 +901,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
new_inode = NULL;
goto out_req;
}
WARN_ON_ONCE(!(new_inode->i_state & I_NEW));
WARN_ON_ONCE(!(inode_state_read_once(new_inode) & I_NEW));
spin_lock(&dentry->d_lock);
di->flags |= CEPH_DENTRY_ASYNC_CREATE;

View File

@ -132,7 +132,7 @@ struct inode *ceph_new_inode(struct inode *dir, struct dentry *dentry,
goto out_err;
}
inode->i_state = 0;
inode_state_assign_raw(inode, 0);
inode->i_mode = *mode;
err = ceph_security_init_secctx(dentry, *mode, as_ctx);
@ -201,7 +201,7 @@ struct inode *ceph_get_inode(struct super_block *sb, struct ceph_vino vino,
doutc(cl, "on %llx=%llx.%llx got %p new %d\n",
ceph_present_inode(inode), ceph_vinop(inode), inode,
!!(inode->i_state & I_NEW));
!!(inode_state_read_once(inode) & I_NEW));
return inode;
}
@ -228,7 +228,7 @@ struct inode *ceph_get_snapdir(struct inode *parent)
goto err;
}
if (!(inode->i_state & I_NEW) && !S_ISDIR(inode->i_mode)) {
if (!(inode_state_read_once(inode) & I_NEW) && !S_ISDIR(inode->i_mode)) {
pr_warn_once_client(cl, "bad snapdir inode type (mode=0%o)\n",
inode->i_mode);
goto err;
@ -261,7 +261,7 @@ struct inode *ceph_get_snapdir(struct inode *parent)
}
}
#endif
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
inode->i_op = &ceph_snapdir_iops;
inode->i_fop = &ceph_snapdir_fops;
ci->i_snap_caps = CEPH_CAP_PIN; /* so we can open */
@ -270,7 +270,7 @@ struct inode *ceph_get_snapdir(struct inode *parent)
return inode;
err:
if ((inode->i_state & I_NEW))
if ((inode_state_read_once(inode) & I_NEW))
discard_new_inode(inode);
else
iput(inode);
@ -744,7 +744,7 @@ void ceph_evict_inode(struct inode *inode)
netfs_wait_for_outstanding_io(inode);
truncate_inode_pages_final(&inode->i_data);
if (inode->i_state & I_PINNING_NETFS_WB)
if (inode_state_read_once(inode) & I_PINNING_NETFS_WB)
ceph_fscache_unuse_cookie(inode, true);
clear_inode(inode);
@ -1013,7 +1013,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
le64_to_cpu(info->version), ci->i_version);
/* Once I_NEW is cleared, we can't change type or dev numbers */
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
inode->i_mode = mode;
} else {
if (inode_wrong_type(inode, mode)) {
@ -1090,7 +1090,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
#ifdef CONFIG_FS_ENCRYPTION
if (iinfo->fscrypt_auth_len &&
((inode->i_state & I_NEW) || (ci->fscrypt_auth_len == 0))) {
((inode_state_read_once(inode) & I_NEW) || (ci->fscrypt_auth_len == 0))) {
kfree(ci->fscrypt_auth);
ci->fscrypt_auth_len = iinfo->fscrypt_auth_len;
ci->fscrypt_auth = iinfo->fscrypt_auth;
@ -1692,13 +1692,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
pr_err_client(cl, "badness %p %llx.%llx\n", in,
ceph_vinop(in));
req->r_target_inode = NULL;
if (in->i_state & I_NEW)
if (inode_state_read_once(in) & I_NEW)
discard_new_inode(in);
else
iput(in);
goto done;
}
if (in->i_state & I_NEW)
if (inode_state_read_once(in) & I_NEW)
unlock_new_inode(in);
}
@ -1898,11 +1898,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
pr_err_client(cl, "inode badness on %p got %d\n", in,
rc);
err = rc;
if (in->i_state & I_NEW) {
if (inode_state_read_once(in) & I_NEW) {
ihold(in);
discard_new_inode(in);
}
} else if (in->i_state & I_NEW) {
} else if (inode_state_read_once(in) & I_NEW) {
unlock_new_inode(in);
}
@ -2114,7 +2114,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
pr_err_client(cl, "badness on %p %llx.%llx\n", in,
ceph_vinop(in));
if (d_really_is_negative(dn)) {
if (in->i_state & I_NEW) {
if (inode_state_read_once(in) & I_NEW) {
ihold(in);
discard_new_inode(in);
}
@ -2124,7 +2124,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
err = ret;
goto next_item;
}
if (in->i_state & I_NEW)
if (inode_state_read_once(in) & I_NEW)
unlock_new_inode(in);
if (d_really_is_negative(dn)) {

View File

@ -70,7 +70,7 @@ struct inode * coda_iget(struct super_block * sb, struct CodaFid * fid,
if (!inode)
return ERR_PTR(-ENOMEM);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
cii = ITOC(inode);
/* we still need to set i_ino for things like stat(2) */
inode->i_ino = hash;
@ -148,7 +148,7 @@ struct inode *coda_fid_to_inode(struct CodaFid *fid, struct super_block *sb)
/* we should never see newly created inodes because we intentionally
* fail in the initialization callback */
BUG_ON(inode->i_state & I_NEW);
BUG_ON(inode_state_read_once(inode) & I_NEW);
return inode;
}

View File

@ -1036,7 +1036,7 @@ static bool coredump_pipe(struct core_name *cn, struct coredump_params *cprm,
static bool coredump_write(struct core_name *cn,
struct coredump_params *cprm,
struct linux_binfmt *binfmt)
const struct linux_binfmt *binfmt)
{
if (dump_interrupted())
@ -1086,15 +1086,80 @@ static inline bool coredump_skip(const struct coredump_params *cprm,
return false;
}
static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
size_t **argv, int *argc, const struct linux_binfmt *binfmt)
{
if (!coredump_parse(cn, cprm, argv, argc)) {
coredump_report_failure("format_corename failed, aborting core");
return;
}
switch (cn->core_type) {
case COREDUMP_FILE:
if (!coredump_file(cn, cprm, binfmt))
return;
break;
case COREDUMP_PIPE:
if (!coredump_pipe(cn, cprm, *argv, *argc))
return;
break;
case COREDUMP_SOCK_REQ:
fallthrough;
case COREDUMP_SOCK:
if (!coredump_socket(cn, cprm))
return;
break;
default:
WARN_ON_ONCE(true);
return;
}
/* Don't even generate the coredump. */
if (cn->mask & COREDUMP_REJECT)
return;
/* get us an unshared descriptor table; almost always a no-op */
/* The cell spufs coredump code reads the file descriptor tables */
if (unshare_files())
return;
if ((cn->mask & COREDUMP_KERNEL) && !coredump_write(cn, cprm, binfmt))
return;
coredump_sock_shutdown(cprm->file);
/* Let the parent know that a coredump was generated. */
if (cn->mask & COREDUMP_USERSPACE)
cn->core_dumped = true;
/*
* When core_pipe_limit is set we wait for the coredump server
* or usermodehelper to finish before exiting so it can e.g.,
* inspect /proc/<pid>.
*/
if (cn->mask & COREDUMP_WAIT) {
switch (cn->core_type) {
case COREDUMP_PIPE:
wait_for_dump_helpers(cprm->file);
break;
case COREDUMP_SOCK_REQ:
fallthrough;
case COREDUMP_SOCK:
coredump_sock_wait(cprm->file);
break;
default:
break;
}
}
}
void vfs_coredump(const kernel_siginfo_t *siginfo)
{
struct cred *cred __free(put_cred) = NULL;
size_t *argv __free(kfree) = NULL;
struct core_state core_state;
struct core_name cn;
struct mm_struct *mm = current->mm;
struct linux_binfmt *binfmt = mm->binfmt;
const struct cred *old_cred;
const struct mm_struct *mm = current->mm;
const struct linux_binfmt *binfmt = mm->binfmt;
int argc = 0;
struct coredump_params cprm = {
.siginfo = siginfo,
@ -1116,7 +1181,7 @@ void vfs_coredump(const kernel_siginfo_t *siginfo)
if (coredump_skip(&cprm, binfmt))
return;
cred = prepare_creds();
CLASS(prepare_creds, cred)();
if (!cred)
return;
/*
@ -1131,74 +1196,9 @@ void vfs_coredump(const kernel_siginfo_t *siginfo)
if (coredump_wait(siginfo->si_signo, &core_state) < 0)
return;
old_cred = override_creds(cred);
if (!coredump_parse(&cn, &cprm, &argv, &argc)) {
coredump_report_failure("format_corename failed, aborting core");
goto close_fail;
}
switch (cn.core_type) {
case COREDUMP_FILE:
if (!coredump_file(&cn, &cprm, binfmt))
goto close_fail;
break;
case COREDUMP_PIPE:
if (!coredump_pipe(&cn, &cprm, argv, argc))
goto close_fail;
break;
case COREDUMP_SOCK_REQ:
fallthrough;
case COREDUMP_SOCK:
if (!coredump_socket(&cn, &cprm))
goto close_fail;
break;
default:
WARN_ON_ONCE(true);
goto close_fail;
}
/* Don't even generate the coredump. */
if (cn.mask & COREDUMP_REJECT)
goto close_fail;
/* get us an unshared descriptor table; almost always a no-op */
/* The cell spufs coredump code reads the file descriptor tables */
if (unshare_files())
goto close_fail;
if ((cn.mask & COREDUMP_KERNEL) && !coredump_write(&cn, &cprm, binfmt))
goto close_fail;
coredump_sock_shutdown(cprm.file);
/* Let the parent know that a coredump was generated. */
if (cn.mask & COREDUMP_USERSPACE)
cn.core_dumped = true;
/*
* When core_pipe_limit is set we wait for the coredump server
* or usermodehelper to finish before exiting so it can e.g.,
* inspect /proc/<pid>.
*/
if (cn.mask & COREDUMP_WAIT) {
switch (cn.core_type) {
case COREDUMP_PIPE:
wait_for_dump_helpers(cprm.file);
break;
case COREDUMP_SOCK_REQ:
fallthrough;
case COREDUMP_SOCK:
coredump_sock_wait(cprm.file);
break;
default:
break;
}
}
close_fail:
scoped_with_creds(cred)
do_coredump(&cn, &cprm, &argv, &argc, binfmt);
coredump_cleanup(&cn, &cprm);
revert_creds(old_cred);
return;
}

View File

@ -95,7 +95,7 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
inode = iget_locked(sb, cramino(cramfs_inode, offset));
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
switch (cramfs_inode->mode & S_IFMT) {

View File

@ -945,7 +945,7 @@ static void evict_dentries_for_decrypted_inodes(struct fscrypt_master_key *mk)
list_for_each_entry(ci, &mk->mk_decrypted_inodes, ci_master_key_link) {
inode = ci->ci_inode;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) {
if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}

View File

@ -834,7 +834,7 @@ int fscrypt_drop_inode(struct inode *inode)
* userspace is still using the files, inodes can be dirtied between
* then and now. We mustn't lose any writes, so skip dirty inodes here.
*/
if (inode->i_state & I_DIRTY_ALL)
if (inode_state_read(inode) & I_DIRTY_ALL)
return 0;
/*

View File

@ -1507,7 +1507,7 @@ static int dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
/* already zeroed? we're done. */
if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
return iomap_iter_advance(iter, &length);
return iomap_iter_advance(iter, length);
/*
* invalidate the pages whose sharing state is to be changed
@ -1536,10 +1536,10 @@ static int dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
if (ret < 0)
return ret;
ret = iomap_iter_advance(iter, &length);
ret = iomap_iter_advance(iter, length);
if (ret)
return ret;
} while (length > 0);
} while ((length = iomap_length(iter)) > 0);
if (did_zero)
*did_zero = true;
@ -1597,7 +1597,7 @@ static int dax_iomap_iter(struct iomap_iter *iomi, struct iov_iter *iter)
if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN) {
done = iov_iter_zero(min(length, end - pos), iter);
return iomap_iter_advance(iomi, &done);
return iomap_iter_advance(iomi, done);
}
}
@ -1681,12 +1681,12 @@ static int dax_iomap_iter(struct iomap_iter *iomi, struct iov_iter *iter)
xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
map_len, iter);
length = xfer;
ret = iomap_iter_advance(iomi, &length);
ret = iomap_iter_advance(iomi, xfer);
if (!ret && xfer == 0)
ret = -EFAULT;
if (xfer < map_len)
break;
length = iomap_length(iomi);
}
dax_read_unlock(id);
@ -1919,10 +1919,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, unsigned long *pfnp,
ret |= VM_FAULT_MAJOR;
}
if (!(ret & VM_FAULT_ERROR)) {
u64 length = PAGE_SIZE;
iter.status = iomap_iter_advance(&iter, &length);
}
if (!(ret & VM_FAULT_ERROR))
iter.status = iomap_iter_advance(&iter, PAGE_SIZE);
}
if (iomap_errp)
@ -2034,10 +2032,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, unsigned long *pfnp,
continue; /* actually breaks out of the loop */
ret = dax_fault_iter(vmf, &iter, pfnp, &xas, &entry, true);
if (ret != VM_FAULT_FALLBACK) {
u64 length = PMD_SIZE;
iter.status = iomap_iter_advance(&iter, &length);
}
if (ret != VM_FAULT_FALLBACK)
iter.status = iomap_iter_advance(&iter, PMD_SIZE);
}
unlock_entry:
@ -2163,7 +2159,6 @@ static int dax_range_compare_iter(struct iomap_iter *it_src,
const struct iomap *smap = &it_src->iomap;
const struct iomap *dmap = &it_dest->iomap;
loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
u64 dest_len;
void *saddr, *daddr;
int id, ret;
@ -2196,10 +2191,9 @@ static int dax_range_compare_iter(struct iomap_iter *it_src,
dax_read_unlock(id);
advance:
dest_len = len;
ret = iomap_iter_advance(it_src, &len);
ret = iomap_iter_advance(it_src, len);
if (!ret)
ret = iomap_iter_advance(it_dest, &dest_len);
ret = iomap_iter_advance(it_dest, len);
return ret;
out_unlock:

View File

@ -86,7 +86,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
static struct kmem_cache *dentry_cache __ro_after_init;
static struct kmem_cache *__dentry_cache __ro_after_init;
#define dentry_cache runtime_const_ptr(__dentry_cache)
const struct qstr empty_name = QSTR_INIT("", 0);
EXPORT_SYMBOL(empty_name);
@ -794,7 +795,7 @@ void d_mark_dontcache(struct inode *inode)
de->d_flags |= DCACHE_DONTCACHE;
spin_unlock(&de->d_lock);
}
inode->i_state |= I_DONTCACHE;
inode_state_set(inode, I_DONTCACHE);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL(d_mark_dontcache);
@ -1073,7 +1074,7 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
spin_lock(&inode->i_lock);
// ->i_dentry and ->i_rcu are colocated, but the latter won't be
// used without having I_FREEING set, which means no aliases left
if (likely(!(inode->i_state & I_FREEING) && !hlist_empty(l))) {
if (likely(!(inode_state_read(inode) & I_FREEING) && !hlist_empty(l))) {
if (S_ISDIR(inode->i_mode)) {
de = hlist_entry(l->first, struct dentry, d_u.d_alias);
} else {
@ -1980,14 +1981,8 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
/*
* Pairs with the barrier in prepare_to_wait_event() to make sure
* ___wait_var_event() either sees the bit cleared or
* waitqueue_active() check in wake_up_var() sees the waiter.
*/
smp_mb();
WARN_ON(!(inode_state_read(inode) & I_NEW));
inode_state_clear(inode, I_NEW | I_CREATING);
inode_wake_up_bit(inode, __I_NEW);
spin_unlock(&inode->i_lock);
}
@ -2306,11 +2301,20 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
seq = raw_seqcount_begin(&dentry->d_seq);
if (dentry->d_parent != parent)
continue;
if (d_unhashed(dentry))
continue;
if (dentry->d_name.hash_len != hashlen)
continue;
if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
if (unlikely(dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0))
continue;
/*
* Check for the dentry being unhashed.
*
* As tempting as it is, we *can't* skip it because of a race window
* between us finding the dentry before it gets unhashed and loading
* the sequence counter after unhashing is finished.
*
* We can at least predict on it.
*/
if (unlikely(d_unhashed(dentry)))
continue;
*seqp = seq;
return dentry;
@ -3222,9 +3226,10 @@ static void __init dcache_init(void)
* but it is probably not worth it because of the cache nature
* of the dcache.
*/
dentry_cache = KMEM_CACHE_USERCOPY(dentry,
__dentry_cache = KMEM_CACHE_USERCOPY(dentry,
SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_ACCOUNT,
d_shortname.string);
runtime_const_init(ptr, __dentry_cache);
/* Hash may have been set up in dcache_init_early */
if (!hashdist)

View File

@ -28,7 +28,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
* inodes without pages but we deliberately won't in case
* we need to reschedule to avoid softlockups.
*/
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
if ((inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_NEW)) ||
(mapping_empty(inode->i_mapping) && !need_resched())) {
spin_unlock(&inode->i_lock);
continue;

View File

@ -4,7 +4,7 @@ config ECRYPT_FS
depends on KEYS && CRYPTO && (ENCRYPTED_KEYS || ENCRYPTED_KEYS=n)
select CRYPTO_ECB
select CRYPTO_CBC
select CRYPTO_MD5
select CRYPTO_LIB_MD5
help
Encrypted filesystem that operates on the VFS layer. See
<file:Documentation/filesystems/ecryptfs.rst> to learn more about

View File

@ -9,7 +9,6 @@
* Michael C. Thompson <mcthomps@us.ibm.com>
*/
#include <crypto/hash.h>
#include <crypto/skcipher.h>
#include <linux/fs.h>
#include <linux/mount.h>
@ -48,32 +47,6 @@ void ecryptfs_from_hex(char *dst, char *src, int dst_size)
}
}
/**
* ecryptfs_calculate_md5 - calculates the md5 of @src
* @dst: Pointer to 16 bytes of allocated memory
* @crypt_stat: Pointer to crypt_stat struct for the current inode
* @src: Data to be md5'd
* @len: Length of @src
*
* Uses the allocated crypto context that crypt_stat references to
* generate the MD5 sum of the contents of src.
*/
static int ecryptfs_calculate_md5(char *dst,
struct ecryptfs_crypt_stat *crypt_stat,
char *src, int len)
{
int rc = crypto_shash_tfm_digest(crypt_stat->hash_tfm, src, len, dst);
if (rc) {
printk(KERN_ERR
"%s: Error computing crypto hash; rc = [%d]\n",
__func__, rc);
goto out;
}
out:
return rc;
}
static int ecryptfs_crypto_api_algify_cipher_name(char **algified_name,
char *cipher_name,
char *chaining_modifier)
@ -104,13 +77,10 @@ static int ecryptfs_crypto_api_algify_cipher_name(char **algified_name,
*
* Generate the initialization vector from the given root IV and page
* offset.
*
* Returns zero on success; non-zero on error.
*/
int ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
loff_t offset)
void ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
loff_t offset)
{
int rc = 0;
char dst[MD5_DIGEST_SIZE];
char src[ECRYPTFS_MAX_IV_BYTES + 16];
@ -129,20 +99,12 @@ int ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
ecryptfs_printk(KERN_DEBUG, "source:\n");
ecryptfs_dump_hex(src, (crypt_stat->iv_bytes + 16));
}
rc = ecryptfs_calculate_md5(dst, crypt_stat, src,
(crypt_stat->iv_bytes + 16));
if (rc) {
ecryptfs_printk(KERN_WARNING, "Error attempting to compute "
"MD5 while generating IV for a page\n");
goto out;
}
md5(src, crypt_stat->iv_bytes + 16, dst);
memcpy(iv, dst, crypt_stat->iv_bytes);
if (unlikely(ecryptfs_verbosity > 0)) {
ecryptfs_printk(KERN_DEBUG, "derived iv:\n");
ecryptfs_dump_hex(iv, crypt_stat->iv_bytes);
}
out:
return rc;
}
/**
@ -151,29 +113,14 @@ int ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
*
* Initialize the crypt_stat structure.
*/
int ecryptfs_init_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat)
void ecryptfs_init_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat)
{
struct crypto_shash *tfm;
int rc;
tfm = crypto_alloc_shash(ECRYPTFS_DEFAULT_HASH, 0, 0);
if (IS_ERR(tfm)) {
rc = PTR_ERR(tfm);
ecryptfs_printk(KERN_ERR, "Error attempting to "
"allocate crypto context; rc = [%d]\n",
rc);
return rc;
}
memset((void *)crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat));
INIT_LIST_HEAD(&crypt_stat->keysig_list);
mutex_init(&crypt_stat->keysig_list_mutex);
mutex_init(&crypt_stat->cs_mutex);
mutex_init(&crypt_stat->cs_tfm_mutex);
crypt_stat->hash_tfm = tfm;
crypt_stat->flags |= ECRYPTFS_STRUCT_INITIALIZED;
return 0;
}
/**
@ -187,7 +134,6 @@ void ecryptfs_destroy_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat)
struct ecryptfs_key_sig *key_sig, *key_sig_tmp;
crypto_free_skcipher(crypt_stat->tfm);
crypto_free_shash(crypt_stat->hash_tfm);
list_for_each_entry_safe(key_sig, key_sig_tmp,
&crypt_stat->keysig_list, crypt_stat_list) {
list_del(&key_sig->crypt_stat_list);
@ -361,14 +307,7 @@ static int crypt_extent(struct ecryptfs_crypt_stat *crypt_stat,
int rc;
extent_base = (((loff_t)page_index) * (PAGE_SIZE / extent_size));
rc = ecryptfs_derive_iv(extent_iv, crypt_stat,
(extent_base + extent_offset));
if (rc) {
ecryptfs_printk(KERN_ERR, "Error attempting to derive IV for "
"extent [0x%.16llx]; rc = [%d]\n",
(unsigned long long)(extent_base + extent_offset), rc);
goto out;
}
ecryptfs_derive_iv(extent_iv, crypt_stat, extent_base + extent_offset);
sg_init_table(&src_sg, 1);
sg_init_table(&dst_sg, 1);
@ -609,31 +548,20 @@ void ecryptfs_set_default_sizes(struct ecryptfs_crypt_stat *crypt_stat)
*/
int ecryptfs_compute_root_iv(struct ecryptfs_crypt_stat *crypt_stat)
{
int rc = 0;
char dst[MD5_DIGEST_SIZE];
BUG_ON(crypt_stat->iv_bytes > MD5_DIGEST_SIZE);
BUG_ON(crypt_stat->iv_bytes <= 0);
if (!(crypt_stat->flags & ECRYPTFS_KEY_VALID)) {
rc = -EINVAL;
ecryptfs_printk(KERN_WARNING, "Session key not valid; "
"cannot generate root IV\n");
goto out;
}
rc = ecryptfs_calculate_md5(dst, crypt_stat, crypt_stat->key,
crypt_stat->key_size);
if (rc) {
ecryptfs_printk(KERN_WARNING, "Error attempting to compute "
"MD5 while generating root IV\n");
goto out;
}
memcpy(crypt_stat->root_iv, dst, crypt_stat->iv_bytes);
out:
if (rc) {
memset(crypt_stat->root_iv, 0, crypt_stat->iv_bytes);
crypt_stat->flags |= ECRYPTFS_SECURITY_WARNING;
return -EINVAL;
}
return rc;
md5(crypt_stat->key, crypt_stat->key_size, dst);
memcpy(crypt_stat->root_iv, dst, crypt_stat->iv_bytes);
return 0;
}
static void ecryptfs_generate_new_key(struct ecryptfs_crypt_stat *crypt_stat)

View File

@ -14,6 +14,7 @@
#ifndef ECRYPTFS_KERNEL_H
#define ECRYPTFS_KERNEL_H
#include <crypto/md5.h>
#include <crypto/skcipher.h>
#include <keys/user-type.h>
#include <keys/encrypted-type.h>
@ -137,8 +138,6 @@ ecryptfs_get_key_payload_data(struct key *key)
+ MAGIC_ECRYPTFS_MARKER_SIZE_BYTES)
#define ECRYPTFS_DEFAULT_CIPHER "aes"
#define ECRYPTFS_DEFAULT_KEY_BYTES 16
#define ECRYPTFS_DEFAULT_HASH "md5"
#define ECRYPTFS_TAG_70_DIGEST ECRYPTFS_DEFAULT_HASH
#define ECRYPTFS_TAG_1_PACKET_TYPE 0x01
#define ECRYPTFS_TAG_3_PACKET_TYPE 0x8C
#define ECRYPTFS_TAG_11_PACKET_TYPE 0xED
@ -163,8 +162,6 @@ ecryptfs_get_key_payload_data(struct key *key)
* ECRYPTFS_MAX_IV_BYTES */
#define ECRYPTFS_FILENAME_MIN_RANDOM_PREPEND_BYTES 16
#define ECRYPTFS_NON_NULL 0x42 /* A reasonable substitute for NULL */
#define MD5_DIGEST_SIZE 16
#define ECRYPTFS_TAG_70_DIGEST_SIZE MD5_DIGEST_SIZE
#define ECRYPTFS_TAG_70_MIN_METADATA_SIZE (1 + ECRYPTFS_MIN_PKT_LEN_SIZE \
+ ECRYPTFS_SIG_SIZE + 1 + 1)
#define ECRYPTFS_TAG_70_MAX_METADATA_SIZE (1 + ECRYPTFS_MAX_PKT_LEN_SIZE \
@ -237,8 +234,6 @@ struct ecryptfs_crypt_stat {
unsigned int extent_mask;
struct ecryptfs_mount_crypt_stat *mount_crypt_stat;
struct crypto_skcipher *tfm;
struct crypto_shash *hash_tfm; /* Crypto context for generating
* the initialization vectors */
unsigned char cipher[ECRYPTFS_MAX_CIPHER_NAME_SIZE + 1];
unsigned char key[ECRYPTFS_MAX_KEY_BYTES];
unsigned char root_iv[ECRYPTFS_MAX_IV_BYTES];
@ -558,7 +553,7 @@ int virt_to_scatterlist(const void *addr, int size, struct scatterlist *sg,
int sg_size);
int ecryptfs_compute_root_iv(struct ecryptfs_crypt_stat *crypt_stat);
void ecryptfs_rotate_iv(unsigned char *iv);
int ecryptfs_init_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat);
void ecryptfs_init_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat);
void ecryptfs_destroy_crypt_stat(struct ecryptfs_crypt_stat *crypt_stat);
void ecryptfs_destroy_mount_crypt_stat(
struct ecryptfs_mount_crypt_stat *mount_crypt_stat);
@ -693,8 +688,8 @@ ecryptfs_parse_tag_70_packet(char **filename, size_t *filename_size,
char *data, size_t max_packet_size);
int ecryptfs_set_f_namelen(long *namelen, long lower_namelen,
struct ecryptfs_mount_crypt_stat *mount_crypt_stat);
int ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
loff_t offset);
void ecryptfs_derive_iv(char *iv, struct ecryptfs_crypt_stat *crypt_stat,
loff_t offset);
extern const struct xattr_handler * const ecryptfs_xattr_handlers[];

View File

@ -95,7 +95,7 @@ static struct inode *__ecryptfs_get_inode(struct inode *lower_inode,
iput(lower_inode);
return ERR_PTR(-EACCES);
}
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
iput(lower_inode);
return inode;
@ -106,7 +106,7 @@ struct inode *ecryptfs_get_inode(struct inode *lower_inode,
{
struct inode *inode = __ecryptfs_get_inode(lower_inode, sb);
if (!IS_ERR(inode) && (inode->i_state & I_NEW))
if (!IS_ERR(inode) && (inode_state_read_once(inode) & I_NEW))
unlock_new_inode(inode);
return inode;
@ -364,7 +364,7 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
}
}
if (inode->i_state & I_NEW)
if (inode_state_read_once(inode) & I_NEW)
unlock_new_inode(inode);
return d_splice_alias(inode, dentry);
}
@ -903,11 +903,8 @@ static int ecryptfs_setattr(struct mnt_idmap *idmap,
struct ecryptfs_crypt_stat *crypt_stat;
crypt_stat = &ecryptfs_inode_to_private(d_inode(dentry))->crypt_stat;
if (!(crypt_stat->flags & ECRYPTFS_STRUCT_INITIALIZED)) {
rc = ecryptfs_init_crypt_stat(crypt_stat);
if (rc)
return rc;
}
if (!(crypt_stat->flags & ECRYPTFS_STRUCT_INITIALIZED))
ecryptfs_init_crypt_stat(crypt_stat);
inode = d_inode(dentry);
lower_inode = ecryptfs_inode_to_lower(inode);
lower_dentry = ecryptfs_dentry_to_lower(dentry);

View File

@ -11,7 +11,6 @@
* Trevor S. Highland <trevor.highland@gmail.com>
*/
#include <crypto/hash.h>
#include <crypto/skcipher.h>
#include <linux/string.h>
#include <linux/pagemap.h>
@ -601,10 +600,7 @@ struct ecryptfs_write_tag_70_packet_silly_stack {
struct crypto_skcipher *skcipher_tfm;
struct skcipher_request *skcipher_req;
char iv[ECRYPTFS_MAX_IV_BYTES];
char hash[ECRYPTFS_TAG_70_DIGEST_SIZE];
char tmp_hash[ECRYPTFS_TAG_70_DIGEST_SIZE];
struct crypto_shash *hash_tfm;
struct shash_desc *hash_desc;
char hash[MD5_DIGEST_SIZE];
};
/*
@ -741,51 +737,15 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
"password tokens\n", __func__);
goto out_free_unlock;
}
s->hash_tfm = crypto_alloc_shash(ECRYPTFS_TAG_70_DIGEST, 0, 0);
if (IS_ERR(s->hash_tfm)) {
rc = PTR_ERR(s->hash_tfm);
printk(KERN_ERR "%s: Error attempting to "
"allocate hash crypto context; rc = [%d]\n",
__func__, rc);
goto out_free_unlock;
}
s->hash_desc = kmalloc(sizeof(*s->hash_desc) +
crypto_shash_descsize(s->hash_tfm), GFP_KERNEL);
if (!s->hash_desc) {
rc = -ENOMEM;
goto out_release_free_unlock;
}
s->hash_desc->tfm = s->hash_tfm;
rc = crypto_shash_digest(s->hash_desc,
(u8 *)s->auth_tok->token.password.session_key_encryption_key,
s->auth_tok->token.password.session_key_encryption_key_bytes,
s->hash);
if (rc) {
printk(KERN_ERR
"%s: Error computing crypto hash; rc = [%d]\n",
__func__, rc);
goto out_release_free_unlock;
}
md5(s->auth_tok->token.password.session_key_encryption_key,
s->auth_tok->token.password.session_key_encryption_key_bytes,
s->hash);
for (s->j = 0; s->j < (s->num_rand_bytes - 1); s->j++) {
s->block_aligned_filename[s->j] =
s->hash[(s->j % ECRYPTFS_TAG_70_DIGEST_SIZE)];
if ((s->j % ECRYPTFS_TAG_70_DIGEST_SIZE)
== (ECRYPTFS_TAG_70_DIGEST_SIZE - 1)) {
rc = crypto_shash_digest(s->hash_desc, (u8 *)s->hash,
ECRYPTFS_TAG_70_DIGEST_SIZE,
s->tmp_hash);
if (rc) {
printk(KERN_ERR
"%s: Error computing crypto hash; "
"rc = [%d]\n", __func__, rc);
goto out_release_free_unlock;
}
memcpy(s->hash, s->tmp_hash,
ECRYPTFS_TAG_70_DIGEST_SIZE);
}
s->hash[s->j % MD5_DIGEST_SIZE];
if ((s->j % MD5_DIGEST_SIZE) == (MD5_DIGEST_SIZE - 1))
md5(s->hash, MD5_DIGEST_SIZE, s->hash);
if (s->block_aligned_filename[s->j] == '\0')
s->block_aligned_filename[s->j] = ECRYPTFS_NON_NULL;
}
@ -798,7 +758,7 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
"convert filename memory to scatterlist; rc = [%d]. "
"block_aligned_filename_size = [%zd]\n", __func__, rc,
s->block_aligned_filename_size);
goto out_release_free_unlock;
goto out_free_unlock;
}
rc = virt_to_scatterlist(&dest[s->i], s->block_aligned_filename_size,
s->dst_sg, 2);
@ -807,7 +767,7 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
"convert encrypted filename memory to scatterlist; "
"rc = [%d]. block_aligned_filename_size = [%zd]\n",
__func__, rc, s->block_aligned_filename_size);
goto out_release_free_unlock;
goto out_free_unlock;
}
/* The characters in the first block effectively do the job
* of the IV here, so we just use 0's for the IV. Note the
@ -825,7 +785,7 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
rc,
s->auth_tok->token.password.session_key_encryption_key,
mount_crypt_stat->global_default_fn_cipher_key_bytes);
goto out_release_free_unlock;
goto out_free_unlock;
}
skcipher_request_set_crypt(s->skcipher_req, s->src_sg, s->dst_sg,
s->block_aligned_filename_size, s->iv);
@ -833,13 +793,11 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
if (rc) {
printk(KERN_ERR "%s: Error attempting to encrypt filename; "
"rc = [%d]\n", __func__, rc);
goto out_release_free_unlock;
goto out_free_unlock;
}
s->i += s->block_aligned_filename_size;
(*packet_size) = s->i;
(*remaining_bytes) -= (*packet_size);
out_release_free_unlock:
crypto_free_shash(s->hash_tfm);
out_free_unlock:
kfree_sensitive(s->block_aligned_filename);
out_unlock:
@ -850,7 +808,6 @@ ecryptfs_write_tag_70_packet(char *dest, size_t *remaining_bytes,
key_put(auth_tok_key);
}
skcipher_request_free(s->skcipher_req);
kfree_sensitive(s->hash_desc);
kfree(s);
return rc;
}

View File

@ -12,6 +12,7 @@
#include <linux/dcache.h>
#include <linux/file.h>
#include <linux/fips.h>
#include <linux/module.h>
#include <linux/namei.h>
#include <linux/skbuff.h>
@ -454,6 +455,12 @@ static int ecryptfs_get_tree(struct fs_context *fc)
goto out;
}
if (fips_enabled) {
rc = -EINVAL;
err = "eCryptfs support is disabled due to FIPS";
goto out;
}
s = sget_fc(fc, NULL, set_anon_super_fc);
if (IS_ERR(s)) {
rc = PTR_ERR(s);

View File

@ -41,10 +41,7 @@ static struct inode *ecryptfs_alloc_inode(struct super_block *sb)
inode_info = alloc_inode_sb(sb, ecryptfs_inode_info_cache, GFP_KERNEL);
if (unlikely(!inode_info))
goto out;
if (ecryptfs_init_crypt_stat(&inode_info->crypt_stat)) {
kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
goto out;
}
ecryptfs_init_crypt_stat(&inode_info->crypt_stat);
mutex_init(&inode_info->lower_file_mutex);
atomic_set(&inode_info->lower_file_count, 0);
inode_info->lower_file = NULL;

View File

@ -62,7 +62,7 @@ struct inode *efs_iget(struct super_block *super, unsigned long ino)
inode = iget_locked(super, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
in = INODE_INFO(inode);

View File

@ -371,7 +371,8 @@ static int erofs_read_folio(struct file *file, struct folio *folio)
{
trace_erofs_read_folio(folio, true);
return iomap_read_folio(folio, &erofs_iomap_ops);
iomap_bio_read_folio(folio, &erofs_iomap_ops);
return 0;
}
static void erofs_readahead(struct readahead_control *rac)
@ -379,7 +380,7 @@ static void erofs_readahead(struct readahead_control *rac)
trace_erofs_readahead(rac->mapping->host, readahead_index(rac),
readahead_count(rac), true);
return iomap_readahead(rac, &erofs_iomap_ops);
iomap_bio_readahead(rac, &erofs_iomap_ops);
}
static sector_t erofs_bmap(struct address_space *mapping, sector_t block)

View File

@ -47,7 +47,6 @@ static void erofs_fileio_ki_complete(struct kiocb *iocb, long ret)
static void erofs_fileio_rq_submit(struct erofs_fileio_rq *rq)
{
const struct cred *old_cred;
struct iov_iter iter;
int ret;
@ -61,9 +60,8 @@ static void erofs_fileio_rq_submit(struct erofs_fileio_rq *rq)
rq->iocb.ki_flags = IOCB_DIRECT;
iov_iter_bvec(&iter, ITER_DEST, rq->bvecs, rq->bio.bi_vcnt,
rq->bio.bi_iter.bi_size);
old_cred = override_creds(rq->iocb.ki_filp->f_cred);
ret = vfs_iocb_iter_read(rq->iocb.ki_filp, &rq->iocb, &iter);
revert_creds(old_cred);
scoped_with_creds(rq->iocb.ki_filp->f_cred)
ret = vfs_iocb_iter_read(rq->iocb.ki_filp, &rq->iocb, &iter);
if (ret != -EIOCBQUEUED)
erofs_fileio_ki_complete(&rq->iocb, ret);
}

View File

@ -295,7 +295,7 @@ struct inode *erofs_iget(struct super_block *sb, erofs_nid_t nid)
if (!inode)
return ERR_PTR(-ENOMEM);
if (inode->i_state & I_NEW) {
if (inode_state_read_once(inode) & I_NEW) {
int err = erofs_fill_inode(inode);
if (err) {

View File

@ -1398,7 +1398,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
if (!(inode_state_read_once(inode) & I_NEW))
return inode;
ei = EXT2_I(inode);

View File

@ -202,8 +202,7 @@ void ext4_evict_inode(struct inode *inode)
* the inode. Flush worker is ignoring it because of I_FREEING flag but
* we still need to remove the inode from the writeback lists.
*/
if (!list_empty_careful(&inode->i_io_list))
inode_io_list_del(inode);
inode_io_list_del(inode);
/*
* Protect us against freezing - iput() caller didn't have to have any
@ -425,7 +424,7 @@ void ext4_check_map_extents_env(struct inode *inode)
if (!S_ISREG(inode->i_mode) ||
IS_NOQUOTA(inode) || IS_VERITY(inode) ||
is_special_ino(inode->i_sb, inode->i_ino) ||
(inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
(inode_state_read_once(inode) & (I_FREEING | I_WILL_FREE | I_NEW)) ||
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
ext4_verity_in_progress(inode))
return;
@ -1319,8 +1318,8 @@ static int ext4_write_begin(const struct kiocb *iocb,
if (IS_ERR(folio))
return PTR_ERR(folio);
if (pos + len > folio_pos(folio) + folio_size(folio))
len = folio_pos(folio) + folio_size(folio) - pos;
if (len > folio_next_pos(folio) - pos)
len = folio_next_pos(folio) - pos;
from = offset_in_folio(folio, pos);
to = from + len;
@ -2619,10 +2618,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
handle_t *handle = NULL;
int bpp = ext4_journal_blocks_per_folio(mpd->inode);
if (mpd->wbc->sync_mode == WB_SYNC_ALL || mpd->wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
tag = wbc_to_tag(mpd->wbc);
mpd->map.m_len = 0;
mpd->next_pos = mpd->start_pos;
@ -2704,7 +2700,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
if (mpd->map.m_len == 0)
mpd->start_pos = folio_pos(folio);
mpd->next_pos = folio_pos(folio) + folio_size(folio);
mpd->next_pos = folio_next_pos(folio);
/*
* Writeout when we cannot modify metadata is simple.
* Just submit the page. For data=journal mode we
@ -3146,8 +3142,8 @@ static int ext4_da_write_begin(const struct kiocb *iocb,
if (IS_ERR(folio))
return PTR_ERR(folio);
if (pos + len > folio_pos(folio) + folio_size(folio))
len = folio_pos(folio) + folio_size(folio) - pos;
if (len > folio_next_pos(folio) - pos)
len = folio_next_pos(folio) - pos;
ret = ext4_block_write_begin(NULL, folio, pos, len,
ext4_da_get_block_prep);
@ -3473,7 +3469,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
/* Any metadata buffers to write? */
if (!list_empty(&inode->i_mapping->i_private_list))
return true;
return inode->i_state & I_DIRTY_DATASYNC;
return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
}
static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
@ -4552,7 +4548,7 @@ int ext4_truncate(struct inode *inode)
* or it's a completely new inode. In those cases we might not
* have i_rwsem locked because it's not necessary.
*/
if (!(inode->i_state & (I_NEW|I_FREEING)))
if (!(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
WARN_ON(!inode_is_locked(inode));
trace_ext4_truncate_enter(inode);
@ -5210,7 +5206,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
inode = iget_locked(sb, ino);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW)) {
if (!(inode_state_read_once(inode) & I_NEW)) {
ret = check_igot_inode(inode, flags, function, line);
if (ret) {
iput(inode);
@ -5549,7 +5545,7 @@ static void __ext4_update_other_inode_time(struct super_block *sb,
if (inode_is_dirtytime_only(inode)) {
struct ext4_inode_info *ei = EXT4_I(inode);
inode->i_state &= ~I_DIRTY_TIME;
inode_state_clear(inode, I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
spin_lock(&ei->i_raw_lock);

View File

@ -57,16 +57,12 @@ static int write_mmp_block_thawed(struct super_block *sb,
static int write_mmp_block(struct super_block *sb, struct buffer_head *bh)
{
int err;
/*
* We protect against freezing so that we don't create dirty buffers
* on frozen filesystem.
*/
sb_start_write(sb);
err = write_mmp_block_thawed(sb, bh);
sb_end_write(sb);
return err;
scoped_guard(super_write, sb)
return write_mmp_block_thawed(sb, bh);
}
/*

View File

@ -107,7 +107,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!sbi->s_journal || is_bad_inode(inode))
return 0;
WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
WARN_ON_ONCE(!(inode_state_read_once(inode) & (I_NEW | I_FREEING)) &&
!inode_is_locked(inode));
if (ext4_inode_orphan_tracked(inode))
return 0;
@ -232,7 +232,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;
WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
WARN_ON_ONCE(!(inode_state_read_once(inode) & (I_NEW | I_FREEING)) &&
!inode_is_locked(inode));
if (ext4_test_inode_state(inode, EXT4_STATE_ORPHAN_FILE))
return ext4_orphan_file_del(handle, inode);

View File

@ -9,6 +9,7 @@
*
* Copyright (C) 2001-2003 Andreas Gruenbacher, <agruen@suse.de>
*/
#include <linux/fs_struct.h>
#include <linux/f2fs_fs.h>
#include "f2fs.h"
#include "xattr.h"

View File

@ -1329,7 +1329,7 @@ static int f2fs_write_compressed_pages(struct compress_ctx *cc,
}
folio = page_folio(cc->rpages[last_index]);
psize = folio_pos(folio) + folio_size(folio);
psize = folio_next_pos(folio);
err = f2fs_get_node_info(fio.sbi, dn.nid, &ni, false);
if (err)

View File

@ -2986,10 +2986,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
}
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
tag = wbc_to_tag(wbc);
retry:
retry = 0;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
@ -4222,7 +4219,7 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
if (map.m_flags & F2FS_MAP_NEW)
iomap->flags |= IOMAP_F_NEW;
if ((inode->i_state & I_DIRTY_DATASYNC) ||
if ((inode_state_read_once(inode) & I_DIRTY_DATASYNC) ||
offset + length > i_size_read(inode))
iomap->flags |= IOMAP_F_DIRTY;

View File

@ -569,7 +569,7 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW)) {
if (!(inode_state_read_once(inode) & I_NEW)) {
if (is_meta_ino(sbi, ino)) {
f2fs_err(sbi, "inaccessible inode: %lu, run fsck to repair", ino);
set_sbi_flag(sbi, SBI_NEED_FSCK);

View File

@ -844,7 +844,7 @@ static int __f2fs_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
f2fs_i_links_write(inode, false);
spin_lock(&inode->i_lock);
inode->i_state |= I_LINKABLE;
inode_state_set(inode, I_LINKABLE);
spin_unlock(&inode->i_lock);
} else {
if (file)
@ -1057,7 +1057,7 @@ static int f2fs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
goto put_out_dir;
spin_lock(&whiteout->i_lock);
whiteout->i_state &= ~I_LINKABLE;
inode_state_clear(whiteout, I_LINKABLE);
spin_unlock(&whiteout->i_lock);
iput(whiteout);

View File

@ -1798,7 +1798,7 @@ static int f2fs_drop_inode(struct inode *inode)
* - f2fs_gc -> iput -> evict
* - inode_wait_for_writeback(inode)
*/
if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) {
if ((!inode_unhashed(inode) && inode_state_read(inode) & I_SYNC)) {
if (!inode->i_nlink && !is_bad_inode(inode)) {
/* to avoid evict_inode call simultaneously */
__iget(inode);

View File

@ -22,6 +22,7 @@
#include <linux/unaligned.h>
#include <linux/random.h>
#include <linux/iversion.h>
#include <linux/fs_struct.h>
#include "fat.h"
#ifndef CONFIG_FAT_DEFAULT_IOCHARSET

View File

@ -641,6 +641,34 @@ void put_unused_fd(unsigned int fd)
EXPORT_SYMBOL(put_unused_fd);
/*
* Install a file pointer in the fd array while it is being resized.
*
* We need to make sure our update to the array does not get lost as the resizing
* thread can be copying the content as we modify it.
*
* We have two ways to do it:
* - go off CPU waiting for resize_in_progress to clear
* - take the spin lock
*
* The latter is trivial to implement and saves us from having to might_sleep()
* for debugging purposes.
*
* This is moved out of line from fd_install() to convince gcc to optimize that
* routine better.
*/
static void noinline fd_install_slowpath(unsigned int fd, struct file *file)
{
struct files_struct *files = current->files;
struct fdtable *fdt;
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
VFS_BUG_ON(rcu_access_pointer(fdt->fd[fd]) != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
}
/**
* fd_install - install a file pointer in the fd array
* @fd: file descriptor to install the file in
@ -658,14 +686,9 @@ void fd_install(unsigned int fd, struct file *file)
return;
rcu_read_lock_sched();
if (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
VFS_BUG_ON(rcu_access_pointer(fdt->fd[fd]) != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
fd_install_slowpath(fd, file);
return;
}
/* coupled with smp_wmb() in expand_fdtable() */

View File

@ -316,7 +316,6 @@ int ioctl_getflags(struct file *file, unsigned int __user *argp)
err = put_user(fa.flags, argp);
return err;
}
EXPORT_SYMBOL(ioctl_getflags);
int ioctl_setflags(struct file *file, unsigned int __user *argp)
{
@ -337,7 +336,6 @@ int ioctl_setflags(struct file *file, unsigned int __user *argp)
}
return err;
}
EXPORT_SYMBOL(ioctl_setflags);
int ioctl_fsgetxattr(struct file *file, void __user *argp)
{
@ -350,7 +348,6 @@ int ioctl_fsgetxattr(struct file *file, void __user *argp)
return err;
}
EXPORT_SYMBOL(ioctl_fsgetxattr);
int ioctl_fssetxattr(struct file *file, void __user *argp)
{
@ -369,7 +366,6 @@ int ioctl_fssetxattr(struct file *file, void __user *argp)
}
return err;
}
EXPORT_SYMBOL(ioctl_fssetxattr);
SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
struct file_attr __user *, ufattr, size_t, usize,

View File

@ -258,7 +258,7 @@ vxfs_iget(struct super_block *sbp, ino_t ino)
ip = iget_locked(sbp, ino);
if (!ip)
return ERR_PTR(-ENOMEM);
if (!(ip->i_state & I_NEW))
if (!(inode_state_read_once(ip) & I_NEW))
return ip;
vip = VXFS_INO(ip);

View File

@ -14,6 +14,7 @@
* Additions for address_space-based writeback
*/
#include <linux/sched/sysctl.h>
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/spinlock.h>
@ -31,11 +32,6 @@
#include <linux/memcontrol.h>
#include "internal.h"
/*
* 4MB minimal write chunk size
*/
#define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10))
/*
* Passed into wb_writeback(), essentially a subset of writeback_control
*/
@ -121,7 +117,7 @@ static bool inode_io_list_move_locked(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
WARN_ON_ONCE(inode->i_state & I_FREEING);
WARN_ON_ONCE(inode_state_read(inode) & I_FREEING);
list_move(&inode->i_io_list, head);
@ -200,6 +196,19 @@ static void wb_queue_work(struct bdi_writeback *wb,
spin_unlock_irq(&wb->work_lock);
}
static bool wb_wait_for_completion_cb(struct wb_completion *done)
{
unsigned long waited_secs = (jiffies - done->wait_start) / HZ;
done->progress_stamp = jiffies;
if (waited_secs > sysctl_hung_task_timeout_secs)
pr_info("INFO: The task %s:%d has been waiting for writeback "
"completion for more than %lu seconds.",
current->comm, current->pid, waited_secs);
return !atomic_read(&done->cnt);
}
/**
* wb_wait_for_completion - wait for completion of bdi_writeback_works
* @done: target wb_completion
@ -212,8 +221,9 @@ static void wb_queue_work(struct bdi_writeback *wb,
*/
void wb_wait_for_completion(struct wb_completion *done)
{
done->wait_start = jiffies;
atomic_dec(&done->cnt); /* put down the initial count */
wait_event(*done->waitq, !atomic_read(&done->cnt));
wait_event(*done->waitq, wb_wait_for_completion_cb(done));
}
#ifdef CONFIG_CGROUP_WRITEBACK
@ -304,9 +314,9 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
WARN_ON_ONCE(inode->i_state & I_FREEING);
WARN_ON_ONCE(inode_state_read(inode) & I_FREEING);
inode->i_state &= ~I_SYNC_QUEUED;
inode_state_clear(inode, I_SYNC_QUEUED);
if (wb != &wb->bdi->wb)
list_move(&inode->i_io_list, &wb->b_attached);
else
@ -408,7 +418,7 @@ static bool inode_do_switch_wbs(struct inode *inode,
* Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction
* path owns the inode and we shouldn't modify ->i_io_list.
*/
if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE)))
if (unlikely(inode_state_read(inode) & (I_FREEING | I_WILL_FREE)))
goto skip_switch;
trace_inode_switch_wbs(inode, old_wb, new_wb);
@ -451,7 +461,7 @@ static bool inode_do_switch_wbs(struct inode *inode,
if (!list_empty(&inode->i_io_list)) {
inode->i_wb = new_wb;
if (inode->i_state & I_DIRTY_ALL) {
if (inode_state_read(inode) & I_DIRTY_ALL) {
/*
* We need to keep b_dirty list sorted by
* dirtied_time_when. However properly sorting the
@ -476,10 +486,11 @@ static bool inode_do_switch_wbs(struct inode *inode,
switched = true;
skip_switch:
/*
* Paired with load_acquire in unlocked_inode_to_wb_begin() and
* Paired with an acquire fence in unlocked_inode_to_wb_begin() and
* ensures that the new wb is visible if they see !I_WB_SWITCH.
*/
smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH);
smp_wmb();
inode_state_clear(inode, I_WB_SWITCH);
xa_unlock_irq(&mapping->i_pages);
spin_unlock(&inode->i_lock);
@ -600,12 +611,12 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock(&inode->i_lock);
if (!(inode->i_sb->s_flags & SB_ACTIVE) ||
inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
inode_to_wb(inode) == new_wb) {
spin_unlock(&inode->i_lock);
return false;
}
inode->i_state |= I_WB_SWITCH;
inode_state_set(inode, I_WB_SWITCH);
__iget(inode);
spin_unlock(&inode->i_lock);
@ -635,7 +646,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
struct bdi_writeback *new_wb = NULL;
/* noop if seems to be already in progress */
if (inode->i_state & I_WB_SWITCH)
if (inode_state_read_once(inode) & I_WB_SWITCH)
return;
/* avoid queueing a new switch if too many are already in flight */
@ -807,9 +818,9 @@ static void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
* @wbc: writeback_control of interest
* @inode: target inode
*
* This function is to be used by __filemap_fdatawrite_range(), which is an
* alternative entry point into writeback code, and first ensures @inode is
* associated with a bdi_writeback and attaches it to @wbc.
* This function is to be used by filemap_writeback(), which is an alternative
* entry point into writeback code, and first ensures @inode is associated with
* a bdi_writeback and attaches it to @wbc.
*/
void wbc_attach_fdatawrite_inode(struct writeback_control *wbc,
struct inode *inode)
@ -1236,9 +1247,9 @@ static void inode_cgwb_move_to_attached(struct inode *inode,
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
WARN_ON_ONCE(inode->i_state & I_FREEING);
WARN_ON_ONCE(inode_state_read(inode) & I_FREEING);
inode->i_state &= ~I_SYNC_QUEUED;
inode_state_clear(inode, I_SYNC_QUEUED);
list_del_init(&inode->i_io_list);
wb_io_lists_depopulated(wb);
}
@ -1348,10 +1359,17 @@ void inode_io_list_del(struct inode *inode)
{
struct bdi_writeback *wb;
/*
* FIXME: ext4 can call here from ext4_evict_inode() after evict() already
* unlinked the inode.
*/
if (list_empty_careful(&inode->i_io_list))
return;
wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
inode->i_state &= ~I_SYNC_QUEUED;
inode_state_clear(inode, I_SYNC_QUEUED);
list_del_init(&inode->i_io_list);
wb_io_lists_depopulated(wb);
@ -1409,13 +1427,13 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb)
{
assert_spin_locked(&inode->i_lock);
inode->i_state &= ~I_SYNC_QUEUED;
inode_state_clear(inode, I_SYNC_QUEUED);
/*
* When the inode is being freed just don't bother with dirty list
* tracking. Flush worker will ignore this inode anyway and it will
* trigger assertions in inode_io_list_move_locked().
*/
if (inode->i_state & I_FREEING) {
if (inode_state_read(inode) & I_FREEING) {
list_del_init(&inode->i_io_list);
wb_io_lists_depopulated(wb);
return;
@ -1449,9 +1467,9 @@ static void inode_sync_complete(struct inode *inode)
{
assert_spin_locked(&inode->i_lock);
inode->i_state &= ~I_SYNC;
inode_state_clear(inode, I_SYNC);
/* If inode is clean an unused, put it into LRU now... */
inode_add_lru(inode);
inode_lru_list_add(inode);
/* Called with inode->i_lock which ensures memory ordering. */
inode_wake_up_bit(inode, __I_SYNC);
}
@ -1493,7 +1511,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
spin_lock(&inode->i_lock);
list_move(&inode->i_io_list, &tmp);
moved++;
inode->i_state |= I_SYNC_QUEUED;
inode_state_set(inode, I_SYNC_QUEUED);
spin_unlock(&inode->i_lock);
if (sb_is_blkdev_sb(inode->i_sb))
continue;
@ -1579,14 +1597,14 @@ void inode_wait_for_writeback(struct inode *inode)
assert_spin_locked(&inode->i_lock);
if (!(inode->i_state & I_SYNC))
if (!(inode_state_read(inode) & I_SYNC))
return;
wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
for (;;) {
prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
/* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
if (!(inode->i_state & I_SYNC))
if (!(inode_state_read(inode) & I_SYNC))
break;
spin_unlock(&inode->i_lock);
schedule();
@ -1612,7 +1630,7 @@ static void inode_sleep_on_writeback(struct inode *inode)
wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
/* Checking I_SYNC with inode->i_lock guarantees memory ordering. */
sleep = !!(inode->i_state & I_SYNC);
sleep = !!(inode_state_read(inode) & I_SYNC);
spin_unlock(&inode->i_lock);
if (sleep)
schedule();
@ -1631,7 +1649,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
struct writeback_control *wbc,
unsigned long dirtied_before)
{
if (inode->i_state & I_FREEING)
if (inode_state_read(inode) & I_FREEING)
return;
/*
@ -1639,7 +1657,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
* shot. If still dirty, it will be redirty_tail()'ed below. Update
* the dirty time to prevent enqueue and sync it again.
*/
if ((inode->i_state & I_DIRTY) &&
if ((inode_state_read(inode) & I_DIRTY) &&
(wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
inode->dirtied_when = jiffies;
@ -1650,7 +1668,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
* is odd for clean inodes, it can happen for some
* filesystems so handle that gracefully.
*/
if (inode->i_state & I_DIRTY_ALL)
if (inode_state_read(inode) & I_DIRTY_ALL)
redirty_tail_locked(inode, wb);
else
inode_cgwb_move_to_attached(inode, wb);
@ -1676,17 +1694,17 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
*/
redirty_tail_locked(inode, wb);
}
} else if (inode->i_state & I_DIRTY) {
} else if (inode_state_read(inode) & I_DIRTY) {
/*
* Filesystems can dirty the inode during writeback operations,
* such as delayed allocation during submission or metadata
* updates after data IO completion.
*/
redirty_tail_locked(inode, wb);
} else if (inode->i_state & I_DIRTY_TIME) {
} else if (inode_state_read(inode) & I_DIRTY_TIME) {
inode->dirtied_when = jiffies;
inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
inode->i_state &= ~I_SYNC_QUEUED;
inode_state_clear(inode, I_SYNC_QUEUED);
} else {
/* The inode is clean. Remove from writeback lists. */
inode_cgwb_move_to_attached(inode, wb);
@ -1712,7 +1730,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
unsigned dirty;
int ret;
WARN_ON(!(inode->i_state & I_SYNC));
WARN_ON(!(inode_state_read_once(inode) & I_SYNC));
trace_writeback_single_inode_start(inode, wbc, nr_to_write);
@ -1736,7 +1754,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* mark_inode_dirty_sync() to notify the filesystem about it and to
* change I_DIRTY_TIME into I_DIRTY_SYNC.
*/
if ((inode->i_state & I_DIRTY_TIME) &&
if ((inode_state_read_once(inode) & I_DIRTY_TIME) &&
(wbc->sync_mode == WB_SYNC_ALL ||
time_after(jiffies, inode->dirtied_time_when +
dirtytime_expire_interval * HZ))) {
@ -1751,8 +1769,8 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* after handling timestamp expiration, as that may dirty the inode too.
*/
spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY;
inode->i_state &= ~dirty;
dirty = inode_state_read(inode) & I_DIRTY;
inode_state_clear(inode, dirty);
/*
* Paired with smp_mb() in __mark_inode_dirty(). This allows
@ -1768,10 +1786,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
smp_mb();
if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state |= I_DIRTY_PAGES;
else if (unlikely(inode->i_state & I_PINNING_NETFS_WB)) {
if (!(inode->i_state & I_DIRTY_PAGES)) {
inode->i_state &= ~I_PINNING_NETFS_WB;
inode_state_set(inode, I_DIRTY_PAGES);
else if (unlikely(inode_state_read(inode) & I_PINNING_NETFS_WB)) {
if (!(inode_state_read(inode) & I_DIRTY_PAGES)) {
inode_state_clear(inode, I_PINNING_NETFS_WB);
wbc->unpinned_netfs_wb = true;
dirty |= I_PINNING_NETFS_WB; /* Cause write_inode */
}
@ -1807,11 +1825,11 @@ static int writeback_single_inode(struct inode *inode,
spin_lock(&inode->i_lock);
if (!icount_read(inode))
WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
WARN_ON(!(inode_state_read(inode) & (I_WILL_FREE | I_FREEING)));
else
WARN_ON(inode->i_state & I_WILL_FREE);
WARN_ON(inode_state_read(inode) & I_WILL_FREE);
if (inode->i_state & I_SYNC) {
if (inode_state_read(inode) & I_SYNC) {
/*
* Writeback is already running on the inode. For WB_SYNC_NONE,
* that's enough and we can just return. For WB_SYNC_ALL, we
@ -1822,7 +1840,7 @@ static int writeback_single_inode(struct inode *inode,
goto out;
inode_wait_for_writeback(inode);
}
WARN_ON(inode->i_state & I_SYNC);
WARN_ON(inode_state_read(inode) & I_SYNC);
/*
* If the inode is already fully clean, then there's nothing to do.
*
@ -1830,11 +1848,11 @@ static int writeback_single_inode(struct inode *inode,
* still under writeback, e.g. due to prior WB_SYNC_NONE writeback. If
* there are any such pages, we'll need to wait for them.
*/
if (!(inode->i_state & I_DIRTY_ALL) &&
if (!(inode_state_read(inode) & I_DIRTY_ALL) &&
(wbc->sync_mode != WB_SYNC_ALL ||
!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
goto out;
inode->i_state |= I_SYNC;
inode_state_set(inode, I_SYNC);
wbc_attach_and_unlock_inode(wbc, inode);
ret = __writeback_single_inode(inode, wbc);
@ -1847,18 +1865,18 @@ static int writeback_single_inode(struct inode *inode,
* If the inode is freeing, its i_io_list shoudn't be updated
* as it can be finally deleted at this moment.
*/
if (!(inode->i_state & I_FREEING)) {
if (!(inode_state_read(inode) & I_FREEING)) {
/*
* If the inode is now fully clean, then it can be safely
* removed from its writeback list (if any). Otherwise the
* flusher threads are responsible for the writeback lists.
*/
if (!(inode->i_state & I_DIRTY_ALL))
if (!(inode_state_read(inode) & I_DIRTY_ALL))
inode_cgwb_move_to_attached(inode, wb);
else if (!(inode->i_state & I_SYNC_QUEUED)) {
if ((inode->i_state & I_DIRTY))
else if (!(inode_state_read(inode) & I_SYNC_QUEUED)) {
if ((inode_state_read(inode) & I_DIRTY))
redirty_tail_locked(inode, wb);
else if (inode->i_state & I_DIRTY_TIME) {
else if (inode_state_read(inode) & I_DIRTY_TIME) {
inode->dirtied_when = jiffies;
inode_io_list_move_locked(inode,
wb,
@ -1874,8 +1892,8 @@ static int writeback_single_inode(struct inode *inode,
return ret;
}
static long writeback_chunk_size(struct bdi_writeback *wb,
struct wb_writeback_work *work)
static long writeback_chunk_size(struct super_block *sb,
struct bdi_writeback *wb, struct wb_writeback_work *work)
{
long pages;
@ -1893,16 +1911,13 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
* (maybe slowly) sync all tagged pages
*/
if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
pages = LONG_MAX;
else {
pages = min(wb->avg_write_bandwidth / 2,
global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
MIN_WRITEBACK_PAGES);
}
return LONG_MAX;
return pages;
pages = min(wb->avg_write_bandwidth / 2,
global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
return round_down(pages + sb->s_min_writeback_pages,
sb->s_min_writeback_pages);
}
/*
@ -1967,12 +1982,12 @@ static long writeback_sb_inodes(struct super_block *sb,
* kind writeout is handled by the freer.
*/
spin_lock(&inode->i_lock);
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
if (inode_state_read(inode) & (I_NEW | I_FREEING | I_WILL_FREE)) {
redirty_tail_locked(inode, wb);
spin_unlock(&inode->i_lock);
continue;
}
if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
if ((inode_state_read(inode) & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
/*
* If this inode is locked for writeback and we are not
* doing writeback-for-data-integrity, move it to
@ -1994,17 +2009,17 @@ static long writeback_sb_inodes(struct super_block *sb,
* are doing WB_SYNC_NONE writeback. So this catches only the
* WB_SYNC_ALL case.
*/
if (inode->i_state & I_SYNC) {
if (inode_state_read(inode) & I_SYNC) {
/* Wait for I_SYNC. This function drops i_lock... */
inode_sleep_on_writeback(inode);
/* Inode may be gone, start again */
spin_lock(&wb->list_lock);
continue;
}
inode->i_state |= I_SYNC;
inode_state_set(inode, I_SYNC);
wbc_attach_and_unlock_inode(&wbc, inode);
write_chunk = writeback_chunk_size(wb, work);
write_chunk = writeback_chunk_size(inode->i_sb, wb, work);
wbc.nr_to_write = write_chunk;
wbc.pages_skipped = 0;
@ -2014,6 +2029,12 @@ static long writeback_sb_inodes(struct super_block *sb,
*/
__writeback_single_inode(inode, &wbc);
/* Report progress to inform the hung task detector of the progress. */
if (work->done && work->done->progress_stamp &&
(jiffies - work->done->progress_stamp) > HZ *
sysctl_hung_task_timeout_secs / 2)
wake_up_all(work->done->waitq);
wbc_detach_inode(&wbc);
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;
@ -2039,7 +2060,7 @@ static long writeback_sb_inodes(struct super_block *sb,
*/
tmp_wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
if (!(inode->i_state & I_DIRTY_ALL))
if (!(inode_state_read(inode) & I_DIRTY_ALL))
total_wrote++;
requeue_inode(inode, tmp_wb, &wbc, dirtied_before);
inode_sync_complete(inode);
@ -2545,10 +2566,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* We tell ->dirty_inode callback that timestamps need to
* be updated by setting I_DIRTY_TIME in flags.
*/
if (inode->i_state & I_DIRTY_TIME) {
if (inode_state_read_once(inode) & I_DIRTY_TIME) {
spin_lock(&inode->i_lock);
if (inode->i_state & I_DIRTY_TIME) {
inode->i_state &= ~I_DIRTY_TIME;
if (inode_state_read(inode) & I_DIRTY_TIME) {
inode_state_clear(inode, I_DIRTY_TIME);
flags |= I_DIRTY_TIME;
}
spin_unlock(&inode->i_lock);
@ -2585,16 +2606,16 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
smp_mb();
if ((inode->i_state & flags) == flags)
if ((inode_state_read_once(inode) & flags) == flags)
return;
spin_lock(&inode->i_lock);
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
if ((inode_state_read(inode) & flags) != flags) {
const int was_dirty = inode_state_read(inode) & I_DIRTY;
inode_attach_wb(inode, NULL);
inode->i_state |= flags;
inode_state_set(inode, flags);
/*
* Grab inode's wb early because it requires dropping i_lock and we
@ -2613,7 +2634,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* the inode it will place it on the appropriate superblock
* list, based upon its state.
*/
if (inode->i_state & I_SYNC_QUEUED)
if (inode_state_read(inode) & I_SYNC_QUEUED)
goto out_unlock;
/*
@ -2624,7 +2645,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (inode_unhashed(inode))
goto out_unlock;
}
if (inode->i_state & I_FREEING)
if (inode_state_read(inode) & I_FREEING)
goto out_unlock;
/*
@ -2639,7 +2660,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (dirtytime)
inode->dirtied_time_when = jiffies;
if (inode->i_state & I_DIRTY)
if (inode_state_read(inode) & I_DIRTY)
dirty_list = &wb->b_dirty;
else
dirty_list = &wb->b_dirty_time;
@ -2736,7 +2757,7 @@ static void wait_sb_inodes(struct super_block *sb)
spin_unlock_irq(&sb->s_inode_wblist_lock);
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_NEW)) {
spin_unlock(&inode->i_lock);
spin_lock_irq(&sb->s_inode_wblist_lock);

Some files were not shown because too many files have changed in this diff Show More