Commit Graph

922 Commits

Author SHA1 Message Date
Linus Torvalds e64aeecbbb mount-related stuff for this cycle
* saner handling of guards in fs/namespace.c, getting
 rid of needlessly strong locking in some of the users.
 
 	* lock_mount() calling conventions change - have it set
 the environment for attaching to given location, storing the
 results in caller-supplied object, without altering the passed
 struct path.  Make unlock_mount() called as __cleanup for those
 objects.  It's not exactly guard(), but similar to it.
 
 	* MNT_WRITE_HOLD done right - mnt_hold_writers() does *not*
 mess with ->mnt_flags anymore, so insertion of a new mount into
 ->s_mounts of underlying superblock does not, in itself, expose
 ->mnt_flags of that mount to concurrent modifications.
 
 	* getting rid of pathological cases when umount() spends
 quadratic time removing the victims from propagation graph -
 part of that had been dealt with last cycle, this should finish
 it.
 
 	* a bunch of stuff constified.
 
 	* assorted cleanups.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaNhzLAAKCRBZ7Krx/gZQ
 63/IAP4yxJ6e3Pt66Uw0MeuSNmeLsQwb7mYo72lsYHpxjYANZAEAspMaLDU9NHxM
 Dy6WDVoJnf7+aDlD6E443YMfPX8XRQM=
 =5T+t
 -----END PGP SIGNATURE-----

Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount updates from Al Viro:
 "Several piles this cycle, this mount-related one being the largest and
  trickiest:

   - saner handling of guards in fs/namespace.c, getting rid of
     needlessly strong locking in some of the users

   - lock_mount() calling conventions change - have it set the
     environment for attaching to given location, storing the results in
     caller-supplied object, without altering the passed struct path.

     Make unlock_mount() called as __cleanup for those objects. It's not
     exactly guard(), but similar to it

   - MNT_WRITE_HOLD done right.

     mnt_hold_writers() does *not* mess with ->mnt_flags anymore, so
     insertion of a new mount into ->s_mounts of underlying superblock
     does not, in itself, expose ->mnt_flags of that mount to concurrent
     modifications

   - getting rid of pathological cases when umount() spends quadratic
     time removing the victims from propagation graph - part of that had
     been dealt with last cycle, this should finish it

   - a bunch of stuff constified

   - assorted cleanups

* tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
  constify {__,}mnt_is_readonly()
  WRITE_HOLD machinery: no need for to bump mount_lock seqcount
  struct mount: relocate MNT_WRITE_HOLD bit
  preparations to taking MNT_WRITE_HOLD out of ->mnt_flags
  setup_mnt(): primitive for connecting a mount to filesystem
  simplify the callers of mnt_unhold_writers()
  copy_mnt_ns(): use guards
  copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure
  open_detached_copy(): separate creation of namespace into helper
  open_detached_copy(): don't bother with mount_lock_hash()
  path_has_submounts(): use guard(mount_locked_reader)
  fs/namespace.c: sanitize descriptions for {__,}lookup_mnt()
  ecryptfs: get rid of pointless mount references in ecryptfs dentries
  umount_tree(): take all victims out of propagation graph at once
  do_mount(): use __free(path_put)
  do_move_mount_old(): use __free(path_put)
  constify can_move_mount_beneath() arguments
  path_umount(): constify struct path argument
  may_copy_tree(), __do_loopback(): constify struct path argument
  path_mount(): constify struct path argument
  ...
2025-10-03 10:19:44 -07:00
Linus Torvalds 18b19abc37 namespace-6.18-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc
 oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx
 +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4=
 =EO4Y
 -----END PGP SIGNATURE-----

Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc/<pid>/ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...
2025-09-29 11:20:29 -07:00
Linus Torvalds 722df25ddf kernel-6.18-rc1.clone3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
 ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
 =tUXG
 -----END PGP SIGNATURE-----

Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_process updates from Christian Brauner:
 "This contains the changes to enable support for clone3() on nios2
  which apparently is still a thing.

  The more exciting part of this is that it cleans up the inconsistency
  in how the 64-bit flag argument is passed from copy_process() into the
  various other copy_*() helpers"

[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]

* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nios2: implement architecture-specific portion of sys_clone3
  arch: copy_thread: pass clone_flags as u64
  copy_process: pass clone_flags as u64 across calltree
  copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29 10:36:50 -07:00
Linus Torvalds 3a2a5b278f vfs-6.18-rc1.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQOwAKCRCRxhvAZXjc
 oth/AQDvlOo+23/f2djgDGS8akjkBYVLW14OYzC0q5cbEnnGgAEAycHL50pp3n1o
 3jMYlCByuv507vpCsDupo7QcJapmQAk=
 =/NwE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "This contains some work around mount api handling:

   - Output the warning message for mnt_too_revealing() triggered during
     fsmount() to the fscontext log. This makes it possible for the
     mount tool to output appropriate warnings on the command line.

     For example, with the newest fsopen()-based mount(8) from
     util-linux, the error messages now look like:

       # mount -t proc proc /tmp
       mount: /tmp: fsmount() failed: VFS: Mount too revealing.
              dmesg(1) may have more information after failed mount system call.

   - Do not consume fscontext log entries when returning -EMSGSIZE

     Userspace generally expects APIs that return -EMSGSIZE to allow for
     them to adjust their buffer size and retry the operation.

     However, the fscontext log would previously clear the message even
     in the -EMSGSIZE case.

     Given that it is very cheap for us to check whether the buffer is
     too small before we remove the message from the ring buffer, let's
     just do that instead.

   - Drop an unused argument from do_remount()"

* tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: fs/namespace.c: remove ms_flags argument from do_remount
  selftests/filesystems: add basic fscontext log tests
  fscontext: do not consume log entries when returning -EMSGSIZE
  vfs: output mount_too_revealing() errors to fscontext
  docs/vfs: Remove mentions to the old mount API helpers
  fscontext: add custom-prefix log helpers
  fs: Remove mount_bdev
  fs: Remove mount_nodev
2025-09-29 09:32:34 -07:00
Christian Brauner 6c7ca6a02f mount: handle NULL values in mnt_ns_release()
When calling in listmount() mnt_ns_release() may be passed a NULL
pointer. Handle that case gracefully.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-29 09:08:18 -07:00
Linus Torvalds b7ce6fa90f vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
 omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
 =1ICf
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add "initramfs_options" parameter to set initramfs mount options.
     This allows to add specific mount options to the rootfs to e.g.,
     limit the memory size

   - Add RWF_NOSIGNAL flag for pwritev2()

     Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
     signal from being raised when writing on disconnected pipes or
     sockets. The flag is handled directly by the pipe filesystem and
     converted to the existing MSG_NOSIGNAL flag for sockets

   - Allow to pass pid namespace as procfs mount option

     Ever since the introduction of pid namespaces, procfs has had very
     implicit behaviour surrounding them (the pidns used by a procfs
     mount is auto-selected based on the mounting process's active
     pidns, and the pidns itself is basically hidden once the mount has
     been constructed)

     This implicit behaviour has historically meant that userspace was
     required to do some special dances in order to configure the pidns
     of a procfs mount as desired. Examples include:

     * In order to bypass the mnt_too_revealing() check, Kubernetes
       creates a procfs mount from an empty pidns so that user
       namespaced containers can be nested (without this, the nested
       containers would fail to mount procfs)

       But this requires forking off a helper process because you cannot
       just one-shot this using mount(2)

     * Container runtimes in general need to fork into a container
       before configuring its mounts, which can lead to security issues
       in the case of shared-pidns containers (a privileged process in
       the pidns can interact with your container runtime process)

       While SUID_DUMP_DISABLE and user namespaces make this less of an
       issue, the strict need for this due to a minor uAPI wart is kind
       of unfortunate

       Things would be much easier if there was a way for userspace to
       just specify the pidns they want. So this pull request contains
       changes to implement a new "pidns" argument which can be set
       using fsconfig(2):

           fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

       or classic mount(2) / mount(8):

           // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
           mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

  Cleanups:

   - Remove the last references to EXPORT_OP_ASYNC_LOCK

   - Make file_remove_privs_flags() static

   - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

   - Use try_cmpxchg() in start_dir_add()

   - Use try_cmpxchg() in sb_init_done_wq()

   - Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

   - Remove vfs_ioctl() export

   - Replace rwlock() with spinlock in epoll code as rwlock causes
     priority inversion on preempt rt kernels

   - Make ns_entries in fs/proc/namespaces const

   - Use a switch() statement() in init_special_inode() just like we do
     in may_open()

   - Use struct_size() in dir_add() in the initramfs code

   - Use str_plural() in rd_load_image()

   - Replace strcpy() with strscpy() in find_link()

   - Rename generic_delete_inode() to inode_just_drop() and
     generic_drop_inode() to inode_generic_drop()

   - Remove unused arguments from fcntl_{g,s}et_rw_hint()

  Fixes:

   - Document @name parameter for name_contains_dotdot() helper

   - Fix spelling mistake

   - Always return zero from replace_fd() instead of the file descriptor
     number

   - Limit the size for copy_file_range() in compat mode to prevent a
     signed overflow

   - Fix debugfs mount options not being applied

   - Verify the inode mode when loading it from disk in minixfs

   - Verify the inode mode when loading it from disk in cramfs

   - Don't trigger automounts with RESOLVE_NO_XDEV

     If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
     through automounts, but could still trigger them

   - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
     tracepoints

   - Fix unused variable warning in rd_load_image() on s390

   - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

   - Use ns_capable_noaudit() when determining net sysctl permissions

   - Don't call path_put() under namespace semaphore in listmount() and
     statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
  fcntl: trim arguments
  listmount: don't call path_put() under namespace semaphore
  statmount: don't call path_put() under namespace semaphore
  pid: use ns_capable_noaudit() when determining net sysctl permissions
  fs: rename generic_delete_inode() and generic_drop_inode()
  init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
  initramfs: Replace strcpy() with strscpy() in find_link()
  initrd: Use str_plural() in rd_load_image()
  initramfs: Use struct_size() helper to improve dir_add()
  initrd: Fix unused variable warning in rd_load_image() on s390
  fs: use the switch statement in init_special_inode()
  fs/proc/namespaces: make ns_entries const
  filelock: add FL_RECLAIM to show_fl_flags() macro
  eventpoll: Replace rwlock with spinlock
  selftests/proc: add tests for new pidns APIs
  procfs: add "pidns" mount option
  pidns: move is-ancestor logic to helper
  openat2: don't trigger automounts with RESOLVE_NO_XDEV
  namei: move cross-device check to __traverse_mounts
  namei: remove LOOKUP_NO_XDEV check from handle_mounts
  ...
2025-09-29 09:03:07 -07:00
Christian Brauner c1f86d0ac3
listmount: don't call path_put() under namespace semaphore
Massage listmount() and make sure we don't call path_put() under the
namespace semaphore. If we put the last reference we're fscked.

Fixes: b4c2bea8ce ("add listmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-26 10:20:29 +02:00
Christian Brauner e8c84e2082
statmount: don't call path_put() under namespace semaphore
Massage statmount() and make sure we don't call path_put() under the
namespace semaphore. If we put the last reference we're fscked.

Fixes: 46eae99ef7 ("add statmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-26 10:16:06 +02:00
Christian Brauner 4055526d35
ns: move ns type into struct ns_common
It's misplaced in struct proc_ns_operations and ns->ops might be NULL if
the namespace is compiled out but we still want to know the type of the
namespace for the initial namespace struct.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-25 09:23:54 +02:00
Christian Brauner d7610cb745 ns: simplify ns_common_init() further
Simply derive the ns operations from the namespace type.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-22 14:47:10 +02:00
Christian Brauner 7cf7303211
ns: use inode initializer for initial namespaces
Just use the common helper we have.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:38 +02:00
Christian Brauner 024596a4e2
ns: rename to __ns_ref
Make it easier to grep and rename to ns_count.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:38 +02:00
Christian Brauner 2e9e697227
mnt: port to ns_ref_*() helpers
Stop accessing ns.count directly.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:36 +02:00
Christian Brauner be5f21d398
ns: add ns_common_free()
And drop ns_free_inum(). Anything common that can be wasted centrally
should be wasted in the new common helper.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:36 +02:00
Christian Brauner 5612ff3ec5
nscommon: simplify initialization
There's a lot of information that namespace implementers don't need to
know about at all. Encapsulate this all in the initialization helper.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:19 +02:00
Christian Brauner 86cdbae5c6
mnt: simplify ns_common_init() handling
Assign the reserved MNT_NS_ANON_INO sentinel to anonymous mount
namespaces and cleanup the initial mount ns allocation. This is just a
preparatory patch and the ns->inum check in ns_common_init() will be
dropped in the next patch.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:18 +02:00
Christian Brauner b2a0b19208
mnt: expose pointer to init_mnt_ns
There's various scenarios where we need to know whether we are in the
initial set of namespaces or not to e.g., shortcut permission checking.
All namespaces expose that information. Let's do that too.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:18 +02:00
Christian Brauner 7d7d164989
mnt: support ns lookup
Move the mount namespace to the generic ns lookup infrastructure.
This allows us to drop a bunch of members from struct mnt_namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:15 +02:00
Christian Brauner 7914f15c5e
Merge branch 'no-rebase-mnt_ns_tree_remove'
Bring in the fix for removing a mount namespace from the mount namespace
rbtree and list.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:14 +02:00
Christian Brauner 96ece8eb67
mnt: use ns_common_init()
Don't cargo-cult the same thing over and over.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:13 +02:00
Al Viro a797652486 constify {__,}mnt_is_readonly()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 1e414adf03 WRITE_HOLD machinery: no need for to bump mount_lock seqcount
... neither for insertion into the list of instances, nor for
mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding
to be nice to RT during a busy-wait loop - all of that only needs
the spinlock side of mount_lock.

IOW, it's mount_locked_reader, not mount_writer.

Clarify the comment re locking rules for mnt_unhold_writers() - it's
not just that mount_lock needs to be held when calling that, it must
have been held all along since the matching mnt_hold_writers().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 3371fa2f27 struct mount: relocate MNT_WRITE_HOLD bit
... from ->mnt_flags to LSB of ->mnt_pprev_for_sb.

This is safe - we always set and clear it within the same mount_lock
scope, so we won't interfere with list operations - traversals are
always forward, so they don't even look at ->mnt_prev_for_sb and
both insertions and removals are in mount_lock scopes of their own,
so that bit will be clear in *all* mount instances during those.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 09a1b33c08 preparations to taking MNT_WRITE_HOLD out of ->mnt_flags
We have an unpleasant wart in accessibility rules for struct mount.  There
are per-superblock lists of mounts, used by sb_prepare_remount_readonly()
to check if any of those is currently claimed for write access and to
block further attempts to get write access on those until we are done.

As soon as it is attached to a filesystem, mount becomes reachable
via that list.  Only sb_prepare_remount_readonly() traverses it and
it only accesses a few members of struct mount.  Unfortunately,
->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set
and then cleared.  It is done under mount_lock, so from the locking
rules POV everything's fine.

However, it has easily overlooked implications - once mount has been
attached to a filesystem, it has to be treated as globally visible.
In particular, initializing ->mnt_flags *must* be done either prior
to that point or under mount_lock.  All other members are still
private at that point.

Life gets simpler if we move that bit (and that's *all* that can get
touched by access via this list) out of ->mnt_flags.  It's not even
hard to do - currently the list is implemented as list_head one,
anchored in super_block->s_mounts and linked via mount->mnt_instance.

As the first step, switch it to hlist-like open-coded structure -
address of the first mount in the set is stored in ->s_mounts
and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb -
the former either NULL or pointing to the next mount in set, the
latter - address of either ->s_mounts or ->mnt_next_for_sb in the
previous element of the set.

In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as
replacement for MNT_WRITE_HOLD.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 5d132cfafb setup_mnt(): primitive for connecting a mount to filesystem
Take the identical logics in vfs_create_mount() and clone_mnt() into
a new helper that takes an empty struct mount and attaches it to
given dentry (sub)tree.

Should be called once in the lifetime of every mount, prior to making
it visible in any data structures.

After that point ->mnt_root and ->mnt_sb never change; ->mnt_root
is a counting reference to dentry and ->mnt_sb - an active reference
to superblock.

Mount remains associated with that dentry tree all the way until
the call of cleanup_mnt(), when the refcount eventually drops
to zero.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:28 -04:00
Al Viro 7f954a6f49 simplify the callers of mnt_unhold_writers()
The logics in cleanup on failure in mount_setattr_prepare() is simplified
by having the mnt_hold_writers() failure followed by advancing m to the
next node in the tree before leaving the loop.

And since all calls are preceded by the same check that flag has been set
and the function is inlined, let's just shift the check into it.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:28 -04:00
Al Viro d7b7253a0a copy_mnt_ns(): use guards
* mntput() of rootmnt and pwdmnt done via __free(mntput)
* mnt_ns_tree_add() can be done within namespace_excl scope.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:28 -04:00
Al Viro 7bb4c851dc copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure
Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for
an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu()
use per failing call of clone() or unshare(), if they fail due to OOM in that
particular spot, but it's not really worth bothering.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:28 -04:00
Al Viro 1b966c4471 Merge branch 'no-rebase-mnt_ns_tree_remove' into work.mount 2025-09-17 15:58:06 -04:00
Al Viro 38f4885088 mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list
Actual removal is done under the lock, but for checking if need to bother
the lockless RB_EMPTY_NODE() is safe - either that namespace had never
been added to mnt_ns_tree, in which case the the node will stay empty, or
whoever had allocated it has called mnt_ns_tree_add() and it has already
run to completion.  After that point RB_EMPTY_NODE() will become false and
will remain false, no matter what we do with other nodes in the tree.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-16 00:33:37 -04:00
Al Viro 57a7b5b0b6 open_detached_copy(): separate creation of namespace into helper
... and convert the helper to use of a guard(namespace_excl)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 71cf10ce45 open_detached_copy(): don't bother with mount_lock_hash()
we are holding namespace_sem and a reference to root of tree;
iterating through that tree does not need mount_lock.  Neither
does the insertion into the rbtree of new namespace or incrementing
the mount count of that namespace.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 19ac81735c fs/namespace.c: sanitize descriptions for {__,}lookup_mnt()
Comments regarding "shadow mounts" were stale - no such thing anymore.
Document the locking requirements for __lookup_mnt().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 75db7fd990 umount_tree(): take all victims out of propagation graph at once
For each removed mount we need to calculate where the slaves will end up.
To avoid duplicating that work, do it for all mounts to be removed
at once, taking the mounts themselves out of propagation graph as
we go, then do all transfers; the duplicate work on finding destinations
is avoided since if we run into a mount that already had destination found,
we don't need to trace the rest of the way.  That's guaranteed
O(removed mounts) for finding destinations and removing from propagation
graph and O(surviving mounts that have master removed) for transfers.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro fc9d5efc4c do_mount(): use __free(path_put)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 43d672dbf1 do_move_mount_old(): use __free(path_put)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 86af25b01d constify can_move_mount_beneath() arguments
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro f91c433a5c path_umount(): constify struct path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 4f4b18af4c may_copy_tree(), __do_loopback(): constify struct path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 8ec7ee2e0b path_mount(): constify struct path argument
now it finally can be done.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro a8be822f61 do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 17d44b452c do_new_mount{,_fc}(): constify struct path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 27e4b78559 mnt_warn_timestamp_expiry(): constify struct path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 44b58cdaf9 do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro b42ffcd506 collect_paths(): constify the return value
callers have no business modifying the paths they get

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:44 -04:00
Al Viro 1f6df58474 drop_collected_paths(): constify arguments
... and use that to constify the pointers in callers

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:43 -04:00
Al Viro 6e024a0e28 do_set_group(): constify path arguments
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:43 -04:00
Al Viro 08404199f3 do_mount_setattr(): constify path argument
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:43 -04:00
Al Viro 8be87700c9 constify check_mnt()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:43 -04:00
Al Viro 90006f21b7 do_lock_mount(): don't modify path.
Currently do_lock_mount() has the target path switched to whatever
might be overmounting it.  We _do_ want to have the parent
mount/mountpoint chosen on top of the overmounting pile; however,
the way it's done has unpleasant races - if umount propagation
removes the overmount while we'd been trying to set the environment
up, we might end up failing if our target path strays into that overmount
just before the overmount gets kicked out.

Users of do_lock_mount() do not need the target path changed - they
have all information in res->{parent,mp}; only one place (in
do_move_mount()) currently uses the resulting path->mnt, and that value
is trivial to reconstruct by the original value of path->mnt + chosen
parent mount.

Let's keep the target path unchanged; it avoids a bunch of subtle races
and it's not hard to do:
	do
		as mount_locked_reader
			find the prospective parent mount/mountpoint dentry
			grab references if it's not the original target
		lock the prospective mountpoint dentry
		take namespace_sem exclusive
		if prospective parent/mountpoint would be different now
			err = -EAGAIN
		else if location has been unmounted
			err = -ENOENT
		else if mountpoint dentry is not allowed to be mounted on
			err = -ENOENT
		else if beneath and the top of the pile was the absolute root
			err = -EINVAL
		else
			try to get struct mountpoint (by dentry), set
			err to 0 on success and -ENO{MEM,ENT} on failure
		if err != 0
			res->parent = ERR_PTR(err)
			drop locks
		else
			res->parent = prospective parent
		drop temporary references
	while err == -EAGAIN

A somewhat subtle part is that dropping temporary references is allowed.
Neither mounts nor dentries should be evicted by a thread that holds
namespace_sem.  On success we are dropping those references under
namespace_sem, so we need to be sure that these are not the last
references remaining.  However, on success we'd already verified (under
namespace_sem) that original target is still mounted and that mount
and dentry we are about to drop are still reachable from it via the
mount tree.  That guarantees that we are not about to drop the last
remaining references.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:42 -04:00