linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	e64aeecbbb	mount-related stuff for this cycle * saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users. * lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it. * MNT_WRITE_HOLD done right - mnt_hold_writers() does not mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications. * getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it. * a bunch of stuff constified. * assorted cleanups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaNhzLAAKCRBZ7Krx/gZQ 63/IAP4yxJ6e3Pt66Uw0MeuSNmeLsQwb7mYo72lsYHpxjYANZAEAspMaLDU9NHxM Dy6WDVoJnf7+aDlD6E443YMfPX8XRQM= =5T+t -----END PGP SIGNATURE----- Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "Several piles this cycle, this mount-related one being the largest and trickiest: - saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users - lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it - MNT_WRITE_HOLD done right. mnt_hold_writers() does not mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications - getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it - a bunch of stuff constified - assorted cleanups * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) constify {__,}mnt_is_readonly() WRITE_HOLD machinery: no need for to bump mount_lock seqcount struct mount: relocate MNT_WRITE_HOLD bit preparations to taking MNT_WRITE_HOLD out of ->mnt_flags setup_mnt(): primitive for connecting a mount to filesystem simplify the callers of mnt_unhold_writers() copy_mnt_ns(): use guards copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure open_detached_copy(): separate creation of namespace into helper open_detached_copy(): don't bother with mount_lock_hash() path_has_submounts(): use guard(mount_locked_reader) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() ecryptfs: get rid of pointless mount references in ecryptfs dentries umount_tree(): take all victims out of propagation graph at once do_mount(): use __free(path_put) do_move_mount_old(): use __free(path_put) constify can_move_mount_beneath() arguments path_umount(): constify struct path argument may_copy_tree(), __do_loopback(): constify struct path argument path_mount(): constify struct path argument ...	2025-10-03 10:19:44 -07:00
Linus Torvalds	18b19abc37	namespace-6.18-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQgQAKCRCRxhvAZXjc oiFXAQCpbLvkWbld9wLgxUBhq+q+kw5NvGxzpvqIhXwJB9F9YAEA44/Wevln4xGx +kRUbP+xlRQqenIYs2dLzVHzAwAdfQ4= =EO4Y -----END PGP SIGNATURE----- Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_() helpers net: port to ns_ref_() helpers uts: port to ns_ref_() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_() helpers time: port to ns_ref_() helpers pid: port to ns_ref_() helpers ipc: port to ns_ref_() helpers cgroup: port to ns_ref_() helpers ...	2025-09-29 11:20:29 -07:00
Linus Torvalds	722df25ddf	kernel-6.18-rc1.clone3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl 8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY= =tUXG -----END PGP SIGNATURE----- Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)	2025-09-29 10:36:50 -07:00
Linus Torvalds	3a2a5b278f	vfs-6.18-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQOwAKCRCRxhvAZXjc oth/AQDvlOo+23/f2djgDGS8akjkBYVLW14OYzC0q5cbEnnGgAEAycHL50pp3n1o 3jMYlCByuv507vpCsDupo7QcJapmQAk= =/NwE -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev	2025-09-29 09:32:34 -07:00
Christian Brauner	6c7ca6a02f	mount: handle NULL values in mnt_ns_release() When calling in listmount() mnt_ns_release() may be passed a NULL pointer. Handle that case gracefully. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-09-29 09:08:18 -07:00
Linus Torvalds	b7ce6fa90f	vfs-6.18-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03 GbMcB2DCFLC4evqYECj6IG7NBmoKsAs= =1ICf -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...	2025-09-29 09:03:07 -07:00
Christian Brauner	c1f86d0ac3	listmount: don't call path_put() under namespace semaphore Massage listmount() and make sure we don't call path_put() under the namespace semaphore. If we put the last reference we're fscked. Fixes: `b4c2bea8ce` ("add listmount(2) syscall") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-26 10:20:29 +02:00
Christian Brauner	e8c84e2082	statmount: don't call path_put() under namespace semaphore Massage statmount() and make sure we don't call path_put() under the namespace semaphore. If we put the last reference we're fscked. Fixes: `46eae99ef7` ("add statmount(2) syscall") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-26 10:16:06 +02:00
Christian Brauner	4055526d35	ns: move ns type into struct ns_common It's misplaced in struct proc_ns_operations and ns->ops might be NULL if the namespace is compiled out but we still want to know the type of the namespace for the initial namespace struct. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-25 09:23:54 +02:00
Christian Brauner	d7610cb745	ns: simplify ns_common_init() further Simply derive the ns operations from the namespace type. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-22 14:47:10 +02:00
Christian Brauner	7cf7303211	ns: use inode initializer for initial namespaces Just use the common helper we have. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:38 +02:00
Christian Brauner	024596a4e2	ns: rename to __ns_ref Make it easier to grep and rename to ns_count. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:38 +02:00
Christian Brauner	2e9e697227	mnt: port to ns_ref_*() helpers Stop accessing ns.count directly. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:36 +02:00
Christian Brauner	be5f21d398	ns: add ns_common_free() And drop ns_free_inum(). Anything common that can be wasted centrally should be wasted in the new common helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:22:36 +02:00
Christian Brauner	5612ff3ec5	nscommon: simplify initialization There's a lot of information that namespace implementers don't need to know about at all. Encapsulate this all in the initialization helper. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:19 +02:00
Christian Brauner	86cdbae5c6	mnt: simplify ns_common_init() handling Assign the reserved MNT_NS_ANON_INO sentinel to anonymous mount namespaces and cleanup the initial mount ns allocation. This is just a preparatory patch and the ns->inum check in ns_common_init() will be dropped in the next patch. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:18 +02:00
Christian Brauner	b2a0b19208	mnt: expose pointer to init_mnt_ns There's various scenarios where we need to know whether we are in the initial set of namespaces or not to e.g., shortcut permission checking. All namespaces expose that information. Let's do that too. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:18 +02:00
Christian Brauner	7d7d164989	mnt: support ns lookup Move the mount namespace to the generic ns lookup infrastructure. This allows us to drop a bunch of members from struct mnt_namespace. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:15 +02:00
Christian Brauner	7914f15c5e	Merge branch 'no-rebase-mnt_ns_tree_remove' Bring in the fix for removing a mount namespace from the mount namespace rbtree and list. Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:14 +02:00
Christian Brauner	96ece8eb67	mnt: use ns_common_init() Don't cargo-cult the same thing over and over. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:13 +02:00
Al Viro	a797652486	constify {__,}mnt_is_readonly() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:29 -04:00
Al Viro	1e414adf03	WRITE_HOLD machinery: no need for to bump mount_lock seqcount ... neither for insertion into the list of instances, nor for mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding to be nice to RT during a busy-wait loop - all of that only needs the spinlock side of mount_lock. IOW, it's mount_locked_reader, not mount_writer. Clarify the comment re locking rules for mnt_unhold_writers() - it's not just that mount_lock needs to be held when calling that, it must have been held all along since the matching mnt_hold_writers(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:29 -04:00
Al Viro	3371fa2f27	struct mount: relocate MNT_WRITE_HOLD bit ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. This is safe - we always set and clear it within the same mount_lock scope, so we won't interfere with list operations - traversals are always forward, so they don't even look at ->mnt_prev_for_sb and both insertions and removals are in mount_lock scopes of their own, so that bit will be clear in all mount instances during those. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:29 -04:00
Al Viro	09a1b33c08	preparations to taking MNT_WRITE_HOLD out of ->mnt_flags We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags must be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's all that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:29 -04:00
Al Viro	5d132cfafb	setup_mnt(): primitive for connecting a mount to filesystem Take the identical logics in vfs_create_mount() and clone_mnt() into a new helper that takes an empty struct mount and attaches it to given dentry (sub)tree. Should be called once in the lifetime of every mount, prior to making it visible in any data structures. After that point ->mnt_root and ->mnt_sb never change; ->mnt_root is a counting reference to dentry and ->mnt_sb - an active reference to superblock. Mount remains associated with that dentry tree all the way until the call of cleanup_mnt(), when the refcount eventually drops to zero. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:28 -04:00
Al Viro	7f954a6f49	simplify the callers of mnt_unhold_writers() The logics in cleanup on failure in mount_setattr_prepare() is simplified by having the mnt_hold_writers() failure followed by advancing m to the next node in the tree before leaving the loop. And since all calls are preceded by the same check that flag has been set and the function is inlined, let's just shift the check into it. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:28 -04:00
Al Viro	d7b7253a0a	copy_mnt_ns(): use guards * mntput() of rootmnt and pwdmnt done via __free(mntput) * mnt_ns_tree_add() can be done within namespace_excl scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:28 -04:00
Al Viro	7bb4c851dc	copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu() use per failing call of clone() or unshare(), if they fail due to OOM in that particular spot, but it's not really worth bothering. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-17 15:58:28 -04:00
Al Viro	1b966c4471	Merge branch 'no-rebase-mnt_ns_tree_remove' into work.mount	2025-09-17 15:58:06 -04:00
Al Viro	38f4885088	mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Actual removal is done under the lock, but for checking if need to bother the lockless RB_EMPTY_NODE() is safe - either that namespace had never been added to mnt_ns_tree, in which case the the node will stay empty, or whoever had allocated it has called mnt_ns_tree_add() and it has already run to completion. After that point RB_EMPTY_NODE() will become false and will remain false, no matter what we do with other nodes in the tree. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-16 00:33:37 -04:00
Al Viro	57a7b5b0b6	open_detached_copy(): separate creation of namespace into helper ... and convert the helper to use of a guard(namespace_excl) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	71cf10ce45	open_detached_copy(): don't bother with mount_lock_hash() we are holding namespace_sem and a reference to root of tree; iterating through that tree does not need mount_lock. Neither does the insertion into the rbtree of new namespace or incrementing the mount count of that namespace. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	19ac81735c	fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Comments regarding "shadow mounts" were stale - no such thing anymore. Document the locking requirements for __lookup_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	75db7fd990	umount_tree(): take all victims out of propagation graph at once For each removed mount we need to calculate where the slaves will end up. To avoid duplicating that work, do it for all mounts to be removed at once, taking the mounts themselves out of propagation graph as we go, then do all transfers; the duplicate work on finding destinations is avoided since if we run into a mount that already had destination found, we don't need to trace the rest of the way. That's guaranteed O(removed mounts) for finding destinations and removing from propagation graph and O(surviving mounts that have master removed) for transfers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	fc9d5efc4c	do_mount(): use __free(path_put) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	43d672dbf1	do_move_mount_old(): use __free(path_put) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	86af25b01d	constify can_move_mount_beneath() arguments Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	f91c433a5c	path_umount(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	4f4b18af4c	may_copy_tree(), __do_loopback(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	8ec7ee2e0b	path_mount(): constify struct path argument now it finally can be done. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	a8be822f61	do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	17d44b452c	do_new_mount{,_fc}(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	27e4b78559	mnt_warn_timestamp_expiry(): constify struct path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	44b58cdaf9	do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	b42ffcd506	collect_paths(): constify the return value callers have no business modifying the paths they get Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:44 -04:00
Al Viro	1f6df58474	drop_collected_paths(): constify arguments ... and use that to constify the pointers in callers Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:43 -04:00
Al Viro	6e024a0e28	do_set_group(): constify path arguments Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:43 -04:00
Al Viro	08404199f3	do_mount_setattr(): constify path argument Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:43 -04:00
Al Viro	8be87700c9	constify check_mnt() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:43 -04:00
Al Viro	90006f21b7	do_lock_mount(): don't modify path. Currently do_lock_mount() has the target path switched to whatever might be overmounting it. We _do_ want to have the parent mount/mountpoint chosen on top of the overmounting pile; however, the way it's done has unpleasant races - if umount propagation removes the overmount while we'd been trying to set the environment up, we might end up failing if our target path strays into that overmount just before the overmount gets kicked out. Users of do_lock_mount() do not need the target path changed - they have all information in res->{parent,mp}; only one place (in do_move_mount()) currently uses the resulting path->mnt, and that value is trivial to reconstruct by the original value of path->mnt + chosen parent mount. Let's keep the target path unchanged; it avoids a bunch of subtle races and it's not hard to do: do as mount_locked_reader find the prospective parent mount/mountpoint dentry grab references if it's not the original target lock the prospective mountpoint dentry take namespace_sem exclusive if prospective parent/mountpoint would be different now err = -EAGAIN else if location has been unmounted err = -ENOENT else if mountpoint dentry is not allowed to be mounted on err = -ENOENT else if beneath and the top of the pile was the absolute root err = -EINVAL else try to get struct mountpoint (by dentry), set err to 0 on success and -ENO{MEM,ENT} on failure if err != 0 res->parent = ERR_PTR(err) drop locks else res->parent = prospective parent drop temporary references while err == -EAGAIN A somewhat subtle part is that dropping temporary references is allowed. Neither mounts nor dentries should be evicted by a thread that holds namespace_sem. On success we are dropping those references under namespace_sem, so we need to be sure that these are not the last references remaining. However, on success we'd already verified (under namespace_sem) that original target is still mounted and that mount and dentry we are about to drop are still reachable from it via the mount tree. That guarantees that we are not about to drop the last remaining references. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:42 -04:00

1 2 3 4 5 ...

922 Commits