Commit Graph

101 Commits

Author SHA1 Message Date
Christian Brauner d9a44089ac
fs: use boolean to indicate anonymous mount namespace
Stop playing games with the namespace id and use a boolean instead:

* This will remove the special-casing we need to do everywhere for mount
  namespaces.

* It will allow us to use asserts on the namespace id for initial
  namespaces everywhere.

* It will allow us to put anonymous mount namespaces on the namespaces
  trees in the future and thus make them available to statmount() and
  listmount().

Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-10-e8a9264e0fb9@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11 10:01:31 +01:00
Linus Torvalds e64aeecbbb mount-related stuff for this cycle
* saner handling of guards in fs/namespace.c, getting
 rid of needlessly strong locking in some of the users.
 
 	* lock_mount() calling conventions change - have it set
 the environment for attaching to given location, storing the
 results in caller-supplied object, without altering the passed
 struct path.  Make unlock_mount() called as __cleanup for those
 objects.  It's not exactly guard(), but similar to it.
 
 	* MNT_WRITE_HOLD done right - mnt_hold_writers() does *not*
 mess with ->mnt_flags anymore, so insertion of a new mount into
 ->s_mounts of underlying superblock does not, in itself, expose
 ->mnt_flags of that mount to concurrent modifications.
 
 	* getting rid of pathological cases when umount() spends
 quadratic time removing the victims from propagation graph -
 part of that had been dealt with last cycle, this should finish
 it.
 
 	* a bunch of stuff constified.
 
 	* assorted cleanups.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaNhzLAAKCRBZ7Krx/gZQ
 63/IAP4yxJ6e3Pt66Uw0MeuSNmeLsQwb7mYo72lsYHpxjYANZAEAspMaLDU9NHxM
 Dy6WDVoJnf7+aDlD6E443YMfPX8XRQM=
 =5T+t
 -----END PGP SIGNATURE-----

Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs mount updates from Al Viro:
 "Several piles this cycle, this mount-related one being the largest and
  trickiest:

   - saner handling of guards in fs/namespace.c, getting rid of
     needlessly strong locking in some of the users

   - lock_mount() calling conventions change - have it set the
     environment for attaching to given location, storing the results in
     caller-supplied object, without altering the passed struct path.

     Make unlock_mount() called as __cleanup for those objects. It's not
     exactly guard(), but similar to it

   - MNT_WRITE_HOLD done right.

     mnt_hold_writers() does *not* mess with ->mnt_flags anymore, so
     insertion of a new mount into ->s_mounts of underlying superblock
     does not, in itself, expose ->mnt_flags of that mount to concurrent
     modifications

   - getting rid of pathological cases when umount() spends quadratic
     time removing the victims from propagation graph - part of that had
     been dealt with last cycle, this should finish it

   - a bunch of stuff constified

   - assorted cleanups

* tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
  constify {__,}mnt_is_readonly()
  WRITE_HOLD machinery: no need for to bump mount_lock seqcount
  struct mount: relocate MNT_WRITE_HOLD bit
  preparations to taking MNT_WRITE_HOLD out of ->mnt_flags
  setup_mnt(): primitive for connecting a mount to filesystem
  simplify the callers of mnt_unhold_writers()
  copy_mnt_ns(): use guards
  copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure
  open_detached_copy(): separate creation of namespace into helper
  open_detached_copy(): don't bother with mount_lock_hash()
  path_has_submounts(): use guard(mount_locked_reader)
  fs/namespace.c: sanitize descriptions for {__,}lookup_mnt()
  ecryptfs: get rid of pointless mount references in ecryptfs dentries
  umount_tree(): take all victims out of propagation graph at once
  do_mount(): use __free(path_put)
  do_move_mount_old(): use __free(path_put)
  constify can_move_mount_beneath() arguments
  path_umount(): constify struct path argument
  may_copy_tree(), __do_loopback(): constify struct path argument
  path_mount(): constify struct path argument
  ...
2025-10-03 10:19:44 -07:00
Christian Brauner 2e9e697227
mnt: port to ns_ref_*() helpers
Stop accessing ns.count directly.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:22:36 +02:00
Christian Brauner 7d7d164989
mnt: support ns lookup
Move the mount namespace to the generic ns lookup infrastructure.
This allows us to drop a bunch of members from struct mnt_namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 14:26:15 +02:00
Al Viro 3371fa2f27 struct mount: relocate MNT_WRITE_HOLD bit
... from ->mnt_flags to LSB of ->mnt_pprev_for_sb.

This is safe - we always set and clear it within the same mount_lock
scope, so we won't interfere with list operations - traversals are
always forward, so they don't even look at ->mnt_prev_for_sb and
both insertions and removals are in mount_lock scopes of their own,
so that bit will be clear in *all* mount instances during those.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 09a1b33c08 preparations to taking MNT_WRITE_HOLD out of ->mnt_flags
We have an unpleasant wart in accessibility rules for struct mount.  There
are per-superblock lists of mounts, used by sb_prepare_remount_readonly()
to check if any of those is currently claimed for write access and to
block further attempts to get write access on those until we are done.

As soon as it is attached to a filesystem, mount becomes reachable
via that list.  Only sb_prepare_remount_readonly() traverses it and
it only accesses a few members of struct mount.  Unfortunately,
->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set
and then cleared.  It is done under mount_lock, so from the locking
rules POV everything's fine.

However, it has easily overlooked implications - once mount has been
attached to a filesystem, it has to be treated as globally visible.
In particular, initializing ->mnt_flags *must* be done either prior
to that point or under mount_lock.  All other members are still
private at that point.

Life gets simpler if we move that bit (and that's *all* that can get
touched by access via this list) out of ->mnt_flags.  It's not even
hard to do - currently the list is implemented as list_head one,
anchored in super_block->s_mounts and linked via mount->mnt_instance.

As the first step, switch it to hlist-like open-coded structure -
address of the first mount in the set is stored in ->s_mounts
and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb -
the former either NULL or pointing to the next mount in set, the
latter - address of either ->s_mounts or ->mnt_next_for_sb in the
previous element of the set.

In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as
replacement for MNT_WRITE_HOLD.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-17 15:58:29 -04:00
Al Viro 25423edc78 new helper: topmost_overmount()
Returns the final (topmost) mount in the chain of overmounts
starting at given mount.  Same locking rules as for any mount
tree traversal - either the spinlock side of mount_lock, or
rcu + sample the seqcount side of mount_lock before the call
and recheck afterwards.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15 21:26:05 -04:00
Al Viro d154f18575 introduced guards for mount_lock
mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash()
mount_locked_reader: read_seqlock_excl; these tend to be open-coded.

No bulk conversions, please - if nothing else, quite a few places take
use mount_writer form when mount_locked_reader is sufficent.  It needs
to be dealt with carefully.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-02 19:35:56 -04:00
Al Viro 663206854f copy_tree(): don't link the mounts via mnt_list
The only place that really needs to be adjusted is commit_tree() -
there we need to iterate through the copy and we might as well
use next_mnt() for that.  However, in case when our tree has been
slid under something already mounted (propagation to a mountpoint
that already has something mounted on it or a 'beneath' move_mount)
we need to take care not to walk into the overmounting tree.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:37 -04:00
Al Viro 8c5a853f58 mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:30 -04:00
Al Viro 406fea7999 mount: separate the flags accessed only under namespace_sem
Several flags are updated and checked only under namespace_sem; we are
already making use of that when we are checking them without mount_lock,
but we have to hold mount_lock for all updates, which makes things
clumsier than they have to be.

Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE
into a separate field (->mnt_t_flags), renaming them to T_SHARED,
etc. to avoid confusion.  All accesses must be under namespace_sem.

That changes locking requirements for mnt_change_propagation() and
set_mnt_shared() - only namespace_sem is needed now.  The same goes
for SET_MNT_MARKED et.al.

There might be more flags moved from ->mnt_flags to that field;
this is just the initial set.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 19:03:29 -04:00
Al Viro d72c773237 get rid of mountpoint->m_count
struct mountpoint has an odd kinda-sorta refcount in it.  It's always
either equal to or one above the number of mounts attached to that
mountpoint.

"One above" happens when a function takes a temporary reference to
mountpoint.  Things get simpler if we express that as inserting
a local object into ->m_list and removing it to drop the reference.

New calling conventions:

1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint()
take an extra struct pinned_mountpoint * argument and returns 0/-E...
(or true/false in case of lookup_mountpoint()) instead of returning
struct mountpoint pointers.  In case of success, the struct mountpoint *
we used to get can be found as pinned_mountpoint.mp

2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes
an address of struct pinned_mountpoint - the same that had been passed to
lock_mount()/do_lock_mount().

3) put_mountpoint() for a temporary reference (paired with get_mountpoint()
or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes
the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint().

4) all instances of pinned_mountpoint are local variables; they always live on
stack.  {} is used for initializer, after successful {get,lookup}_mountpoint()
we must make sure to call unpin_mountpoint() before leaving the scope and
after successful {do_,}lock_mount() we must make sure to call unlock_mount()
before leaving the scope.

5) all manipulations of ->m_count are gone, along with ->m_count itself.
struct mountpoint lives while its ->m_list is non-empty.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:42 -04:00
Al Viro f0d0ba1998 Rewrite of propagate_umount()
The variant currently in the tree has problems; trying to prove
correctness has caught at least one class of bugs (reparenting
that ends up moving the visible location of reparented mount, due
to not excluding some of the counterparts on propagation that
should've been included).

I tried to prove that it's the only bug there; I'm still not sure
whether it is.  If anyone can reconstruct and write down an analysis
of the mainline implementation, I'll gladly review it; as it is,
I ended up doing a different implementation.  Candidate collection
phase is similar, but trimming the set down until it satisfies the
constraints turned out pretty different.

I hoped to do transformation as a massage series, but that turns out
to be too convoluted.  So it's a single patch replacing propagate_umount()
and friends in one go, with notes and analysis in D/f/propagate_umount.txt
(in addition to inline comments).

As far I can tell, it is provably correct and provably linear by the number
of mounts we need to look at in order to decide what should be unmounted.
It even builds and seems to survive testing...

Another nice thing that fell out of that is that ->mnt_umounting is no longer
needed.

Compared to the first version:
	* explicit MNT_UMOUNT_CANDIDATE flag for is_candidate()
	* trim_ancestors() only clears that flag, leaving the suckers on list
	* trim_one() and handle_locked() take the stuff with flag cleared off
the list.  That allows to iterate with list_for_each_entry_safe() when calling
trim_one() - it removes at most one element from the list now.
	* no globals - I didn't bother with any kind of context, not worth it.

	* Notes updated accordingly; I have not touch the terms yet.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro 05da054d43 new predicate: anon_ns_root(mount)
checks if mount is the root of an anonymouns namespace.
Switch open-coded equivalents to using it.

For mounts that belong to anon namespace !mnt_has_parent(mount)
is the same as mount == ns->root, and intent is more obvious in
the latter form.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro e031251cb2 constify is_local_mountpoint()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro 0e84653ea5 constify mnt_has_parent()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro ffdc52fbbd prevent mount hash conflicts
Currently it's still possible to run into a pathological situation when
two hashed mounts share both parent and mountpoint.  That does not work
well, for obvious reasons.

We are not far from getting rid of that; the only remaining gap is
attach_recursive_mnt() not being careful enough when sliding a tree
under existing mount (for propagated copies or in 'beneath' case for
the original one).

To deal with that cleanly we need to be able to find overmounts
(i.e. mounts on top of parent's root); we could do hash lookups or scan
the list of children but either would be costly.  Since one of the results
we get from that will be prevention of multiple parallel overmounts, let's
just bite the bullet and store a (non-counting) reference to overmount
in struct mount.

With that done, closing the hole in attach_recursive_mnt() becomes easy
- we just need to follow the chain of overmounts before we change the
mountpoint of the mount we are sliding things under.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29 18:13:41 -04:00
Al Viro 3b5260d12b Don't propagate mounts into detached trees
All versions up to 6.14 did not propagate mount events into detached
tree.  Shortly after 6.14 a merge of vfs-6.15-rc1.mount.namespace
(130e696aa6) has changed that.

Unfortunately, that has caused userland regressions (reported in
https://lore.kernel.org/all/CAOYeF9WQhFDe+BGW=Dp5fK8oRy5AgZ6zokVyTj1Wp4EUiYgt4w@mail.gmail.com/)

Straight revert wouldn't be an option - in particular, the variant in 6.14
had a bug that got fixed in d1ddc6f1d9 ("fix IS_MNT_PROPAGATING uses")
and we don't want to bring the bug back.

This is a modification of manual revert posted by Christian, with changes
needed to avoid reintroducing the breakage in scenario described in
d1ddc6f1d9.

Cc: stable@vger.kernel.org
Reported-by: Allison Karlitskaya <lis@redhat.com>
Tested-by: Allison Karlitskaya <lis@redhat.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-26 17:35:32 -04:00
Linus Torvalds 130e696aa6 vfs-6.15-rc1.mount.namespace
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90r2wAKCRCRxhvAZXjc
 ouC6AQCk3MoqskN0WeNcaZT23dB7dHbEhf/7YXOFC9MFRMKXqQD9Fbn95+GuIe3U
 nBVPbVyQfDtfXE08ml6gbDJrCsbkkQI=
 =Xm1C
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount namespace updates from Christian Brauner:
 "This expands the ability of anonymous mount namespaces:

   - Creating detached mounts from detached mounts

     Currently, detached mounts can only be created from attached
     mounts. This limitaton prevents various use-cases. For example, the
     ability to mount a subdirectory without ever having to make the
     whole filesystem visible first.

     The current permission modelis:

      (1) Check that the caller is privileged over the owning user
          namespace of it's current mount namespace.

      (2) Check that the caller is located in the mount namespace of the
          mount it wants to create a detached copy of.

     While it is not strictly necessary to do it this way it is
     consistently applied in the new mount api. This model will also be
     used when allowing the creation of detached mount from another
     detached mount.

     The (1) requirement can simply be met by performing the same check
     as for the non-detached case, i.e., verify that the caller is
     privileged over its current mount namespace.

     To meet the (2) requirement it must be possible to infer the origin
     mount namespace that the anonymous mount namespace of the detached
     mount was created from.

     The origin mount namespace of an anonymous mount is the mount
     namespace that the mounts that were copied into the anonymous mount
     namespace originate from.

     In order to check the origin mount namespace of an anonymous mount
     namespace the sequence number of the original mount namespace is
     recorded in the anonymous mount namespace.

     With this in place it is possible to perform an equivalent check
     (2') to (2). The origin mount namespace of the anonymous mount
     namespace must be the same as the caller's mount namespace. To
     establish this the sequence number of the caller's mount namespace
     and the origin sequence number of the anonymous mount namespace are
     compared.

     The caller is always located in a non-anonymous mount namespace
     since anonymous mount namespaces cannot be setns()ed into. The
     caller's mount namespace will thus always have a valid sequence
     number.

     The owning namespace of any mount namespace, anonymous or
     non-anonymous, can never change. A mount attached to a
     non-anonymous mount namespace can never change mount namespace.

     If the sequence number of the non-anonymous mount namespace and the
     origin sequence number of the anonymous mount namespace match, the
     owning namespaces must match as well.

     Hence, the capability check on the owning namespace of the caller's
     mount namespace ensures that the caller has the ability to copy the
     mount tree.

   - Allow mount detached mounts on detached mounts

     Currently, detached mounts can only be mounted onto attached
     mounts. This limitation makes it impossible to assemble a new
     private rootfs and move it into place. Instead, a detached tree
     must be created, attached, then mounted open and then either moved
     or detached again. Lift this restriction.

     In order to allow mounting detached mounts onto other detached
     mounts the same permission model used for creating detached mounts
     from detached mounts can be used (cf. above).

     Allowing to mount detached mounts onto detached mounts leaves three
     cases to consider:

      (1) The source mount is an attached mount and the target mount is
          a detached mount. This would be equivalent to moving a mount
          between different mount namespaces. A caller could move an
          attached mount to a detached mount. The detached mount can now
          be freely attached to any mount namespace. This changes the
          current delegatioh model significantly for no good reason. So
          this will fail.

      (2) Anonymous mount namespaces are always attached fully, i.e., it
          is not possible to only attach a subtree of an anoymous mount
          namespace. This simplifies the implementation and reasoning.

          Consequently, if the anonymous mount namespace of the source
          detached mount and the target detached mount are the identical
          the mount request will fail.

      (3) The source mount's anonymous mount namespace is different from
          the target mount's anonymous mount namespace.

          In this case the source anonymous mount namespace of the
          source mount tree must be freed after its mounts have been
          moved to the target anonymous mount namespace. The source
          anonymous mount namespace must be empty afterwards.

     By allowing to mount detached mounts onto detached mounts a caller
     may do the following:

       fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
       fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)

     fd_tree1 and fd_tree2 refer to two different detached mount trees
     that belong to two different anonymous mount namespace.

     It is important to note that fd_tree1 and fd_tree2 both refer to
     the root of their respective anonymous mount namespaces.

     By allowing to mount detached mounts onto detached mounts the
     caller may now do:

         move_mount(fd_tree1, "", fd_tree2, "",
                    MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)

     This will cause the detached mount referred to by fd_tree1 to be
     mounted on top of the detached mount referred to by fd_tree2.

     Thus, the detached mount fd_tree1 is moved from its separate
     anonymous mount namespace into fd_tree2's anonymous mount
     namespace.

     It also means that while fd_tree2 continues to refer to the root of
     its respective anonymous mount namespace fd_tree1 doesn't anymore.

     This has the consequence that only fd_tree2 can be moved to another
     anonymous or non-anonymous mount namespace. Moving fd_tree1 will
     now fail as fd_tree1 doesn't refer to the root of an anoymous mount
     namespace anymore.

     Now fd_tree1 and fd_tree2 refer to separate detached mount trees
     referring to the same anonymous mount namespace.

     This is conceptually fine. The new mount api does allow for this to
     happen already via:

       mount -t tmpfs tmpfs /mnt
       mkdir -p /mnt/A
       mount -t tmpfs tmpfs /mnt/A

       fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
       fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)

     Both fd_tree3 and fd_tree4 refer to two different detached mount
     trees but both detached mount trees refer to the same anonymous
     mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3
     may be moved another mount namespace as fd_tree3 refers to the root
     of the anonymous mount namespace just while fd_tree4 doesn't.

     However, there's an important difference between the
     fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example.

     Closing fd_tree4 and releasing the respective struct file will have
     no further effect on fd_tree3's detached mount tree.

     However, closing fd_tree3 will cause the mount tree and the
     respective anonymous mount namespace to be destroyed causing the
     detached mount tree of fd_tree4 to be invalid for further mounting.

     By allowing to mount detached mounts on detached mounts as in the
     fd_tree1/fd_tree2 example both struct files will affect each other.

     Both fd_tree1 and fd_tree2 refer to struct files that have
     FMODE_NEED_UNMOUNT set.

     To handle this we use the fact that @fd_tree1 will have a parent
     mount once it has been attached to @fd_tree2.

     When dissolve_on_fput() is called the mount that has been passed in
     will refer to the root of the anonymous mount namespace. If it
     doesn't it would mean that mounts are leaked. So before allowing to
     mount detached mounts onto detached mounts this would be a bug.

     Now that detached mounts can be mounted onto detached mounts it
     just means that the mount has been attached to another anonymous
     mount namespace and thus dissolve_on_fput() must not unmount the
     mount tree or free the anonymous mount namespace as the file
     referring to the root of the namespace hasn't been closed yet.

     If it had been closed yet it would be obvious because the mount
     namespace would be NULL, i.e., the @fd_tree1 would have already
     been unmounted. If @fd_tree1 hasn't been unmounted yet and has a
     parent mount it is safe to skip any cleanup as closing @fd_tree2
     will take care of all cleanup operations.

   - Allow mount propagation for detached mount trees

     In commit ee2e3f5062 ("mount: fix mounting of detached mounts
     onto targets that reside on shared mounts") I fixed a bug where
     propagating the source mount tree of an anonymous mount namespace
     into a target mount tree of a non-anonymous mount namespace could
     be used to trigger an integer overflow in the non-anonymous mount
     namespace causing any new mounts to fail.

     The cause of this was that the propagation algorithm was unable to
     recognize mounts from the source mount tree that were already
     propagated into the target mount tree and then reappeared as
     propagation targets when walking the destination propagation mount
     tree.

     When fixing this I disabled mount propagation into anonymous mount
     namespaces. Make it possible for anonymous mount namespace to
     receive mount propagation events correctly. This is now also a
     correctness issue now that we allow mounting detached mount trees
     onto detached mount trees.

     Mark the source anonymous mount namespace with MNTNS_PROPAGATING
     indicating that all mounts belonging to this mount namespace are
     currently in the process of being propagated and make the
     propagation algorithm discard those if they appear as propagation
     targets"

* tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits)
  selftests: test subdirectory mounting
  selftests: add test for detached mount tree propagation
  fs: namespace: fix uninitialized variable use
  mount: handle mount propagation for detached mount trees
  fs: allow creating detached mounts from fsmount() file descriptors
  selftests: seventh test for mounting detached mounts onto detached mounts
  selftests: sixth test for mounting detached mounts onto detached mounts
  selftests: fifth test for mounting detached mounts onto detached mounts
  selftests: fourth test for mounting detached mounts onto detached mounts
  selftests: third test for mounting detached mounts onto detached mounts
  selftests: second test for mounting detached mounts onto detached mounts
  selftests: first test for mounting detached mounts onto detached mounts
  fs: mount detached mounts onto detached mounts
  fs: support getname_maybe_null() in move_mount()
  selftests: create detached mounts from detached mounts
  fs: create detached mounts from detached mounts
  fs: add may_copy_tree()
  fs: add fastpath for dissolve_on_fput()
  fs: add assert for move_mount()
  fs: add mnt_ns_empty() helper
  ...
2025-03-24 11:41:41 -07:00
Christian Brauner 064fe6e233
mount: handle mount propagation for detached mount trees
In commit ee2e3f5062 ("mount: fix mounting of detached mounts onto
targets that reside on shared mounts") I fixed a bug where propagating
the source mount tree of an anonymous mount namespace into a target
mount tree of a non-anonymous mount namespace could be used to trigger
an integer overflow in the non-anonymous mount namespace causing any new
mounts to fail.

The cause of this was that the propagation algorithm was unable to
recognize mounts from the source mount tree that were already propagated
into the target mount tree and then reappeared as propagation targets
when walking the destination propagation mount tree.

When fixing this I disabled mount propagation into anonymous mount
namespaces. Make it possible for anonymous mount namespace to receive
mount propagation events correctly. This is no also a correctness issue
now that we allow mounting detached mount trees onto detached mount
trees.

Mark the source anonymous mount namespace with MNTNS_PROPAGATING
indicating that all mounts belonging to this mount namespace are
currently in the process of being propagated and make the propagation
algorithm discard those if they appear as propagation targets.

Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-1-e6e3724500eb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04 09:29:54 +01:00
Christian Brauner 2f576220cd
fs: add mnt_ns_empty() helper
Add a helper that checks whether a give mount namespace is empty instead
of open-coding the specific data structure check. This also be will be
used in follow-up patches.

Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-2-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04 09:29:52 +01:00
Christian Brauner 3b0cdba4da
fs: record sequence number of origin mount namespace
Store the sequence number of the mount namespace the anonymous mount
namespace has been created from. This information will be used in
follow-up patches.

Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-1-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04 09:29:19 +01:00
Miklos Szeredi bf630c4016
vfs: add notifications for mount attach and detach
Add notifications for attaching and detaching mounts to fs/namespace.c

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129165803.72138-4-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-05 17:21:11 +01:00
Miklos Szeredi 0f46d81f2b
fanotify: notify on mount attach and detach
Add notifications for attaching and detaching mounts.  The following new
event masks are added:

  FAN_MNT_ATTACH  - Mount was attached
  FAN_MNT_DETACH  - Mount was detached

If a mount is moved, then the event is reported with (FAN_MNT_ATTACH |
FAN_MNT_DETACH).

These events add an info record of type FAN_EVENT_INFO_TYPE_MNT containing
these fields identifying the affected mounts:

  __u64 mnt_id    - the ID of the mount (see statmount(2))

FAN_REPORT_MNT must be supplied to fanotify_init() to receive these events
and no other type of event can be received with this report type.

Marks are added with FAN_MARK_MNTNS, which records the mount namespace from
an nsfs file (e.g. /proc/self/ns/mnt).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129165803.72138-3-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-05 17:21:07 +01:00
Miklos Szeredi b944249bce
fsnotify: add mount notification infrastructure
This is just the plumbing between the event source (fs/namespace.c) and the
event consumer (fanotify).  In itself it does nothing.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129165803.72138-2-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-04 11:14:47 +01:00
Christian Brauner 2ce23285d7
fs: cache first and last mount
Speed up listmount() by caching the first and last node making retrieval
of the first and last mount of each mount namespace O(1).

Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-2-fd55922c4af8@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09 16:58:54 +01:00
Christian Brauner 4368898b27
fs: lockless mntns lookup for nsfs
We already made the rbtree lookup lockless for the simple lookup case.
However, walking the list of mount namespaces via nsfs still happens
with taking the read lock blocking concurrent additions of new mount
namespaces pointlessly. Plus, such additions are rare anyway so allow
lockless lookup of the previous and next mount namespace by keeping a
separate list. This also allows to make some things simpler in the code.

Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-5-6e3cdaf9b280@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09 16:58:52 +01:00
Christian Brauner 5dcbd85d35
fs: lockless mntns rbtree lookup
Currently we use a read-write lock but for the simple search case we can
make this lockless. Creating a new mount namespace is a rather rare
event compared with querying mounts in a foreign mount namespace. Once
this is picked up by e.g., systemd to list mounts in another mount in
it's isolated services or in containers this will be used a lot so this
seems worthwhile doing.

Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-3-6e3cdaf9b280@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09 16:58:52 +01:00
Christian Brauner 344bac8f0d
fs: kill MNT_ONRB
Move mnt->mnt_node into the union with mnt->mnt_rcu and mnt->mnt_llist
instead of keeping it with mnt->mnt_list. This allows us to use
RB_CLEAR_NODE(&mnt->mnt_node) in umount_tree() as well as
list_empty(&mnt->mnt_node). That in turn allows us to remove MNT_ONRB.

This also fixes the bug reported in [1] where seemingly MNT_ONRB wasn't
set in @mnt->mnt_flags even though the mount was present in the mount
rbtree of the mount namespace.

The root cause is the following race. When a btrfs subvolume is mounted
a temporary mount is created:

btrfs_get_tree_subvol()
{
        mnt = fc_mount()
        // Register the newly allocated mount with sb->mounts:
        lock_mount_hash();
        list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
        unlock_mount_hash();
}

and registered on sb->s_mounts. Later it is added to an anonymous mount
namespace via mount_subvol():

-> mount_subvol()
   -> mount_subtree()
      -> alloc_mnt_ns()
         mnt_add_to_ns()
         vfs_path_lookup()
         put_mnt_ns()

The mnt_add_to_ns() call raises MNT_ONRB in @mnt->mnt_flags. If someone
concurrently does a ro remount:

reconfigure_super()
-> sb_prepare_remount_readonly()
   {
           list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
   }

all mounts registered in sb->s_mounts are visited and first
MNT_WRITE_HOLD is raised, then MNT_READONLY is raised, and finally
MNT_WRITE_HOLD is removed again.

The flag modification for MNT_WRITE_HOLD/MNT_READONLY and MNT_ONRB race
so MNT_ONRB might be lost.

Fixes: 2eea9ce431 ("mounts: keep list of mounts in an rbtree")
Cc: <stable@kernel.org> # v6.8+
Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-1-fd55922c4af8@kernel.org
Link: https://lore.kernel.org/r/ec6784ed-8722-4695-980a-4400d4e7bd1a@gmx.com [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09 16:58:50 +01:00
Linus Torvalds 9020d0d844 vfs-6.12.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEmwAKCRCRxhvAZXjc
 otRsAQCUdlBS/ky2JiYn3ePURKYVBgRq/+PnmhRrBNDuv+ToZwD+NRLNlOM8FzQy
 c8BMSq0rkwO2C5Aax3kGxgTPMEuuCwc=
 =QLvm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount updates from Christian Brauner:
 "Recently, we added the ability to list mounts in other mount
  namespaces and the ability to retrieve namespace file descriptors
  without having to go through procfs by deriving them from pidfds.

  This extends nsfs in two ways:

   (1) Add the ability to retrieve information about a mount namespace
       via NS_MNT_GET_INFO.

       This will return the mount namespace id and the number of mounts
       currently in the mount namespace. The number of mounts can be
       used to size the buffer that needs to be used for listmount() and
       is in general useful without having to actually iterate through
       all the mounts.

      The structure is extensible.

   (2) Add the ability to iterate through all mount namespaces over
       which the caller holds privilege returning the file descriptor
       for the next or previous mount namespace.

       To retrieve a mount namespace the caller must be privileged wrt
       to it's owning user namespace. This means that PID 1 on the host
       can list all mounts in all mount namespaces or that a container
       can list all mounts of its nested containers.

       Optionally pass a structure for NS_MNT_GET_INFO with
       NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
       namespace in one go.

  (1) and (2) can be implemented for other namespace types easily.

  Together with recent api additions this means one can iterate through
  all mounts in all mount namespaces without ever touching procfs.

  The commit message in 49224a345c ('Merge patch series "nsfs: iterate
  through mount namespaces"') contains example code how to do this"

* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nsfs: iterate through mount namespaces
  file: add fput() cleanup helper
  fs: add put_mnt_ns() cleanup helper
  fs: allow mount namespace fd
2024-09-16 11:15:26 +02:00
Yue Haibing 2e91f69afa fs: mounts: Remove unused declaration mnt_cursor_del()
Commit 2eea9ce431 ("mounts: keep list of mounts in an rbtree")
removed the implementation but leave declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240803115000.589872-1-yuehaibing@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:32 +02:00
Christian Brauner a1d220d9da
nsfs: iterate through mount namespaces
It is already possible to list mounts in other mount namespaces and to
retrieve namespace file descriptors without having to go through procfs
by deriving them from pidfds.

Augment these abilities by adding the ability to retrieve information
about a mount namespace via NS_MNT_GET_INFO. This will return the mount
namespace id and the number of mounts currently in the mount namespace.
The number of mounts can be used to size the buffer that needs to be
used for listmount() and is in general useful without having to actually
iterate through all the mounts. The structure is extensible.

And add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the next or
previous mount namespace.

To retrieve a mount namespace the caller must be privileged wrt to it's
owning user namespace. This means that PID 1 on the host can list all
mounts in all mount namespaces or that a container can list all mounts
of its nested containers.

Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace
in one go. Both ioctls can be implemented for other namespace types
easily.

Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.

Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-09 12:46:59 +02:00
Linus Torvalds f608cabaed vfs-6.11.mount
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEHCgAKCRCRxhvAZXjc
 om+gAQCC4nJqJqs9bJZIItRtZ71GnxZQO3HVohhIlNM2KKh0VgEA47JhD0O0Srfb
 CleII4cQWqB32BfB/vMeff6hpNa7SA4=
 =7ltk
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount query updates from Christian Brauner:
 "This contains work to extend the abilities of listmount() and
  statmount() and various fixes and cleanups.

  Features:

   - Allow iterating through mounts via listmount() from newest to
     oldest. This makes it possible for mount(8) to keep iterating the
     mount table in reverse order so it gets newest mounts first.

   - Relax permissions on listmount() and statmount().

     It's not necessary to have capabilities in the initial namespace:
     it is sufficient to have capabilities in the owning namespace of
     the mount namespace we're located in to list unreachable mounts in
     that namespace.

   - Extend both listmount() and statmount() to list and stat mounts in
     foreign mount namespaces.

     Currently the only way to iterate over mount entries in mount
     namespaces that aren't in the caller's mount namespace is by
     crawling through /proc in order to find /proc/<pid>/mountinfo for
     the relevant mount namespace.

     This is both very clumsy and hugely inefficient. So extend struct
     mnt_id_req with a new member that allows to specify the mount
     namespace id of the mount namespace we want to look at.

     Luckily internally we already have most of the infrastructure for
     this so we just need to expose it to userspace. Give userspace a
     way to retrieve the id of a mount namespace via statmount() and
     through a new nsfs ioctl() on mount namespace file descriptor.

     This comes with appropriate selftests.

   - Expose mount options through statmount().

     Currently if userspace wants to get mount options for a mount and
     with statmount(), they still have to open /proc/<pid>/mountinfo to
     parse mount options. Simply the information through statmount()
     directly.

     Afterwards it's possible to only rely on statmount() and
     listmount() to retrieve all and more information than
     /proc/<pid>/mountinfo provides.

     This comes with appropriate selftests.

  Fixes:

   - Avoid copying to userspace under the namespace semaphore in
     listmount.

  Cleanups:

   - Simplify the error handling in listmount by relying on our newly
     added cleanup infrastructure.

   - Refuse invalid mount ids early for both listmount and statmount"

* tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reject invalid last mount id early
  fs: refuse mnt id requests with invalid ids early
  fs: find rootfs mount of the mount namespace
  fs: only copy to userspace on success in listmount()
  sefltests: extend the statmount test for mount options
  fs: use guard for namespace_sem in statmount()
  fs: export mount options via statmount()
  fs: rename show_mnt_opts -> show_vfsmnt_opts
  selftests: add a test for the foreign mnt ns extensions
  fs: add an ioctl to get the mnt ns id from nsfs
  fs: Allow statmount() in foreign mount namespace
  fs: Allow listmount() in foreign mount namespace
  fs: export the mount ns id via statmount
  fs: keep an index of current mount namespaces
  fs: relax permissions for statmount()
  listmount: allow listing in reverse order
  fs: relax permissions for listmount()
  fs: simplify error handling
  fs: don't copy to userspace under namespace semaphore
  path: add cleanup helper
2024-07-15 11:54:04 -07:00
Josef Bacik 1901c92497
fs: keep an index of current mount namespaces
In order to allow for listmount() to be used on different namespaces we
need a way to lookup a mount ns by its id.  Keep a rbtree of the current
!anonymous mount name spaces indexed by ID that we can use to look up
the namespace.

Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/e5fdd78a90f5b00a75bd893962a70f52a2c015cd.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-28 09:53:30 +02:00
Christian Brauner 620c266f39 fhandle: relax open_by_handle_at() permission checks
A current limitation of open_by_handle_at() is that it's currently not possible
to use it from within containers at all because we require CAP_DAC_READ_SEARCH
in the initial namespace. That's unfortunate because there are scenarios where
using open_by_handle_at() from within containers.

Two examples:

(1) cgroupfs allows to encode cgroups to file handles and reopen them with
    open_by_handle_at().
(2) Fanotify allows placing filesystem watches they currently aren't usable in
    containers because the returned file handles cannot be used.

Here's a proposal for relaxing the permission check for open_by_handle_at().

(1) Opening file handles when the caller has privileges over the filesystem
    (1.1) The caller has an unobstructed view of the filesystem.
    (1.2) The caller has permissions to follow a path to the file handle.

This doesn't address the problem of opening a file handle when only a portion
of a filesystem is exposed as is common in containers by e.g., bind-mounting a
subtree. The proposal to solve this use-case is:

(2) Opening file handles when the caller has privileges over a subtree
    (2.1) The caller is able to reach the file from the provided mount fd.
    (2.2) The caller has permissions to construct an unobstructed path to the
          file handle.
    (2.3) The caller has permissions to follow a path to the file handle.

The relaxed permission checks are currently restricted to directory file
handles which are what both cgroupfs and fanotify need. Handling disconnected
non-directory file handles would lead to a potentially non-deterministic api.
If a disconnected non-directory file handle is provided we may fail to decode
a valid path that we could use for permission checking. That in itself isn't a
problem as we would just return EACCES in that case. However, confusion may
arise if a non-disconnected dentry ends up in the cache later and those opening
the file handle would suddenly succeed.

* It's potentially possible to use timing information (side-channel) to infer
  whether a given inode exists. I don't think that's particularly
  problematic. Thanks to Jann for bringing this to my attention.

* An unrelated note (IOW, these are thoughts that apply to
  open_by_handle_at() generically and are unrelated to the changes here):
  Jann pointed out that we should verify whether deleted files could
  potentially be reopened through open_by_handle_at(). I don't think that's
  possible though.

  Another potential thing to check is whether open_by_handle_at() could be
  abused to open internal stuff like memfds or gpu stuff. I don't think so
  but I haven't had the time to completely verify this.

This dates back to discussions Amir and I had quite some time ago and thanks to
him for providing a lot of details around the export code and related patches!

Link: https://lore.kernel.org/r/20240524-vfs-open_by_handle_at-v1-1-3d4b7d22736b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-28 15:57:23 +02:00
Miklos Szeredi 2eea9ce431
mounts: keep list of mounts in an rbtree
When adding a mount to a namespace insert it into an rbtree rooted in the
mnt_namespace instead of a linear list.

The mnt.mnt_list is still used to set up the mount tree and for
propagation, but not after the mount has been added to a namespace.  Hence
mnt_list can live in union with rb_node.  Use MNT_ONRB mount flag to
validate that the mount is on the correct list.

This allows removing the cursor used for reading /proc/$PID/mountinfo.  The
mnt_id_unique of the next mount can be used as an index into the seq file.

Tested by inserting 100k bind mounts, unsharing the mount namespace, and
unmounting.  No performance regressions have been observed.

For the last mount in the 100k list the statmount() call was more than 100x
faster due to the mount ID lookup not having to do a linear search.  This
patch makes the overhead of mount ID lookup non-observable in this range.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-3-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:56:16 +01:00
Miklos Szeredi 98d2b43081
add unique mount ID
If a mount is released then its mnt_id can immediately be reused.  This is
bad news for user interfaces that want to uniquely identify a mount.

Implementing a unique mount ID is trivial (use a 64bit counter).
Unfortunately userspace assumes 32bit size and would overflow after the
counter reaches 2^32.

Introduce a new 64bit ID alongside the old one.  Initialize the counter to
2^32, this guarantees that the old and new IDs are never mixed up.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-2-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:56:16 +01:00
Al Viro 7e4745a094 switch try_to_unlazy_next() to __legitimize_mnt()
The tricky case (__legitimize_mnt() failing after having grabbed
a reference) can be trivially dealt with by leaving nd->path.mnt
non-NULL, for terminate_walk() to drop it.

legitimize_mnt() becomes static after that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05 16:18:21 -04:00
Christian Brauner d033cb6784
mount: make {lock,unlock}_mount_hash() static
The lock_mount_hash() and unlock_mount_hash() helpers are never called
outside a single file. Remove them from the header and make them static
to reflect this fact. There's no need to have them callable from other
places right now, as Christoph observed.

Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Kirill Tkhai 1a7b8969e6
mnt: Use generic ns_common::count
Switch over mount namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.

This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.

It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.

Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/159644980287.604812.761686947449081169.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-08-19 14:14:19 +02:00
Miklos Szeredi 9f6c61f96f proc/mounts: add cursor
If mounts are deleted after a read(2) call on /proc/self/mounts (or its
kin), the subsequent read(2) could miss a mount that comes after the
deleted one in the list.  This is because the file position is interpreted
as the number mount entries from the start of the list.

E.g. first read gets entries #0 to #9; the seq file index will be 10.  Then
entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
etc...  The next read will continue from entry #10, and #9 is missed.

Solve this by adding a cursor entry for each open instance.  Taking the
global namespace_sem for write seems excessive, since we are only dealing
with a per-namespace list.  Instead add a per-namespace spinlock and use
that together with namespace_sem taken for read to protect against
concurrent modification of the mount list.  This may reduce parallelism of
is_local_mountpoint(), but it's hardly a big contention point.  We could
also use RCU freeing of cursors to make traversal not need additional
locks, if that turns out to be neceesary.

Only move the cursor once for each read (cursor is not added on open) to
minimize cacheline invalidation.  When EOF is reached, the cursor is taken
off the list, in order to prevent an excessive number of cursors due to
inactive open file descriptors.

Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Al Viro 56cbb429d9 switch the remnants of releasing the mountpoint away from fs_pin
We used to need rather convoluted ordering trickery to guarantee
that dput() of ex-mountpoints happens before the final mntput()
of the same.  Since we don't need that anymore, there's no point
playing with fs_pin for that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-16 22:52:37 -04:00
Al Viro 4edbe133f8 make struct mountpoint bear the dentry reference to mountpoint, not struct mount
Using dput_to_list() to shift the contributing reference from ->mnt_mountpoint
to ->mnt_mp->m_dentry.  Dentries are dropped (with dput_to_list()) as soon
as struct mountpoint is destroyed; in cases where we are under namespace_sem
we use the global list, shrinking it in namespace_unlock().  In case of
detaching stuck MNT_LOCKed children at final mntput_no_expire() we use a local
list and shrink it ourselves.  ->mnt_ex_mountpoint crap is gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-16 22:43:40 -04:00
Al Viro 74e831221c saner handling of temporary namespaces
mount_subtree() creates (and soon destroys) a temporary namespace,
so that automounts could function normally.  These beasts should
never become anyone's current namespaces; they don't, but it would
be better to make prevention of that more straightforward.  And
since they don't become anyone's current namespace, we don't need
to bother with reserving procfs inums for those.

Teach alloc_mnt_ns() to skip inum allocation if told so, adjust
put_mnt_ns() accordingly, make mount_subtree() use temporary
(anon) namespace.  is_anon_ns() checks if a namespace is such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:07 -05:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Linus Torvalds e06fdaf40a Now that IPC and other changes have landed, enable manual markings for
randstruct plugin, including the task_struct.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZbRgGAAoJEIly9N/cbcAmk2AQAIL60aQ+9RIcFAXriFhnd7Z2
 x9Jqi9JNc8NgPFXx8GhE4J4eTZ5PwcjgXBpNRWY/laBkRyoBHn24ku09YxrJjmHz
 ZSUsP+/iO9lVeEfbmU9Tnk50afkfwx6bHXBwkiVGQWHtybNVUqA19JbqkHeg8ubx
 myKLGeUv5PPCodRIcBDD0+HaAANcsqtgbDpgmWU8s+IXWwvWCE2p7PuBw7v3HHgH
 qzlPDHYQCRDw+LWsSqPaHj+9mbRO18P/ydMoZHGH4Hl3YYNtty8ZbxnraI3A7zBL
 6mLUVcZ+/l88DqHc5I05T8MmLU1yl2VRxi8/jpMAkg9wkvZ5iNAtlEKIWU6eqsvk
 vaImNOkViLKlWKF+oUD1YdG16d8Segrc6m4MGdI021tb+LoGuUbkY7Tl4ee+3dl/
 9FM+jPv95HjJnyfRNGidh2TKTa9KJkh6DYM9aUnktMFy3ca1h/LuszOiN0LTDiHt
 k5xoFURk98XslJJyXM8FPwXCXiRivrXMZbg5ixNoS4aYSBLv7Cn1M6cPnSOs7UPh
 FqdNPXLRZ+vabSxvEg5+41Ioe0SHqACQIfaSsV5BfF2rrRRdaAxK4h7DBcI6owV2
 7ziBN1nBBq2onYGbARN6ApyCqLcchsKtQfiZ0iFsvW7ZawnkVOOObDTCgPl3tdkr
 403YXzphQVzJtpT5eRV6
 =ngAW
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull structure randomization updates from Kees Cook:
 "Now that IPC and other changes have landed, enable manual markings for
  randstruct plugin, including the task_struct.

  This is the rest of what was staged in -next for the gcc-plugins, and
  comes in three patches, largest first:

   - mark "easy" structs with __randomize_layout

   - mark task_struct with an optional anonymous struct to isolate the
     __randomize_layout section

   - mark structs to opt _out_ of automated marking (which will come
     later)

  And, FWIW, this continues to pass allmodconfig (normal and patched to
  enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
  s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: opt-out externally exposed function pointer structs
  task_struct: Allow randomized layout
  randstruct: Mark various structs for randomization
2017-07-19 08:55:18 -07:00
Kees Cook 3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
Eric W. Biederman 99b19d1647 mnt: In propgate_umount handle visiting mounts in any order
While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe3142 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:41:16 -05:00
Eric W. Biederman 570487d3fa mnt: In umount propagation reparent in a separate pass
It was observed that in some pathlogical cases that the current code
does not unmount everything it should.  After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Cc: stable@vger.kernel.org
Fixes: 1064f874ab ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:40:32 -05:00
Jan Kara 08991e83b7 fsnotify: Free fsnotify_mark_connector when there is no mark attached
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00