mirror of https://github.com/torvalds/linux.git
93 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
9b93f5069f |
fs.idmapped.mnt_idmap.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5by6AAKCRCRxhvAZXjc
onblAPsFzodV8/9UoCIkKxwn0aiclbiAITTWI9ZLulmKhm0I6wD/RUOLKjt12uZJ
m81UTfkWHopWKtQ+X3saZEcyYTNLugE=
=AtGb
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping updates from Christian Brauner:
"Last cycle we've already made the interaction with idmapped mounts
more robust and type safe by introducing the vfs{g,u}id_t type. This
cycle we concluded the conversion and removed the legacy helpers.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem - with
namespaces that are relevent on the mount level. Especially for
filesystem developers without detailed knowledge in this area this can
be a potential source for bugs.
Instead of passing the plain namespace we introduce a dedicated type
struct mnt_idmap and replace the pointer with a pointer to a struct
mnt_idmap. There are no semantic or size changes for the mount struct
caused by this.
We then start converting all places aware of idmapped mounts to rely
on struct mnt_idmap. Once the conversion is done all helpers down to
the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take
a struct mnt_idmap argument instead of two namespace arguments. This
way it becomes impossible to conflate the two removing and thus
eliminating the possibility of any bugs. Fwiw, I fixed some issues in
that area a while ago in ntfs3 and ksmbd in the past. Afterwards only
low-level code can ultimately use the associated namespace for any
permission checks. Even most of the vfs can be completely obivious
about this ultimately and filesystems will never interact with it in
any form in the future.
A struct mnt_idmap currently encompasses a simple refcount and pointer
to the relevant namespace the mount is idmapped to. If a mount isn't
idmapped then it will point to a static nop_mnt_idmap and if it
doesn't that it is idmapped. As usual there are no allocations or
anything happening for non-idmapped mounts. Everthing is carefully
written to be a nop for non-idmapped mounts as has always been the
case.
If an idmapped mount is created a struct mnt_idmap is allocated and a
reference taken on the relevant namespace. Each mount that gets
idmapped or inherits the idmap simply bumps the reference count on
struct mnt_idmap. Just a reminder that we only allow a mount to change
it's idmapping a single time and only if it hasn't already been
attached to the filesystems and has no active writers.
The actual changes are fairly straightforward but this will have huge
benefits for maintenance and security in the long run even if it
causes some churn.
Note that this also makes it possible to extend struct mount_idmap in
the future. For example, it would be possible to place the namespace
pointer in an anonymous union together with an idmapping struct. This
would allow us to expose an api to userspace that would let it specify
idmappings directly instead of having to go through the detour of
setting up namespaces at all"
* tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
acl: conver higher-level helpers to rely on mnt_idmap
fs: introduce dedicated idmap type for mounts
|
|
|
|
d6fdf29f7b
|
posix_acl: Fix the type of sentinel in get_acl
The type should be struct posix_acl * instead of void *. Cc: Christian Brauner <brauner@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
5a6f52d20c
|
acl: conver higher-level helpers to rely on mnt_idmap
Convert an initial portion to rely on struct mnt_idmap by converting the high level xattr helpers. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
a351b1f444
|
acl: make vfs_posix_acl_to_xattr() static
After reworking posix acls this helper isn't used anywhere outside the core posix acl paths. Make it static. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
0a26bde2c9
|
acl: remove a slew of now unused helpers
Now that the posix acl api is active we can remove all the hacky helpers we had to keep around for all these years and also remove the set and get posix acl xattr handler methods as they aren't needed anymore. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
318e66856d
|
xattr: use posix acl api
In previous patches we built a new posix api solely around get and set inode operations. Now that we have all the pieces in place we can switch the system calls and the vfs over to only rely on this api when interacting with posix acls. This finally removes all type unsafety and type conversion issues explained in detail in [1] that we aim to get rid of. With the new posix acl api we immediately translate into an appropriate kernel internal struct posix_acl format both when getting and setting posix acls. This is a stark contrast to before were we hacked unsafe raw values into the uapi struct that was stored in a void pointer relying and having filesystems and security modules hack around in the uapi struct as well. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
aeb7f00542
|
acl: add vfs_remove_acl()
In previous patches we implemented get and set inode operations for all non-stacking filesystems that support posix acls but didn't yet implement get and/or set acl inode operations. This specifically affected cifs and 9p. Now we can build a posix acl api based solely on get and set inode operations. We add a new vfs_remove_acl() api that can be used to set posix acls. This finally removes all type unsafety and type conversion issues explained in detail in [1] that we aim to get rid of. After we finished building the vfs api we can switch stacking filesystems to rely on the new posix api and then finally switch the xattr system calls themselves to rely on the posix acl api. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
4f353ba4a9
|
acl: add vfs_get_acl()
In previous patches we implemented get and set inode operations for all non-stacking filesystems that support posix acls but didn't yet implement get and/or set acl inode operations. This specifically affected cifs and 9p. Now we can build a posix acl api based solely on get and set inode operations. We add a new vfs_get_acl() api that can be used to get posix acls. This finally removes all type unsafety and type conversion issues explained in detail in [1] that we aim to get rid of. After we finished building the vfs api we can switch stacking filesystems to rely on the new posix api and then finally switch the xattr system calls themselves to rely on the posix acl api. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
e4cc916303
|
acl: add vfs_set_acl()
In previous patches we implemented get and set inode operations for all non-stacking filesystems that support posix acls but didn't yet implement get and/or set acl inode operations. This specifically affected cifs and 9p. Now we can build a posix acl api based solely on get and set inode operations. We add a new vfs_set_acl() api that can be used to set posix acls. This finally removes all type unsafety and type conversion issues explained in detail in [1] that we aim to get rid of. After we finished building the vfs api we can switch stacking filesystems to rely on the new posix api and then finally switch the xattr system calls themselves to rely on the posix acl api. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
cac2f8b8d8
|
fs: rename current get acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
138060ba92
|
fs: pass dentry to set acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode operation. But since ->set_acl() is required in order to use the generic posix acl xattr handlers filesystems that do not implement this inode operation cannot use the handler and need to implement their own dedicated posix acl handlers. Update the ->set_acl() inode method to take a dentry argument. This allows all filesystems to rely on ->set_acl(). As far as I can tell all codepaths can be switched to rely on the dentry instead of just the inode. Note that the original motivation for passing the dentry separate from the inode instead of just the dentry in the xattr handlers was because of security modules that call security_d_instantiate(). This hook is called during d_instantiate_new(), d_add(), __d_instantiate_anon(), and d_splice_alias() to initialize the inode's security context and possibly to set security.* xattrs. Since this only affects security.* xattrs this is completely irrelevant for posix acls. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
27bc50fc90 |
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... |
|
|
|
36f05cab0a |
tmpfs: add support for an i_version counter
NFSv4 mandates a change attribute to avoid problems with timestamp granularity, which Linux implements using the i_version counter. This is particularly important when the underlying filesystem is fast. Give tmpfs an i_version counter. Since it doesn't have to be persistent, we can just turn on SB_I_VERSION and sprinkle some inode_inc_iversion calls in the right places. Also, while there is no formal spec for xattrs, most implementations update the ctime on setxattr. Fix shmem_xattr_handler_set to update the ctime and bump the i_version appropriately. Link: https://lkml.kernel.org/r/20220909130031.15477-1-jlayton@kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0978c7c41f
|
acl: fix the comments of posix_acl_xattr_set
remove the double world of 'in'. Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
52edb4080e
|
acl: move idmapping handling into posix_acl_xattr_set()
The uapi POSIX ACL struct passed through the value argument during
setxattr() contains {g,u}id values encoded via ACL_{GROUP,USER} entries
that should actually be stored in the form of k{g,u}id_t (See [1] for a
long explanation of the issue.).
In
|
|
|
|
6b70fe0601
|
acl: add vfs_set_acl_prepare()
Various filesystems store POSIX ACLs on the backing store in their uapi
format. Such filesystems need to translate from the uapi POSIX ACL
format into the VFS format during i_op->get_acl(). The VFS provides the
posix_acl_from_xattr() helper for this task.
But the usage of posix_acl_from_xattr() is currently ambiguous. It is
intended to transform from a uapi POSIX ACL to the VFS represenation.
For example, when retrieving POSIX ACLs for permission checking during
lookup or when calling getxattr() to retrieve system.posix_acl_{access,default}.
Calling posix_acl_from_xattr() during i_op->get_acl() will map the raw
{g,u}id values stored as ACL_{GROUP,USER} entries in the uapi POSIX ACL
format into k{g,u}id_t in the filesystem's idmapping and return a struct
posix_acl ready to be returned to the VFS for caching and to perform
permission checks on.
However, posix_acl_from_xattr() is also called during setxattr() for all
filesystems that rely on VFS provides posix_acl_{access,default}_xattr_handler.
The posix_acl_xattr_set() handler which is used for the ->set() method
of posix_acl_{access,default}_xattr_handler uses posix_acl_from_xattr()
to translate from the uapi POSIX ACL format to the VFS format so that it
can be passed to the i_op->set_acl() handler of the filesystem or for
direct caching in case no i_op->set_acl() handler is defined.
During setxattr() the {g,u}id values stored as ACL_{GROUP,USER} entries
in the uapi POSIX ACL format aren't raw {g,u}id values that need to be
mapped according to the filesystem's idmapping. Instead they are {g,u}id
values in the caller's idmapping which have been generated during
posix_acl_fix_xattr_from_user(). In other words, they are k{g,u}id_t
which are passed as raw {g,u}id values abusing the uapi POSIX ACL format
(Please note that this type safety violation has existed since the
introduction of k{g,u}id_t. Please see [1] for more details.).
So when posix_acl_from_xattr() is called in posix_acl_xattr_set() the
filesystem idmapping is completely irrelevant. Instead, we abuse the
initial idmapping to recover the k{g,u}id_t base on the value stored in
raw {g,u}id as ACL_{GROUP,USER} in the uapi POSIX ACL format.
We need to clearly distinguish betweeen these two operations as it is
really easy to confuse for filesystems as can be seen in ntfs3.
In order to do this we factor out make_posix_acl() which takes callbacks
allowing callers to pass dedicated methods to generate the correct
k{g,u}id_t. This is just an internal static helper which is not exposed
to any filesystems but it neatly encapsulates the basic logic of walking
through a uapi POSIX ACL and returning an allocated VFS POSIX ACL with
the correct k{g,u}id_t values.
The posix_acl_from_xattr() helper can then be implemented as a simple
call to make_posix_acl() with callbacks that generate the correct
k{g,u}id_t from the raw {g,u}id values in ACL_{GROUP,USER} entries in
the uapi POSIX ACL format as read from the backing store.
For setxattr() we add a new helper vfs_set_acl_prepare() which has
callbacks to map the POSIX ACLs from the uapi format with the k{g,u}id_t
values stored in raw {g,u}id format in ACL_{GROUP,USER} entries into the
correct k{g,u}id_t values in the filesystem idmapping. In contrast to
posix_acl_from_xattr() the vfs_set_acl_prepare() helper needs to take
the mount idmapping into account. The differences are explained in more
detail in the kernel doc for the new functions.
In follow up patches we will remove all abuses of posix_acl_from_xattr()
for setxattr() operations and replace it with calls to vfs_set_acl_prepare().
The new vfs_set_acl_prepare() helper allows us to deal with the
ambiguity in how the POSI ACL uapi struct stores {g,u}id values
depending on whether this is a getxattr() or setxattr() operation.
This also allows us to remove the posix_acl_setxattr_idmapped_mnt()
helper reducing the abuse of the POSIX ACL uapi format to pass values
that should be distinct types in {g,u}id values stored as
ACL_{GROUP,USER} entries.
The removal of posix_acl_setxattr_idmapped_mnt() in turn allows us to
re-constify the value parameter of vfs_setxattr() which in turn allows
us to avoid the nasty cast from a const void pointer to a non-const void
pointer on ovl_do_setxattr().
Ultimately, the plan is to get rid of the type violations completely and
never pass the values from k{g,u}id_t as raw {g,u}id in ACL_{GROUP,USER}
entries in uapi POSIX ACL format. But that's a longer way to go and this
is a preparatory step.
Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Co-Developed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
|
|
985a6d0b3c
|
acl: return EOPNOTSUPP in posix_acl_fix_xattr_common()
Return EOPNOTSUPP when the POSIX ACL version doesn't match and zero if
there are no entries. This will allow us to reuse the helper in
posix_acl_from_xattr(). This change will have no user visible effects.
Fixes:
|
|
|
|
abfcf55d8b
|
acl: handle idmapped mounts for idmapped filesystems
Ensure that POSIX ACLs checking, getting, and setting works correctly
for filesystems mountable with a filesystem idmapping ("fs_idmapping")
that want to support idmapped mounts ("mnt_idmapping").
Note that no filesystems mountable with an fs_idmapping do yet support
idmapped mounts. This is required infrastructure work to unblock this.
As we explained in detail in [1] the fs_idmapping is irrelevant for
getxattr() and setxattr() when mapping the ACL_{GROUP,USER} {g,u}ids
stored in the uapi struct posix_acl_xattr_entry in
posix_acl_fix_xattr_{from,to}_user().
But for acl_permission_check() and posix_acl_{g,s}etxattr_idmapped_mnt()
the fs_idmapping matters.
acl_permission_check():
During lookup POSIX ACLs are retrieved directly via i_op->get_acl() and
are returned via the kernel internal struct posix_acl which contains
e_{g,u}id members of type k{g,u}id_t that already take the
fs_idmapping into acccount.
For example, a POSIX ACL stored with u4 on the backing store is mapped
to k10000004 in the fs_idmapping. The mnt_idmapping remaps the POSIX ACL
to k20000004. In order to do that the fs_idmapping needs to be taken
into account but that doesn't happen yet (Again, this is a
counterfactual currently as fuse doesn't support idmapped mounts
currently. It's just used as a convenient example.):
fs_idmapping: u0:k10000000:r65536
mnt_idmapping: u0:v20000000:r65536
ACL_USER: k10000004
acl_permission_check()
-> check_acl()
-> get_acl()
-> i_op->get_acl() == fuse_get_acl()
-> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
{
k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
u4 /* ACL_USER */);
}
-> posix_acl_permission()
{
-1 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
&init_user_ns,
k10000004);
vfsuid_eq_kuid(-1, k10000004 /* caller_fsuid */)
}
In order to correctly map from the fs_idmapping into mnt_idmapping we
require the relevant fs_idmaping to be passed:
acl_permission_check()
-> check_acl()
-> get_acl()
-> i_op->get_acl() == fuse_get_acl()
-> posix_acl_from_xattr(u0:k10000000:r65536 /* fs_idmapping */, ...)
{
k10000004 = make_kuid(u0:k10000000:r65536 /* fs_idmapping */,
u4 /* ACL_USER */);
}
-> posix_acl_permission()
{
v20000004 = make_vfsuid(u0:v20000000:r65536 /* mnt_idmapping */,
u0:k10000000:r65536 /* fs_idmapping */,
k10000004);
vfsuid_eq_kuid(v20000004, k10000004 /* caller_fsuid */)
}
The initial_idmapping is only correct for the current situation because
all filesystems that currently support idmapped mounts do not support
being mounted with an fs_idmapping.
Note that ovl_get_acl() is used to retrieve the POSIX ACLs from the
relevant lower layer and the lower layer's mnt_idmapping needs to be
taken into account and so does the fs_idmapping. See
|
|
|
|
8043bffd01
|
acl: make posix_acl_clone() available to overlayfs
The ovl_get_acl() function needs to alter the POSIX ACLs retrieved from the lower filesystem. Instead of hand-rolling a overlayfs specific posix_acl_clone() variant allow export it. It's not special and it's not deeply internal anyway. Link: https://lore.kernel.org/r/20220708090134.385160-3-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: linux-unionfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
e933c15f76
|
acl: port to vfs{g,u}id_t
Port the few remaining pieces to vfs{g,u}id_t and associated type safe helpers.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
|
|
0c5fd887d2
|
acl: move idmapped mount fixup into vfs_{g,s}etxattr()
This cycle we added support for mounting overlayfs on top of idmapped mounts.
Recently I've started looking into potential corner cases when trying to add
additional tests and I noticed that reporting for POSIX ACLs is currently wrong
when using idmapped layers with overlayfs mounted on top of it.
I'm going to give a rather detailed explanation to both the origin of the
problem and the solution.
Let's assume the user creates the following directory layout and they have a
rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
expect files on your host system to be owned. For example, ~/.bashrc for your
regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
filesystem.
The user chooses to set POSIX ACLs using the setfacl binary granting the user
with uid 4 read, write, and execute permissions for their .bashrc file:
setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
Now they to expose the whole rootfs to a container using an idmapped mount. So
they first create:
mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
mkdir -pv /vol/contpool/ctrover/{over,work}
chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
The user now creates an idmapped mount for the rootfs:
mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
/var/lib/lxc/c2/rootfs \
/vol/contpool/lowermap
This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.
Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.
mount -t overlay overlay \
-o lowerdir=/vol/contpool/lowermap, \
upperdir=/vol/contpool/overmap/over, \
workdir=/vol/contpool/overmap/work \
/vol/contpool/merge
The user can do this in two ways:
(1) Mount overlayfs in the initial user namespace and expose it to the
container.
(2) Mount overlayfs on top of the idmapped mounts inside of the container's
user namespace.
Let's assume the user chooses the (1) option and mounts overlayfs on the host
and then changes into a container which uses the idmapping 0:10000000:65536
which is the same used for the two idmapped mounts.
Now the user tries to retrieve the POSIX ACLs using the getfacl command
getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
and to their surprise they see:
# file: vol/contpool/merge/home/ubuntu/.bashrc
# owner: 1000
# group: 1000
user::rw-
user:4294967295:rwx
group::r--
mask::rwx
other::r--
indicating the the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
callchain in this example:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}
If the user chooses to use option (2) and mounts overlayfs on top of idmapped
mounts inside the container things don't look that much better:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}
As is easily seen the problem arises because the idmapping of the lower mount
isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus cannot
possible take the idmapping of the lower layers into account.
This problem is similar for fscaps but there the translation happens as part of
vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:
setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
The expected outcome here is that we'll receive the cap_net_raw capability as
we are able to map the uid associated with the fscap to 0 within our container.
IOW, we want to see 0 as the result of the idmapping translations.
If the user chooses option (1) we get the following callchain for fscaps:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() ________________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */
And if the user chooses option (2) we get:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() _______________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
|-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */
We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.
For POSIX ACLs we need to do something similar. However, in contrast to fscaps
we cannot apply the fix directly to the kernel internal posix acl data
structure as this would alter the cached values and would also require a rework
of how we currently deal with POSIX ACLs in general which almost never take the
filesystem idmapping into account (the noteable exception being FUSE but even
there the implementation is special) and instead retrieve the raw values based
on the initial idmapping.
The correct values are then generated right before returning to userspace. The
fix for this is to move taking the mount's idmapping into account directly in
vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().
To this end we split out two small and unexported helpers
posix_acl_getxattr_idmapped_mnt() and posix_acl_setxattr_idmapped_mnt(). The
former to be called in vfs_getxattr() and the latter to be called in
vfs_setxattr().
Let's go back to the original example. Assume the user chose option (1) and
mounted overlayfs on top of idmapped mounts on the host:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { |
| V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(&init_user_ns, 10000004);
| } |_________________________________________________
| |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
}
And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(0(&init_user_ns, 10000004);
| |_________________________________________________
| } |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
}
The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:
ovl_permission()
-> generic_permission()
-> acl_permission_check()
-> check_acl()
-> get_acl()
-> inode->i_op->get_acl() == ovl_get_acl()
> get_acl() /* on the underlying filesystem)
->inode->i_op->get_acl() == /*lower filesystem callback */
-> posix_acl_permission()
passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the idmapping
of the underlying mount into account as this would mean altering the cached
values for the lower filesystem. So we block using ACLs for now until we
decided on a nice way to fix this. Note this limitation both in the
documentation and in the code.
The most straightforward solution would be to have ovl_get_acl() simply
duplicate the ACLs, update the values according to the idmapped mount and
return it to acl_permission_check() so it can be used in posix_acl_permission()
forgetting them afterwards. This is a bit heavy handed but fairly
straightforward otherwise.
Link: https://github.com/brauner/mount-idmapped/issues/9
Link: https://lore.kernel.org/r/20220708090134.385160-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
|
|
705191b03d |
fs: fix acl translation
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.
While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough. But
they need to use the no_idmapping() helper instead.
Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.
Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping. Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).
I verified the regression reported in [1] and verified that this patch
fixes it. A regression test will be added to xfstests in parallel.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
Fixes:
|
|
|
|
bd303368b7
|
fs: support mapped mounts of mapped filesystems
In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.
With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.
In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.
As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.
This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.
Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
|
|
4472071331
|
fs: use low-level mapping helpers
In a few places the vfs needs to interact with bare k{g,u}ids directly
instead of struct inode. These are just a few. In previous patches we
introduced low-level mapping helpers that are able to support
filesystems mounted an idmapping. This patch simply converts the places
to use these new helpers.
Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
|
|
a793d79ea3
|
fs: move mapping helpers
The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.
Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
|
|
d1cef29adc |
fs/posix_acl.c: avoid -Wempty-body warning
The fallthrough comment for an ignored cmpxchg() return value produces a
harmless warning with 'make W=1':
fs/posix_acl.c: In function 'get_acl':
fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
127 | /* fall through */ ;
| ^
Simplify it as a step towards a clean W=1 build. As all architectures
define cmpxchg() as a statement expression these days, it is no longer
necessary to evaluate its return code, and the if() can just be droped.
Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org
Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
332f606b32 |
ovl: enable RCU'd ->get_acl()
Overlayfs does not cache ACL's (to avoid double caching). Instead it just calls the underlying filesystem's i_op->get_acl(), which will return the cached value, if possible. In rcu path walk, however, get_cached_acl_rcu() is employed to get the value from the cache, which will fail on overlayfs resulting in dropping out of rcu walk mode. This can result in a big performance hit in certain situations. Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which indicates pass-through) Reported-by: garyhuang <zjh.20052005@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
|
|
|
0cad624662 |
vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
|
|
|
549c729771
|
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
e65ce2a50c
|
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
21cb47be6f
|
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
47291baa8d
|
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
0558c1bf5a
|
capability: handle idmapped mounts
In order to determine whether a caller holds privilege over a given inode the capability framework exposes the two helpers privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former verifies that the inode has a mapping in the caller's user namespace and the latter additionally verifies that the caller has the requested capability in their current user namespace. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior. Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
63d72b93f2 |
vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK
posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact the permission logic internally must not check that bit (it's only for upper layers to decide whether they can block to do IO to look up the acl information or not). But the way the code was written, it _looked_ like it cared, since the function explicitly did not mask that bit off. But it has exactly two callers: one for when that bit is set, which first clears the bit before calling posix_acl_permission(), and the other call site when that bit was clear. So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be used for the actual permission test, and that currently is pointlessly cleared by the callers when the function itself should just not care. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
e39e773ad1 |
fs/posix_acl.c: fix kernel-doc warnings
Fix kernel-doc warnings in fs/posix_acl.c.
Also fix one typo (setgit -> setgid).
fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org
Fixes:
|
|
|
|
457c899653 |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
6671726054 |
posix_acl: convert posix_acl.a_refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable posix_acl.a_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the posix_acl.a_refcount it might make a difference in following places: - get_cached_acl(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart. However this operation is performed under rcu_read_lock(), so this should be fine. - posix_acl_release(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> |
|
|
|
5b825c3af1 |
sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
497de07d89 |
tmpfs: clear S_ISGID when setting posix ACLs
This change was missed the tmpfs modification in In CVE-2016-7097
commit
|
|
|
|
101105b171 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning |
|
|
|
e55f1d1d13 | Merge remote-tracking branch 'jk/vfs' into work.misc | |
|
|
f334bcd94b | Merge remote-tracking branch 'ovl/misc' into work.misc | |
|
|
2211d5ba5c |
posix_acl: xattr representation cleanups
Remove the unnecessary typedefs and the zero-length a_entries array in struct posix_acl_xattr_header. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
078cd8279e |
fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
073931017b |
posix_acl: Clear SGID bit when setting file permissions
When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> |
|
|
|
beaf226b86 |
posix_acl: don't ignore return value of posix_acl_create_masq()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> |
|
|
|
a867d7349e |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.
While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.
At a high level what this patchset does a couple of simple things:
- Add a user namespace owner (s_user_ns) to struct super_block.
- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.
By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.
One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).
This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.
These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.
- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.
- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.
Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.
There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.
- Verifying the mounter has permission to read/write the block device
during mount.
- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.
- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).
Furthermore there are a few changes that are on the wishlist:
- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]
- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.
- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.
I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.
Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...
|
|
|
|
0d4d717f25 |
vfs: Verify acls are valid within superblock's s_user_ns.
Update posix_acl_valid to verify that an acl is within a user namespace. Update the callers of posix_acl_valid to pass in an appropriate user namespace. For posix_acl_xattr_set and v9fs_xattr_set_acl pass in inode->i_sb->s_user_ns to posix_acl_valid. For md_unpack_acl pass in &init_user_ns as no inode or superblock is in sight. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
485e71e8fb |
posix_acl: Add set_posix_acl
Factor out part of posix_acl_xattr_set into a common function that takes a posix_acl, which nfsd can also call. The prototype already exists in include/linux/posix_acl.h. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: J. Bruce Fields <bfields@redhat.com> |
|
|
|
5930122683 |
switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |