mirror of https://github.com/torvalds/linux.git
751 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
275498a98b |
exec: Add comments on check_unsafe_exec() fs counting
Add some comments about what the fs counting is doing in check_unsafe_exec() and how it relates to the call graph. Specifically, we can't force an unshare of the fs because of at least Chrome: https://lore.kernel.org/lkml/86CE201B-5632-4BB7-BCF6-7CB2C2895409@chromium.org/ Cc: Eric Biederman <ebiederm@xmission.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20221018071537.never.662-kees@kernel.org |
|
|
|
2b5f9dad32 |
fs/exec: switch timens when a task gets a new mm
Changing a time namespace requires remapping a vvar page, so we don't want
to allow doing that if any other tasks can use the same mm.
Currently, we install a time namespace when a task is created with a new
vm. exec() is another case when a task gets a new mm and so it can switch
a time namespace safely, but it isn't handled now.
One more issue of the current interface is that clone() with CLONE_VM isn't
allowed if the current task has unshared a time namespace
(timens_for_children doesn't match the current timens).
Both these issues make some inconvenience for users. For example, Alexey
and Florian reported that posix_spawn() uses vfork+exec and this pattern
doesn't work with time namespaces due to the both described issues.
LXC needed to workaround the exec() issue by calling setns.
In the commit
|
|
|
|
5bf2fedca8 |
exec: Copy oldsighand->action under spin-lock
unshare_sighand should only access oldsighand->action
while holding oldsighand->siglock, to make sure that
newsighand->action is in a consistent state.
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
27bc50fc90 |
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... |
|
|
|
493ffd6605 |
ucounts: Split rlimit and ucount values and max values
After the ucount rlimit code was merged a bunch of small but
siginificant bugs were found and fixed. At the time it was realized
that part of the problem was that while the ucount rlimits were very
similar to the oridinary ucounts (in being nested counts with limits)
the semantics were slightly different and the code would be less error
prone if there was less sharing. This is the long awaited cleanup
that should hopefully keep things more comprehensible and less error
prone for whoever needs to touch that code next.
Alexey Gladkov (1):
ucounts: Split rlimit and ucount values and max values
fs/exec.c | 2 +-
fs/proc/array.c | 2 +-
include/linux/user_namespace.h | 35 ++++++++++++++++++++++-------------
kernel/fork.c | 12 ++++++------
kernel/sys.c | 2 +-
kernel/ucount.c | 34 +++++++++++++++-------------------
kernel/user_namespace.c | 10 +++++-----
7 files changed, 51 insertions(+), 46 deletions(-)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmM7U4cACgkQC/v6Eiaj
j0AbRA//RVrGJ9n5iYyHM7WgeoTlFbaupEyLTq5dEpkOMD9CEB4OpMymGA/VXbeX
cjgF5dqykfrdpYBwJdosl1fgq15ZFe9ChKhPGQkI5CGlwyRYTl2kq+FrZLC790s8
c4TN3fKO1DyQPn5+UNzlBgLP8ofiUqeScZJDGa+LeMlUIv1OFS3m05jHuG/uzl6b
bbbdcn61tFKOFCapbE72hWusEQssPOAN+dSY1/lwKO05WOKR0N2CR0EHyZhW2Owd
GIQ27Zh5ed/9xRNlxa8VIa+JDfuATbPeoWcvRmiWSEoAxKtPBUf8lwcltlHBUcKK
72MH+KU9AaIZ1prq9ng4xEaM+vXiSSNspYB8siwph7au1gWx1Yu2yYVavEPeFB9o
C0JaD7kTh6Mhk6xdPhnmFUHFOLLGC5LdnBcIwwoMb1jlwP4QJRVucbjpqaOptoiE
SeWhRRKUBwpcQdztQZCR+X0h1paHRJJXplHFmeEGcMviGWntgKUaxXJQ4BJrnRTO
pagn7h181KVF7u9Toh0IWzrd322mXNqmcgwhzE/S9pa5EJMQHt7qYkDzQCgEwoap
JmIld9tKkv/0fYMOHordjMb1OY37feI7FyDAuZuLP1ZWYgKhOq0LrD5x8PzKmoyM
6oKAOfXZUVT/Pnw21nEzAtHsazV3mLRpW+gLLiLSiWoSaYT4x14=
=kjVh
-----END PGP SIGNATURE-----
Merge tag 'ucount-rlimits-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts update from Eric Biederman:
"Split rlimit and ucount values and max values
After the ucount rlimit code was merged a bunch of small but
siginificant bugs were found and fixed. At the time it was realized
that part of the problem was that while the ucount rlimits were very
similar to the oridinary ucounts (in being nested counts with limits)
the semantics were slightly different and the code would be less error
prone if there was less sharing.
This is the long awaited cleanup that should hopefully keep things
more comprehensible and less error prone for whoever needs to touch
that code next"
* tag 'ucount-rlimits-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Split rlimit and ucount values and max values
|
|
|
|
ab29622157 |
whack-a-mole: cropped up open-coded file_inode() uses...
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxj0gAKCRBZ7Krx/gZQ 66/1AQC/KfIAINNOPxozsZaxOaOKo0ouVJ7sJV4ZGsPKpU69gwD/UodJZCtyZ52h wwkmfzTDjAgGt1QCKj96zk2XFqg4swE= =u0pv -----END PGP SIGNATURE----- Merge tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file_inode() updates from Al Vrio: "whack-a-mole: cropped up open-coded file_inode() uses..." * tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: use ->f_mapping _nfs42_proc_copy(): use ->f_mapping instead of file_inode()->i_mapping dma_buf: no need to bother with file_inode()->i_mapping nfs_finish_open(): don't open-code file_inode() bprm_fill_uid(): don't open-code file_inode() sgx: use ->f_mapping... exfat_iterate(): don't open-code file_inode(file) ibmvmc: don't open-code file_inode() |
|
|
|
987f20a9dc |
a.out: Remove the a.out implementation
In commit
|
|
|
|
19066e5868 |
exec: use VMA iterator instead of linked list
Remove a use of the vm_next list by doing the initial lookup with the VMA iterator and then using it to find the next entry. Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7964cf8caa |
mm: remove vmacache
By using the maple tree and the maple tree state, the vmacache is no longer beneficial and is complicating the VMA code. Remove the vmacache to reduce the work in keeping it up to date and code complexity. Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bd74fdaea1 |
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
33a2d6bc34 |
Revert "fs/exec: allow to unshare a time namespace on vfork+exec"
This reverts commit
|
|
|
|
e6ae438124 |
bprm_fill_uid(): don't open-code file_inode()
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
50cd95ac46 |
execve fix for v6.0-rc2
- Replace remaining kmap() uses with kmap_local_page() (Fabio M. De Francesco)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmL/3lEWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJu9xD/9SivxMYldHN8E+KrVUaOUZB6OF
k97f+C4d438/kiZKIIGUqHv2Zbq4w9IFX3YQ9cmbN9dArEbTnvuGH/RTGI2L4yK1
uS4BG+9RZZgZfZ+I2JOxX1elqo2qloEkQQJMSvLSXFqElrb6eQWDSdizWMWto+Yb
l3Fsfq3zNf2cbxcCmlSaOZ4FFS3t5Yc7r/ArIRNPXVXIMcZkBdqj0fJ7nR7PplJY
YKg7ioAIYUoY2nH97hIWvMWTP+DJc9vXp9+SRUB45qph9gkOBBl34aOC39iMIkEH
AhPcGPf09DCZAu9sSfx3jH/YVV2jJg2DNbRw9bQlmfu+fdQBYCw1a2mhenpc0rUI
OA4R26KMCI7338aWjDJy9N+kY2fhm0J8L2NxZ4ySo0ZMQf3VxbPsXDh6ijX69ijr
DllH09o18RCIxhCRLOfGa3Je93FGnd2b/CI1z1CdQqQ/mdCWRvl3jRCpOXstqcAq
5ptU0tIEeY8+hor6rd2RfT5xqd1LjIrNdE0m5UUuNEv30IP2ldKIBWLH6jvZ5LJ2
U/YWJ8/GR+SLrpMNIX6fUAS+2VZNHUbUj2cfmB87OpBmKOcp+vJwf65OzujIfT21
OeMBkjql8DQlXhHk41SMDPrzzx76TJ0IaK0JCRPeUOggZwBX7txfUW5Xyi2UxseD
SRT6R+SXibr0vQyemQ==
=DTae
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
- Replace remaining kmap() uses with kmap_local_page() (Fabio M. De
Francesco)
* tag 'execve-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Replace kmap{,_atomic}() with kmap_local_page()
|
|
|
|
3a608cfee9 |
exec: Replace kmap{,_atomic}() with kmap_local_page()
The use of kmap() and kmap_atomic() are being deprecated in favor of kmap_local_page(). There are two main problems with kmap(): (1) It comes with an overhead as mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap’s pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page faults, and can be called from any context (including interrupts). It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore, the tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. Since the use of kmap_local_page() in exec.c is safe, it should be preferred everywhere in exec.c. As said, since kmap_local_page() can be also called from atomic context, and since remove_arg_zero() doesn't (and shouldn't ever) rely on an implicit preempt_disable(), this function can also safely replace kmap_atomic(). Therefore, replace kmap() and kmap_atomic() with kmap_local_page() in fs/exec.c. Tested with xfstests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Cc: Eric W. Biederman <ebiederm@xmission.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220803182856.28246-1-fmdefrancesco@gmail.com |
|
|
|
e362359ace |
posix-cpu-timers: Cleanup CPU timers before freeing them during exec
Commit |
|
|
|
d7b767b508 |
execve updates for v5.20-rc1
- Allow unsharing time namespace on vfork+exec (Andrei Vagin) - Replace usage of deprecated kmap APIs (Fabio M. De Francesco) - Fix spelling mistake (Zhang Jiaming) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoDyAWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh0mEACL07hj3eT3rWg6ohZx9sCTcAjY /tG+zxLQ7xu717nM1a4j7CI5kdNNpYbsCqG71ikDDRrOCeEutu7M8zE1emctjtHv oh853D6BKhV2Hvsiuk1oM2ZHR1bmgiW1eFNAJcCLz6rE6wYu564R0wYJV0h418fH Rjk+Y989A7Srs9t/9GQSktjX3Q039/PG28avhA5q144/ZNycr5FnLFOf4RlmzEUz 7E8TfGsftX8eRAfxW/dPiWuIKMuYPLqspca9pT3aFj3ze2qKnldjNV3c9M5ajL5Q q7KKWeWzunKyYHMaRzIxkHyhs396ZGKFN2PbcNYyml+NBItyc3fCHishMF7bW0Vb nyZbmYJslBloYmrSJYgqCfxyjUuhe0cMMk9iMzDVp6ROwtLgFFLwfwunM6RwRmnr dAmM8QGwSE3qYLhVnLEcRqpgdXzVd+S0TGhB5k5AyI3628/mLxhE66/eWq0X8QF5 los5zku1GagMkylt6SOGb3TME4JZe6ZdZpU4fe/ilM22qw852xgbF3+6Zap6IBbD AdzXVCHyU/obORfIxx5KTF213m4KpkWBBi3N1/vVlxIAFAUy1WdXDM1o2RPMD7hw DeHe8sgfTZxLmSqfWLuX+3qC94IvrbDPFaRCIMj1QNK0ltM8I9oHRPcUFyZMaV0O xHN/5QtmgVDfKA3mTw== =82SS -----END PGP SIGNATURE----- Merge tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Allow unsharing time namespace on vfork+exec (Andrei Vagin) - Replace usage of deprecated kmap APIs (Fabio M. De Francesco) - Fix spelling mistake (Zhang Jiaming) * tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Call kmap_local_page() in copy_string_kernel() exec: Fix a spelling mistake selftests/timens: add a test for vfork+exit fs/exec: allow to unshare a time namespace on vfork+exec |
|
|
|
c6e8e36c6a |
exec: Call kmap_local_page() in copy_string_kernel()
The use of kmap_atomic() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mappings are per thread, CPU local and not globally visible. Furthermore, the mappings can be acquired from any context (including interrupts). Therefore, replace kmap_atomic() with kmap_local_page() in copy_string_kernel(). Instead of open-coding local mapping + memcpy(), use memcpy_to_page(). Delete a redundant call to flush_dcache_page(). Tested with xfstests on a QEMU/ KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220724212523.13317-1-fmdefrancesco@gmail.com |
|
|
|
d5b36a4dbd |
fix race between exit_itimers() and /proc/pid/timers
As Chris explains, the comment above exit_itimers() is not correct, we can race with proc_timers_seq_ops. Change exit_itimers() to clear signal->posix_timers with ->siglock held. Cc: <stable@vger.kernel.org> Reported-by: chris@accessvector.net Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
5036793d7d |
exec: Fix a spelling mistake
Change 'wont't' to 'won't'. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Reviewed-by: Souptick Joarder (HPE) <jrdr.linux@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220629072932.27506-1-jiaming@nfschina.com |
|
|
|
133e2d3e81 |
fs/exec: allow to unshare a time namespace on vfork+exec
Right now, a new process can't be forked in another time namespace if it shares mm with its parent. It is prohibited, because each time namespace has its own vvar page that is mapped into a process address space. When a process calls exec, it gets a new mm and so it could be "legal" to switch time namespace in that case. This was not implemented and now if we want to do this, we need to add another clone flag to not break backward compatibility. We don't have any user requests to switch times on exec except the vfork+exec combination, so there is no reason to add a new clone flag. As for vfork+exec, this should be safe to allow switching timens with the current clone flag. Right now, vfork (CLONE_VFORK | CLONE_VM) fails if a child is forked into another time namespace. With this change, vfork creates a new process in parent's timens, and the following exec does the actual switch to the target time namespace. Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220613060723.197407-1-avagin@gmail.com |
|
|
|
1ec6574a3c |
This set of changes updates init and user mode helper tasks to be
ordinary user mode tasks. In commit |
|
|
|
de399236e2 |
ucounts: Split rlimit and ucount values and max values
Since the semantics of maximum rlimit values are different, it would be better not to mix ucount and rlimit values. This will prevent the error of using inc_count/dec_ucount for rlimit parameters. This patch also renames the functions to emphasize the lack of connection between rlimit and ucount. v3: - Fix BUG:KASAN:use-after-free_in_dec_ucount. v2: - Fix the array-index-out-of-bounds that was found by the lkp project. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
4a18419f71 |
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6. This patchset is intended to remove unnecessary TLB flushes during mprotect() syscalls. Once this patch-set make it through, similar and further optimizations for MADV_COLD and userfaultfd would be possible. Basically, there are 3 optimizations in this patch-set: 1. Use TLB batching infrastructure to batch flushes across VMAs and do better/fewer flushes. This would also be handy for later userfaultfd enhancements. 2. Avoid unnecessary TLB flushes. This optimization is the one that provides most of the performance benefits. Unlike previous versions, we now only avoid flushes that would not result in spurious page-faults. 3. Avoiding TLB flushes on change_huge_pmd() that are only needed to prevent the A/D bits from changing. Andrew asked for some benchmark numbers. I do not have an easy determinate macrobenchmark in which it is easy to show benefit. I therefore ran a microbenchmark: a loop that does the following on anonymous memory, just as a sanity check to see that time is saved by avoiding TLB flushes. The loop goes: mprotect(p, PAGE_SIZE, PROT_READ) mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE) *p = 0; // make the page writable The test was run in KVM guest with 1 or 2 threads (the second thread was busy-looping). I measured the time (cycles) of each operation: 1 thread 2 threads mmots +patch mmots +patch PROT_READ 3494 2725 (-22%) 8630 7788 (-10%) PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%) [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ] The exact numbers are really meaningless, but the benefit is clear. There are 2 interesting results though. (1) PROT_READ is cheaper, while one can expect it not to be affected. This is presumably due to TLB miss that is saved (2) Without memory access (*p = 0), the speedup of the patch is even greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush. As a result both operations on the patched kernel take roughly ~1500 cycles (with either 1 or 2 threads), whereas on mmotm their cost is as high as presented in the table. This patch (of 3): change_pXX_range() currently does not use mmu_gather, but instead implements its own deferred TLB flushes scheme. This both complicates the code, as developers need to be aware of different invalidation schemes, and prevents opportunities to avoid TLB flushes or perform them in finer granularity. The use of mmu_gather for modified PTEs has benefits in various scenarios even if pages are not released. For instance, if only a single page needs to be flushed out of a range of many pages, only that page would be flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction can be used instead of 512 instructions (or a full TLB flush, which would Linux would actually use by default). mprotect() over multiple VMAs requires a single flush. Use mmu_gather in change_pXX_range(). As the pages are not released, only record the flushed range using tlb_flush_pXX_range(). Handle THP similarly and get rid of flush_cache_range() which becomes redundant since tlb_start_vma() calls it when needed. Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1b2552cbdb |
fork: Stop allowing kthreads to call execve
Now that kernel_execve is no longer called from kernel threads stop supporting kernel threads calling kernel_execve. Remove the code for converting a kthread to a normal thread in execve. Document the restriction that kthreads may not call kernel_execve by having kernel_execve fail if called by a kthread. Link: https://lkml.kernel.org/r/20220506141512.516114-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
343f4c49f2 |
kthread: Don't allocate kthread_struct for init and umh
If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit |
|
|
|
1930a6e739 |
ptrace: Cleanups for v5.18
This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was
around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
making the semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where now anything left in tracehook.h is
some weird strange thing that is difficult to understand.
Eric W. Biederman (15):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop
Jann Horn (1):
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
Yang Li (1):
ptrace: Remove duplicated include in ptrace.c
MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 5 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 81 ++++++++++++-
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/ptrace.c | 47 +++++---
kernel/seccomp.c | 1 -
kernel/signal.c | 62 +++++-----
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
85 files changed, 372 insertions(+), 495 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
/HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
=uEro
-----END PGP SIGNATURE-----
Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
|
|
|
|
194dfe88d6 |
asm-generic updates for 5.18
There are three sets of updates for 5.18 in the asm-generic tree:
- The set_fs()/get_fs() infrastructure gets removed for good. This
was already gone from all major architectures, but now we can
finally remove it everywhere, which loses some particularly
tricky and error-prone code.
There is a small merge conflict against a parisc cleanup, the
solution is to use their new version.
- The nds32 architecture ends its tenure in the Linux kernel. The
hardware is still used and the code is in reasonable shape, but
the mainline port is not actively maintained any more, as all
remaining users are thought to run vendor kernels that would never
be updated to a future release.
There are some obvious conflicts against changes to the removed
files.
- A series from Masahiro Yamada cleans up some of the uapi header
files to pass the compile-time checks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmI69BsACgkQmmx57+YA
GNn/zA//f4d5VTT0ThhRxRWTu9BdThGHoB8TUcY7iOhbsWu0X/913NItRC3UeWNl
IdmisaXgVtirg1dcC2pWUmrcHdoWOCEGfK4+Zr2NhSWfuZDWvODHK9pGWk4WLnhe
cQgUNBvIuuAMryGtrOBwHPO4TpfCyy2ioeVP36ZfcsWXdDxTrqfaq/56mk3sxIP6
sUTk1UEjut9NG4C9xIIvcSU50R3l6LryQE/H9kyTLtaSvfvTOvprcVYCq0GPmSzo
DtQ1Wwa9zbJ+4EqoMiP5RrgQwWvOTg2iRByLU8ytwlX3e/SEF0uihvMv1FQbL8zG
G8RhGUOKQSEhaBfc3lIkm8GpOVPh0uHzB6zhn7daVmAWtazRD2Nu59BMjipa+ims
a8Z58iHH7jRAnKeEkVZqXKb1CEiUxaQx/IeVPzN4QlwMhDtwrI76LY7ZJ1zCqTGY
ENG0yRLav1XselYBslOYXGtOEWcY5EZPWqLyWbp4P9vz2g0Fe0gZxoIOvPmNQc89
QnfXpCt7vm/DGkyO255myu08GOLeMkisVqUIzLDB9avlym5mri7T7vk9abBa2YyO
CRpTL5gl1/qKPWuH1UI5mvhT+sbbBE2SUHSuy84btns39ZKKKynwCtdu+hSQkKLE
h9pV30Gf1cLTD4JAE0RWlUgOmbBLVp34loTOexQj4MrLM1noOnw=
=vtCN
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"There are three sets of updates for 5.18 in the asm-generic tree:
- The set_fs()/get_fs() infrastructure gets removed for good.
This was already gone from all major architectures, but now we can
finally remove it everywhere, which loses some particularly tricky
and error-prone code. There is a small merge conflict against a
parisc cleanup, the solution is to use their new version.
- The nds32 architecture ends its tenure in the Linux kernel.
The hardware is still used and the code is in reasonable shape, but
the mainline port is not actively maintained any more, as all
remaining users are thought to run vendor kernels that would never
be updated to a future release.
- A series from Masahiro Yamada cleans up some of the uapi header
files to pass the compile-time checks"
* tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (27 commits)
nds32: Remove the architecture
uaccess: remove CONFIG_SET_FS
ia64: remove CONFIG_SET_FS support
sh: remove CONFIG_SET_FS support
sparc64: remove CONFIG_SET_FS support
lib/test_lockup: fix kernel pointer check for separate address spaces
uaccess: generalize access_ok()
uaccess: fix type mismatch warnings from access_ok()
arm64: simplify access_ok()
m68k: fix access_ok for coldfire
MIPS: use simpler access_ok()
MIPS: Handle address errors for accesses above CPU max virtual user address
uaccess: add generic __{get,put}_kernel_nofault
nios2: drop access_ok() check from __put_user()
x86: use more conventional access_ok() definition
x86: remove __range_not_ok()
sparc64: add __{get,put}_kernel_nofault()
nds32: fix access_ok() checks in get/put_user
uaccess: fix nios2 and microblaze get_user_8()
sparc64: fix building assembly files
...
|
|
|
|
355f841a3f |
tracehook: Remove tracehook.h
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it. Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
b452722e6f |
exec: cleanup comments
Remove the second 'from'. Replace 'backwords' with 'backwards'. Replace 'visibile' with 'visible'. Signed-off-by: Tom Rix <trix@redhat.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220211160940.2516243-1-trix@redhat.com |
|
|
|
dcd46d897a |
exec: Force single empty string when argv is empty
Quoting[1] Ariadne Conill:
"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:
The argument arg0 should point to a filename string that is
associated with the process being started by one of the exec
functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.
This issue is being tracked in the KSPP issue tracker[5]."
While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.
The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.
Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:
process './argc0' launched './argc0' with NULL argv: empty string added
Additionally WARN() and reject NULL argv usage for kernel threads.
[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
|
|
|
|
967747bbc0 |
uaccess: remove CONFIG_SET_FS
There are no remaining callers of set_fs(), so CONFIG_SET_FS can be removed globally, along with the thread_info field and any references to it. This turns access_ok() into a cheaper check against TASK_SIZE_MAX. As CONFIG_SET_FS is now gone, drop all remaining references to set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel(). Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic] Acked-by: Dinh Nguyen <dinguyen@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
|
|
|
f0bc21b268 |
fs/coredump: move coredump sysctls into its own file
This moves the fs/coredump.c respective sysctls to its own file. Link: https://lkml.kernel.org/r/20211129211943.640266-6-mcgrof@kernel.org Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Stephen Kitt <steve@sk2.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
66ad398634 |
fs: move fs/exec.c sysctls into its own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move the fs/exec.c respective sysctls to its own file. Since checkpatch complains about style issues with the old code, this move also fixes a few of those minor style issues: * Use pr_warn() instead of prink(WARNING * New empty lines are wanted at the beginning of routines Link: https://lkml.kernel.org/r/20211129205548.605569-9-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f4484d138b |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ... |
|
|
|
503471ac36 |
fs/exec: replace strncpy with strscpy_pad in __get_task_comm
If the dest buffer size is smaller than sizeof(tsk->comm), the buffer will be without null ternimator, that may cause problem. Using strscpy_pad() instead of strncpy() in __get_task_comm() can make the string always nul ternimated and zero padded. Link: https://lkml.kernel.org/r/20211120112738.45980-3-laoar.shao@gmail.com Suggested-by: Kees Cook <keescook@chromium.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
06c5088aee |
fs/exec: replace strlcpy with strscpy_pad in __set_task_comm
Patch series "task comm cleanups", v2. This patchset is part of the patchset "extend task comm from 16 to 24"[1]. Now we have different opinion that dynamically allocates memory to store kthread's long name into a separate pointer, so I decide to take the useful cleanups apart from the original patchset and send it separately[2]. These useful cleanups can make the usage around task comm less error-prone. Furthermore, it will be useful if we want to extend task comm in the future. [1]. https://lore.kernel.org/lkml/20211101060419.4682-1-laoar.shao@gmail.com/ [2]. https://lore.kernel.org/lkml/CALOAHbAx55AUo3bm8ZepZSZnw7A08cvKPdPyNTf=E_tPqmw5hw@mail.gmail.com/ This patch (of 7): strlcpy() can trigger out-of-bound reads on the source string[1], we'd better use strscpy() instead. To make it be robust against full tsk->comm copies that got noticed in other places, we should make sure it's zero padded. [1] https://github.com/KSPP/linux/issues/89 Link: https://lkml.kernel.org/r/20211120112738.45980-1-laoar.shao@gmail.com Link: https://lkml.kernel.org/r/20211120112738.45980-2-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
49697335e0 |
signal: Remove the helper signal_group_exit
This helper is misleading. It tests for an ongoing exec as well as the process having received a fatal signal. Sometimes it is appropriate to treat an on-going exec differently than a process that is shutting down due to a fatal signal. In particular taking the fast path out of exit_signals instead of retargeting signals is not appropriate during exec, and not changing the the exit code in do_group_exit during exec. Removing the helper makes it more obvious what is going on as both cases must be coded for explicitly. While removing the helper fix the two cases where I have observed using signal_group_exit resulted in the wrong result. In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are retargetted during an exec. In do_group_exit use 0 as the exit code during an exec as de_thread does not set group_exit_code. As best as I can determine group_exit_code has been is set to 0 most of the time during de_thread. During a thread group stop group_exit_code is set to the stop signal and when the thread group receives SIGCONT group_exit_code is reset to 0. Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
60700e38fb |
signal: Rename group_exit_task group_exec_task
The only remaining user of group_exit_task is exec. Rename the field so that it is clear which part of the code uses it. Update the comment above the definition of group_exec_task to document how it is currently used. Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
40966e316f |
kthread: Ensure struct kthread is present for all kthreads
Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
5147da902e |
Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exit cleanups from Eric Biederman: "While looking at some issues related to the exit path in the kernel I found several instances where the code is not using the existing abstractions properly. This set of changes introduces force_fatal_sig a way of sending a signal and not allowing it to be caught, and corrects the misuse of the existing abstractions that I found. A lot of the misuse of the existing abstractions are silly things such as doing something after calling a no return function, rolling BUG by hand, doing more work than necessary to terminate a kernel thread, or calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL). In the review a deficiency in force_fatal_sig and force_sig_seccomp where ptrace or sigaction could prevent the delivery of the signal was found. I have added a change that adds SA_IMMUTABLE to change that makes it impossible to interrupt the delivery of those signals, and allows backporting to fix force_sig_seccomp And Arnd found an issue where a function passed to kthread_run had the wrong prototype, and after my cleanup was failing to build." * 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) soc: ti: fix wkup_m3_rproc_boot_thread return type signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV) exit/r8188eu: Replace the macro thread_exit with a simple return 0 exit/rtl8712: Replace the macro thread_exit with a simple return 0 exit/rtl8723bs: Replace the macro thread_exit with a simple return 0 signal/x86: In emulate_vsyscall force a signal instead of calling do_exit signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails exit/syscall_user_dispatch: Send ordinary signals on failure signal: Implement force_fatal_sig exit/kthread: Have kernel threads return instead of calling do_exit signal/s390: Use force_sigsegv in default_trap_handler signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved. signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON signal/sparc: In setup_tsb_params convert open coded BUG into BUG signal/powerpc: On swapcontext failure force SIGSEGV signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT signal/sparc32: Remove unreachable do_exit in do_sparc_fault ... |
|
|
|
e21294a7aa |
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
Now that force_fatal_sig exists it is unnecessary and a bit confusing to use force_sigsegv in cases where the simpler force_fatal_sig is wanted. So change every instance we can to make the code clearer. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
7e3c4fb7fc |
exec: Check for a pending fatal signal instead of core_state
Prevent exec continuing when a fatal signal is pending by replacing mmap_read_lock with mmap_read_lock_killable. This is always the right thing to do as userspace will never observe an exec complete when there is a fatal signal pending. With that change it becomes unnecessary to explicitly test for a core dump in progress. In coredump_wait zap_threads arranges under mmap_write_lock for all tasks that use a mm to also have SIGKILL pending, which means mmap_read_lock_killable will always return -EINTR when old_mm->core_state is present. Link: https://lkml.kernel.org/r/87fstux27w.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
49624efa65 |
Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
|
|
|
|
14726903c8 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ... |
|
|
|
5b78ed24e8 |
mm/pagemap: add mmap_assert_locked() annotations to find_vma*()
find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_lock() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. [penguin-kernel@i-love.sakura.ne.jp: TOMOYO needs the same protection which get_arg_page() needs] Link: https://lkml.kernel.org/r/58bb6bf7-a57e-8a40-e74b-39584b415152@i-love.sakura.ne.jp Link: https://lkml.kernel.org/r/20210731175341.3458608-1-lrizzo@google.com Signed-off-by: Luigi Rizzo <lrizzo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f358afc52c |
mm: remove flush_kernel_dcache_page
flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.
The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:
1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible
Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
fe69d560b5 |
kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
|
|
|
8228e2c313 |
namei: add getname_uflags()
There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
71bd934101 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ... |
|
|
|
bae7702a17 |
exec: remove checks in __register_bimfmt()
Delete NULL check, all callers pass valid pointer. Delete ->load_binary check -- failure to provide hook in a custom module will be very noticeable at the very first execve call. Link: https://lkml.kernel.org/r/YK1Gy1qXaLAR+tPl@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
21d1c5e386 |
Reimplement RLIMIT_NPROC on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. To illustrate the impact of rlimits, let's say there is a program that does not fork. Some service-A wants to run this program as user X in multiple containers. Since the program never fork the service wants to set RLIMIT_NPROC=1. service-A \- program (uid=1000, container1, rlimit_nproc=1) \- program (uid=1000, container2, rlimit_nproc=1) The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails since user X already has one running process. We cannot use existing inc_ucounts / dec_ucounts because they do not allow us to exceed the maximum for the counter. Some rlimits can be overlimited by root or if the user has the appropriate capability. Changelog v11: * Change inc_rlimit_ucounts() which now returns top value of ucounts. * Drop inc_rlimit_ucounts_and_test() because the return code of inc_rlimit_ucounts() can be checked. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
905ae01c4a |
Add a reference to ucounts for each cred
For RLIMIT_NPROC and some other rlimits the user_struct that holds the global limit is kept alive for the lifetime of a process by keeping it in struct cred. Adding a pointer to ucounts in the struct cred will allow to track RLIMIT_NPROC not only for user in the system, but for user in the user_namespace. Updating ucounts may require memory allocation which may fail. So, we cannot change cred.ucounts in the commit_creds() because this function cannot fail and it should always return 0. For this reason, we modify cred.ucounts before calling the commit_creds(). Changelog v6: * Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This error was caused by the fact that cred_alloc_blank() left the ucounts pointer empty. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
3d742d4b6e |
fs: delete repeated words in comments
Delete duplicate words in fs/*.c. The doubled words that are being dropped are: that, be, the, in, and, for Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7d6beb71da |
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
|
|
|
|
a72afd8730 |
tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com |
|
|
|
ae8eba8b5d |
tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
Since commit
|
|
|
|
1ab29965b3
|
exec: handle idmapped mounts
When executing a setuid binary the kernel will verify in bprm_fill_uid() that the inode has a mapping in the caller's user namespace before setting the callers uid and gid. Let bprm_fill_uid() handle idmapped mounts. If the inode is accessed through an idmapped mount it is mapped according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
435ac6214e
|
would_dump: handle idmapped mounts
When determining whether or not to create a coredump the vfs will verify that the caller is privileged over the inode. Make the would_dump() helper handle idmapped mounts by passing down the mount's user namespace of the exec file. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
47291baa8d
|
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
0558c1bf5a
|
capability: handle idmapped mounts
In order to determine whether a caller holds privilege over a given inode the capability framework exposes the two helpers privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former verifies that the inode has a mapping in the caller's user namespace and the latter additionally verifies that the caller has the requested capability in their current user namespace. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior. Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
5ee863bec7 |
Merge branch 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller: "A change to increase the default maximum stack size on parisc to 100MB and the ability to further increase the stack hard limit size at runtime with ulimit for newly started processes. The other patches fix compile warnings, utilize the Kbuild logic and cleanups the parisc arch code" * 'parisc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: pci-dma: fix warning unused-function parisc/uapi: Use Kbuild logic to provide <asm/types.h> parisc: Make user stack size configurable parisc: Use _TIF_USER_WORK_MASK in entry.S parisc: Drop loops_per_jiffy from per_cpu struct |
|
|
|
d01e7f10da |
Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec-update-lock update from Eric Biederman: "The key point of this is to transform exec_update_mutex into a rw_semaphore so readers can be separated from writers. This makes it easier to understand what the holders of the lock are doing, and makes it harder to contend or deadlock on the lock. The real deadlock fix wound up in perf_event_open" * 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: exec: Transform exec_update_mutex into a rw_semaphore |
|
|
|
faf145d6f3 |
Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman: "This set of changes ultimately fixes the interaction of posix file lock and exec. Fundamentally most of the change is just moving where unshare_files is called during exec, and tweaking the users of files_struct so that the count of files_struct is not unnecessarily played with. Along the way fcheck and related helpers were renamed to more accurately reflect what they do. There were also many other small changes that fell out, as this is the first time in a long time much of this code has been touched. Benchmarks haven't turned up any practical issues but Al Viro has observed a possibility for a lot of pounding on task_lock. So I have some changes in progress to convert put_files_struct to always rcu free files_struct. That wasn't ready for the merge window so that will have to wait until next time" * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) exec: Move io_uring_task_cancel after the point of no return coredump: Document coredump code exclusively used by cell spufs file: Remove get_files_struct file: Rename __close_fd_get_file close_fd_get_file file: Replace ksys_close with close_fd file: Rename __close_fd to close_fd and remove the files parameter file: Merge __alloc_fd into alloc_fd file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __fd_install into fd_install proc/fd: In fdinfo seq_show don't use get_files_struct bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu file: Implement task_lookup_next_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu file: Implement task_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Replace fcheck_files with files_lookup_fd_rcu file: Factor files_lookup_fd_locked out of fcheck_files file: Rename __fcheck_files to files_lookup_fd_raw ... |
|
|
|
f7cfd871ae |
exec: Transform exec_update_mutex into a rw_semaphore
Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex. The problematic lock ordering found by lockdep
was:
perf_event_open (exec_update_mutex -> ovl_i_mutex)
chown (ovl_i_mutex -> sb_writes)
sendfile (sb_writes -> p->lock)
by reading from a proc file and writing to overlayfs
proc_pid_syscall (p->lock -> exec_update_mutex)
While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same. They are all readers. The only writer is
exec.
There is no reason for readers to block on each other. So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.
Cc: Jann Horn <jannh@google.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Fixes:
|
|
|
|
9ee1206dcf |
exec: Move io_uring_task_cancel after the point of no return
Now that unshare_files happens in begin_new_exec after the point of no return, io_uring_task_cancel can also happen later. Effectively this means io_uring activities for a task are only canceled when exec succeeds. Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
1f702603e7 |
exec: Simplify unshare_files
Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
b604350128 |
exec: Move unshare_files to fix posix file locking during exec
Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2]. The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].
Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8]. It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8]. Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.
This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present. Which led to unshare_files
being moved to the start of do_execve[9].
Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10]. I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].
I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].
There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17]. Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.
Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.
So move unshare_files into begin_new_exec, after the point of no
return. If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return. Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.
This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen. There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting. Failing later if userspace is being silly is
not a problem.
With this change it the following discription from the removal
of steal_locks[8] finally becomes true.
Apps using NPTL are not affected, since all other threads are killed before
execve.
Apps using LinuxThreads are only affected if they
- have multiple threads during exec (LinuxThreads doesn't kill other
threads, the app may do it with pthread_kill_other_threads_np())
- rely on POSIX locks being inherited across exec
Both conditions are documented, but not their interaction.
Apps using clone() natively are affected if they
- use clone(CLONE_FILES)
- rely on POSIX locks being inherited across exec
I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reported-by: Jeff Layton <jlayton@redhat.com>
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
[6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
[7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
[8]
|
|
|
|
878f12dbb8 |
exec: Don't open code get_close_on_exec
Al Viro pointed out that using the phrase "close_on_exec(fd, rcu_dereference_raw(current->files->fdt))" instead of wrapping it in rcu_read_lock(), rcu_read_unlock() is a very questionable optimization[1]. Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase becomes equivalent the helper function get_close_on_exec so simplify the code and make it more robust by simply using get_close_on_exec. [1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk Suggested-by: Al Viro <viro@ftp.linux.org.uk> Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
1446e1df9e |
kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com |
|
|
|
22ee3ea588 |
parisc: Make user stack size configurable
On parisc we need to initialize the memory layout for the user stack at process start time to a fixed size, which up until now was limited to the size as given by CONFIG_MAX_STACK_SIZE_MB at compile time. This hard limit was too small and showed problems when compiling ruby2.7, qmlcachegen and some Qt packages. This patch changes two things: a) It increases the default maximum stack size to 100MB. b) Users can modify the stack hard limit size with ulimit and then newly forked processes will use the given stack size which can even be bigger than the default 100MB. Reported-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> |
|
|
|
96685f8666 |
powerpc updates for 5.10
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting it for
powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for detecting ISA
v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal Power9
systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about the
hardware topology on Power9/10, where two SMT4 cores can be presented by
firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(), to
prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Biwen
Li, Cameron Berkenpas, Cédric Le Goater, Christophe Leroy, Christoph Hellwig,
Colin Ian King, Daniel Axtens, David Dai, Finn Thain, Frederic Barrat, Gautham
R. Shenoy, Greg Kurz, Gustavo Romero, Ira Weiny, Jason Yan, Joel Stanley,
Jordan Niethe, Kajol Jain, Konrad Rzeszutek Wilk, Laurent Dufour, Leonardo
Bras, Liu Shixin, Luca Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar,
Nathan Lynch, Nicholas Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver
O'Halloran, Pedro Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai,
Qinglang Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott
Cheloha, Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl+JBQoTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgJJAD/0e3tsFP+9rFlxKSJlDcMW3w7kXDRXE
tG40F1ubYFLU8wtFVR0De3njTRsz5HyaNU6SI8CwPq48mCa7OFn1D1OeHonHXDX9
w6v3GE2S1uXXQnjm+czcfdjWQut0IwWBLx007/S23WcPff3Abc2irupKLNu+Gx29
b/yxJHZSRJVX59jSV94HkdJS75mDHQ3oUOlFGXtuGcUZDufpD1ynRcQOjr0V/8JU
F4WAblFSe7hiczHGqIvfhFVJ+OikEhnj2aEMAL8U7vxzrAZ7RErKCN9s/0Tf0Ktx
FzNEFNLHZGqh+qNDpKKmM+RnaeO2Lcoc9qVn7vMHOsXPzx9F5LJwkI/DgPjtgAq/
mFvGnQB/FapATnQeMluViC/qhEe5bQXLUfPP5i2+QOjK0QqwyFlUMgaVNfsY8jRW
0Q/sNA72Opzst4WUTveCd4SOInlUuat09e5nLooCRLW7u7/jIiXNRSFNvpOiwkfF
EcIPJsi6FUQ4SNbqpRSNEO9fK5JZrrUtmr0pg8I7fZhHYGcxEjqPR6IWCs3DTsak
4/KhjhhTnP/IWJRw6qKAyNhEyEwpWqYZ97SIQbvSb1g/bS47AIdQdJRb0eEoRjhx
sbbnnYFwPFkG4c1yQSIFanT9wNDQ2hFx/c/mRfbd7J+ordx9JsoqXjqrGuhsU/pH
GttJLmkJ5FH+pQ==
=akeX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
|
|
|
|
726eb70e0d |
Char/Misc driver patches for 5.10-rc1
Here is the big set of char, misc, and other assorted driver subsystem patches for 5.10-rc1. There's a lot of different things in here, all over the drivers/ directory. Some summaries: - soundwire driver updates - habanalabs driver updates - extcon driver updates - nitro_enclaves new driver - fsl-mc driver and core updates - mhi core and bus updates - nvmem driver updates - eeprom driver updates - binder driver updates and fixes - vbox minor bugfixes - fsi driver updates - w1 driver updates - coresight driver updates - interconnect driver updates - misc driver updates - other minor driver updates All of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4g8YQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+yngKgCeNpArCP/9vQJRK9upnDm8ZLunSCUAn1wUT/2A /bTQ42c/WRQ+LU828GSM =6sO2 -----END PGP SIGNATURE----- Merge tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big set of char, misc, and other assorted driver subsystem patches for 5.10-rc1. There's a lot of different things in here, all over the drivers/ directory. Some summaries: - soundwire driver updates - habanalabs driver updates - extcon driver updates - nitro_enclaves new driver - fsl-mc driver and core updates - mhi core and bus updates - nvmem driver updates - eeprom driver updates - binder driver updates and fixes - vbox minor bugfixes - fsi driver updates - w1 driver updates - coresight driver updates - interconnect driver updates - misc driver updates - other minor driver updates All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits) binder: fix UAF when releasing todo list docs: w1: w1_therm: Fix broken xref, mistakes, clarify text misc: Kconfig: fix a HISI_HIKEY_USB dependency LSM: Fix type of id parameter in kernel_post_load_data prototype misc: Kconfig: add a new dependency for HISI_HIKEY_USB firmware_loader: fix a kernel-doc markup w1: w1_therm: make w1_poll_completion static binder: simplify the return expression of binder_mmap test_firmware: Test partial read support firmware: Add request_partial_firmware_into_buf() firmware: Store opt_flags in fw_priv fs/kernel_file_read: Add "offset" arg for partial reads IMA: Add support for file reads without contents LSM: Add "contents" flag to kernel_read_file hook module: Call security_kernel_post_load_data() firmware_loader: Use security_post_load_data() LSM: Introduce kernel_post_load_data() hook fs/kernel_read_file: Add file_size output argument fs/kernel_read_file: Switch buffer size arg to size_t fs/kernel_read_file: Remove redundant size argument ... |
|
|
|
5287b07f6d |
fs/kernel_read_file: Split into separate source file
These routines are used in places outside of exec(2), so in preparation for refactoring them, move them into a separate source file, fs/kernel_read_file.c. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Scott Branden <scott.branden@broadcom.com> Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
b89999d004 |
fs/kernel_read_file: Split into separate include file
Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h include file. That header gets pulled in just about everywhere and doesn't really need functions not related to the general fs interface. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Scott Branden <scott.branden@broadcom.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: James Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
c307459b9d |
fs/kernel_read_file: Remove FIRMWARE_PREALLOC_BUFFER enum
FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs that are interested in filtering between types of things. The "how" should be an internal detail made uninteresting to the LSMs. Fixes: |
|
|
|
0f2122045b |
io_uring: don't rely on weak ->files references
Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
d53c3dfb23 |
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit
|
|
|
|
64019a2e46 |
mm/gup: remove task_struct pointer for all gup code
After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0fd338b2d2 |
exec: move path_noexec() check earlier
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
633fb6ac39 |
exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit
|
|
|
|
db19c91c3b |
exec: change uselib(2) IS_SREG() failure to EACCES
Patch series "Relocate execve() sanity checks", v2. While looking at the code paths for the proposed O_MAYEXEC flag, I saw some things that looked like they should be fixed up. exec: Change uselib(2) IS_SREG() failure to EACCES This just regularizes the return code on uselib(2). exec: Move S_ISREG() check earlier This moves the S_ISREG() check even earlier than it was already. exec: Move path_noexec() check earlier This adds the path_noexec() check to the same place as the S_ISREG() check. This patch (of 3): Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so the behavior matches execve(2), and the seemingly documented value. The "not a regular file" failure mode of execve(2) is explicitly documented[1], but it is not mentioned in uselib(2)[2] which does, however, say that open(2) and mmap(2) errors may apply. The documentation for open(2) does not include a "not a regular file" error[3], but mmap(2) does[4], and it is EACCES. [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
fe81417596 |
exec: use force_uaccess_begin during exec and exit
Both exec and exit want to ensure that the uaccess routines actually do access user pointers. Use the newly added force_uaccess_begin helper instead of an open coded set_fs for that to prepare for kernel builds where set_fs() does not exist. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
be619f7f06 |
exec: Implement kernel_execve
To allow the kernel not to play games with set_fs to call exec implement kernel_execve. The function kernel_execve takes pointers into kernel memory and copies the values pointed to onto the new userspace stack. The calls with arguments from kernel space of do_execve are replaced with calls to kernel_execve. The calls do_execve and do_execveat are made static as there are now no callers outside of exec. The comments that mention do_execve are updated to refer to kernel_execve or execve depending on the circumstances. In addition to correcting the comments, this makes it easy to grep for do_execve and verify it is not used. Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
d8b9cd549e |
exec: Factor bprm_stack_limits out of prepare_arg_pages
In preparation for implementiong kernel_execve (which will take kernel pointers not userspace pointers) factor out bprm_stack_limits out of prepare_arg_pages. This separates the counting which depends upon the getting data from userspace from the calculations of the stack limits which is usable in kernel_execve. The remove prepare_args_pages and compute bprm->argc and bprm->envc directly in do_execveat_common, before bprm_stack_limits is called. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
0c9cdff054 |
exec: Factor bprm_execve out of do_execve_common
Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Factor bprm_execve out of do_execve_common to separate out the copying of arguments to the newe stack, and the rest of exec. In separating bprm_execve from do_execve_common the copying of the arguments onto the new stack happens earlier. As the copying of the arguments does not depend any security hooks, files, the file table, current->in_execve, current->fs->in_exec, bprm->unsafe, or creds this is safe. Likewise the security hook security_creds_for_exec does not depend upon preventing the argument copying from happening. In addition to making it possible to implement kernel_execve that performs the copying differently, this separation of bprm_execve from do_execve_common makes for a nice separation of responsibilities making the exec code easier to navigate. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
f18ac551e5 |
exec: Move bprm_mm_init into alloc_bprm
Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the allocation and initialization of bprm->mm into alloc_bprm so that the bprm->mm is available early to store the new user stack into. This is a prerequisite for copying argv and envp into the new user stack early before ther rest of exec. To keep the things consistent the cleanup of bprm->mm is moved into free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is allocated and free_bprm are called. Moving bprm_mm_init earlier is safe as it does not depend on any files, current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file table is shared. (AKA bprm_mm_init does not depend on any of the code that happens between alloc_bprm and where it was previously called.) This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This is safe because current->fs->in_exec is only used to preventy taking an additional reference on the fs_struct. This moves bprm->mm cleanup after current->in_execve is set to 0. This is safe because current->in_execve is only used by the lsms (apparmor and tomoyou) and always for LSM specific functions, never for anything to do with the mm. This adds bprm->mm cleanup into the successful return path. This is safe because being on the successful return path implies that begin_new_exec succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am moving into free_bprm will do nothing. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
60d9ad1d1d |
exec: Move initialization of bprm->filename into alloc_bprm
Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the computation of bprm->filename and possible allocation of a name in the case of execveat into alloc_bprm to make that possible. The exectuable name, the arguments, and the environment are copied into the new usermode stack which is stored in bprm until exec passes the point of no return. As the executable name is copied first onto the usermode stack it needs to be known. As there are no dependencies to computing the executable name, compute it early in alloc_bprm. As an implementation detail if the filename needs to be generated because it embeds a file descriptor store that filename in a new field bprm->fdpath, and free it in free_bprm. Previously this was done in an independent variable pathbuf. I have renamed pathbuf fdpath because fdpath is more suggestive of what kind of path is in the variable. I moved fdpath into struct linux_binprm because it is tightly tied to the other variables in struct linux_binprm, and as such is needed to allow the call alloc_binprm to move. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
0a8f36eb48 |
exec: Factor out alloc_bprm
Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the allocation of the bprm into it's own function (alloc_bprm) and move the call of alloc_bprm before unshare_files so that bprm can ultimately be allocated, the arguments can be placed on the new stack, and then the bprm can be passed into the core of exec. Neither the allocation of struct binprm nor the unsharing depend upon each other so swapping the order in which they are called is trivially safe. To keep things consistent the order of cleanup at the end of do_execve_common swapped to match the order of initialization. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
25cf336de5 |
exec: Remove do_execve_file
Now that the last callser has been removed remove this code from exec. For anyone thinking of resurrecing do_execve_file please note that the code was buggy in several fundamental ways. - It did not ensure the file it was passed was read-only and that deny_write_access had been called on it. Which subtlely breaks invaniants in exec. - The caller of do_execve_file was expected to hold and put a reference to the file, but an extra reference for use by exec was not taken so that when exec put it's reference to the file an underflow occured on the file reference count. - The point of the interface was so that a pathname did not need to exist. Which breaks pathname based LSMs. Tetsuo Handa originally reported these issues[1]. While it was clear that deny_write_access was missing the fundamental incompatibility with the passed in O_RDWR filehandle was not immediately recognized. All of these issues were fixed by modifying the usermode driver code to have a path, so it did not need this hack. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/ v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
c1e8d7c6a7 |
mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
d8ed45c5dc |
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
bce2b68b89 |
exec: use flush_icache_user_range in read_code
read_code operates on user addresses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200515143646.3857579-27-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
48304f7994 |
exec: only build read_code when needed
Only build read_code when binary formats that use it are built into the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200515143646.3857579-26-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
886d7de631 |
Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton: - More MM work. 100ish more to go. Mike Rapoport's "mm: remove __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue - Various other little subsystems * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits) lib/ubsan.c: fix gcc-10 warnings tools/testing/selftests/vm: remove duplicate headers selftests: vm: pkeys: fix multilib builds for x86 selftests: vm: pkeys: use the correct page size on powerpc selftests/vm/pkeys: override access right definitions on powerpc selftests/vm/pkeys: test correct behaviour of pkey-0 selftests/vm/pkeys: introduce a sub-page allocator selftests/vm/pkeys: detect write violation on a mapped access-denied-key page selftests/vm/pkeys: associate key on a mapped page and detect write violation selftests/vm/pkeys: associate key on a mapped page and detect access violation selftests/vm/pkeys: improve checks to determine pkey support selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust() selftests/vm/pkeys: fix number of reserved powerpc pkeys selftests/vm/pkeys: introduce powerpc support selftests/vm/pkeys: introduce generic pkey abstractions selftests: vm: pkeys: use the correct huge page size selftests/vm/pkeys: fix alloc_random_pkey() to make it really random selftests/vm/pkeys: fix assertion in pkey_disable_set/clear() selftests/vm/pkeys: fix pkey_disable_clear() selftests: vm: pkeys: add helpers for pkey bits ... |
|
|
|
762a3af6fa |
exec: open code copy_string_kernel
Currently copy_string_kernel is just a wrapper around copy_strings that simplifies the calling conventions and uses set_fs to allow passing a kernel pointer. But due to the fact the we only need to handle a single kernel argument pointer, the logic can be sigificantly simplified while getting rid of the set_fs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
986db2d14a |
exec: simplify the copy_strings_kernel calling convention
copy_strings_kernel is always used with a single argument, adjust the calling convention to that. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
15a2bc4dbb |
Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman: "Last cycle for the Nth time I ran into bugs and quality of implementation issues related to exec that could not be easily be fixed because of the way exec is implemented. So I have been digging into exec and cleanup up what I can. I don't think I have exec sorted out enough to fix the issues I started with but I have made some headway this cycle with 4 sets of changes. - promised cleanups after introducing exec_update_mutex - trivial cleanups for exec - control flow simplifications - remove the recomputation of bprm->cred The net result is code that is a bit easier to understand and work with and a decrease in the number of lines of code (if you don't count the added tests)" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits) exec: Compute file based creds only once exec: Add a per bprm->file version of per_clear binfmt_elf_fdpic: fix execfd build regression selftests/exec: Add binfmt_script regression test exec: Remove recursion from search_binary_handler exec: Generic execfd support exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC exec: Move the call of prepare_binprm into search_binary_handler exec: Allow load_misc_binary to call prepare_binprm unconditionally exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds exec: Teach prepare_exec_creds how exec treats uids & gids exec: Set the point of no return sooner exec: Move handling of the point of no return to the top level exec: Run sync_mm_rss before taking exec_update_mutex exec: Fix spelling of search_binary_handler in a comment exec: Move the comment from above de_thread to above unshare_sighand exec: Rename flush_old_exec begin_new_exec exec: Move most of setup_new_exec into flush_old_exec exec: In setup_new_exec cache current in the local variable me ... |
|
|
|
9ff7258575 |
Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc updates from Eric Biederman: "This has four sets of changes: - modernize proc to support multiple private instances - ensure we see the exit of each process tid exactly - remove has_group_leader_pid - use pids not tasks in posix-cpu-timers lookup Alexey updated proc so each mount of proc uses a new superblock. This allows people to actually use mount options with proc with no fear of messing up another mount of proc. Given the kernel's internal mounts of proc for things like uml this was a real problem, and resulted in Android's hidepid mount options being ignored and introducing security issues. The rest of the changes are small cleanups and fixes that came out of my work to allow this change to proc. In essence it is swapping the pids in de_thread during exec which removes a special case the code had to handle. Then updating the code to stop handling that special case" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: proc_pid_ns takes super_block as an argument remove the no longer needed pid_alive() check in __task_pid_nr_ns() posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type posix-cpu-timers: Extend rcu_read_lock removing task_struct references signal: Remove has_group_leader_pid exec: Remove BUG_ON(has_group_leader_pid) posix-cpu-timer: Unify the now redundant code in lookup_task posix-cpu-timer: Tidy up group_leader logic in lookup_task proc: Ensure we see the exit of each process tid exactly once rculist: Add hlists_swap_heads_rcu proc: Use PIDTYPE_TGID in next_tgid Use proc_pid_ns() to get pid_namespace from the proc superblock proc: use named enums for better readability proc: use human-readable values for hidepid docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior proc: add option to mount only a pids subset proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option proc: allow to mount many instances of proc in one pid namespace proc: rename struct proc_fs_info to proc_fs_opts |
|
|
|
56305aa9b6 |
exec: Compute file based creds only once
Move the computation of creds from prepare_binfmt into begin_new_exec so that the creds need only be computed once. This is just code reorganization no semantic changes of any kind are made. Moving the computation is safe. I have looked through the kernel and verified none of the binfmts look at bprm->cred directly, and that there are no helpers that look at bprm->cred indirectly. Which means that it is not a problem to compute the bprm->cred later in the execution flow as it is not used until it becomes current->cred. A new function bprm_creds_from_file is added to contain the work that needs to be done. bprm_creds_from_file first computes which file bprm->executable or most likely bprm->file that the bprm->creds will be computed from. The funciton bprm_fill_uid is updated to receive the file instead of accessing bprm->file. The now unnecessary work needed to reset the bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid. A small comment to document that bprm_fill_uid now only deals with the work to handle suid and sgid files. The default case is already heandled by prepare_exec_creds. The function security_bprm_repopulate_creds is renamed security_bprm_creds_from_file and now is explicitly passed the file from which to compute the creds. The documentation of the bprm_creds_from_file security hook is updated to explain when the hook is called and what it needs to do. The file is passed from cap_bprm_creds_from_file into get_file_caps so that the caps are computed for the appropriate file. The now unnecessary work in cap_bprm_creds_from_file to reset the ambient capabilites has been removed. A small comment to document that the work of cap_bprm_creds_from_file is to read capabilities from the files secureity attribute and derive capabilities from the fact the user had uid 0 has been added. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
a7868323c2 |
exec: Add a per bprm->file version of per_clear
There is a small bug in the code that recomputes parts of bprm->cred
for every bprm->file. The code never recomputes the part of
clear_dangerous_personality_flags it is responsible for.
Which means that in practice if someone creates a sgid script
the interpreter will not be able to use any of:
READ_IMPLIES_EXEC
ADDR_NO_RANDOMIZE
ADDR_COMPAT_LAYOUT
MMAP_PAGE_ZERO.
This accentially clearing of personality flags probably does
not matter in practice because no one has complained
but it does make the code more difficult to understand.
Further remaining bug compatible prevents the recomputation from being
removed and replaced by simply computing bprm->cred once from the
final bprm->file.
Making this change removes the last behavior difference between
computing bprm->creds from the final file and recomputing
bprm->cred several times. Which allows this behavior change
to be justified for it's own reasons, and for any but hunts
looking into why the behavior changed to wind up here instead
of in the code that will follow that computes bprm->cred
from the final bprm->file.
This small logic bug appears to have existed since the code
started clearing dangerous personality bits.
History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
|
|
bc2bf338d5 |
exec: Remove recursion from search_binary_handler
Recursion in kernel code is generally a bad idea as it can overflow the kernel stack. Recursion in exec also hides that the code is looping and that the loop changes bprm->file. Instead of recursing in search_binary_handler have the methods that would recurse set bprm->interpreter and return 0. Modify exec_binprm to loop when bprm->interpreter is set. Consolidate all of the reassignments of bprm->file in that loop to make it clear what is going on. The structure of the new loop in exec_binprm is that all errors return immediately, while successful completion (ret == 0 && !bprm->interpreter) just breaks out of the loop and runs what exec_bprm has always run upon successful completion. Fail if the an interpreter is being call after execfd has been set. The code has never properly handled an interpreter being called with execfd being set and with reassignments of bprm->file and the assignment of bprm->executable in generic code it has finally become possible to test and fail when if this problematic condition happens. With the reassignments of bprm->file and the assignment of bprm->executable moved into the generic code add a test to see if bprm->executable is being reassigned. In search_binary_handler remove the test for !bprm->file. With all reassignments of bprm->file moved to exec_binprm bprm->file can never be NULL in search_binary_handler. Link: https://lkml.kernel.org/r/87sgfwyd84.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
b8a61c9e7b |
exec: Generic execfd support
Most of the support for passing the file descriptor of an executable to an interpreter already lives in the generic code and in binfmt_elf. Rework the fields in binfmt_elf that deal with executable file descriptor passing to make executable file descriptor passing a first class concept. Move the fd_install from binfmt_misc into begin_new_exec after the new creds have been installed. This means that accessing the file through /proc/<pid>/fd/N is able to see the creds for the new executable before allowing access to the new executables files. Performing the install of the executables file descriptor after the point of no return also means that nothing special needs to be done on error. The exiting of the process will close all of it's open files. Move the would_dump from binfmt_misc into begin_new_exec right after would_dump is called on the bprm->file. This makes it obvious this case exists and that no nesting of bprm->file is currently supported. In binfmt_misc the movement of fd_install into generic code means that it's special error exit path is no longer needed. Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
8b72ca9004 |
exec: Move the call of prepare_binprm into search_binary_handler
The code in prepare_binary_handler needs to be run every time search_binary_handler is called so move the call into search_binary_handler itself to make the code simpler and easier to understand. Link: https://lkml.kernel.org/r/87d070zrvx.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
a16b3357b2 |
exec: Allow load_misc_binary to call prepare_binprm unconditionally
Add a flag preserve_creds that binfmt_misc can set to prevent
credentials from being updated. This allows binfmt_misc to always
call prepare_binprm. Allowing the credential computation logic to be
consolidated.
Not replacing the credentials with the interpreters credentials is
safe because because an open file descriptor to the executable is
passed to the interpreter. As the interpreter does not need to
reopen the executable it is guaranteed to see the same file that
exec sees.
Ref: c407c033de84 ("[PATCH] binfmt_misc: improve calculation of interpreter's credentials")
Link: https://lkml.kernel.org/r/87imgszrwo.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
|
|
112b714759 |
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
Rename bprm->cap_elevated to bprm->active_secureexec and initialize it in prepare_binprm instead of in cap_bprm_set_creds. Initializing bprm->active_secureexec in prepare_binprm allows multiple implementations of security_bprm_repopulate_creds to play nicely with each other. Rename security_bprm_set_creds to security_bprm_reopulate_creds to emphasize that this path recomputes part of bprm->cred. This recomputation avoids the time of check vs time of use problems that are inherent in unix #! interpreters. In short two renames and a move in the location of initializing bprm->active_secureexec. Link: https://lkml.kernel.org/r/87o8qkzrxp.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
b8bff59926 |
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
Today security_bprm_set_creds has several implementations: apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds, smack_bprm_set_creds, and tomoyo_bprm_set_creds. Except for cap_bprm_set_creds they all test bprm->called_set_creds and return immediately if it is true. The function cap_bprm_set_creds ignores bprm->calld_sed_creds entirely. Create a new LSM hook security_bprm_creds_for_exec that is called just before prepare_binprm in __do_execve_file, resulting in a LSM hook that is called exactly once for the entire of exec. Modify the bits of security_bprm_set_creds that only want to be called once per exec into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds behind. Remove bprm->called_set_creds all of it's former users have been moved to security_bprm_creds_for_exec. Add or upate comments a appropriate to bring them up to date and to reflect this change. Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> # For the LSM and Smack bits Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
b127c16d06 |
Merge f87d1c9559 ("exec: Move would_dump into flush_old_exec")
The change to exec is relevant to the cleanup work I have been doing. Merge it here so that I can build on top of it, and so hopefully that other merge logic can pick up on this and see how to deal with the conflict between that change and my exec cleanup work. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
f87d1c9559 |
exec: Move would_dump into flush_old_exec
I goofed when I added mm->user_ns support to would_dump. I missed the
fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
binfmt_script bprm->file is reassigned. Which made the move of
would_dump from setup_new_exec to __do_execve_file before exec_binprm
incorrect as it can result in would_dump running on the script instead
of the interpreter of the script.
The net result is that the code stopped making unreadable interpreters
undumpable. Which allows them to be ptraced and written to disk
without special permissions. Oops.
The move was necessary because the call in set_new_exec was after
bprm->mm was no longer valid.
To correct this mistake move the misplaced would_dump from
__do_execve_file into flos_old_exec, before exec_mmap is called.
I tested and confirmed that without this fix I can attach with gdb to
a script with an unreadable interpreter, and with this fix I can not.
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
6834e0bb41 |
exec: Set the point of no return sooner
Make the code more robust by marking the point of no return sooner. This ensures that future code changes don't need to worry about how they return errors if they are past this point. This results in no actual change in behavior as __do_execve_file does not force SIGSEGV when there is a pending fatal signal pending past the point of no return. Further the only error returns from de_thread and exec_mmap that can occur result in fatal signals being pending. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87sgga5klu.fsf_-_@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
8890b29341 |
exec: Move handling of the point of no return to the top level
Move the handing of the point of no return from search_binary_handler into __do_execve_file so that it is easier to find, and to keep things robust in the face of change. Make it clear that an existing fatal signal will take precedence over a forced SIGSEGV by not forcing SIGSEGV if a fatal signal is already pending. This does not change the behavior but it saves a reader of the code the tedium of reading and understanding force_sig and the signal delivery code. Update the comment in begin_new_exec about where SIGSEGV is forced. Keep point_of_no_return from being a mystery by documenting what the code is doing where it forces SIGSEGV if the code is past the point of no return. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87y2q25knl.fsf_-_@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
a28bf136e6 |
exec: Run sync_mm_rss before taking exec_update_mutex
Like exec_mm_release sync_mm_rss is about flushing out the state of the old_mm, which does not need to happen under exec_update_mutex. Make this explicit by moving sync_mm_rss outside of exec_update_mutex. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/875zd66za3.fsf_-_@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
13c432b514 |
exec: Fix spelling of search_binary_handler in a comment
Link: https://lkml.kernel.org/r/87h7wq6zc1.fsf_-_@x220.int.ebiederm.org Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
7a60ef4803 |
exec: Move the comment from above de_thread to above unshare_sighand
The comment describes work that now happens in unshare_sighand so move the comment where it makes sense. Link: https://lkml.kernel.org/r/87mu6i6zcs.fsf_-_@x220.int.ebiederm.org Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
2388777a0a |
exec: Rename flush_old_exec begin_new_exec
There is and has been for a very long time been a lot more going on in flush_old_exec than just flushing the old state. After the movement of code from setup_new_exec there is a whole lot more going on than just flushing the old executables state. Rename flush_old_exec to begin_new_exec to more accurately reflect what this function does. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
df9e4d2c4a |
exec: Move most of setup_new_exec into flush_old_exec
The current idiom for the callers is:
flush_old_exec(bprm);
set_personality(...);
setup_new_exec(bprm);
In 2010 Linus split flush_old_exec into flush_old_exec and
setup_new_exec. With the intention that setup_new_exec be what is
called after the processes new personality is set.
Move the code that doesn't depend upon the personality from
setup_new_exec into flush_old_exec. This is to facilitate future
changes by having as much code together in one function as possible.
To see why it is safe to move this code please note that effectively
this change moves the personality setting in the binfmt and the following
three lines of code after everything except unlocking the mutexes:
arch_pick_mmap_layout
arch_setup_new_exec
mm->task_size = TASK_SIZE
The function arch_pick_mmap_layout at most sets:
mm->get_unmapped_area
mm->mmap_base
mm->mmap_legacy_base
mm->mmap_compat_base
mm->mmap_compat_legacy_base
which nothing in flush_old_exec or setup_new_exec depends on.
The function arch_setup_new_exec only sets architecture specific
state and the rest of the functions only deal in state that applies
to all architectures.
The last line just sets mm->task_size and again nothing in flush_old_exec
or setup_new_exec depend on task_size.
Ref:
|
|
|
|
7d503feba0 |
exec: In setup_new_exec cache current in the local variable me
At least gcc 8.3 when generating code for x86_64 has a hard time consolidating multiple calls to current aka get_current(), and winds up unnecessarily rereading %gs:current_task several times in setup_new_exec. Caching the value of current in the local variable of me generates slightly better and shorter assembly. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
96ecee29b0 |
exec: Merge install_exec_creds into setup_new_exec
The two functions are now always called one right after the other so merge them together to make future maintenance easier. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
1507b7a30a |
exec: Rename the flag called_exec_mmap point_of_no_return
Update the comments and make the code easier to understand by renaming this flag. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
89826cce37 |
exec: Make unlocking exec_update_mutex explict
With install_exec_creds updated to follow immediately after setup_new_exec, the failure of unshare_sighand is the only code path where exec_update_mutex is held but not explicitly unlocked. Update that code path to explicitly unlock exec_update_mutex. Remove the unlocking of exec_update_mutex from free_bprm. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
610b818856 |
exec: Remove BUG_ON(has_group_leader_pid)
With the introduction of exchange_tids thread_group_leader and has_group_leader_pid have become equivalent. Further at this point in the code a thread group has exactly two threads, the previous thread_group_leader that is waiting to be reaped and tsk. So we know it is impossible for tsk to be the thread_group_leader. This is also the last user of has_group_leader_pid so removing this check will allow has_group_leader_pid to be removed. So remove the "BUG_ON(has_group_leader_pid)" that will never fire. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
6b03d1304a |
proc: Ensure we see the exit of each process tid exactly once
When the thread group leader changes during exec and the old leaders thread is reaped proc_flush_pid will flush the dentries for the entire process because the leader still has it's original pid. Fix this by exchanging the pids in an rcu safe manner, and wrapping the code to do that up in a helper exchange_tids. When I removed switch_exec_pids and introduced this behavior in |
|
|
|
d987ca1c6b |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ... |
|
|
|
d1e7fd6462 |
signal: Extend exec_id to 64bits
Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
eea9673250 |
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
ccf0fa6be0 |
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe. This rearrangement of code has two significant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal. Further this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list. This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
153ffb6ba4 |
exec: Move cleanup of posix timers on exec out of de_thread
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
0216915592 |
exec: Factor unshare_sighand out of de_thread and call it separately
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
2ca7be7d55 |
exec: Only compute current once in flush_old_exec
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> |
|
|
|
901cff7cb9 |
firmware_loader: load files from the mount namespace of init
I have an experimental setup where almost every possible system service (even early startup ones) runs in separate namespace, using a dedicated, minimal file system. In process of minimizing the contents of the file systems with regards to modules and firmware files, I noticed that in my system, the firmware files are loaded from three different mount namespaces, those of systemd-udevd, init and systemd-networkd. The logic of the source namespace is not very clear, it seems to depend on the driver, but the namespace of the current process is used. So, this patch tries to make things a bit clearer and changes the loading of firmware files only from the mount namespace of init. This may also improve security, though I think that using firmware files as attack vector could be too impractical anyway. Later, it might make sense to make the mount namespace configurable, for example with a new file in /proc/sys/kernel/firmware_config/. That would allow a dedicated file system only for firmware files and those need not be present anywhere else. This configurability would make more sense if made also for kernel modules and /sbin/modprobe. Modules are already loaded from init namespace (usermodehelper uses kthreadd namespace) except when directly loaded by systemd-udevd. Instead of using the mount namespace of the current process to load firmware files, use the mount namespace of init process. Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/ Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/ Signed-off-by: Topi Miettinen <toiwoton@gmail.com> Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
7eec11d3a7 |
Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
"Most of -mm and quite a number of other subsystems: hotfixes, scripts,
ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.
MM is fairly quiet this time. Holidays, I assume"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
kcov: ignore fault-inject and stacktrace
include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
execve: warn if process starts with executable stack
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
init/main.c: fix misleading "This architecture does not have kernel memory protection" message
init/main.c: fix quoted value handling in unknown_bootoption
init/main.c: remove unnecessary repair_env_string in do_initcall_level
init/main.c: log arguments and environment passed to init
fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
fs/binfmt_elf.c: coredump: delete duplicated overflow check
fs/binfmt_elf.c: coredump: allocate core ELF header on stack
fs/binfmt_elf.c: make BAD_ADDR() unlikely
fs/binfmt_elf.c: better codegen around current->mm
fs/binfmt_elf.c: don't copy ELF header around
fs/binfmt_elf.c: fix ->start_code calculation
fs/binfmt_elf.c: smaller code generation around auxv vector fill
lib/find_bit.c: uninline helper _find_next_bit()
lib/find_bit.c: join _find_next_bit{_le}
uapi: rename ext2_swab() to swab() and share globally in swab.h
lib/scatterlist.c: adjust indentation in __sg_alloc_table
...
|
|
|
|
47a2ebb7f5 |
execve: warn if process starts with executable stack
There were few episodes of silent downgrade to an executable stack over
years:
1) linking innocent looking assembly file will silently add executable
stack if proper linker options is not given as well:
$ cat f.S
.intel_syntax noprefix
.text
.globl f
f:
ret
$ cat main.c
void f(void);
int main(void)
{
f();
return 0;
}
$ gcc main.c f.S
$ readelf -l ./a.out
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RWE 0x10
^^^
2) converting C99 nested function into a closure
https://nullprogram.com/blog/2019/11/15/
void intsort2(int *base, size_t nmemb, _Bool invert)
{
int cmp(const void *a, const void *b)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
will silently require stack trampolines while non-closure version will
not.
Without doubt this behaviour is documented somewhere, add a warning so
that developers and users can at least notice. After so many years of
x86_64 having proper executable stack support it should not cause too
many problems.
Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
ccaaaf6fe5 |
MPX requires recompiling applications, which requires compiler support.
Unfortunately, GCC 9.1 is expected to be be released without support for MPX. This means that there was only a relatively small window where folks could have ever used MPX. It failed to gain wide adoption in the industry, and Linux was the only mainstream OS to ever support it widely. Support for the feature may also disappear on future processors. This set completes the process that we started during the 5.4 merge window. -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJeK1/pAAoJEGg1lTBwyZKwgC8QAIiVn1d7A9Uj/WpnpgfCChCZ 9XiV6Ak999qD9fbAcrgNfPjieaD4mtokocSRVJuRgJu5iLnIJCINlozLPe4yVl7P 7zebnxkLq0CIA8d56bEUoFlC0J+oWYlDVQePZzNQsSk5KHVGXVLpF6U4vDVzZeQy cprgvdeY+ehB7G6IIo0MWTg5ylKYAsOAyVvK8NIGpKY2k6/YqCnsptnsVE7bvlHy TrEOiUWLv+hh0bMkZdP1PwKQKEuMO/IZly0HtviFbMN7T4TB1spfg7ELoBucEq3T s4EVbYRe+nIE4tuEAveaX3CgxJek8cY5MlticskdaKSEACBwabdOF55qsZy0u+WA PYC4iUIXfbOH8OgieKWtGX4IuSkRYdQ2nP4BOpe4ZX4+zvU7zOCIyVSKRrwkX8cc ADtWI5FAtB36KCgUuWnHGHNZpOxPTbTLBuBataFY4Q2uBNJEBJpscZ5H9ObtyGFU ZjlzqFnM0nFNDKEI1EEtv9jLzgZTU1RQ46s7EFeSeEQ2/s9wJ3+s5sBlVbljsmus o658bLOEaRWC/aF15dgmEXW9GAO6uifNdmbzGnRn7oEMYyFQPTWbZvi1zGz58QaG Y6WTtigVtsSrHS4wpYd+p+n1W06VnB6J3BpBM4G1VQv1Vm0dNd1tUOfkqOzPjg7c 33Itmsz2LaW1mb67GlgZ =g4cC -----END PGP SIGNATURE----- Merge tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx Pull x86 MPX removal from Dave Hansen: "MPX requires recompiling applications, which requires compiler support. Unfortunately, GCC 9.1 is expected to be be released without support for MPX. This means that there was only a relatively small window where folks could have ever used MPX. It failed to gain wide adoption in the industry, and Linux was the only mainstream OS to ever support it widely. Support for the feature may also disappear on future processors. This set completes the process that we started during the 5.4 merge window when the MPX prctl()s were removed. XSAVE support is left in place, which allows MPX-using KVM guests to continue to function" * tag 'mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-mpx: x86/mpx: remove MPX from arch/x86 mm: remove arch_bprm_mm_init() hook x86/mpx: remove bounds exception code x86/mpx: remove build infrastructure x86/alternatives: add missing insn.h include |
|
|
|
42222eae17 |
mm: remove arch_bprm_mm_init() hook
From: Dave Hansen <dave.hansen@linux.intel.com> MPX is being removed from the kernel due to a lack of support in the toolchain going forward (gcc). arch_bprm_mm_init() is used at execve() time. The only non-stub implementation is on x86 for MPX. Remove the hook entirely from all architectures and generic code. Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: x86@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> |
|
|
|
043cf46825 |
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
"The main changes in the timer code in this cycle were:
- Clockevent updates:
- timer-of framework cleanups. (Geert Uytterhoeven)
- Use timer-of for the renesas-ostm and the device name to prevent
name collision in case of multiple timers. (Geert Uytterhoeven)
- Check if there is an error after calling of_clk_get in asm9260
(Chuhong Yuan)
- ABI fix: Zero out high order bits of nanoseconds on compat
syscalls. This got broken a year ago, with apparently no side
effects so far.
Since the kernel would use random data otherwise I don't think we'd
have other options but to fix the bug, even if there was a side
effect to applications (Dmitry Safonov)
- Optimize ns_to_timespec64() on 32-bit systems: move away from
div_s64_rem() which can be slow, to div_u64_rem() which is faster
(Arnd Bergmann)
- Annotate KCSAN-reported false positive data races in
hrtimer_is_queued() users by moving timer->state handling over to
the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
(Eric Dumazet)
- Misc cleanups and small fixes"
[ I undid the "ABI fix" and updated the comments instead. The reason
there were apparently no side effects is that the fix was a no-op.
The updated comment is to say _why_ it was a no-op. - Linus ]
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Zero the upper 32-bits in __kernel_timespec on 32-bit
time: Rename tsk->real_start_time to ->start_boottime
hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
time: Fix spelling mistake in comment
time: Optimize ns_to_timespec64()
hrtimer: Annotate lockless access to timer->state
clocksource/drivers/asm9260: Add a check for of_clk_get
clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
clocksource/drivers/renesas-ostm: Convert to timer_of
clocksource/drivers/timer-of: Use unique device name instead of timer
clocksource/drivers/timer-of: Convert last full_name to %pOF
|
|
|
|
6a965666b7 |
Pipework for general notification queue
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL
C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH
F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C
NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay
+KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH
1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e
LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj
tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY
qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U
1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP
xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA
CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo=
=FKC7
-----END PGP SIGNATURE-----
Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull pipe rework from David Howells:
"This is my set of preparatory patches for building a general
notification queue on top of pipes. It makes a number of significant
changes:
- It removes the nr_exclusive argument from __wake_up_sync_key() as
this is always 1. This prepares for the next step:
- Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue
spinlock.
- Change the pipe buffer ring to be managed in terms of unbounded
head and tail indices rather than bounded index and length. This
means that reading the pipe only needs to modify one index, not
two.
- A selection of helper functions are provided to query the state of
the pipe buffer, plus a couple to apply updates to the pipe
indices.
- The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.
- Advance the head and tail indices inside the pipe waitqueue lock
and use wake_up_interruptible_sync_poll_locked() to poke poll
without having to take the lock twice.
- Rearrange pipe_write() to preallocate the buffer it is going to
write into and then drop the spinlock. This allows kernel
notifications to then be added the ring whilst it is filling the
buffer it allocated. The read side is stalled because the pipe
mutex is still held.
- Don't wake up readers on a pipe if there was already data in it
when we added more.
- Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer"
* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
pipe: Remove sync on wake_ups
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Remove redundant wakeup from pipe_write()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Conditionalise wakeup in pipe_read()
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Allow pipes to have kernel-reserved slots
pipe: Use head and tail pointers for the ring, not cursor and length
Add wake_up_interruptible_sync_poll_locked()
Remove the nr_exclusive argument from __wake_up_sync_key()
pipe: Reduce #inclusion of pipe_fs_i.h
|
|
|
|
4610ba7ad8 |
exit/exec: Seperate mm_release()
mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de |
|
|
|
cf25e24db6 |
time: Rename tsk->real_start_time to ->start_boottime
Since it stores CLOCK_BOOTTIME, not, as the name suggests, CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
d055b4fb4d |
pipe: Reduce #inclusion of pipe_fs_i.h
Remove some #inclusions of linux/pipe_fs_i.h that don't seem to be necessary any more. Signed-off-by: David Howells <dhowells@redhat.com> |
|
|
|
227a4aadc7 |
sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct to use a cache line already touched by the scheduler switch_mm. The membarrier_execve() (now membarrier_exec_mmap) hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state, ensuring memory accesses executed prior to exec are not reordered with the stores clearing the membarrier state. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
16d51a590a |
sched/fair: Don't free p->numa_faults with concurrent readers
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes:
|
|
|
|
5ad18b2e60 |
Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman: "A source of error over the years has been that force_sig has taken a task parameter when it is only safe to use force_sig with the current task. The force_sig function is built for delivering synchronous signals such as SIGSEGV where the userspace application caused a synchronous fault (such as a page fault) and the kernel responded with a signal. Because the name force_sig does not make this clear, and because the force_sig takes a task parameter the function force_sig has been abused for sending other kinds of signals over the years. Slowly those have been fixed when the oopses have been tracked down. This set of changes fixes the remaining abusers of force_sig and carefully rips out the task parameter from force_sig and friends making this kind of error almost impossible in the future" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus signal: Remove the signal number and task parameters from force_sig_info signal: Factor force_sig_info_to_task out of force_sig_info signal: Generate the siginfo in force_sig signal: Move the computation of force into send_signal and correct it. signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal signal: Remove the task parameter from force_sig_fault signal: Use force_sig_fault_to_task for the two calls that don't deliver to current signal: Explicitly call force_sig_fault on current signal/unicore32: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from ptrace_break signal/nds32: Remove tsk parameter from send_sigtrap signal/riscv: Remove tsk parameter from do_trap signal/sh: Remove tsk parameter from force_sig_info_fault signal/um: Remove task parameter from send_sigtrap signal/x86: Remove task parameter from send_sigtrap signal: Remove task parameter from force_sig_mceerr signal: Remove task parameter from force_sig signal: Remove task parameter from force_sigsegv ... |
|
|
|
cb44c9a0ab |
signal: Remove task parameter from force_sigsegv
The function force_sigsegv is always called on the current task so passing in current is redundant and not passing in current makes this fact obvious. This also makes it clear force_sigsegv always calls force_sig on the current task. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
457c899653 |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
d53ddd0181 |
fs/exec.c: move ->recursion_depth out of critical sections
->recursion_depth is changed only by current, therefore decrementing can be done without taking any locks. Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6eb3c3d0a5 |
exec: increase BINPRM_BUF_SIZE to 256
Large enterprise clients often run applications out of networked file systems where the IT mandated layout of project volumes can end up leading to paths that are longer than 128 characters. Bumping this up to the next order of two solves this problem in all but the most egregious case while still fitting into a 512b slab. [oleg@redhat.com: update comment, per Kees] Link: http://lkml.kernel.org/r/20181112160956.GA28472@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Ben Woodard <woodard@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
26e152252e |
fs/exec.c: replace opencoded set_mask_bits()
Link: http://lkml.kernel.org/r/1548275584-18096-2-git-send-email-vgupta@synopsys.com Link: http://lkml.kernel.org/g/20150807115710.GA16897@redhat.com Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
45802da05e |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
|
|
|
|
f612acfae8 |
exec: Fix mem leak in kernel_read_file
syzkaller report this:
BUG: memory leak
unreferenced object 0xffffc9000488d000 (size 9195520):
comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
hex dump (first 32 bytes):
ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................
02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z.....
backtrace:
[<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline]
[<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
[<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
[<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
[<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
[<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
[<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
[<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<00000000241f889b>] 0xffffffffffffffff
It should goto 'out_free' lable to free allocated buf while kernel_read
fails.
Fixes:
|
|
|
|
d036bda7d0 |
sched/core: Convert sighand_struct.count to refcount_t
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable sighand_struct.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the sighand_struct.count it might make a difference in following places: - __cleanup_sighand: decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
9b286efeb5 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull trivial vfs updates from Al Viro: "A few cleanups + Neil's namespace_unlock() optimization" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: exec: make prepare_bprm_creds static genheaders: %-<width>s had been there since v6; %-*s - since v7 VFS: use synchronize_rcu_expedited() in namespace_unlock() iov_iter: reduce code duplication |