mirror of https://github.com/torvalds/linux.git
23823 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
3472f639c6 |
mm: remove the non-useful else after a break in a if statement
Remove the else block since there is already a break in the statement of if (iter->oom_lock), just set iter->oom_lock true after the if block ends. Link: https://lkml.kernel.org/r/20241115235744.1419580-4-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
91478b238e |
mm: remove unnecessary whitespace before a quoted newline
Remove whitespaces before newlines for strings in pr_warn_once() Link: https://lkml.kernel.org/r/20241115235744.1419580-3-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
55c1d6a401 |
mm: prefer 'unsigned int' to bare use of 'unsigned'
Patch series "mm: fix format issues and param types" Change the param 'mode' from type 'unsigned' to 'unsigned int' in memcg_event_wake() and memcg_oom_wake_function(), and for the param 'nid' in VM_BUG_ON(). Link: https://lkml.kernel.org/r/20241115235744.1419580-2-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1168b2bec7 |
filemap: remove unused folio_add_wait_queue
folio_add_wait_queue() has been unused since 2021's commit
|
|
|
|
58d534c8c6 |
mm/rodata_test: verify test data is unchanged, rather than non-zero
Verify that the test variable holds the initialization value, rather than any non-zero value. Link: https://lkml.kernel.org/r/386ffda192eb4a26f68c526c496afd48a5cd87ce.1732016064.git.ptesarik@suse.com Signed-off-by: Petr Tesarik <ptesarik@suse.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Jinbum Park <jinb.park7@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
db27ad8b02 |
mm/rodata_test: use READ_ONCE() to read const variable
Patch series "Fix mm/rodata_test", v2.
Make sure that the test actually reads the read-only memory location.
Verify that the variable contains the expected value rather than any
non-zero value.
This patch (of 2):
The C compiler may optimize away the memory read of a const variable if
its value is known at compile time.
In particular, GCC14 with -O2 generates no code at all for test 1, and it
generates the following x86_64 instructions for test 3:
cmpl $195, 4(%rsp)
je .L14
That is, it replaces the read of rodata_test_data with an immediate value
and compares it to the value of the local variable "zero".
Use READ_ONCE() to undo any such compiler optimizations and enforce a
memory read.
Link: https://lkml.kernel.org/r/cover.1732016064.git.ptesarik@suse.com
Link: https://lkml.kernel.org/r/2a66dee010151b25cb143efb39091ef7530aa00a.1732016064.git.ptesarik@suse.com
Fixes:
|
|
|
|
d635ccdb43 |
mm: shmem: add a kernel command line to change the default huge policy for tmpfs
Now the tmpfs can allow to allocate any sized large folios, and the default huge policy is still preferred to be 'never'. Due to tmpfs not behaving like other file systems in some cases as previously explained by David[1]: : I think I raised this in the past, but tmpfs/shmem is just like any : other file system .. except it sometimes really isn't and behaves much : more like (swappable) anonymous memory. (or mlocked files) : : There are many systems out there that run without swap enabled, or with : extremely minimal swap (IIRC until recently kubernetes was completely : incompatible with swapping). Swap can even be disabled today for shmem : using a mount option. : : That's a big difference to all other file systems where you are : guaranteed to have backend storage where you can simply evict under : memory pressure (might temporarily fail, of course). : : I *think* that's the reason why we have the "huge=" parameter that also : controls the THP allocations during page faults (IOW possible memory : over-allocation). Maybe also because it was a new feature, and we only : had a single THP size. Thus adding a new command line to change the default huge policy will be helpful to use the large folios for tmpfs, which is similar to the 'transparent_hugepage_shmem' cmdline for shmem. [1] https://lore.kernel.org/all/cbadd5fe-69d5-4c21-8eb8-3344ed36c721@redhat.com/ Link: https://lkml.kernel.org/r/ff390b2656f0d39649547f8f2cbb30fcb7e7be2d.1732779148.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
acd7ccb284 |
mm: shmem: add large folio support for tmpfs
Add large folio support for tmpfs write and fallocate paths matching the same high order preference mechanism used in the iomap buffered IO path as used in __filemap_get_folio(). Add shmem_mapping_size_orders() to get a hint for the orders of the folio based on the file size which takes care of the mapping requirements. Traditionally, tmpfs only supported PMD-sized large folios. However nowadays with other file systems supporting any sized large folios, and extending anonymous to support mTHP, we should not restrict tmpfs to allocating only PMD-sized large folios, making it more special. Instead, we should allow tmpfs can allocate any sized large folios. Considering that tmpfs already has the 'huge=' option to control the PMD-sized large folios allocation, we can extend the 'huge=' option to allow any sized large folios. The semantics of the 'huge=' mount option are: huge=never: no any sized large folios huge=always: any sized large folios huge=within_size: like 'always' but respect the i_size huge=advise: like 'always' if requested with madvise() Note: for tmpfs mmap() faults, due to the lack of a write size hint, still allocate the PMD-sized huge folios if huge=always/within_size/advise is set. Moreover, the 'deny' and 'force' testing options controlled by '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same semantics. The 'deny' can disable any sized large folios for tmpfs, while the 'force' can enable PMD sized large folios for tmpfs. Link: https://lkml.kernel.org/r/035bf55fbdebeff65f5cb2cdb9907b7d632c3228.1732779148.git.baolin.wang@linux.alibaba.com Co-developed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
736bbc6825 |
mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap
Change the shmem_huge_global_enabled() to return the suitable huge order bitmap, and return 0 if huge pages are not allowed. This is a preparation for supporting various huge orders allocation of tmpfs in the following patches. No functional changes. Link: https://lkml.kernel.org/r/9dce1cfad3e9c1587cf1a0ea782ddbebd0e92984.1732779148.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
d40797d672 |
kasan: make kasan_record_aux_stack_noalloc() the default behaviour
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes:
|
|
|
|
773fc6ab71 |
mm/memory: fix a comment typo in lock_mm_and_find_vma()
s/equivalend/equivalent/ Link: https://lkml.kernel.org/r/20241120105041.2394283-1-yikming2222@gmail.com Signed-off-by: Chin Yik Ming <yikming2222@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
afeac03c48 |
mm: change type of cma_area_count to unsigned int
Prefer 'unsigned int' over plain 'unsigned'. Also make it consistent with mm/cma.c Link: https://lkml.kernel.org/r/tencent_1E5E3AA25C261196D8C1F7097F130E382008@qq.com Signed-off-by: Jiale Yang <295107659@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3658cb16e2 |
mm/hugetlb_cgroup: avoid useless return in void function
The return statement at the end of void function is unnecessary. Just remove it as part of cleanup. Link: https://lkml.kernel.org/r/20241122173558.20670-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9023691d75 |
mm: mmap_lock: optimize mmap_lock tracepoints
We are starting to deploy mmap_lock tracepoint monitoring across our fleet and the early results showed that these tracepoints are consuming significant amount of CPUs in kernfs_path_from_node when enabled. It seems like the kernel is trying to resolve the cgroup path in the fast path of the locking code path when the tracepoints are enabled. In addition for some application their metrics are regressing when monitoring is enabled. The cgroup path resolution can be slow and should not be done in the fast path. Most userspace tools, like bpftrace, provides functionality to get the cgroup path from cgroup id, so let's just trace the cgroup id and the users can use better tools to get the path in the slow path. Link: https://lkml.kernel.org/r/20241125171617.113892-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6653995262 |
mm/damon/core: remove duplicate list_empty quota->goals check
damos_set_effective_quota() checks quota contidions but there are some duplicate checks for quota->goals inside. This patch reduces one of if statement to simplify the esz calculation logic by setting esz as ULONG_MAX by default. Link: https://lkml.kernel.org/r/20241125184307.41746-1-sj@kernel.org Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9aec2fb0fd |
slab: allocate frozen pages
Since slab does not use the page refcount, it can allocate and free frozen pages, saving one atomic operation per free. Link: https://lkml.kernel.org/r/20241125210149.2976098-16-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
642975242e |
mm/mempolicy: add alloc_frozen_pages()
Provide an interface to allocate pages from the page allocator without incrementing their refcount. This saves an atomic operation on free, which may be beneficial to some users (eg slab). Link: https://lkml.kernel.org/r/20241125210149.2976098-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
49249a2a5e |
mm/page_alloc: add __alloc_frozen_pages()
Defer the initialisation of the page refcount to the new __alloc_pages() wrapper and turn the old __alloc_pages() into __alloc_frozen_pages(). Link: https://lkml.kernel.org/r/20241125210149.2976098-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c972106db3 |
mm/page_alloc: move set_page_refcounted() to end of __alloc_pages()
Remove some code duplication by calling set_page_refcounted() at the end of __alloc_pages() instead of after each call that can allocate a page. That means that we free a frozen page if we've exceeded the allowed memcg memory. Link: https://lkml.kernel.org/r/20241125210149.2976098-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
a88de400e3 |
mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_slowpath()
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_slowpath(). Link: https://lkml.kernel.org/r/20241125210149.2976098-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
30fdb6df4c |
mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_direct_reclaim()
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_direct_reclaim(). Link: https://lkml.kernel.org/r/20241125210149.2976098-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
8e4c8a9702 |
mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_direct_compact()
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_direct_compact(). Link: https://lkml.kernel.org/r/20241125210149.2976098-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
df544c5eef |
mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_may_oom()
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_may_oom(). Link: https://lkml.kernel.org/r/20241125210149.2976098-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4c9017cc4c |
mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_cpuset_fallback()
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_cpuset_fallback(). Link: https://lkml.kernel.org/r/20241125210149.2976098-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
efabfe1420 |
mm/page_alloc: move set_page_refcounted() to callers of get_page_from_freelist()
In preparation for allocating frozen pages, stop initialising the page refcount in get_page_from_freelist(). Link: https://lkml.kernel.org/r/20241125210149.2976098-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ee66e9c34f |
mm/page_alloc: move set_page_refcounted() to callers of prep_new_page()
In preparation for allocating frozen pages, stop initialising the page refcount in prep_new_page(). Link: https://lkml.kernel.org/r/20241125210149.2976098-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
8fd10a892a |
mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook()
In preparation for allocating frozen pages, stop initialising the page refcount in post_alloc_hook(). Link: https://lkml.kernel.org/r/20241125210149.2976098-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
520128a1d1 |
mm/page_alloc: export free_frozen_pages() instead of free_unref_page()
We already have the concept of "frozen pages" (eg page_ref_freeze()), so let's not complicate things by also having the concept of "unref pages". Link: https://lkml.kernel.org/r/20241125210149.2976098-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
38558b2460 |
mm: make alloc_pages_mpol() static
All callers outside mempolicy.c now use folio_alloc_mpol() thanks to Kefeng's cleanups, so we can remove this as a visible symbol. And also remove the alloc_hooks for alloc_pages_mpol(), since all users in mempolicy.c are using the nonprof version. Link: https://lkml.kernel.org/r/20241125210149.2976098-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
d4056386ae |
mm/page_alloc: cache page_zone() result in free_unref_page()
Patch series "Allocate and free frozen pages", v3. Slab does not need to use the page refcount at all, and it can avoid an atomic operation on page free. Hugetlb wants to delay setting the refcount until it has assembled a complete gigantic page. We already have the ability to freeze a page (safely reduce its reference count to 0), so this patchset adds APIs to allocate and free pages which are in a frozen state. This patchset is also a step towards the Glorious Future in which struct page doesn't have a refcount; the users which need a refcount will have one in their per-allocation memdesc. This patch (of 15): Save 17 bytes of text by calculating page_zone() once instead of twice. Link: https://lkml.kernel.org/r/20241125210149.2976098-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241125210149.2976098-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bfc1d17829 |
mm: migrate: remove unused argument vma from migrate_misplaced_folio()
Commit
|
|
|
|
a882dd92de |
mm/zswap: add LRU_STOP to comment about dropping the lru lock
This function has been able to return LRU_STOP since commit
|
|
|
|
2da76e9e12 |
mm/slab: fix kernel-doc func param names
Use corrected function parameter names to eliminate kernel-doc warnings: slab.h:142: warning: Function parameter or struct member 's' not described in 'slab_folio' slab.h:142: warning: Excess function parameter 'slab' description in 'slab_folio' slab.h:168: warning: Function parameter or struct member 's' not described in 'slab_page' slab.h:168: warning: Excess function parameter 'slab' description in 'slab_page' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
|
|
|
84c398ce2a |
mm: kmemleak: convert timeouts to secs_to_jiffies()
Commit
|
|
|
|
1c47c57818 |
mm: fix assertion in folio_end_read()
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bd3d56ffa2 |
mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit |
|
|
|
9fd8fcf171 |
vmstat: disable vmstat_work on vmstat_cpu_down_prep()
The upstream commit |
|
|
|
0cef0bb836 |
mm: clear uffd-wp PTE/PMD state on mremap()
When mremap()ing a memory region previously registered with userfaultfd as
write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in
flag clearing leads to a mismatch between the vma flags (which have
uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp
cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to
trigger a warning in page_table_check_pte_flags() due to setting the pte
to writable while uffd-wp is still set.
Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any
such mremap() so that the values are consistent with the existing clearing
of VM_UFFD_WP. Be careful to clear the logical flag regardless of its
physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE,
huge PMD and hugetlb paths.
Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com
Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/
Fixes:
|
|
|
|
12dcb0ef54 |
mm: zswap: properly synchronize freeing resources during CPU hotunplug
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit |
|
|
|
4dff389c9f |
Revert "mm: zswap: fix race between [de]compression and CPU hotunplug"
This reverts commit |
|
|
|
4ce718f397 |
mm: fix div by zero in bdi_ratio_from_pages
During testing it has been detected, that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. [shr@devkernel.io: check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes()] Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io [shr@devkernel.io: v3] Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Qiang Zhang <zzqq0103.hey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f505e6c91e |
filemap: avoid truncating 64-bit offset to 32 bits
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing
to an xfs filesystem.
Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com
Fixes:
|
|
|
|
264a88cafd |
mm/mempolicy: count MPOL_WEIGHTED_INTERLEAVE to "interleave_hit"
Commit |
|
|
|
76d5d4c53e |
mm/kmemleak: fix percpu memory leak detection failure
kmemleak_alloc_percpu gives an incorrect min_count parameter, causing
percpu memory to be considered a gray object.
Link: https://lkml.kernel.org/r/20241227092311.3572500-1-guoweikang.kernel@gmail.com
Fixes:
|
|
|
|
bbe658d658 |
mm/slab: Move kvfree_rcu() into SLAB
Move kvfree_rcu() functionality to the slab_common.c file. The reason to have kvfree_rcu() functionality as part of SLAB is that there is a clear trend and need of closer integration. One of the recent example is creating a barrier function for SLAB caches. Another reason is to prevent of having several implementations of RCU machinery for reclaiming objects after a GP. As future steps, it can be more integrated(easier) with SLAB internals. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
|
|
|
c6a566f6c1 |
mm: Create/affine kswapd to its preferred node
kswapd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise. Use the proper kthread API to achieve that. As a bonus it takes care of CPU-hotplug events and CPU-isolation on its behalf. Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
|
|
|
54880b5a2b |
mm: Create/affine kcompactd to its preferred node
Kcompactd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise. Use the proper kthread API to achieve that. As a bonus it takes care of CPU-hotplug events and CPU-isolation on its behalf. Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
|
|
|
b168ed458d
|
kernel/cgroup: Add "dmem" memory accounting cgroup
This code is based on the RDMA and misc cgroup initially, but now uses page_counter. It uses the same min/low/max semantics as the memory cgroup as a result. There's a small mismatch as TTM uses u64, and page_counter long pages. In practice it's not a problem. 32-bits systems don't really come with >=4GB cards and as long as we're consistently wrong with units, it's fine. The device page size may not be in the same units as kernel page size, and each region might also have a different page size (VRAM vs GART for example). The interface is simple: - Call dmem_cgroup_register_region() - Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory, use dmem_cgroup__uncharge when freeing it. This may return an error code, or -EAGAIN when the cgroup limit is reached. In that case a reference to the limiting pool is returned. - The limiting cs can be used as compare function for dmem_cgroup_state_evict_valuable. - After having evicted enough, drop reference to limiting cs with dmem_cgroup_pool_state_put. This API allows you to limit device resources with cgroups. You can see the supported cards in /sys/fs/cgroup/dmem.capacity You need to echo +dmem to cgroup.subtree_control, and then you can partition device memory. Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Co-developed-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se Signed-off-by: Maxime Ripard <mripard@kernel.org> |
|
|
|
cd6313beae |
Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()"
This reverts commit
|
|
|
|
d7bde4f27c
|
Revert "libfs: Add simple_offset_empty()"
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty. Only one such mechanism is ever needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-3-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
dd2a5b5514 |
mm/util: make memdup_user_nul() similar to memdup_user()
Since the string data to copy from userspace is likely less than PAGE_SIZE bytes, replace GFP_KERNEL with GFP_USER like commit |
|
|
|
62e72d2cf7 |
mm, madvise: fix potential workingset node list_lru leaks
Since commit |
|
|
|
7d390b5306 |
mm/damon/core: fix ignored quota goals and filters of newly committed schemes
damon_commit_schemes() ignores quota goals and filters of the newly
committed schemes. This makes users confused about the behaviors.
Correctly handle those inputs.
Link: https://lkml.kernel.org/r/20241222231222.85060-3-sj@kernel.org
Fixes:
|
|
|
|
8debfc5b1a |
mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
Patch series "mm/damon/core: fix memory leaks and ignored inputs from
damon_commit_ctx()".
Due to two bugs in damon_commit_targets() and damon_commit_schemes(),
which are called from damon_commit_ctx(), some user inputs can be ignored,
and some mmeory objects can be leaked. Fix those.
Note that only DAMON sysfs interface users are affected. Other DAMON core
API user modules that more focused more on simple and dedicated production
usages, including DAMON_RECLAIM and DAMON_LRU_SORT are not using the buggy
function in the way, so not affected.
This patch (of 2):
When new DAMON targets are added via damon_commit_targets(), the newly
created targets are not deallocated when updating the internal data
(damon_commit_target()) is failed. Worse yet, even if the setup is
successfully done, the new target is not linked to the context. Hence,
the new targets are always leaked regardless of the internal data setup
failure. Fix the leaks.
Link: https://lkml.kernel.org/r/20241222231222.85060-2-sj@kernel.org
Fixes:
|
|
|
|
98a6abc6ce |
mm/list_lru: fix false warning of negative counter
commit |
|
|
|
adcfb264c3 |
vmstat: disable vmstat_work on vmstat_cpu_down_prep()
Even after mm/vmstat:online teardown, shepherd may still queue work for
the dying cpu until the cpu is removed from online mask. While it's quite
rare, this means that after unbind_workers() unbinds a per-cpu kworker, it
potentially runs vmstat_update for the dying CPU on an irrelevant cpu
before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results
in the following error with the backtrace.
BUG: using smp_processor_id() in preemptible [00000000] code: \
kworker/7:3/1702
caller is refresh_cpu_vm_stats+0x235/0x5f0
CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
Tainted: [N]=TEST
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xb0
check_preemption_disabled+0xce/0xe0
refresh_cpu_vm_stats+0x235/0x5f0
vmstat_update+0x17/0xa0
process_one_work+0x869/0x1aa0
worker_thread+0x5e5/0x1100
kthread+0x29e/0x380
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
symmetrically enable it on startup.
Link: https://lkml.kernel.org/r/20241221033321.4154409-1-koichiro.den@canonical.com
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
d77b90d2b2 |
mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
The 'shmem_falloc->nr_unswapped' is used to record how many writepage
refused to swap out because fallocate() is allocating, but after shmem
supports large folio swap out, the update of 'shmem_falloc->nr_unswapped'
does not use the correct number of pages in the large folio, which may
lead to fallocate() not exiting as soon as possible.
Anyway, this is found through code inspection, and I am not sure whether
it would actually cause serious issues.
Link: https://lkml.kernel.org/r/f66a0119d0564c2c37c84f045835b870d1b2196f.1734593154.git.baolin.wang@linux.alibaba.com
Fixes:
|
|
|
|
d0e6983a6d |
mm: shmem: fix incorrect index alignment for within_size policy
With enabling the shmem per-size within_size policy, using an incorrect
'order' size to round_up() the index can lead to incorrect i_size checks,
resulting in an inappropriate large orders being returned.
Changing to use '1 << order' to round_up() the index to fix this issue.
Additionally, adding an 'aligned_index' variable to avoid affecting the
index checks.
Link: https://lkml.kernel.org/r/77d8ef76a7d3d646e9225e9af88a76549a68aab1.1734593154.git.baolin.wang@linux.alibaba.com
Fixes:
|
|
|
|
eaebeb9392 |
mm: zswap: fix race between [de]compression and CPU hotunplug
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead(). The problem was introduced in commit |
|
|
|
cddc76b165 |
mm/kmemleak: fix sleeping function called from invalid context at print message
Address a bug in the kernel that triggers a "sleeping function called from
invalid context" warning when /sys/kernel/debug/kmemleak is printed under
specific conditions:
- CONFIG_PREEMPT_RT=y
- Set SELinux as the LSM for the system
- Set kptr_restrict to 1
- kmemleak buffer contains at least one item
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
6 locks held by cat/136:
#0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
#1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
#3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
#4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
#5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
irq event stamp: 136660
hardirqs last enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
softirqs last enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G E 6.11.0-rt7+ #34
Tainted: [E]=UNSIGNED_MODULE
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xa0/0x128
show_stack+0x1c/0x30
dump_stack_lvl+0xe8/0x198
dump_stack+0x18/0x20
rt_spin_lock+0x8c/0x1a8
avc_perm_nonode+0xa0/0x150
cred_has_capability.isra.0+0x118/0x218
selinux_capable+0x50/0x80
security_capable+0x7c/0xd0
has_ns_capability_noaudit+0x94/0x1b0
has_capability_noaudit+0x20/0x30
restricted_pointer+0x21c/0x4b0
pointer+0x298/0x760
vsnprintf+0x330/0xf70
seq_printf+0x178/0x218
print_unreferenced+0x1a4/0x2d0
kmemleak_seq_show+0xd0/0x1e0
seq_read_iter+0x354/0xe30
seq_read+0x250/0x378
full_proxy_read+0xd8/0x148
vfs_read+0x190/0x918
ksys_read+0xf0/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall.constprop.0+0xd4/0x1d8
el0_svc+0x50/0x158
el0t_64_sync+0x17c/0x180
%pS and %pK, in the same back trace line, are redundant, and %pS can void
%pK service in certain contexts.
%pS alone already provides the necessary information, and if it cannot
resolve the symbol, it falls back to printing the raw address voiding
the original intent behind the %pK.
Additionally, %pK requires a privilege check CAP_SYSLOG enforced through
the LSM, which can trigger a "sleeping function called from invalid
context" warning under RT_PREEMPT kernels when the check occurs in an
atomic context. This issue may also affect other LSMs.
This change avoids the unnecessary privilege check and resolves the
sleeping function warning without any loss of information.
Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com
Fixes:
|
|
|
|
59d9094df3 |
mm: hugetlb: independent PMD page table shared count
The folio refcount may be increased unexpectly through try_get_folio() by
caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount
to check whether a pmd page table is shared. The check is incorrect if
the refcount is increased by the above caller, and this can cause the page
table leaked:
BUG: Bad page state in process sh pfn:109324
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
page_type: f2(table)
raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
page dumped because: nonzero mapcount
...
CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7
Tainted: [B]=BAD_PAGE
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x80/0xf8
dump_stack+0x18/0x28
bad_page+0x8c/0x130
free_page_is_bad_report+0xa4/0xb0
free_unref_page+0x3cc/0x620
__folio_put+0xf4/0x158
split_huge_pages_all+0x1e0/0x3e8
split_huge_pages_write+0x25c/0x2d8
full_proxy_write+0x64/0xd8
vfs_write+0xcc/0x280
ksys_write+0x70/0x110
__arm64_sys_write+0x24/0x38
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x34/0x128
el0t_64_sync_handler+0xc8/0xd0
el0t_64_sync+0x190/0x198
The issue may be triggered by damon, offline_page, page_idle, etc, which
will increase the refcount of page table.
1. The page table itself will be discarded after reporting the
"nonzero mapcount".
2. The HugeTLB page mapped by the page table miss freeing since we
treat the page table as shared and a shared page table will not be
unmapped.
Fix it by introducing independent PMD page table shared count. As
described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390
gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv
pmds, so we can reuse the field as pt_share_count.
Link: https://lkml.kernel.org/r/20241216071147.3984217-1-liushixin2@huawei.com
Fixes:
|
|
|
|
158cdce87c |
mm/readahead: fix large folio support in async readahead
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not
be dependent on the value of read_ahead_kb. It would be more robust if
the kernel can automatically adopt to it.
With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:
- before this patch
FileHugePages: 18432 kB
FilePmdMapped: 4096 kB
- after this patch
FileHugePages: 1067008 kB
FilePmdMapped: 1048576 kB
This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don't function optimally in the readahead path.
It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
to map to hugepages.
Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
Link: https://lkml.kernel.org/r/20241206083025.3478-1-laoar.shao@gmail.com
Fixes:
|
|
|
|
34d7cf637c |
mm: don't try THP alignment for FS without get_unmapped_area
Commit |
|
|
|
6aaced5abd |
mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.
#0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
#1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
#2 [ffff80002cb6f990] schedule at ffff800008abc50c
#3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
#4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
#5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
#6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4
At this point, the pgdat contains the following two zones:
NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32"
SIZE: 20480 MIN/LOW/HIGH: 11/28/45
VM_STAT:
NR_FREE_PAGES: 359
NR_ZONE_INACTIVE_ANON: 18813
NR_ZONE_ACTIVE_ANON: 0
NR_ZONE_INACTIVE_FILE: 50
NR_ZONE_ACTIVE_FILE: 0
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal"
SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264
VM_STAT:
NR_FREE_PAGES: 146
NR_ZONE_INACTIVE_ANON: 94668
NR_ZONE_ACTIVE_ANON: 3
NR_ZONE_INACTIVE_FILE: 735
NR_ZONE_ACTIVE_FILE: 78
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.
Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.
crash> p nr_swap_pages
nr_swap_pages = $1937 = {
counter = 0
}
As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.
The problem is that the pgdat->kswapd_failures hasn't been incremented.
crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
$1935 = 0x0
This is because the node deemed balanced. The node balancing logic in
balance_pgdat() evaluates all zones collectively. If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced. This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.
The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages). This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable. By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.
The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.
The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().
This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Fixes:
|
|
|
|
8ec396d05d |
mm: reinstate ability to map write-sealed memfd mappings read-only
Patch series "mm: reinstate ability to map write-sealed memfd mappings read-only". In commit |
|
|
|
657e726e0c
|
tmpfs: use inode_set_cached_link()
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241120112037.822078-4-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
d3ac65d274 |
mm: huge_memory: handle strsep not finding delimiter
split_huge_pages_write() does not handle the case where strsep finds no delimiter in the given string and sets the input buffer to NULL, which allows this reproducer to trigger a protection fault. Link: https://lkml.kernel.org/r/20241216042752.257090-2-leocstone@gmail.com Signed-off-by: Leo Stone <leocstone@gmail.com> Reported-by: syzbot+8a3da2f1bbf59227c289@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8a3da2f1bbf59227c289 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
42b2eb6983 |
mm: convert partially_mapped set/clear operations to be atomic
Other page flags in the 2nd page, like PG_hwpoison and PG_anon_exclusive
can get modified concurrently. Changes to other page flags might be lost
if they are happening at the same time as non-atomic partially_mapped
operations. Hence, make partially_mapped operations atomic.
Link: https://lkml.kernel.org/r/20241212183351.1345389-1-usamaarif642@gmail.com
Fixes:
|
|
|
|
a2e740e216 |
vmalloc: fix accounting with i915
If the caller of vmap() specifies VM_MAP_PUT_PAGES (currently only the
i915 driver), we will decrement nr_vmalloc_pages and MEMCG_VMALLOC in
vfree(). These counters are incremented by vmalloc() but not by vmap() so
this will cause an underflow. Check the VM_MAP_PUT_PAGES flag before
decrementing either counter.
Link: https://lkml.kernel.org/r/20241211202538.168311-1-willy@infradead.org
Fixes:
|
|
|
|
faeec8e23c |
mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy()
In split_large_buddy(), we might call pfn_to_page() on a PFN that might
not exist. In corner cases, such as when freeing the highest pageblock in
the last memory section, this could result with CONFIG_SPARSEMEM &&
!CONFIG_SPARSEMEM_EXTREME in __pfn_to_section() returning NULL and and
__section_mem_map_addr() dereferencing that NULL pointer.
Let's fix it, and avoid doing a pfn_to_page() call for the first
iteration, where we already have the page.
So far this was found by code inspection, but let's just CC stable as the
fix is easy.
Link: https://lkml.kernel.org/r/20241210093437.174413-1-david@redhat.com
Fixes:
|
|
|
|
c51a4f11e6 |
mm: use clear_user_(high)page() for arch with special user folio handling
Some architectures have special handling after clearing user folios:
architectures, which set cpu_dcache_is_aliasing() to true, require
flushing dcache; arc, which sets cpu_icache_is_aliasing() to true, changes
folio->flags to make icache coherent to dcache. So __GFP_ZERO using only
clear_page() is not enough to zero user folios and clear_user_(high)page()
must be used. Otherwise, user data will be corrupted.
Fix it by always clearing user folios with clear_user_(high)page() when
cpu_dcache_is_aliasing() is true or cpu_icache_is_aliasing() is true.
Rename alloc_zeroed() to user_alloc_needs_zeroing() and invert the logic
to clarify its intend.
Link: https://lkml.kernel.org/r/20241209182326.2955963-2-ziy@nvidia.com
Fixes:
|
|
|
|
31c5629920 |
mm: add RCU annotation to pte_offset_map(_lock)
RCU lock is taken by ___pte_offset_map() unless it returns NULL. Add this information to its inline callers to avoid sparse warning about context imbalance in pte_unmap(). Link: https://lkml.kernel.org/r/20241210000604.700710-1-oss@malat.biz Signed-off-by: Petr Malat <oss@malat.biz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
42c4e4b20d |
mm: correctly reference merged VMA
On second merge attempt on mmap() we incorrectly discard the possibly
merged VMA, resulting in a possible use-after-free (and most certainly a
reference to the wrong VMA) in this instance in the subsequent
__mmap_complete() invocation.
Correct this mistake by reassigning vma correctly if a merge succeeds in
this case.
Link: https://lkml.kernel.org/r/20241206215229.244413-1-lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
f5d09de9f1 |
mm: use aligned address in copy_user_gigantic_page()
In current kernel, hugetlb_wp() calls copy_user_large_folio() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, copy_user_large_folio() may call
copy_user_gigantic_page() with the address, while
copy_user_gigantic_page() requires the address to be huge page size
aligned. So, this may cause memory corruption or information leak,
addtional, use more obvious naming 'addr_hint' instead of 'addr' for
copy_user_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
Fixes:
|
|
|
|
8aca2bc96c |
mm: use aligned address in clear_gigantic_page()
In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned. So, this may cause memory corruption or information
leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for
clear_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
Fixes:
|
|
|
|
dad2dc9c92 |
mm: shmem: fix ShmemHugePages at swapout
/proc/meminfo ShmemHugePages has been showing overlarge amounts (more than
Shmem) after swapping out THPs: we forgot to update NR_SHMEM_THPS.
Add shmem_update_stats(), to avoid repetition, and risk of making that
mistake again: the call from shmem_delete_from_page_cache() is the bugfix;
the call from shmem_replace_folio() is reassuring, but not really a bugfix
(replace corrects misplaced swapin readahead, but huge swapin readahead
would be a mistake).
Link: https://lkml.kernel.org/r/5ba477c8-a569-70b5-923e-09ab221af45b@google.com
Fixes:
|
|
|
|
266facde83 |
slab fixes for 6.13-rc3
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmdb/84ACgkQu+CwddJF iJqEhQf+KHXQfZ/m7ma6bUEisjVOkUsxnpSj/DM1CP03+czqXoSEr+CHX2g1v0Os pZSOGGPTtvDgSBO0UZgo1P+sXyl0hrs2A/XrVQfKPmTfc9mPoxFTkbhI7Rv9hZfi 9OVh9kg3OE2hixD9ussGmxtFTXE3vXeu9zASa2sEdpakc3gxgeNhIklGK9r3oQuF 4aa33JpRV/oF80XAAUch5ltAO0tqQWtN50CbnoKc5cGRDuFRRdzXMMiL7gGiCctI te3sVMc6mnusdJqvp+AE10aT0iQpRWwL1lzGrgp3PuZ7DN8UQHxBIf8u8VIjyPML D2JZEYmqykKSvXbDzYBdOAF8XLEYCw== =PpHb -----END PGP SIGNATURE----- Merge tag 'slab-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix for memcg unreclaimable slab stats drift when post-charging large kmalloc allocations (Shakeel Butt) * tag 'slab-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: memcg: slub: fix SUnreclaim for post charged objects |
|
|
|
8392bc2ff8 |
fsnotify: generate pre-content permission event on page fault
FS_PRE_ACCESS will be generated on page fault depending on the faulting method. This pre-content event is meant to be used by hierarchical storage managers that want to fill in the file content on first read access. Export a simple helper that file systems that have their own ->fault() will use, and have a more complicated helper to be do fancy things in filemap_fault. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/aa56c50ce81b1fd18d7f5d71dd2dfced5eba9687.1731684329.git.josef@toxicpanda.com |
|
|
|
20bf82a898 |
mm: don't allow huge faults for files with pre content watches
There's nothing stopping us from supporting this, we could simply pass the order into the helper and emit the proper length. However currently there's no tests to validate this works properly, so disable it until there's a desire to support this along with the appropriate tests. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9035b82cff08a3801cef3d06bbf2778b2e5a4dba.1731684329.git.josef@toxicpanda.com |
|
|
|
fac84846a2 |
fanotify: disable readahead if we have pre-content watches
With page faults we can trigger readahead on the file, and then subsequent faults can find these pages and insert them into the file without emitting an fanotify event. To avoid this case, disable readahead if we have pre-content watches on the file. This way we are guaranteed to get an event for every range we attempt to access on a pre-content watched file. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/70a54e859f555e54bc7a47b32fe5aca92b085615.1731684329.git.josef@toxicpanda.com |
|
|
|
b7ffecbe19 |
memcg: slub: fix SUnreclaim for post charged objects
Large kmalloc directly allocates from the page allocator and then use lruvec_stat_mod_folio() to increment the unreclaimable slab stats for global and memcg. However when post memcg charging of slab objects was added in commit |
|
|
|
553c89ec31 |
24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM.
The usual bunch of singletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ1U/QwAKCRDdBJ7gKXxA jnE7AQC0eyNNvaL5pLCIxN/Vmr8YeuWP1dldgI29TjrH/JKjSQEAihZNqVZYjoIT Gf7Y+IKnc4LbfAXcTe+MfJFeDexM5AU= =U5LQ -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM. The usual bunch of singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) iio: magnetometer: yas530: use signed integer type for clamp limits sched/numa: fix memory leak due to the overwritten vma->numab_state mm/damon: fix order of arguments in damos_before_apply tracepoint lib: stackinit: hide never-taken branch from compiler mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio() scatterlist: fix incorrect func name in kernel-doc mm: correct typo in MMAP_STATE() macro mm: respect mmap hint address when aligning for THP mm: memcg: declare do_memsw_account inline mm/codetag: swap tags when migrate pages ocfs2: update seq_file index in ocfs2_dlm_seq_next stackdepot: fix stack_depot_save_flags() in NMI context mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags() mm: fix vrealloc()'s KASAN poisoning logic Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()" selftests/damon: add _damon_sysfs.py to TEST_FILES selftest: hugetlb_dio: fix test naming ocfs2: free inode when ocfs2_get_init_inode() fails nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry() ... |
|
|
|
ddfc146ed5 |
memblock: restore check for node validity in arch_numa
Rework of NUMA initialization in arch_numa dropped a check that refused to accept configurations with invalid node IDs. Restore that check to ensure that when firmware passes invalid nodes, such configuration is rejected and kernel gracefully falls back to dummy NUMA. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmdSz9wQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kQPWCACSCwm7B8K0ctWbqGHsglCkMgF9pI/mUwjM 3c6zjzpsL5z0ii41cAEbDKWNTfroJddkWxZbDveHt3PytEYVM5ZvQL3tGwCfkpG8 wrAQSRE4XMv+ffA4LBB7U4xHxxEKtSc7OpqO3h4RED3T66hlFtKWMhiNYhl2mKwn 4ic7xLqoKj7Nu3hHc3014x/94tVWszgdgsZo+OJyPSxh+kwLdOVpwZWG22CT58UR nTVQu/a13XVFu8R11S3a4iDMTOqb5oBVRw2pnw+knChXFJ4r2Pr/pA8uneTWEAFB TiYclkH/0/eDd9Vpx5JTUQf4xPfuIXHynjQDwXYHWJ/U9jwLAwTH =h/KU -----END PGP SIGNATURE----- Merge tag 'fixes-2024-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: "Restore check for node validity in arch_numa. The rework of NUMA initialization in arch_numa dropped a check that refused to accept configurations with invalid node IDs. Restore that check to ensure that when firmware passes invalid nodes, such configuration is rejected and kernel gracefully falls back to dummy NUMA" * tag 'fixes-2024-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: arch_numa: Restore nid checks before registering a memblock with a node memblock: allow zero threshold in validate_numa_converage() |
|
|
|
3203b3ab0f |
mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
The folio can get freed + buddy-merged + reallocated in the meantime,
resulting in us calling folio_test_locked() possibly on a tail page.
This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the
tail page.
Could this result in other issues? Doesn't look like it. False positives
and false negatives don't really matter, because this folio would get
skipped either way when detecting that they have been reallocated in the
meantime.
Fix it by performing the folio_test_locked() checked after grabbing a
reference. If this ever becomes a real problem, we could add a special
helper that racily checks if the bit is set even on tail pages ... but
let's hope that's not required so we can just handle it cleaner: work on
the folio after we hold a reference.
Do we really need the folio_test_locked() check if we are going to trylock
briefly after? Well, we can at least avoid a xas_reload().
It's a bit unclear which exact change introduced that issue. Likely, ever
since we made PG_locked obey to the PF_NO_TAIL policy it could have been
triggered in some way.
Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com
Fixes:
|
|
|
|
cbb70e4534 |
mm: correct typo in MMAP_STATE() macro
We mistakenly refer to len rather than len_ here. The only existing caller passes len to the len_ parameter so this has no impact on the code, but it is obviously incorrect to do this, so fix it. Link: https://lkml.kernel.org/r/20241118175414.390827-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
249608ee47 |
mm: respect mmap hint address when aligning for THP
Commit |
|
|
|
89dd878282 |
mm: memcg: declare do_memsw_account inline
In commit |
|
|
|
51f43d5d82 |
mm/codetag: swap tags when migrate pages
Current solution to adjust codetag references during page migration is
done in 3 steps:
1. sets the codetag reference of the old page as empty (not pointing
to any codetag);
2. subtracts counters of the new page to compensate for its own
allocation;
3. sets codetag reference of the new page to point to the codetag of
the old page.
This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because
set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag
references so that the new page is referencing the old codetag and the old
page is referencing the new codetag. This way accounting stays valid and
the logic makes more sense.
Link: https://lkml.kernel.org/r/20241129025213.34836-1-00107082@163.com
Fixes:
|
|
|
|
6a7de1bf21 |
mm: open-code page_folio() in dump_page()
page_folio() calls page_fixed_fake_head() which will misidentify this page
as being a fake head and load off the end of 'precise'. We may have a
pointer to a fake head, but that's OK because it contains the right
information for dump_page().
gcc-15 is smart enough to catch this with -Warray-bounds:
In function 'page_fixed_fake_head',
inlined from '_compound_head' at ../include/linux/page-flags.h:251:24,
inlined from '__dump_page' at ../mm/debug.c:123:11:
../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside
+array bounds of 'struct page[1]' [-Warray-bounds=]
Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org
Fixes:
|
|
|
|
d699440f58 |
mm: fix vrealloc()'s KASAN poisoning logic
When vrealloc() reuses already allocated vmap_area, we need to re-annotate
poisoned and unpoisoned portions of underlying memory according to the new
size.
This results in a KASAN splat recorded at [1]. A KASAN mis-reporting
issue where there is none.
Note, hard-coding KASAN_VMALLOC_PROT_NORMAL might not be exactly correct,
but KASAN flag logic is pretty involved and spread out throughout
__vmalloc_node_range_noprof(), so I'm using the bare minimum flag here and
leaving the rest to mm people to refactor this logic and reuse it here.
Link: https://lkml.kernel.org/r/20241126005206.3457974-1-andrii@kernel.org
Link: https://lore.kernel.org/bpf/67450f9b.050a0220.21d33d.0004.GAE@google.com/ [1]
Fixes:
|
|
|
|
a220d6b95b |
Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()"
This reverts commit |
|
|
|
e30a0361b8 |
kasan: make report_lock a raw spinlock
If PREEMPT_RT is enabled, report_lock is a sleeping spinlock and must not
be locked when IRQs are disabled. However, KASAN reports may be triggered
in such contexts. For example:
char *s = kzalloc(1, GFP_KERNEL);
kfree(s);
local_irq_disable();
char c = *s; /* KASAN report here leads to spin_lock() */
local_irq_enable();
Make report_spinlock a raw spinlock to prevent rescheduling when
PREEMPT_RT is enabled.
Link: https://lkml.kernel.org/r/20241119210234.1602529-1-jkangas@redhat.com
Fixes:
|
|
|
|
091c1dd2d4 |
mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MM
We currently assume that there is at least one VMA in a MM, which isn't
true.
So we might end up having find_vma() return NULL, to then de-reference
NULL. So properly handle find_vma() returning NULL.
This fixes the report:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 6021 Comm: syz-executor284 Not tainted 6.12.0-rc7-syzkaller-00187-gf868cd251776 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
RIP: 0010:migrate_to_node mm/mempolicy.c:1090 [inline]
RIP: 0010:do_migrate_pages+0x403/0x6f0 mm/mempolicy.c:1194
Code: ...
RSP: 0018:ffffc9000375fd08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffc9000375fd78 RCX: 0000000000000000
RDX: ffff88807e171300 RSI: dffffc0000000000 RDI: ffff88803390c044
RBP: ffff88807e171428 R08: 0000000000000014 R09: fffffbfff2039ef1
R10: ffffffff901cf78f R11: 0000000000000000 R12: 0000000000000003
R13: ffffc9000375fe90 R14: ffffc9000375fe98 R15: ffffc9000375fdf8
FS: 00005555919e1380(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555919e1ca8 CR3: 000000007f12a000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
kernel_migrate_pages+0x5b2/0x750 mm/mempolicy.c:1709
__do_sys_migrate_pages mm/mempolicy.c:1727 [inline]
__se_sys_migrate_pages mm/mempolicy.c:1723 [inline]
__x64_sys_migrate_pages+0x96/0x100 mm/mempolicy.c:1723
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[akpm@linux-foundation.org: add unlikely()]
Link: https://lkml.kernel.org/r/20241120201151.9518-1-david@redhat.com
Fixes:
|
|
|
|
a1268be280 |
mm/gup: handle NULL pages in unpin_user_pages()
The recent addition of "pofs" (pages or folios) handling to gup has a
flaw: it assumes that unpin_user_pages() handles NULL pages in the pages**
array. That's not the case, as I discovered when I ran on a new
configuration on my test machine.
Fix this by skipping NULL pages in unpin_user_pages(), just like
unpin_folios() already does.
Details: when booting on x86 with "numa=fake=2 movablecore=4G" on Linux
6.12, and running this:
tools/testing/selftests/mm/gup_longterm
...I get the following crash:
BUG: kernel NULL pointer dereference, address: 0000000000000008
RIP: 0010:sanity_check_pinned_pages+0x3a/0x2d0
...
Call Trace:
<TASK>
? __die_body+0x66/0xb0
? page_fault_oops+0x30c/0x3b0
? do_user_addr_fault+0x6c3/0x720
? irqentry_enter+0x34/0x60
? exc_page_fault+0x68/0x100
? asm_exc_page_fault+0x22/0x30
? sanity_check_pinned_pages+0x3a/0x2d0
unpin_user_pages+0x24/0xe0
check_and_migrate_movable_pages_or_folios+0x455/0x4b0
__gup_longterm_locked+0x3bf/0x820
? mmap_read_lock_killable+0x12/0x50
? __pfx_mmap_read_lock_killable+0x10/0x10
pin_user_pages+0x66/0xa0
gup_test_ioctl+0x358/0xb20
__se_sys_ioctl+0x6b/0xc0
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Link: https://lkml.kernel.org/r/20241121034933.77502-1-jhubbard@nvidia.com
Fixes:
|
|
|
|
cdd30ebb1b |
module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit
|
|
|
|
eb449bd969 |
mm: convert mm_lock_seq to a proper seqcount
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com |
|
|
|
7528585290 |
mm/gup: Use raw_seqcount_try_begin()
David pointed out that gup_fast() does exactly what the new raw_seqcount_try_begin() does -- use it. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> |
|
|
|
9cdc6423ac |
memblock: allow zero threshold in validate_numa_converage()
Currently memblock validate_numa_converage() returns false negative when threshold set to zero. Make the check if the memory size with invalid node ID is greater than the threshold exclusive to fix that. Link: https://lore.kernel.org/all/Z0mIDBD4KLyxyOCm@kernel.org/ Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> |
|
|
|
6a34dfa15d |
Kbuild updates for v6.13
- Add generic support for built-in boot DTB files
- Enable TAB cycling for dialog buttons in nconfig
- Fix issues in streamline_config.pl
- Refactor Kconfig
- Add support for Clang's AutoFDO (Automatic Feedback-Directed
Optimization)
- Add support for Clang's Propeller, a profile-guided optimization.
- Change the working directory to the external module directory for M=
builds
- Support building external modules in a separate output directory
- Enable objtool for *.mod.o and additional kernel objects
- Use lz4 instead of deprecated lz4c
- Work around a performance issue with "git describe"
- Refactor modpost
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmdKGgEVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGrFoQAIgioJPRG+HC6bGmjy4tL4bq1RAm
78nbD12grrAa+NvQGRZYRs264rWxBGwrNfGGS9BDvlWJZ3fmKEuPlfCIxC0nkKk8
LVLNxSVvgpHE47RQ+E4V+yYhrlZKb4aWZjH3ZICn7vxRgbQ5Veq61aatluVHyn9c
I8g+APYN/S1A4JkFzaLe8GV7v5OM3+zGSn3M9n7/DxVkoiNrMOXJm5hRdRgYfEv/
kMppheY2PPshZsaL+yLAdrJccY5au5vYE/v8wHkMtvM/LF6YwjgqPVDRFQ30BuLM
sAMMd6AUoopiDZQOpqmXYukU0b0MQPswg3jmB+PWUBrlsuydRa8kkyPwUaFrDd+w
9d0jZRc8/O/lxUdD1AefRkNcA/dIJ4lTPr+2NpxwHuj2UFo0gLQmtjBggMFHaWvs
0NQRBPlxfOE4+Htl09gyg230kHuWq+rh7xqbyJCX+hBOaZ6kI2lryl6QkgpAoS+x
KDOcUKnsgGMGARQRrvCOAXnQs+rjkW8fEm6t7eSBFPuWJMK85k4LmxOog8GVYEdM
JcwCnCHt9TtcHlSxLRnVXj2aqGTFNLJXE1aLyCp9u8MxZ7qcx01xOuCmwp6FRzNq
ghal7ngA58Y/S4K/oJ+CW2KupOb6CFne8mpyotpYeWI7MR64t0YWs4voZkuK46E2
CEBfA4PDehA4BxQe
=GDKD
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add generic support for built-in boot DTB files
- Enable TAB cycling for dialog buttons in nconfig
- Fix issues in streamline_config.pl
- Refactor Kconfig
- Add support for Clang's AutoFDO (Automatic Feedback-Directed
Optimization)
- Add support for Clang's Propeller, a profile-guided optimization.
- Change the working directory to the external module directory for M=
builds
- Support building external modules in a separate output directory
- Enable objtool for *.mod.o and additional kernel objects
- Use lz4 instead of deprecated lz4c
- Work around a performance issue with "git describe"
- Refactor modpost
* tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits)
kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms
gitignore: Don't ignore 'tags' directory
kbuild: add dependency from vmlinux to resolve_btfids
modpost: replace tdb_hash() with hash_str()
kbuild: deb-pkg: add python3:native to build dependency
genksyms: reduce indentation in export_symbol()
modpost: improve error messages in device_id_check()
modpost: rename alias symbol for MODULE_DEVICE_TABLE()
modpost: rename variables in handle_moddevtable()
modpost: move strstarts() to modpost.h
modpost: convert do_usb_table() to a generic handler
modpost: convert do_of_table() to a generic handler
modpost: convert do_pnp_device_entry() to a generic handler
modpost: convert do_pnp_card_entries() to a generic handler
modpost: call module_alias_printf() from all do_*_entry() functions
modpost: pass (struct module *) to do_*_entry() functions
modpost: remove DEF_FIELD_ADDR_VAR() macro
modpost: deduplicate MODULE_ALIAS() for all drivers
modpost: introduce module_alias_printf() helper
modpost: remove unnecessary check in do_acpi_entry()
...
|
|
|
|
ab952fc5c7 |
memblock: updates for 6.13-rc1
* replace hardcoded strings with str_on_off() in report_meminit() * initialize reserved pages to MIGRATE_MOVABLE when deferred struct page initialization is enabled so that if the reserved pages are freed they are put on movable free lists like it is done now when deferred struct page initialization is disabled -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmdG0cAQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kV3JCACjcw5F3plRqPOUYcHpbFT0gVq787CHQTke szhLrRLcSaCMf9P7sjlLdlAkB0SVR64oghCLxa4s8IxkOSBk52WJneUimkN8smsI VAm5K7YPZtC+W4BHmGmrXdMeV6XWzV7RJ/SMxyD3oBd+zpnzevXmxRTglKq+MRok t3WCT0cSb7FFORraqrexKg4zcEEYCad1swDmpSlHBzYFnC05C5qFgEW4hnqsEc1x dikstJujVxnfxFwIu3L2yRLjB2Ti4EDMsh1Z5HjjfE89wm/PRW0YL0bsJ+RG+7t4 GYpG/KxhnAZeEEcOdwYxZKWxqyTgBXYvVyL43Yizw1BGSIyLvOjn =mv+T -----END PGP SIGNATURE----- Merge tag 'memblock-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - replace hardcoded strings with str_on_off() in report_meminit() - initialize reserved pages to MIGRATE_MOVABLE when deferred struct page initialization is enabled so that if the reserved pages are freed they are put on movable free lists like it is done now when deferred struct page initialization is disabled * tag 'memblock-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: uniformly initialize all reserved pages to MIGRATE_MOVABLE mm: Use str_on_off() helper function in report_meminit() |
|
|
|
dbefa1f31a |
Rename .data.once to .data..once to fix resetting WARN*_ONCE
Commit |
|
|
|
798bb342e0 |
Rust changes for v6.13
Toolchain and infrastructure:
- Enable a series of lints, including safety-related ones, e.g. the
compiler will now warn about missing safety comments, as well as
unnecessary ones. How safety documentation is organized is a frequent
source of review comments, thus having the compiler guide new
developers on where they are expected (and where not) is very nice.
- Start using '#[expect]': an interesting feature in Rust (stabilized
in 1.81.0) that makes the compiler warn if an expected warning was
_not_ emitted. This is useful to avoid forgetting cleaning up locally
ignored diagnostics ('#[allow]'s).
- Introduce '.clippy.toml' configuration file for Clippy, the Rust
linter, which will allow us to tweak its behaviour. For instance, our
first use cases are declaring a disallowed macro and, more
importantly, enabling the checking of private items.
- Lints-related fixes and cleanups related to the items above.
- Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
kernel into stable Rust, one of the major pieces of the puzzle is the
support to write custom types that can be used as 'self', i.e. as
receivers, since the kernel needs to write types such as 'Arc' that
common userspace Rust would not. 'arbitrary_self_types' has been
accepted to become stable, and this is one of the steps required to
get there.
- Remove usage of the 'new_uninit' unstable feature.
- Use custom C FFI types. Includes a new 'ffi' crate to contain our
custom mapping, instead of using the standard library 'core::ffi'
one. The actual remapping will be introduced in a later cycle.
- Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize' instead
of 32/64-bit integers.
- Fix 'size_t' in bindgen generated prototypes of C builtins.
- Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
in the projects, which we managed to trigger with the upcoming
tracepoint support. It includes a build test since some distributions
backported the fix (e.g. Debian -- thanks!). All major distributions
we list should be now OK except Ubuntu non-LTS.
'macros' crate:
- Adapt the build system to be able run the doctests there too; and
clean up and enable the corresponding doctests.
'kernel' crate:
- Add 'alloc' module with generic kernel allocator support and remove
the dependency on the Rust standard library 'alloc' and the extension
traits we used to provide fallible methods with flags.
Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
Add the 'Box' type (a heap allocation for a single value of type 'T'
that is also generic over an allocator and considers the kernel's GFP
flags) and its shorthand aliases '{K,V,KV}Box'. Add 'ArrayLayout'
type. Add 'Vec' (a contiguous growable array type) and its shorthand
aliases '{K,V,KV}Vec', including iterator support.
For instance, now we may write code such as:
let mut v = KVec::new();
v.push(1, GFP_KERNEL)?;
assert_eq!(&v, &[1]);
Treewide, move as well old users to these new types.
- 'sync' module: add global lock support, including the
'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
and the 'global_lock!' macro. Add the 'Lock::try_lock' method.
- 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
conversion functions public.
- 'page' module: add 'page_align' function.
- Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
traits.
- 'block::mq::request' module: improve rendered documentation.
- 'types' module: extend 'Opaque' type documentation and add simple
examples for the 'Either' types.
drm/panic:
- Clean up a series of Clippy warnings.
Documentation:
- Add coding guidelines for lints and the '#[expect]' feature.
- Add Ubuntu to the list of distributions in the Quick Start guide.
MAINTAINERS:
- Add Danilo Krummrich as maintainer of the new 'alloc' module.
And a few other small cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmdFMIgACgkQGXyLc2ht
IW16hQ/+KX/jmdGoXtNXx7T6yG6SJ/txPOieGAWfhBf6C3bqkGrU9Gnw/O3VWrxf
eyj1QLQaIVUkmumWCefeiy9u3xRXx5fpS0tWJOjUtxC5NcS7vCs0AHQs1skIa6H+
YD6UKDPOy7CB5fVYqo13B5xnFAlciU0dLo6IGB6bB/lSpCudGLE9+nukfn5H3/R1
DTc3/fbSoYQU6Ij/FKscB+D/A7ojdYaReodhbzNw1lChg1MrJlCpqoQvHPE8ijg+
UDljHFFvgKdhSQL9GTa3LC7X4DsnihMWzXt14m6mMOqBa6TqF47WUhhgC77pHEI2
v/Yy8MLq0pdIzT1wFjsqs6opuvXc7K5Yk5Y60HDsDyIyjk2xgOjh6ZlD0EV161gS
7w1NtaKd/Cn7hnL7Ua51yJDxJTMllne3fTWemhs3Zd63j7ham98yOoiw+6L2QaM4
C9nW48vfUuTwDuYJ5HU0uSugubuHW3Ng5JEvMcvd4QjmaI1bQNkgVzefR5j3dLw8
9kEOTzJoxHpu5B7PZVTEd/L95hlmk1csSQObxi7JYCCimMkusF1S+heBzV/SqWD5
5ioEhCnSKE05fhQs0Uxns1HkcFle8Bn6r3aSAWV6yaR8oF94yHcuaZRUKxKMHw+1
cmBE2X8Yvtldw+CYDwEGWjKDtwOStbqk+b/ZzP7f7/p56QH9lSg=
=Kn7b
-----END PGP SIGNATURE-----
Merge tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux
Pull rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Enable a series of lints, including safety-related ones, e.g. the
compiler will now warn about missing safety comments, as well as
unnecessary ones. How safety documentation is organized is a
frequent source of review comments, thus having the compiler guide
new developers on where they are expected (and where not) is very
nice.
- Start using '#[expect]': an interesting feature in Rust (stabilized
in 1.81.0) that makes the compiler warn if an expected warning was
_not_ emitted. This is useful to avoid forgetting cleaning up
locally ignored diagnostics ('#[allow]'s).
- Introduce '.clippy.toml' configuration file for Clippy, the Rust
linter, which will allow us to tweak its behaviour. For instance,
our first use cases are declaring a disallowed macro and, more
importantly, enabling the checking of private items.
- Lints-related fixes and cleanups related to the items above.
- Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
kernel into stable Rust, one of the major pieces of the puzzle is
the support to write custom types that can be used as 'self', i.e.
as receivers, since the kernel needs to write types such as 'Arc'
that common userspace Rust would not. 'arbitrary_self_types' has
been accepted to become stable, and this is one of the steps
required to get there.
- Remove usage of the 'new_uninit' unstable feature.
- Use custom C FFI types. Includes a new 'ffi' crate to contain our
custom mapping, instead of using the standard library 'core::ffi'
one. The actual remapping will be introduced in a later cycle.
- Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize'
instead of 32/64-bit integers.
- Fix 'size_t' in bindgen generated prototypes of C builtins.
- Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
in the projects, which we managed to trigger with the upcoming
tracepoint support. It includes a build test since some
distributions backported the fix (e.g. Debian -- thanks!). All
major distributions we list should be now OK except Ubuntu non-LTS.
'macros' crate:
- Adapt the build system to be able run the doctests there too; and
clean up and enable the corresponding doctests.
'kernel' crate:
- Add 'alloc' module with generic kernel allocator support and remove
the dependency on the Rust standard library 'alloc' and the
extension traits we used to provide fallible methods with flags.
Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
Add the 'Box' type (a heap allocation for a single value of type
'T' that is also generic over an allocator and considers the
kernel's GFP flags) and its shorthand aliases '{K,V,KV}Box'. Add
'ArrayLayout' type. Add 'Vec' (a contiguous growable array type)
and its shorthand aliases '{K,V,KV}Vec', including iterator
support.
For instance, now we may write code such as:
let mut v = KVec::new();
v.push(1, GFP_KERNEL)?;
assert_eq!(&v, &[1]);
Treewide, move as well old users to these new types.
- 'sync' module: add global lock support, including the
'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
and the 'global_lock!' macro. Add the 'Lock::try_lock' method.
- 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
conversion functions public.
- 'page' module: add 'page_align' function.
- Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
traits.
- 'block::mq::request' module: improve rendered documentation.
- 'types' module: extend 'Opaque' type documentation and add simple
examples for the 'Either' types.
drm/panic:
- Clean up a series of Clippy warnings.
Documentation:
- Add coding guidelines for lints and the '#[expect]' feature.
- Add Ubuntu to the list of distributions in the Quick Start guide.
MAINTAINERS:
- Add Danilo Krummrich as maintainer of the new 'alloc' module.
And a few other small cleanups and fixes"
* tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux: (82 commits)
rust: alloc: Fix `ArrayLayout` allocations
docs: rust: remove spurious item in `expect` list
rust: allow `clippy::needless_lifetimes`
rust: warn on bindgen < 0.69.5 and libclang >= 19.1
rust: use custom FFI integer types
rust: map `__kernel_size_t` and friends also to usize/isize
rust: fix size_t in bindgen prototypes of C builtins
rust: sync: add global lock support
rust: macros: enable the rest of the tests
rust: macros: enable paste! use from macro_rules!
rust: enable macros::module! tests
rust: kbuild: expand rusttest target for macros
rust: types: extend `Opaque` documentation
rust: block: fix formatting of `kernel::block::mq::request` module
rust: macros: fix documentation of the paste! macro
rust: kernel: fix THIS_MODULE header path in ThisModule doc comment
rust: page: add Rust version of PAGE_ALIGN
rust: helpers: remove unnecessary header includes
rust: exports: improve grammar in commentary
drm/panic: allow verbose version check
...
|
|
|
|
fb527fc1f3 |
fuse update for 6.13
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZ0Rb/wAKCRDh3BK/laaZ PK80AQDAUgA6S5SSrbJxwRFNOhbwtZxZqJ8fomJR5xuWIEQ9pwEAkpFqhBhBW0y1 0YaREow2aDANQQtSUrfPtgva1ZXFwQU= =Cyx5 -----END PGP SIGNATURE----- Merge tag 'fuse-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add page -> folio conversions (Joanne Koong, Josef Bacik) - Allow max size of fuse requests to be configurable with a sysctl (Joanne Koong) - Allow FOPEN_DIRECT_IO to take advantage of async code path (yangyun) - Fix large kernel reads (like a module load) in virtio_fs (Hou Tao) - Fix attribute inconsistency in case readdirplus (and plain lookup in corner cases) is racing with inode eviction (Zhang Tianci) - Fix a WARN_ON triggered by virtio_fs (Asahi Lina) * tag 'fuse-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (30 commits) virtiofs: dax: remove ->writepages() callback fuse: check attributes staleness on fuse_iget() fuse: remove pages for requests and exclusively use folios fuse: convert direct io to use folios mm/writeback: add folio_mark_dirty_lock() fuse: convert writebacks to use folios fuse: convert retrieves to use folios fuse: convert ioctls to use folios fuse: convert writes (non-writeback) to use folios fuse: convert reads to use folios fuse: convert readdir to use folios fuse: convert readlink to use folios fuse: convert cuse to use folios fuse: add support in virtio for requests using folios fuse: support folios in struct fuse_args_pages and fuse_copy_pages() fuse: convert fuse_notify_store to use folios fuse: convert fuse_retrieve to use folios fuse: use the folio based vmstat helpers fuse: convert fuse_writepage_need_send to take a folio fuse: convert fuse_do_readpage to use folios ... |
|
|
|
e06635e26c |
slab updates for 6.13
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmdERvEACgkQu+CwddJF iJre6Af9EBMVQiWJrmoMOjbGLqLgmZzSXRNxR862WGn4D/wesA1HmSlWgEn54hgc GIYIeD++v4JaIRNH0yZqb2UBSKjF/rYPDkKstnqgFaVakLoDrwkkwV2n3Gk5BEgR m/SzLGgoDWKR65I/oMpL6e2KrMOfMfjpB31qiVvdlaQd2Nv/5rw+gUVylxhNIZEH W11N3IC+e9hmgT3ZBpTmHeqNrlXE1+USWPrp/HV05Ndz6yf97JnP4Wr9f9pcyN3R aflLHR38+Q9cCfO7y8wNqtYvIV/kbqgdaqD76frSgalC4Lmz9+L+TZ2NuENCPoGj Xdbip2z+iffWhvqM+qooOLVxR0XqTA== =Sepb -----END PGP SIGNATURE----- Merge tag 'slab-for-6.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Add new slab_strict_numa boot parameter to enforce per-object memory policies on top of slab folio policies, for systems where saving cost of remote accesses is more important than minimizing slab allocation overhead (Christoph Lameter) - Fix for freeptr_offset alignment check being too strict for m68k (Geert Uytterhoeven) - krealloc() fixes for not violating __GFP_ZERO guarantees on krealloc() when slub_debug (redzone and object tracking) is enabled (Feng Tang) - Fix a memory leak in case sysfs registration fails for a slab cache, and also no longer fail to create the cache in that case (Hyeonggon Yoo) - Fix handling of detected consistency problems (due to buggy slab user) with slub_debug enabled, so that it does not cause further list corruption bugs (yuan.gao) - Code cleanup and kerneldocs polishing (Zhen Lei, Vlastimil Babka) * tag 'slab-for-6.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: Fix too strict alignment check in create_cache() mm/slab: Allow cache creation to proceed even if sysfs registration fails mm/slub: Avoid list corruption when removing a slab from the full list mm/slub, kunit: Add testcase for krealloc redzone and zeroing mm/slub: Improve redzone check and zeroing for krealloc() mm/slub: Consider kfence case for get_orig_size() SLUB: Add support for per object memory policies mm, slab: add kerneldocs for common SLAB_ flags mm/slab: remove duplicate check in create_cache() mm/slub: Move krealloc() and related code to slub.c mm/kasan: Don't store metadata inside kmalloc object when slub_debug_orig_size is on |
|
|
|
f5f4745a7f |
- The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code.
- The series "Improve the copy of task comm" from Yafang Shao addresses
possible race-induced overflows in the management of task_struct.comm[].
- The series "Remove unnecessary header includes from
{tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
small fix to the list_sort library code and to its selftest.
- The series "Enhance min heap API with non-inline functions and
optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
min_heap library code.
- The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
finishes off nilfs2's folioification.
- The series "add detect count for hung tasks" from Lance Yang adds more
userspace visibility into the hung-task detector's activity.
- Apart from that, singelton patches in many places - please see the
individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA
jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m
Ii8igH0UBibIgva7MrCyJedDI1O23AA=
=8BIU
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code
- The series "Improve the copy of task comm" from Yafang Shao addresses
possible race-induced overflows in the management of
task_struct.comm[]
- The series "Remove unnecessary header includes from
{tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
small fix to the list_sort library code and to its selftest
- The series "Enhance min heap API with non-inline functions and
optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
min_heap library code
- The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
finishes off nilfs2's folioification
- The series "add detect count for hung tasks" from Lance Yang adds
more userspace visibility into the hung-task detector's activity
- Apart from that, singelton patches in many places - please see the
individual changelogs for details
* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
gdb: lx-symbols: do not error out on monolithic build
kernel/reboot: replace sprintf() with sysfs_emit()
lib: util_macros_kunit: add kunit test for util_macros.h
util_macros.h: fix/rework find_closest() macros
Improve consistency of '#error' directive messages
ocfs2: fix uninitialized value in ocfs2_file_read_iter()
hung_task: add docs for hung_task_detect_count
hung_task: add detect count for hung tasks
dma-buf: use atomic64_inc_return() in dma_buf_getfile()
fs/proc/kcore.c: fix coccinelle reported ERROR instances
resource: avoid unnecessary resource tree walking in __region_intersects()
ocfs2: remove unused errmsg function and table
ocfs2: cluster: fix a typo
lib/scatterlist: use sg_phys() helper
checkpatch: always parse orig_commit in fixes tag
nilfs2: convert metadata aops from writepage to writepages
nilfs2: convert nilfs_recovery_copy_block() to take a folio
nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
nilfs2: remove nilfs_writepage
nilfs2: convert checkpoint file to be folio-based
...
|
|
|
|
5c00ff742b |
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of shadow
entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in the
hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page into
small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio size
rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations" from
Mike Rapoport teaches x86 to use large pages for read-only-execute
module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a single
VMA, rather than requiring that multiple VMAs be created for this.
Improved efficiencies for userspace memory allocators are expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP from
the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep is
enabled.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
=JfWR
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
|
|
|
|
341d041daa |
iommufd 6.13 merge window pull
Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the backing
memory for an iommu mapping. To date VFIO/iommufd have used VMA's and
pin_user_pages(), this now allows using memfds and memfd_pin_folios().
Notably this creates a pure folio path from the memfd to the iommu page
table where memory is never broken down to PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between two
processes. Combined with the above this allows iommufd to support a VMM
re-start using exec() where something like qemu would exec() a new
version of itself and fd pass the memfds/iommufd/etc to the new
process. The memfd allows DMA access to the memory to continue while
the new process is getting setup, and the CHANGE_PROCESS updates all
the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed virtual
iommu. This will be used by VMMs to access hardware features that are
contained with in a VM. The first use is to inform the kernel of the
virtual SID to physical SID mapping when issuing SID based invalidation
on ARM. Further uses will tie HW features that are directly accessed by
the VM, such as invalidation queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used to
translate VM issued cache invalidation commands from virtual to physical
device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work with
the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and VDEVICE
the driver can model this HW's behavior for nested translation. This
includes a shared branch from Will.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZzzKKwAKCRCFwuHvBreF
YaCMAQDOQAgw87eUYKnY7vFodlsTUA2E8uSxDmk6nPWySd0NKwD/flOP85MdEs9O
Ot+RoL4/J3IyNH+eg5kN68odmx4mAw8=
=ec8x
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the
backing memory for an iommu mapping. To date VFIO/iommufd have used
VMA's and pin_user_pages(), this now allows using memfds and
memfd_pin_folios(). Notably this creates a pure folio path from the
memfd to the iommu page table where memory is never broken down to
PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between
two processes. Combined with the above this allows iommufd to
support a VMM re-start using exec() where something like qemu would
exec() a new version of itself and fd pass the memfds/iommufd/etc
to the new process. The memfd allows DMA access to the memory to
continue while the new process is getting setup, and the
CHANGE_PROCESS updates all the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed
virtual iommu. This will be used by VMMs to access hardware
features that are contained with in a VM. The first use is to
inform the kernel of the virtual SID to physical SID mapping when
issuing SID based invalidation on ARM. Further uses will tie HW
features that are directly accessed by the VM, such as invalidation
queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used
to translate VM issued cache invalidation commands from virtual to
physical device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work
with the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and
VDEVICE the driver can model this HW's behavior for nested
translation. This includes a shared branch from Will"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits)
iommu/arm-smmu-v3: Import IOMMUFD module namespace
iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest
iommufd: Add IOMMU_IOAS_CHANGE_PROCESS
iommufd: Lock all IOAS objects
iommufd: Export do_update_pinned
iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Use S2FWB for NESTED domains
iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
Documentation: userspace-api: iommufd: Update vDEVICE
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommu: Add iommu_copy_struct_from_full_user_array helper
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
...
|
|
|
|
fcc79e1714 |
Networking changes for 6.13.
The most significant set of changes is the per netns RTNL. The new
behavior is disabled by default, regression risk should be contained.
Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its
default value from PTP_1588_CLOCK_KVM, as the first is intended to be
a more reliable replacement for the latter.
Core
----
- Started a very large, in-progress, effort to make the RTNL lock
scope per network-namespace, thus reducing the lock contention
significantly in the containerized use-case, comprising:
- RCU-ified some relevant slices of the FIB control path
- introduce basic per netns locking helpers
- namespacified the IPv4 address hash table
- remove rtnl_register{,_module}() in favour of rtnl_register_many()
- refactor rtnl_{new,del,set}link() moving as much validation as
possible out of RTNL lock
- convert all phonet doit() and dumpit() handlers to RCU
- convert IPv4 addresses manipulation to per-netns RTNL
- convert virtual interface creation to per-netns RTNL
the per-netns lock infra is guarded by the CONFIG_DEBUG_NET_SMALL_RTNL
knob, disabled by default ad interim.
- Introduce NAPI suspension, to efficiently switching between busy
polling (NAPI processing suspended) and normal processing.
- Migrate the IPv4 routing input, output and control path from direct
ToS usage to DSCP macros. This is a work in progress to make ECN
handling consistent and reliable.
- Add drop reasons support to the IPv4 rotue input path, allowing
better introspection in case of packets drop.
- Make FIB seqnum lockless, dropping RTNL protection for read
access.
- Make inet{,v6} addresses hashing less predicable.
- Allow providing timestamp OPT_ID via cmsg, to correlate TX packets
and timestamps
Things we sprinkled into general kernel code
--------------------------------------------
- Add small file operations for debugfs, to reduce the struct ops size.
- Refactoring and optimization for the implementation of page_frag API,
This is a preparatory work to consolidate the page_frag
implementation.
Netfilter
---------
- Optimize set element transactions to reduce memory consumption
- Extended netlink error reporting for attribute parser failure.
- Make legacy xtables configs user selectable, giving users
the option to configure iptables without enabling any other config.
- Address a lot of false-positive RCU issues, pointed by recent
CI improvements.
BPF
---
- Put xsk sockets on a struct diet and add various cleanups. Overall,
this helps to bump performance by 12% for some workloads.
- Extend BPF selftests to increase coverage of XDP features in
combination with BPF cpumap.
- Optimize and homogenize bpf_csum_diff helper for all archs and also
add a batch of new BPF selftests for it.
- Extend netkit with an option to delegate skb->{mark,priority}
scrubbing to its BPF program.
- Make the bpf_get_netns_cookie() helper available also to tc(x) BPF
programs.
Protocols
---------
- Introduces 4-tuple hash for connected udp sockets, speeding-up
significantly connected sockets lookup.
- Add a fastpath for some TCP timers that usually expires after close,
the socket lock contention.
- Add inbound and outbound xfrm state caches to speed up state lookups.
- Avoid sending MPTCP advertisements on stale subflows, reducing
risks on loosing them.
- Make neighbours table flushing more scalable, maintaining per device
neigh lists.
Driver API
----------
- Introduce a unified interface to configure transmission H/W shaping,
and expose it to user-space via generic-netlink.
- Add support for per-NAPI config via netlink. This makes napi
configuration persistent across queues removal and re-creation.
Requires driver updates, currently supported drivers are:
nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice.
- Add ethtool support for writing SFP / PHY firmware blocks.
- Track RSS context allocation from ethtool core.
- Implement support for mirroring to DSA CPU port, via TC mirror
offload.
- Consolidate FDB updates notification, to avoid duplicates on
device-specific entries.
- Expose DPLL clock quality level to the user-space.
- Support master-slave PHY config via device tree.
Tests and tooling
-----------------
- forwarding: introduce deferred commands, to simplify
the cleanup phase
Drivers
-------
- Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic,
Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the
IRQs and queues to NAPI IDs, allowing busy polling and better
introspection.
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- mlx5:
- a large refactor to implement support for cross E-Switch
scheduling
- refactor H/W conter management to let it scale better
- H/W GRO cleanups
- Intel (100G, ice)::
- adds support for ethtool reset
- implement support for per TX queue H/W shaping
- AMD/Solarflare:
- implement per device queue stats support
- Broadcom (bnxt):
- improve wildcard l4proto on IPv4/IPv6 ntuple rules
- Marvell Octeon:
- Adds representor support for each Resource Virtualization Unit
(RVU) device.
- Hisilicon:
- adds support for the BMC Gigabit Ethernet
- IBM (EMAC):
- driver cleanup and modernization
- Cisco (VIC):
- raise the queues number limit to 256
- Ethernet virtual:
- Google vNIC:
- implements page pool support
- macsec:
- inherit lower device's features and TSO limits when offloading
- virtio_net:
- enable premapped mode by default
- support for XDP socket(AF_XDP) zerocopy TX
- wireguard:
- set the TSO max size to be GSO_MAX_SIZE, to aggregate larger
packets.
- Ethernet NICs embedded and virtual:
- Broadcom ASP:
- enable software timestamping
- Freescale:
- add enetc4 PF driver
- MediaTek: Airoha SoC:
- implement BQL support
- RealTek r8169:
- enable TSO by default on r8168/r8125
- implement extended ethtool stats
- Renesas AVB:
- enable TX checksum offload
- Synopsys (stmmac):
- support header splitting for vlan tagged packets
- move common code for DWMAC4 and DWXGMAC into a separate FPE
module.
- Add the dwmac driver support for T-HEAD TH1520 SoC
- Synopsys (xpcs):
- driver refactor and cleanup
- TI:
- icssg_prueth: add VLAN offload support
- Xilinx emaclite:
- adds clock support
- Ethernet switches:
- Microchip:
- implement support for the lan969x Ethernet switch family
- add LAN9646 switch support to KSZ DSA driver
- Ethernet PHYs:
- Marvel: 88q2x: enable auto negotiation
- Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2
- PTP:
- Add support for the Amazon virtual clock device
- Add PtP driver for s390 clocks
- WiFi:
- mac80211
- EHT 1024 aggregation size for transmissions
- new operation to indicate that a new interface is to be added
- support radio separation of multi-band devices
- move wireless extension spy implementation to libiw
- Broadcom:
- brcmfmac: optional LPO clock support
- Microchip:
- add support for Atmel WILC3000
- Qualcomm (ath12k):
- firmware coredump collection support
- add debugfs support for a multitude of statistics
- Qualcomm (ath5k):
- Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
- Realtek:
- rtw88: 8821au and 8812au USB adapters support
- rtw89: add thermal protection
- rtw89: fine tune BT-coexsitence to improve user experience
- rtw89: firmware secure boot for WiFi 6 chip
- Bluetooth
- add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and
0x13d3:0x3623
- add Realtek RTL8852BE support for id Foxconn 0xe123
- add MediaTek MT7920 support for wireless module ids
- btintel_pcie: add handshake between driver and firmware
- btintel_pcie: add recovery mechanism
- btnxpuart: add GPIO support to power save feature
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmc8sukSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkLEYQAIMM6Qjh0bh3Byr3gOS1xZzXG+APLjP4
9Jr0p3i+X53i90jvVqzeVO5FTc95MVHSKZ3kvPkDMXSLUaEJxocNHCI5Dzl/2/qL
wWdpUB6/ou+jKB4Bn6Z8OvVODT7qrr0tVa9M2/fuKWrIsOU/ntIhG8EhnGddk5U/
vKPSf5PUIb81uNRnF58VusY3wrT1dEoh9VfJYxL+ST+inPxjEAMy6Y+lmlsjGaSX
jrS+Pp9KYiUwl3Qt0AQs+cG4OHkJdjbnChrfosWwpkiyddO8klVq06+wX/TiSzfF
b9VZtBfy/GZs3lkE1mQkcILdtX5pP3YHQdpsuxFfVI0JHVszx2ck7WdoRux/8F0v
kKZsYcO7bH9I1wMFP66Ff9hIbdEQaeucK+KdDkXyPNMfP91Vzmfjii8IBxOC36Ie
BbOeFUrXyTxxJ2u0vf/X9JtIq8bcrkNrSd1n1jlGPMqG3FVzsY95+Oi4qfsyeUbl
lS1PlVTqPMPFdX54HnxM3y2rJjhd7iXhkvmtuXNjRFThXlOiK3maAPWlM1aZ3b8u
Vjs4JFUsW0tleZG+RzANjsGjXbf7AiPUGLZt+acem0K+fcjG4i5aGIAJrxwa/ORx
eG74IZRt5cOI371W7gNLGHjwnuge8tFPgOWcRP2eozNm7jvMYALBejYS7eWUTvaf
THcvVM+bupEZ
=GzPr
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most significant set of changes is the per netns RTNL. The new
behavior is disabled by default, regression risk should be contained.
Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its
default value from PTP_1588_CLOCK_KVM, as the first is intended to be
a more reliable replacement for the latter.
Core:
- Started a very large, in-progress, effort to make the RTNL lock
scope per network-namespace, thus reducing the lock contention
significantly in the containerized use-case, comprising:
- RCU-ified some relevant slices of the FIB control path
- introduce basic per netns locking helpers
- namespacified the IPv4 address hash table
- remove rtnl_register{,_module}() in favour of
rtnl_register_many()
- refactor rtnl_{new,del,set}link() moving as much validation as
possible out of RTNL lock
- convert all phonet doit() and dumpit() handlers to RCU
- convert IPv4 addresses manipulation to per-netns RTNL
- convert virtual interface creation to per-netns RTNL
the per-netns lock infrastructure is guarded by the
CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim.
- Introduce NAPI suspension, to efficiently switching between busy
polling (NAPI processing suspended) and normal processing.
- Migrate the IPv4 routing input, output and control path from direct
ToS usage to DSCP macros. This is a work in progress to make ECN
handling consistent and reliable.
- Add drop reasons support to the IPv4 rotue input path, allowing
better introspection in case of packets drop.
- Make FIB seqnum lockless, dropping RTNL protection for read access.
- Make inet{,v6} addresses hashing less predicable.
- Allow providing timestamp OPT_ID via cmsg, to correlate TX packets
and timestamps
Things we sprinkled into general kernel code:
- Add small file operations for debugfs, to reduce the struct ops
size.
- Refactoring and optimization for the implementation of page_frag
API, This is a preparatory work to consolidate the page_frag
implementation.
Netfilter:
- Optimize set element transactions to reduce memory consumption
- Extended netlink error reporting for attribute parser failure.
- Make legacy xtables configs user selectable, giving users the
option to configure iptables without enabling any other config.
- Address a lot of false-positive RCU issues, pointed by recent CI
improvements.
BPF:
- Put xsk sockets on a struct diet and add various cleanups. Overall,
this helps to bump performance by 12% for some workloads.
- Extend BPF selftests to increase coverage of XDP features in
combination with BPF cpumap.
- Optimize and homogenize bpf_csum_diff helper for all archs and also
add a batch of new BPF selftests for it.
- Extend netkit with an option to delegate skb->{mark,priority}
scrubbing to its BPF program.
- Make the bpf_get_netns_cookie() helper available also to tc(x) BPF
programs.
Protocols:
- Introduces 4-tuple hash for connected udp sockets, speeding-up
significantly connected sockets lookup.
- Add a fastpath for some TCP timers that usually expires after
close, the socket lock contention.
- Add inbound and outbound xfrm state caches to speed up state
lookups.
- Avoid sending MPTCP advertisements on stale subflows, reducing
risks on loosing them.
- Make neighbours table flushing more scalable, maintaining per
device neigh lists.
Driver API:
- Introduce a unified interface to configure transmission H/W
shaping, and expose it to user-space via generic-netlink.
- Add support for per-NAPI config via netlink. This makes napi
configuration persistent across queues removal and re-creation.
Requires driver updates, currently supported drivers are:
nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice.
- Add ethtool support for writing SFP / PHY firmware blocks.
- Track RSS context allocation from ethtool core.
- Implement support for mirroring to DSA CPU port, via TC mirror
offload.
- Consolidate FDB updates notification, to avoid duplicates on
device-specific entries.
- Expose DPLL clock quality level to the user-space.
- Support master-slave PHY config via device tree.
Tests and tooling:
- forwarding: introduce deferred commands, to simplify the cleanup
phase
Drivers:
- Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic,
Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the
IRQs and queues to NAPI IDs, allowing busy polling and better
introspection.
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- mlx5:
- a large refactor to implement support for cross E-Switch
scheduling
- refactor H/W conter management to let it scale better
- H/W GRO cleanups
- Intel (100G, ice)::
- add support for ethtool reset
- implement support for per TX queue H/W shaping
- AMD/Solarflare:
- implement per device queue stats support
- Broadcom (bnxt):
- improve wildcard l4proto on IPv4/IPv6 ntuple rules
- Marvell Octeon:
- Add representor support for each Resource Virtualization Unit
(RVU) device.
- Hisilicon:
- add support for the BMC Gigabit Ethernet
- IBM (EMAC):
- driver cleanup and modernization
- Cisco (VIC):
- raise the queues number limit to 256
- Ethernet virtual:
- Google vNIC:
- implement page pool support
- macsec:
- inherit lower device's features and TSO limits when
offloading
- virtio_net:
- enable premapped mode by default
- support for XDP socket(AF_XDP) zerocopy TX
- wireguard:
- set the TSO max size to be GSO_MAX_SIZE, to aggregate larger
packets.
- Ethernet NICs embedded and virtual:
- Broadcom ASP:
- enable software timestamping
- Freescale:
- add enetc4 PF driver
- MediaTek: Airoha SoC:
- implement BQL support
- RealTek r8169:
- enable TSO by default on r8168/r8125
- implement extended ethtool stats
- Renesas AVB:
- enable TX checksum offload
- Synopsys (stmmac):
- support header splitting for vlan tagged packets
- move common code for DWMAC4 and DWXGMAC into a separate FPE
module.
- add dwmac driver support for T-HEAD TH1520 SoC
- Synopsys (xpcs):
- driver refactor and cleanup
- TI:
- icssg_prueth: add VLAN offload support
- Xilinx emaclite:
- add clock support
- Ethernet switches:
- Microchip:
- implement support for the lan969x Ethernet switch family
- add LAN9646 switch support to KSZ DSA driver
- Ethernet PHYs:
- Marvel: 88q2x: enable auto negotiation
- Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2
- PTP:
- Add support for the Amazon virtual clock device
- Add PtP driver for s390 clocks
- WiFi:
- mac80211
- EHT 1024 aggregation size for transmissions
- new operation to indicate that a new interface is to be added
- support radio separation of multi-band devices
- move wireless extension spy implementation to libiw
- Broadcom:
- brcmfmac: optional LPO clock support
- Microchip:
- add support for Atmel WILC3000
- Qualcomm (ath12k):
- firmware coredump collection support
- add debugfs support for a multitude of statistics
- Qualcomm (ath5k):
- Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
- Realtek:
- rtw88: 8821au and 8812au USB adapters support
- rtw89: add thermal protection
- rtw89: fine tune BT-coexsitence to improve user experience
- rtw89: firmware secure boot for WiFi 6 chip
- Bluetooth
- add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and
0x13d3:0x3623
- add Realtek RTL8852BE support for id Foxconn 0xe123
- add MediaTek MT7920 support for wireless module ids
- btintel_pcie: add handshake between driver and firmware
- btintel_pcie: add recovery mechanism
- btnxpuart: add GPIO support to power save feature"
* tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits)
mm: page_frag: fix a compile error when kernel is not compiled
Documentation: tipc: fix formatting issue in tipc.rst
selftests: nic_performance: Add selftest for performance of NIC driver
selftests: nic_link_layer: Add selftest case for speed and duplex states
selftests: nic_link_layer: Add link layer selftest for NIC driver
bnxt_en: Add FW trace coredump segments to the coredump
bnxt_en: Add a new ethtool -W dump flag
bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr()
bnxt_en: Add functions to copy host context memory
bnxt_en: Do not free FW log context memory
bnxt_en: Manage the FW trace context memory
bnxt_en: Allocate backing store memory for FW trace logs
bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem()
bnxt_en: Refactor bnxt_free_ctx_mem()
bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type
bnxt_en: Update firmware interface spec to 1.10.3.85
selftests/bpf: Add some tests with sockmap SK_PASS
bpf: fix recursive lock when verdict program return SK_PASS
wireguard: device: support big tcp GSO
wireguard: selftests: load nf_conntrack if not present
...
|
|
|
|
6e95ef0258 |
bpf-next-bpf-next-6.13
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmc7hIQACgkQ6rmadz2v bTrcRA/+MsUOzJPnjokonHwk8X4KQM21gOua/sUcGArLVGF/JoW5/b1W8UBQ0y5+ +okYaRNGpwF0/2S8M5FAYpM7VSPLl1U7Rihr55I63D9kbAo0pDQwpn4afQFuZhaC l7MzkhBHS7XXx5/70APOzy3kz1GDYvz39jiWuAAhRqVejFO+fa4pDz4W+Ht7jYTQ jJOLn4vJna9fSfVf/U/bbdz5lL0lncIiEnRIEbF7EszbF2CA7sa+/KFENGM7ChEo UlxK2Xz5fpzgT6htZRjMr6jmupfg7gzdT4moOysQQcjkllvv6/4MD0s/GLShtG9H SmpaptpYCEGXLuApGzkSddwiT6iUMTqQr7zs6LPp0gPh+4Z0sSPNoBtBp2v0aVDl w0zhVhMfoF66rMG+IZY684CsMGg5h8UsOS46KLjSU0fW2HpGM7+zZLpXOaGkU3OH UV0womPT/C2kS2fpOn9F91O8qMjOZ4EXd+zuRtIRv9CeuVIpCT9R13lEYn+wfr6d aUci8wybha1UOAvkRiXiqWOPS+0Z/arrSbCSDMQF6DevLpQl0noVbTVssWXcRdUE 9Ve6J0yS29WxNWFtuuw4xP5NcG1AnRXVGh215TuVBX7xK9X/hnDDhfalltsjXfnd m1f64FxU2SGp2D7X8BX/6Aeyo6mITE6I3SNMUrcvk1Zid36zhy8= =TXGS -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Add BPF uprobe session support (Jiri Olsa) - Optimize uprobe performance (Andrii Nakryiko) - Add bpf_fastcall support to helpers and kfuncs (Eduard Zingerman) - Avoid calling free_htab_elem() under hash map bucket lock (Hou Tao) - Prevent tailcall infinite loop caused by freplace (Leon Hwang) - Mark raw_tracepoint arguments as nullable (Kumar Kartikeya Dwivedi) - Introduce uptr support in the task local storage map (Martin KaFai Lau) - Stringify errno log messages in libbpf (Mykyta Yatsenko) - Add kmem_cache BPF iterator for perf's lock profiling (Namhyung Kim) - Support BPF objects of either endianness in libbpf (Tony Ambardar) - Add ksym to struct_ops trampoline to fix stack trace (Xu Kuohai) - Introduce private stack for eligible BPF programs (Yonghong Song) - Migrate samples/bpf tests to selftests/bpf test_progs (Daniel T. Lee) - Migrate test_sock to selftests/bpf test_progs (Jordan Rife) * tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (152 commits) libbpf: Change hash_combine parameters from long to unsigned long selftests/bpf: Fix build error with llvm 19 libbpf: Fix memory leak in bpf_program__attach_uprobe_multi bpf: use common instruction history across all states bpf: Add necessary migrate_disable to range_tree. bpf: Do not alloc arena on unsupported arches selftests/bpf: Set test path for token/obj_priv_implicit_token_envvar selftests/bpf: Add a test for arena range tree algorithm bpf: Introduce range_tree data structure and use it in bpf arena samples/bpf: Remove unused variable in xdp2skb_meta_kern.c samples/bpf: Remove unused variables in tc_l2_redirect_kern.c bpftool: Cast variable `var` to long long bpf, x86: Propagate tailcall info only for subprogs bpf: Add kernel symbol for struct_ops trampoline bpf: Use function pointers count as struct_ops links count bpf: Remove unused member rcu from bpf_struct_ops_map selftests/bpf: Add struct_ops prog private stack tests bpf: Support private stack for struct_ops progs selftests/bpf: Add tracing prog private stack tests bpf, x86: Support private stack in jit ... |
|
|
|
9008fe8fad |
slab: Fix too strict alignment check in create_cache()
On m68k, where the minimum alignment of unsigned long is 2 bytes:
Kernel panic - not syncing: __kmem_cache_create_args: Failed to create slab 'io_kiocb'. Error -22
CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.12.0-atari-03776-g7eaa1f99261a #1783
Stack from 0102fe5c:
0102fe5c 00514a2b 00514a2b ffffff00 00000001 0051f5ed 00425e78 00514a2b
0041eb74 ffffffea 00000310 0051f5ed ffffffea ffffffea 00601f60 00000044
0102ff20 000e7a68 0051ab8e 004383b8 0051f5ed ffffffea 000000b8 00000007
01020c00 00000000 000e77f0 0041e5f0 005f67c0 0051f5ed 000000b6 0102fef4
00000310 0102fef4 00000000 00000016 005f676c 0060a34c 00000010 00000004
00000038 0000009a 01000000 000000b8 005f668e 0102e000 00001372 0102ff88
Call Trace: [<00425e78>] dump_stack+0xc/0x10
[<0041eb74>] panic+0xd8/0x26c
[<000e7a68>] __kmem_cache_create_args+0x278/0x2e8
[<000e77f0>] __kmem_cache_create_args+0x0/0x2e8
[<0041e5f0>] memset+0x0/0x8c
[<005f67c0>] io_uring_init+0x54/0xd2
The minimal alignment of an integral type may differ from its size,
hence is not safe to assume that an arbitrary freeptr_t (which is
basically an unsigned long) is always aligned to 4 or 8 bytes.
As nothing seems to require the additional alignment, it is safe to fix
this by relaxing the check to the actual minimum alignment of freeptr_t.
Fixes:
|
|
|
|
bf9aa14fc5 |
A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the signal
of the timer is ignored so that the timer signal can be delivered once
the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small intervals
and keeps the system pointlessly out of low power states for no value.
This is a long standing non-trivial problem due to the lock order of
posix-timer lock and the sighand lock along with life time issues as
the timer and the sigqueue have different life time rules.
Cure this by:
* Embedding the sigqueue into the timer struct to have the same life
time rules. Aside of that this also avoids the lookup of the timer
in the signal delivery and rearm path as it's just a always valid
container_of() now.
* Queuing ignored timer signals onto a seperate ignored list.
* Moving queued timer signals onto the ignored list when the signal is
switched to SIG_IGN before it could be delivered.
* Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal delivery
code to rearm the timer.
This also required to consolidate the signal delivery rules so they are
consistent across all situations. With that all self test scenarios
finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time stamps
by default and switch to fine grained time stamps when inode attributes
are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that the
VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
* Move all sleep and timeout functions into one file
* Rework udelay() and ndelay() into proper documented inline functions
and replace the hardcoded magic numbers by proper defines.
* Rework the fsleep() implementation to take the reality of the timer
wheel granularity on different HZ values into account. Right now the
boundaries are hard coded time ranges which fail to provide the
requested accuracy on different HZ settings.
* Update documentation for all sleep/timeout related functions and fix
up stale documentation links all over the place
* Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as that's
the clock which drives the timekeeping adjustments via the various user
space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file descriptor
based posix clocks, but their usability is very limited. They can't be
accessed fast as they always go all the way out to the hardware and
they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the kernel
provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework converts
timekeeping and adjtimex(2) into generic functionality which operates
on pointers to data structures instead of using static variables.
This allows to provide time accessors and adjtimex(2) functionality for
the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less straight
forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the core
code and a few usage sites of the less frequently used interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is already
prepared and scheduled for the next merge window.
- Drivers:
* Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with other
clusters.
* Mostly boring cleanups, fixes, improvements and code movement
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
hZZ4hdM9+3T7cg==
=2VC6
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
|
|
|
|
ba1f9c8fe3 |
arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
Compute Architecture (CCA)
* Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from libc
* AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
* Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing the
signal handler in line with the x86 behaviour for 6.12
* arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access control
- Switch back to platform_driver::remove() now that it returns 'void'
- Add some missing events for the CXL PMU driver
* Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
=wDot
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Support for running Linux in a protected VM under the Arm
Confidential Compute Architecture (CCA)
- Guarded Control Stack user-space support. Current patches follow the
x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
patches (already on the list) will add support for clone3() allowing
finer-grained control of the shadow stack size and placement from
libc
- AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
getting close with the upcoming dpISA support)
- Other arch features:
- In-kernel use of the memcpy instructions, FEAT_MOPS (previously
only exposed to user; uaccess support not merged yet)
- MTE: hugetlbfs support and the corresponding kselftests
- Optimise CRC32 using the PMULL instructions
- Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
- Optimise the kernel TLB flushing to use the range operations
- POE/pkey (permission overlays): further cleanups after bringing
the signal handler in line with the x86 behaviour for 6.12
- arm64 perf updates:
- Support for the NXP i.MX91 PMU in the existing IMX driver
- Support for Ampere SoCs in the Designware PCIe PMU driver
- Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
- Support for Samsung's 'Mongoose' CPU PMU
- Support for PMUv3.9 finer-grained userspace counter access
control
- Switch back to platform_driver::remove() now that it returns
'void'
- Add some missing events for the CXL PMU driver
- Miscellaneous arm64 fixes/cleanups:
- Page table accessors cleanup: type updates, drop unused macros,
reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
check addresses before runtime P4D/PUD folding
- Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
FEAT_ECV for the generic timers) allowing Linux to boot with
firmware deployments that don't set SCTLR_EL3.ECVEn
- ACPI/arm64: tighten the check for the array of platform timer
structures and adjust the error handling procedure in
gtdt_parse_timer_block()
- Optimise the cache flush for the uprobes xol slot (skip if no
change) and other uprobes/kprobes cleanups
- Fix the context switching of tpidrro_el0 when kpti is enabled
- Dynamic shadow call stack fixes
- Sysreg updates
- Various arm64 kselftest improvements
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
arm64/ptrace: Clarify documentation of VL configuration via ptrace
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
...
|
|
|
|
3e7447ab48 |
A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most
notably in the journaling code, bufered I/O, and compiler warning cleanups. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmc7NN4ACgkQ8vlZVpUN gaMJRAf+Oc3Tn/ZvuX0amkaBQI+ZNIeYD/U0WBSvarKb00bo1X39mM/0LovqV6ec c51iRgt8U6uDZDUm6zJtppkIUiqkHRj+TmTInueFtmUqhIg8jgfZIpxCn0QkFKnQ jI5EKCkvUqM0B347axH/s+dlOE9JBSlQNKgjkvCYOGknQ1PH6X8oMDt5QAqGEk3P Nsa4QChIxt2yujFvydgFT+RAbjvY3sNvrZ7D3B+KL3VSJpILChVZK/UdFrraSXxq mLO5j4txjtnr/OLgujCTHOfPsTiQReHHXErrSbKhnFhrTXLD0mZSUgJ6irpaxRQ5 wQHQzmsrVwqFfqPU3Hkl8FGeCR0owQ== =26/E -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ... |
|
|
|
0f25f0e4ef |
the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
|
|
|
|
7956186e75 |
vfs-6.13.tmpfs
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcZIgAKCRCRxhvAZXjc oge4AQDxhsKW+v/jKHydzqzwG3Ks7DIxrUg/mcGfdtBwjiWgvwEA8t0QAAfKECAK B0+bNKJ8XJRUtZ10Jgm3dzURbEhBWgU= =4Lui -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tmpfs case folding updates from Christian Brauner: "This adds case-insensitive support for tmpfs. The work contained in here adds support for case-insensitive file names lookups in tmpfs. The main difference from other casefold filesystems is that tmpfs has no information on disk, just on RAM, so we can't use mkfs to create a case-insensitive tmpfs. For this implementation, there's a mount option for casefolding. The rest of the patchset follows a similar approach as ext4 and f2fs. The use case for this feature is similar to the use case for ext4, to better support compatibility layers (like Wine), particularly in combination with sandboxing/container tools (like Flatpak). Those containerization tools can share a subset of the host filesystem with an application. In the container, the root directory and any parent directories required for a shared directory are on tmpfs, with the shared directories bind-mounted into the container's view of the filesystem. If the host filesystem is using case-insensitive directories, then the application can do lookups inside those directories in a case-insensitive way, without this needing to be implemented in user-space. However, if the host is only sharing a subset of a case-insensitive directory with the application, then the parent directories of the mount point will be part of the container's root tmpfs. When the application tries to do case-insensitive lookups of those parent directories on a case-sensitive tmpfs, the lookup will fail" * tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name docs: tmpfs: Add casefold options tmpfs: Expose filesystem features via sysfs tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs tmpfs: Add casefold lookup support libfs: Export generic_ci_ dentry functions unicode: Recreate utf8_parse_version() unicode: Export latest available UTF-8 version number ext4: Use generic_ci_validate_strict_name helper libfs: Create the helper function generic_ci_validate_strict_name() |
|
|
|
56be9aaf98 |
vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE= =LbP6 -----END PGP SIGNATURE----- Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagecache updates from Christian Brauner: "Cleanup filesystem page flag usage: This continues the work to make the mappedtodisk/owner_2 flag available to filesystems which don't use buffer heads. Further patches remove uses of Private2. This brings us very close to being rid of it entirely" * tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: migrate: Remove references to Private2 ceph: Remove call to PagePrivate2() btrfs: Switch from using the private_2 flag to owner_2 mm: Remove PageMappedToDisk nilfs2: Convert nilfs_copy_buffer() to use folios fs: Move clearing of mappedtodisk to buffer.c |
|
|
|
70e7730c2a |
vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
pqlG+38UdATImpfqqWjPbb72sBYcfQg=
=wLUh
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Fixup and improve NLM and kNFSD file lock callbacks
Last year both GFS2 and OCFS2 had some work done to make their
locking more robust when exported over NFS. Unfortunately, part of
that work caused both NLM (for NFS v3 exports) and kNFSD (for
NFSv4.1+ exports) to no longer send lock notifications to clients
This in itself is not a huge problem because most NFS clients will
still poll the server in order to acquire a conflicted lock
It's important for NLM and kNFSD that they do not block their
kernel threads inside filesystem's file_lock implementations
because that can produce deadlocks. We used to make sure of this by
only trusting that posix_lock_file() can correctly handle blocking
lock calls asynchronously, so the lock managers would only setup
their file_lock requests for async callbacks if the filesystem did
not define its own lock() file operation
However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started
signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
for also trusting posix_lock_file() was inadvertently dropped, so
now most filesystems no longer produce lock notifications when
exported over NFS
Fix this by using an fop_flag which greatly simplifies the problem
and grooms the way for future uses by both filesystems and lock
managers alike
- Add a sysctl to delete the dentry when a file is removed instead of
making it a negative dentry
Commit
|
|
|
|
6ac81fd55e |
vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
=nVlm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs multigrain timestamps from Christian Brauner:
"This is another try at implementing multigrain timestamps. This time
with significant help from the timekeeping maintainers to reduce the
performance impact.
Thomas provided a base branch that contains the required timekeeping
interfaces for the VFS. It serves as the base for the multi-grain
timestamp work:
- Multigrain timestamps allow the kernel to use fine-grained
timestamps when an inode's attributes is being actively observed
via ->getattr(). With this support, it's possible for a file to get
a fine-grained timestamp, and another modified after it to get a
coarse-grained stamp that is earlier than the fine-grained time. If
this happens then the files can appear to have been modified in
reverse order, which breaks VFS ordering guarantees.
To prevent this, a floor value is maintained for multigrain
timestamps. Whenever a fine-grained timestamp is handed out, record
it, and when later coarse-grained stamps are handed out, ensure
they are not earlier than that value. If the coarse-grained
timestamp is earlier than the fine-grained floor, return the floor
value instead.
The timekeeper changes add a static singleton atomic64_t into
timekeeper.c that is used to keep track of the latest fine-grained
time ever handed out. This is tracked as a monotonic ktime_t value
to ensure that it isn't affected by clock jumps. Because it is
updated at different times than the rest of the timekeeper object,
the floor value is managed independently of the timekeeper via a
cmpxchg() operation, and sits on its own cacheline.
Two new public timekeeper interfaces are added:
(1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
later of the coarse-grained clock and the floor time
(2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
and tries to swap it into the floor. A timespec64 is filled
with the result.
- The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around
1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting
via NFSv3, which relies on timestamps to validate caches. A lot of
changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide when to invalidate the cache. Even with
NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with
timestamp granularity. Other applications have similar issues with
timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would
improve the situation, but that becomes rather expensive, as the
underlying filesystem would have to log a lot more metadata
updates.
This adds a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in
inode->i_ctime_nsec as a flag that indicates whether the current
timestamps have been queried via stat() or the like. When it's set,
we allow the kernel to use a fine-grained timestamp iff it's
necessary to make the ctime show a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible
for a file being changed to get a fine-grained timestamp. A file
that is altered just a bit later can then get a coarse-grained one
that appears older than the earlier fine-grained time. This
violates timestamp ordering guarantees.
This is where the earlier mentioned timkeeping interfaces help. A
global monotonic atomic64_t value is kept that acts as a timestamp
floor. When we go to stamp a file, we first get the latter of the
current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it
with that value.
If it has been queried, then first see whether the current coarse
time is later than the existing ctime. If it is, then we accept
that value. If it isn't, then we get a fine-grained time and try to
swap that into the global floor. Whether that succeeds or fails, we
take the resulting floor time, convert it to realtime and try to
swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails,
since either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same
floor value as multigrain filesystems)"
* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce pointer chasing in is_mgtime() test
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
timekeeping: Add percpu counter for tracking floor swap events
timekeeping: Add interfaces for handling timestamps with a floor value
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
|
|
|
|
4a5df37964 |
10 hotfixes, 7 of which are cc:stable. All singletons, please see the
changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzkr6AAKCRDdBJ7gKXxA jsb2AP9HCOI4w9rQTmBdnaefXytS7fiiPq+LVNpjJ0NGXX2FSgD/e1NM0wi8KevQ npcvlqTcXtRSJvYNF904aTNyDn+Kuw0= =KFGY -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff |
|
|
|
d1aa0c0429 |
mm: revert "mm: shmem: fix data-race in shmem_getattr()"
Revert |
|
|
|
9e19aa165c |
Merge branch 'slab/for-6.13/features' into slab/for-next
Merge the slab feature branch for 6.13: - Add new slab_strict_numa parameter for per-object memory policies (Christoph Lameter) |
|
|
|
2420baa8e0 |
mm/slab: Allow cache creation to proceed even if sysfs registration fails
When kobject_init_and_add() fails during cache creation, kobj->name can be leaked because SLUB does not call kobject_put(), which should be invoked per the kobject API documentation. This has a bit of historical context, though; SLUB does not call kobject_put() to avoid double-free for struct kmem_cache because 1) simply calling it would free all resources related to the cache, and 2) struct kmem_cache descriptor is always freed by cache_cache()'s error handling path, causing struct kmem_cache to be freed twice. This issue can be reproduced by creating new slab caches while applying failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, but causes kobject_add_internal() to fail in kobject_init_and_add() during cache creation. Historically, this issue has attracted developers' attention several times. Each time a fix addressed either the leak or the double-free, it caused the other issue. Let's summarize a bit of history here: The leak has existed since the early days of SLUB. Commit |
|
|
|
dbc1691527 |
mm/slub: Avoid list corruption when removing a slab from the full list
Boot with slub_debug=UFPZ.
If allocated object failed in alloc_consistency_checks, all objects of
the slab will be marked as used, and then the slab will be removed from
the partial list.
When an object belonging to the slab got freed later, the remove_full()
function is called. Because the slab is neither on the partial list nor
on the full list, it eventually lead to a list corruption (actually a
list poison being detected).
So we need to mark and isolate the slab page with metadata corruption,
do not put it back in circulation.
Because the debug caches avoid all the fastpaths, reusing the frozen bit
to mark slab page with metadata corruption seems to be fine.
[ 4277.385669] list_del corruption, ffffea00044b3e50->next is LIST_POISON1 (dead000000000100)
[ 4277.387023] ------------[ cut here ]------------
[ 4277.387880] kernel BUG at lib/list_debug.c:56!
[ 4277.388680] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 4277.389562] CPU: 5 PID: 90 Comm: kworker/5:1 Kdump: loaded Tainted: G OE 6.6.1-1 #1
[ 4277.392113] Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker [xfs]
[ 4277.393551] RIP: 0010:__list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.394518] Code: 48 91 82 e8 37 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 28 49 91 82 e8 26 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 58 49 91
[ 4277.397292] RSP: 0018:ffffc90000333b38 EFLAGS: 00010082
[ 4277.398202] RAX: 000000000000004e RBX: ffffea00044b3e50 RCX: 0000000000000000
[ 4277.399340] RDX: 0000000000000002 RSI: ffffffff828f8715 RDI: 00000000ffffffff
[ 4277.400545] RBP: ffffea00044b3e40 R08: 0000000000000000 R09: ffffc900003339f0
[ 4277.401710] R10: 0000000000000003 R11: ffffffff82d44088 R12: ffff888112cf9910
[ 4277.402887] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8881000424c0
[ 4277.404049] FS: 0000000000000000(0000) GS:ffff88842fd40000(0000) knlGS:0000000000000000
[ 4277.405357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4277.406389] CR2: 00007f2ad0b24000 CR3: 0000000102a3a006 CR4: 00000000007706e0
[ 4277.407589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4277.408780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4277.410000] PKRU: 55555554
[ 4277.410645] Call Trace:
[ 4277.411234] <TASK>
[ 4277.411777] ? die+0x32/0x80
[ 4277.412439] ? do_trap+0xd6/0x100
[ 4277.413150] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.414158] ? do_error_trap+0x6a/0x90
[ 4277.414948] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.415915] ? exc_invalid_op+0x4c/0x60
[ 4277.416710] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.417675] ? asm_exc_invalid_op+0x16/0x20
[ 4277.418482] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.419466] ? __list_del_entry_valid_or_report+0x7b/0xc0
[ 4277.420410] free_to_partial_list+0x515/0x5e0
[ 4277.421242] ? xfs_iext_remove+0x41a/0xa10 [xfs]
[ 4277.422298] xfs_iext_remove+0x41a/0xa10 [xfs]
[ 4277.423316] ? xfs_inodegc_worker+0xb4/0x1a0 [xfs]
[ 4277.424383] xfs_bmap_del_extent_delay+0x4fe/0x7d0 [xfs]
[ 4277.425490] __xfs_bunmapi+0x50d/0x840 [xfs]
[ 4277.426445] xfs_itruncate_extents_flags+0x13a/0x490 [xfs]
[ 4277.427553] xfs_inactive_truncate+0xa3/0x120 [xfs]
[ 4277.428567] xfs_inactive+0x22d/0x290 [xfs]
[ 4277.429500] xfs_inodegc_worker+0xb4/0x1a0 [xfs]
[ 4277.430479] process_one_work+0x171/0x340
[ 4277.431227] worker_thread+0x277/0x390
[ 4277.431962] ? __pfx_worker_thread+0x10/0x10
[ 4277.432752] kthread+0xf0/0x120
[ 4277.433382] ? __pfx_kthread+0x10/0x10
[ 4277.434134] ret_from_fork+0x2d/0x50
[ 4277.434837] ? __pfx_kthread+0x10/0x10
[ 4277.435566] ret_from_fork_asm+0x1b/0x30
[ 4277.436280] </TASK>
Fixes:
|
|
|
|
5474d33ca4 |
mm/slub: Improve redzone check and zeroing for krealloc()
For current krealloc(), one problem is its caller doesn't pass the old request size, say the object is 64 bytes kmalloc one, but caller may only requested 48 bytes. Then when krealloc() shrinks or grows in the same object, or allocate a new bigger object, it lacks this 'original size' information to do accurate data preserving or zeroing (when __GFP_ZERO is set). Thus with slub debug redzone and object tracking enabled, parts of the object after krealloc() might contain redzone data instead of zeroes, which is violating the __GFP_ZERO guarantees. Good thing is in this case, kmalloc caches do have this 'orig_size' feature. So solve the problem by utilize 'org_size' to do accurate data zeroing and preserving. [Thanks to syzbot and V, Narasimhan for discovering kfence and big kmalloc related issues in early patch version] Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
|
|
|
9ef8568bd7 |
mm/slub: Consider kfence case for get_orig_size()
When 'orig_size' of kmalloc object is enabled by debug option, it should either contains the actual requested size or the cache's 'object_size'. But it's not true if that object is a kfence-allocated one, and the data at 'orig_size' offset of metadata could be zero or other values. This is not a big issue for current 'orig_size' usage, as init_object() and check_object() during alloc/free process will be skipped for kfence addresses. But it could cause trouble for other usage in future. Use the existing kfence helper kfence_ksize() which can return the real original request size. Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
|
|
|
2532e6c74a |
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
cma_init_reserved_mem() checks base and size alignment with CMA_MIN_ALIGNMENT_BYTES. However, some users might call this during early boot when pageblock_order is 0. That means if base and size does not have pageblock_order alignment, it can cause functional failures during cma activate area. So let's enforce pageblock_order to be non-zero during cma_init_reserved_mem() to catch such wrong usages. 1. This was seen with fadump on PowerPC which was calling cma_init_reserved_mem() before the pageblock_order was initialized. This is now fixed in the fadump on PowerPC itself. The details of that can be found in the patch including the userspace-visible effect of the issue [1]. 2. However it was also decided that we should add a stronger enforcement check within cma_init_reserved_mem() to catch such wrong usages [2]. Hence this patch. This is ok to be in -next and there is no "Fixes" tag required for this patch. [1]: https://lore.kernel.org/all/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list@gmail.com/ [2]: https://lore.kernel.org/all/83eb128e-4f06-4725-a843-a4563f246a44@redhat.com/ Link: https://lkml.kernel.org/r/e274344b44d5f80fa54c52f530387257fe99ec65.1731505681.git.ritesh.list@gmail.com Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
811808d365 |
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
Faults from copy_from_kernel_nofault() need to be handled by fixup table and should not be handled by kfence. Otherwise while reading /proc/kcore which uses copy_from_kernel_nofault(), kfence can generate false negatives. This can happen when /proc/kcore ends up reading an unmapped address from kfence pool. Let's add a testcase to cover this case. Link: https://lkml.kernel.org/r/210e561f7845697a32de44b643393890f180069f.1729272697.git.ritesh.list@gmail.com Signed-off-by: Nirjhar Roy <nirjhar@linux.ibm.com> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
05d4532b60 |
memcg/hugetlb: add hugeTLB counters to memcg
This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This feature
is enabled the same way hugeTLB accounting is enabled, via the
memory_hugetlb_accounting mount flag for cgroupsv2.
1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including the
metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)
Aside from the consistency between the two files, we also see benefits in
observability. Userspace might be interested in the hugeTLB footprint of
cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that don't
consume a lot of memory, or are seen to fault frequently. Note that this
is separate from wanting to inspect the distribution for limiting purposes
(in which case, hugeTLB controller makes more sense).
2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability benefits
that I mentioned above, and users can check the total hugeTLB usage,
verify if it is distributed as expected, etc.
With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
just for the sake of hugeTLB usage reporting, especially since
they don't have any use for hugeTLB usage enforcing [2].
3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio only
if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so
lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the
feature is enabled. Regardless of whether memcg accounts for hugetlb, the
newly added global counter is updated and shown in /proc/vmstat.
The global counter is added because vmstats is the preferred framework for
cgroup stats. It makes stat items consistent between global and cgroups.
It also provides a per-node breakdown, which is useful. Because it does
not use cgroup-specific hooks, we also keep generic MM code separate from
memcg code.
[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
duplicated. However, since the existing solution of enabling the
hugeTLB controller is an imperfect solution that still leaves a
discrepancy between memory.stat and memory.curent, I think that it
is reasonable to isolate the feature in this case.
Link: https://lkml.kernel.org/r/20241101204402.1885383-1-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
2ea80b039b |
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
Since 5.14-rc1, NUMA events will only be folded from per-CPU statistics to
per zone and global statistics when the user actually needs it.
Currently, the kernel has performs the fold operation when reading
/proc/vmstat, but does not perform the fold operation in /proc/zoneinfo.
This can lead to inaccuracies in the following statistics in zoneinfo:
- numa_hit
- numa_miss
- numa_foreign
- numa_interleave
- numa_local
- numa_other
Therefore, before printing per-zone vm_numa_event when reading
/proc/zoneinfo, we should also perform the fold operation.
Link: https://lkml.kernel.org/r/1730433998-10461-1-git-send-email-mengensun@tencent.com
Fixes:
|
|
|
|
8ce41b0f9d |
mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
We triggered a NULL pointer dereference for ac.preferred_zoneref->zone in alloc_pages_bulk_noprof() when the task is migrated between cpusets. When cpuset is enabled, in prepare_alloc_pages(), ac->nodemask may be ¤t->mems_allowed. when first_zones_zonelist() is called to find preferred_zoneref, the ac->nodemask may be modified concurrently if the task is migrated between different cpusets. Assuming we have 2 NUMA Node, when traversing Node1 in ac->zonelist, the nodemask is 2, and when traversing Node2 in ac->zonelist, the nodemask is 1. As a result, the ac->preferred_zoneref points to NULL zone. In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading to NULL pointer dereference. __alloc_pages_noprof() fixes this issue by checking NULL pointer in commit |
|
|
|
a4a282daf1 |
mm/mremap: fix address wraparound in move_page_tables()
On 32-bit platforms, it is possible for the expression `len + old_addr <
old_end` to be false-positive if `len + old_addr` wraps around.
`old_addr` is the cursor in the old range up to which page table entries
have been moved; so if the operation succeeded, `old_addr` is the *end* of
the old region, and adding `len` to it can wrap.
The overflow causes mremap() to mistakenly believe that PTEs have been
copied; the consequence is that mremap() bails out, but doesn't move the
PTEs back before the new VMA is unmapped, causing anonymous pages in the
region to be lost. So basically if userspace tries to mremap() a
private-anon region and hits this bug, mremap() will return an error and
the private-anon region's contents appear to have been zeroed.
The idea of this check is that `old_end - len` is the original start
address, and writing the check that way also makes it easier to read; so
fix the check by rearranging the comparison accordingly.
(An alternate fix would be to refactor this function by introducing an
"orig_old_start" variable or such.)
Tested in a VM with a 32-bit X86 kernel; without the patch:
```
user@horn:~/big_mremap$ cat test.c
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>
#define ADDR1 ((void*)0x60000000)
#define ADDR2 ((void*)0x10000000)
#define SIZE 0x50000000uL
int main(void) {
unsigned char *p1 = mmap(ADDR1, SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p1 == MAP_FAILED)
err(1, "mmap 1");
unsigned char *p2 = mmap(ADDR2, SIZE, PROT_NONE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0);
if (p2 == MAP_FAILED)
err(1, "mmap 2");
*p1 = 0x41;
printf("first char is 0x%02hhx\n", *p1);
unsigned char *p3 = mremap(p1, SIZE, SIZE,
MREMAP_MAYMOVE|MREMAP_FIXED, p2);
if (p3 == MAP_FAILED) {
printf("mremap() failed; first char is 0x%02hhx\n", *p1);
} else {
printf("mremap() succeeded; first char is 0x%02hhx\n", *p3);
}
}
user@horn:~/big_mremap$ gcc -static -o test test.c
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() failed; first char is 0x00
```
With the patch:
```
user@horn:~/big_mremap$ setarch -R ./test
first char is 0x41
mremap() succeeded; first char is 0x41
```
Link: https://lkml.kernel.org/r/20241111-fix-mremap-32bit-wrap-v1-1-61d6be73b722@google.com
Fixes:
|
|
|
|
0ec8bc9e88 |
mm, swap: fix allocation and scanning race with swapoff
There are two flags used to synchronize allocation and scanning with swapoff: SWP_WRITEOK and SWP_SCANNING. SWP_WRITEOK: Swapoff will first unset this flag, at this point any further swap allocation or scanning on this device should just abort so no more new entries will be referencing this device. Swapoff will then unuse all existing swap entries. SWP_SCANNING: This flag is set when device is being scanned. Swapoff will wait for all scanner to stop before the final release of the swap device structures to avoid UAF. Note this flag is the highest used bit of si->flags so it could be added up arithmetically, if there are multiple scanner. commit |
|
|
|
a79993b5fc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore |
|
|
|
5a4332062e |
Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', 'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf: Switch back to struct platform_driver::remove()
perf: arm_pmuv3: Add support for Samsung Mongoose PMU
dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
perf/dwc_pcie: Fix typos in event names
perf/dwc_pcie: Add support for Ampere SoCs
ARM: pmuv3: Add missing write_pmuacr()
perf/marvell: Marvell PEM performance monitor support
perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
perf/dwc_pcie: Convert the events with mixed case to lowercase
perf/cxlpmu: Support missing events in 3.1 spec
perf: imx_perf: add support for i.MX91 platform
dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
drivers perf: remove unused field pmu_node
* for-next/gcs: (42 commits)
: arm64 Guarded Control Stack user-space support
kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
arm64/gcs: Fix outdated ptrace documentation
kselftest/arm64: Ensure stable names for GCS stress test results
kselftest/arm64: Validate that GCS push and write permissions work
kselftest/arm64: Enable GCS for the FP stress tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Add GCS signal tests
kselftest/arm64: Add test coverage for GCS mode locking
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Verify the GCS hwcap
arm64: Add Kconfig for Guarded Control Stack (GCS)
arm64/ptrace: Expose GCS via ptrace and core files
arm64/signal: Expose GCS state in signal frames
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/mm: Implement map_shadow_stack()
...
* for-next/probes:
: Various arm64 uprobes/kprobes cleanups
arm64: insn: Simulate nop instruction for better uprobe performance
arm64: probes: Remove probe_opcode_t
arm64: probes: Cleanup kprobes endianness conversions
arm64: probes: Move kprobes-specific fields
arm64: probes: Fix uprobes for big-endian kernels
arm64: probes: Fix simulate_ldr*_literal()
arm64: probes: Remove broken LDR (literal) uprobe support
* for-next/asm-offsets:
: arm64 asm-offsets.c cleanup (remove unused offsets)
arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
arm64: asm-offsets: remove MM_CONTEXT_ID
arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
arm64: asm-offsets: remove VMA_VM_*
arm64: asm-offsets: remove TSK_ACTIVE_MM
* for-next/tlb:
: TLB flushing optimisations
arm64: optimize flush tlb kernel range
arm64: tlbflush: add __flush_tlb_range_limit_excess()
* for-next/misc:
: Miscellaneous patches
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
arm64/ptrace: Clarify documentation of VL configuration via ptrace
acpi/arm64: remove unnecessary cast
arm64/mm: Change protval as 'pteval_t' in map_range()
arm64: uprobes: Optimize cache flushes for xol slot
acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
ACPI: GTDT: Tighten the check for the array of platform timer structures
arm64/fpsimd: Fix a typo
arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
arm64: Return early when break handler is found on linked-list
arm64/mm: Re-organize arch_make_huge_pte()
arm64/mm: Drop _PROT_SECT_DEFAULT
arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
arm64: head: Drop SWAPPER_TABLE_SHIFT
arm64: cpufeature: add POE to cpucap_is_possible()
arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t
* for-next/mte:
: Various MTE improvements
selftests: arm64: add hugetlb mte tests
hugetlb: arm64: add mte support
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09
* for-next/stacktrace:
: arm64 stacktrace improvements
arm64: preserve pt_regs::stackframe during exec*()
arm64: stacktrace: unwind exception boundaries
arm64: stacktrace: split unwind_consume_stack()
arm64: stacktrace: report recovered PCs
arm64: stacktrace: report source of unwind data
arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
arm64: use a common struct frame_record
arm64: pt_regs: swap 'unused' and 'pmr' fields
arm64: pt_regs: rename "pmr_save" -> "pmr"
arm64: pt_regs: remove stale big-endian layout
arm64: pt_regs: assert pt_regs is a multiple of 16 bytes
* for-next/hwcap3:
: Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
arm64: Support AT_HWCAP3
binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4
* for-next/kselftest: (30 commits)
: arm64 kselftest fixes/cleanups
kselftest/arm64: Try harder to generate different keys during PAC tests
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
kselftest/arm64: Add FPMR coverage to fp-ptrace
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Enable build of PAC tests with LLVM=1
kselftest/arm64: Check that SVCR is 0 in signal handlers
kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix build with stricter assemblers
kselftest/arm64: Test signal handler state modification in fp-stress
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Poll less often while waiting for fp-stress children
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Fix encoding for SVE B16B16 test
...
* for-next/crc32:
: Optimise CRC32 using PMULL instructions
arm64/crc32: Implement 4-way interleave using PMULL
arm64/crc32: Reorganize bit/byte ordering macros
arm64/lib: Handle CRC-32 alternative in C code
* for-next/guest-cca:
: Support for running Linux as a guest in Arm CCA
arm64: Document Arm Confidential Compute
virt: arm-cca-guest: TSM_REPORT support for realms
arm64: Enable memory encrypt for Realms
arm64: mm: Avoid TLBI when marking pages as valid
arm64: Enforce bounce buffers for realm DMA
efi: arm64: Map Device with Prot Shared
arm64: rsi: Map unprotected MMIO as decrypted
arm64: rsi: Add support for checking whether an MMIO is protected
arm64: realm: Query IPA size from the RMM
arm64: Detect if in a realm and set RIPAS RAM
arm64: rsi: Add RSI definitions
* for-next/haft:
: Support for arm64 FEAT_HAFT
arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
arm64: Add support for FEAT_HAFT
arm64: setup: name 'tcr2' register
arm64/sysreg: Update ID_AA64MMFR1_EL1 register
* for-next/scs:
: Dynamic shadow call stack fixes
arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
arm64/scs: Deal with 64-bit relative offsets in FDE frames
arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
|
|
|
|
8714381703 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit |
|
|
|
4b49c0ba4e |
10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
singletons. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzP1ZAAKCRDdBJ7gKXxA jmBUAP9n2zTKoNeF/WpS0aSg+SpG78mtyMIwSUW2PPfGObYTBwD/bncG9U3fnno1 v6Sey0OjAKwGdV+gTd+5ymWJKPSQbgA= =HxTA -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All singletons" * tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: swapfile: fix cluster reclaim work crash on rotational devices selftests: hugetlb_dio: fixup check for initial conditions to skip in the start mm/thp: fix deferred split queue not partially_mapped: fix mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases nommu: pass NULL argument to vma_iter_prealloc() ocfs2: fix UBSAN warning in ocfs2_verify_volume() nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint mm: page_alloc: move mlocked flag clearance into free_pages_prepare() mm: count zeromap read and set for swapout and swapin |
|
|
|
52aecaee1c |
mm: zero range of eof folio exposed by inode size extension
On some filesystems, it is currently possible to create a transient data inconsistency between pagecache and on-disk state. For example, on a 1k block size ext4 filesystem: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "fiemap -v" -c "pread -v 2k 16" <file> ... EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..3]: 17410..17413 4 0x1 1: [4..15]: hole 12 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 2k 16" <file> 00000800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ This allocates and writes two 1k blocks, map writes to the post-eof portion of the (4k) eof folio, extends the file, and then shows that the post-eof data is not cleared before the file size is extended. The result is pagecache with a clean and uptodate folio over a hole that returns non-zero data. Once reclaimed, pagecache begins to return valid data. Some filesystems avoid this problem by flushing the EOF folio before inode size extension. This triggers writeback time partial post-eof zeroing. XFS explicitly zeroes newly exposed file ranges via iomap_zero_range(), but this includes a hack to flush dirty but hole-backed folios, which means writeback actually does the zeroing in this particular case as well. bcachefs explicitly flushes the eof folio on truncate extension to the same effect, but doesn't handle the analogous write extension case (i.e., replace "truncate 8k" with "pwrite 4k 4k" in the above example command to reproduce the same problem on bcachefs). btrfs doesn't seem to support subpage block sizes. The two main options to avoid this behavior are to either flush or do the appropriate zeroing during size extending operations. Zeroing is only required when the size change exposes ranges of the file that haven't been directly written, such as a write or truncate that starts beyond the current eof. The pagecache_isize_extended() helper is already used for this particular scenario. It currently cleans any pte's for the eof folio to ensure preexisting mappings fault and allow the filesystem to take action based on the updated inode size. This is required to ensure the folio is fully backed by allocated blocks, for example, but this also happens to be the same scenario zeroing is required. Update pagecache_isize_extended() to zero the post-eof range of the eof folio if it is dirty at the time of the size change, since writeback now won't have the chance. If non-dirty, the folio has either not been written or the post-eof portion was zeroed by writeback. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-3-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
dcf32ea7ec |
mm: swapfile: fix cluster reclaim work crash on rotational devices
syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:
> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> Workqueue: events swap_reclaim_work
> RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
> Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
> RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
> RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
> R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
> FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> __list_del_entry_valid include/linux/list.h:124 [inline]
> __list_del_entry include/linux/list.h:215 [inline]
> list_move_tail include/linux/list.h:310 [inline]
> swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
> swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779
The syzbot console output indicates a virtual environment where swapfile
is on a rotational device. In this case, clusters aren't actually used,
and si->full_clusters is not initialized. Daan's report is from qemu, so
likely rotational too.
Make sure to only schedule the cluster reclaim work when clusters are
actually in use.
Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org
Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/
Link: https://github.com/systemd/systemd/issues/35044
Fixes:
|
|
|
|
a3477c9e02 |
mm/thp: fix deferred split queue not partially_mapped: fix
Though even more elusive than before, list_del corruption has still been seen on THP's deferred split queue. The idea in commit |
|
|
|
94efde1d15 |
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
commit |
|
|
|
4e6bd13aa3 |
Merge branch 'iommufd/arm-smmuv3-nested' of iommu/linux into iommufd for-next
Common SMMUv3 patches for the following patches adding nesting, shared branch with the iommu tree. * 'iommufd/arm-smmuv3-nested' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu/arm-smmu-v3: Expose the arm_smmu_attach interface iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS ACPI/IORT: Support CANWBS memory access flag ACPICA: IORT: Update for revision E.f vfio: Remove VFIO_TYPE1_NESTING_IOMMU ... Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
|
|
|
9b5c87d479 |
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
Since
|
|
|
|
7269ed4af3 |
mm: define general function pXd_init()
pud_init(), pmd_init() and kernel_pte_init() are duplicated defined in file kasan.c and sparse-vmemmap.c as weak functions. Move them to generic header file pgtable.h, architecture can redefine them. Link: https://lkml.kernel.org/r/20241104070712.52902-1-maobibo@loongson.cn Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Huacai Chen <chenhuacai@loongson.cn> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7591c127f3 |
kmemleak: iommu/iova: fix transient kmemleak false positive
The introduction of iova_depot_pop() in
|
|
|
|
da0c02516c |
mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
fb56fdf8b9 |
mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.
This lock can be split into per-cgroup scope to reduce contention.
To achieve this, we need a stable list_lru_one for every cgroup. This
commit adds a lock to each list_lru_one and introduced a helper function
lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg.
Then reworked the reparenting process.
Reparenting will switch the list_lru_one instances one by one. By locking
each instance and marking it dead using the nr_items counter, reparenting
ensures that all items in the corresponding cgroup (on-list or not,
because items have a stable cgroup, see below) will see the list_lru_one
switch synchronously.
Objcg reparent is also moved after list_lru reparent so items will have a
stable mem cgroup until all list_lru_one instances are drained.
The only caller that doesn't work the *_obj interfaces are direct calls to
list_lru_{add,del}. But it's only used by zswap and that's also based on
objcg, so it's fine.
This also changes the bahaviour of the isolation function when LRU_RETRY
or LRU_REMOVED_RETRY is returned, because now releasing the lock could
unblock reparenting and free the list_lru_one, isolation function will
have to return withoug re-lock the lru.
prepare() {
mkdir /tmp/test-fs
modprobe brd rd_nr=1 rd_size=33554432
mkfs.xfs -f /dev/ram0
mount -t xfs /dev/ram0 /tmp/test-fs
for i in $(seq 1 512); do
mkdir "/tmp/test-fs/$i"
for j in $(seq 1 10240); do
echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
done &
done; wait
}
do_test() {
read_worker() {
sleep 1
tar -cv "$1" &>/dev/null
}
read_in_all() {
cd "/tmp/test-fs" && ls
for i in $(seq 1 512); do
(exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
read_worker "$i" &
done; wait
}
for i in $(seq 1 512); do
mkdir -p "/sys/fs/cgroup/benchmark/$i"
done
echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
echo 512M > /sys/fs/cgroup/benchmark/memory.max
echo 3 > /proc/sys/vm/drop_caches
time read_in_all
}
Above script simulates compression of small files in multiple cgroups
with memory pressure. Run prepare() then do_test for 6 times:
Before:
real 0m7.762s user 0m11.340s sys 3m11.224s
real 0m8.123s user 0m11.548s sys 3m2.549s
real 0m7.736s user 0m11.515s sys 3m11.171s
real 0m8.539s user 0m11.508s sys 3m7.618s
real 0m7.928s user 0m11.349s sys 3m13.063s
real 0m8.105s user 0m11.128s sys 3m14.313s
After this commit (about ~15% faster):
real 0m6.953s user 0m11.327s sys 2m42.912s
real 0m7.453s user 0m11.343s sys 2m51.942s
real 0m6.916s user 0m11.269s sys 2m43.957s
real 0m6.894s user 0m11.528s sys 2m45.346s
real 0m6.911s user 0m11.095s sys 2m43.168s
real 0m6.773s user 0m11.518s sys 2m40.774s
Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
28e98022b3 |
mm/list_lru: simplify reparenting and initial allocation
Currently, there is a lot of code for detecting reparent racing using
kmemcg_id as the synchronization flag. And an intermediate table is
required to record and compare the kmemcg_id.
We can simplify this by just checking the cgroup css status, skip if
cgroup is being offlined. On the reparenting side, ensure no more
allocation is on going and no further allocation will occur by using the
XArray lock as barrier.
Combined with a O(n^2) top-down walk for the allocation, we get rid of the
intermediate table allocation completely. Despite being O(n^2), it should
be actually faster because it's not practical to have a very deep cgroup
level, and in most cases the parent cgroup should have been allocated
already.
This also avoided changing kmemcg_id before reparenting, making cgroups
have a stable index for list_lru_memcg. After this change it's possible
that a dying cgroup will see a NULL value in XArray corresponding to the
kmemcg_id, because the kmemcg_id will point to an empty slot. In such
case, just fallback to use its parent.
As a result the code is simpler, following test also showed a very slight
performance gain (12 test runs):
prepare() {
mkdir /tmp/test-fs
modprobe brd rd_nr=1 rd_size=16777216
mkfs.xfs -f /dev/ram0
mount -t xfs /dev/ram0 /tmp/test-fs
for i in $(seq 10000); do
seq 8000 > "/tmp/test-fs/$i"
done
mkdir -p /sys/fs/cgroup/system.slice/bench/test/1
echo +memory > /sys/fs/cgroup/system.slice/bench/cgroup.subtree_control
echo +memory > /sys/fs/cgroup/system.slice/bench/test/cgroup.subtree_control
echo +memory > /sys/fs/cgroup/system.slice/bench/test/1/cgroup.subtree_control
echo 768M > /sys/fs/cgroup/system.slice/bench/memory.max
}
do_test() {
read_worker() {
mkdir -p "/sys/fs/cgroup/system.slice/bench/test/1/$1"
echo $BASHPID > "/sys/fs/cgroup/system.slice/bench/test/1/$1/cgroup.procs"
read -r __TMP < "/tmp/test-fs/$1";
}
read_in_all() {
for i in $(seq 10000); do
read_worker "$i" &
done; wait
}
echo 3 > /proc/sys/vm/drop_caches
time read_in_all
for i in $(seq 1 10000); do
rmdir "/sys/fs/cgroup/system.slice/bench/test/1/$i" &>/dev/null
done
}
Before:
real 0m3.498s user 0m11.037s sys 0m35.872s
real 1m33.860s user 0m11.593s sys 3m1.169s
real 1m31.883s user 0m11.265s sys 2m59.198s
real 1m32.394s user 0m11.294s sys 3m1.616s
real 1m31.017s user 0m11.379s sys 3m1.349s
real 1m31.931s user 0m11.295s sys 2m59.863s
real 1m32.758s user 0m11.254s sys 2m59.538s
real 1m35.198s user 0m11.145s sys 3m1.123s
real 1m30.531s user 0m11.393s sys 2m58.089s
real 1m31.142s user 0m11.333s sys 3m0.549s
After:
real 0m3.489s user 0m10.943s sys 0m36.036s
real 1m10.893s user 0m11.495s sys 2m38.545s
real 1m29.129s user 0m11.382s sys 3m1.601s
real 1m29.944s user 0m11.494s sys 3m1.575s
real 1m31.208s user 0m11.451s sys 2m59.693s
real 1m25.944s user 0m11.327s sys 2m56.394s
real 1m28.599s user 0m11.312s sys 3m0.162s
real 1m26.746s user 0m11.538s sys 2m55.462s
real 1m30.668s user 0m11.475s sys 3m2.075s
real 1m29.258s user 0m11.292s sys 3m0.780s
Which is slightly faster in real time.
Link: https://lkml.kernel.org/r/20241104175257.60853-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
8d42abbfa4 |
mm/list_lru: code clean up for reparenting
No feature change, just change of code structure and fix comment. The list lrus are not empty until memcg_reparent_list_lru_node() calls are all done, so the comments in memcg_offline_kmem were slightly inaccurate. Link: https://lkml.kernel.org/r/20241104175257.60853-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
78c0ed0913 |
mm/list_lru: don't export list_lru_add
It's no longer used by any module, just remove it. Link: https://lkml.kernel.org/r/20241104175257.60853-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3f28bbe56c |
mm/list_lru: don't pass unnecessary key parameters
Patch series "mm/list_lru: Split list_lru lock into per-cgroup scope". When LOCKDEP is not enabled, lock_class_key is an empty struct that is never used. But the list_lru initialization function still takes a placeholder pointer as parameter, and the compiler cannot optimize it because the function is not static and exported. Remove this parameter and move it inside the list_lru struct. Only use it when LOCKDEP is enabled. Kernel builds with LOCKDEP will be slightly larger, while !LOCKDEP builds without it will be slightly smaller (the common case). Link: https://lkml.kernel.org/r/20241104175257.60853-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20241104175257.60853-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3738290bfc |
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
The Kunit tests for kmalloc_track_caller and kmalloc_node_track_caller were missing in kasan_test_c.c, which check that these functions poison the memory properly. Add a Kunit test: -> kmalloc_tracker_caller_oob_right(): This includes out-of-bounds access test for kmalloc_track_caller and kmalloc_node_track_caller. Link: https://lkml.kernel.org/r/20241014190128.442059-1-niharchaithanya@gmail.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=216509 Signed-off-by: Nihar Chaithanya <niharchaithanya@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
247d720b2c |
nommu: pass NULL argument to vma_iter_prealloc()
When deleting a vma entry from a maple tree, it has to pass NULL to
vma_iter_prealloc() in order to calculate internal state of the tree, but
it passed a wrong argument. As a result, nommu kernels crashed upon
accessing a vma iterator, such as acct_collect() reading the size of vma
entries after do_munmap().
This commit fixes this issue by passing a right argument to the
preallocation call.
Link: https://lkml.kernel.org/r/20241108222834.3625217-1-thehajime@gmail.com
Fixes:
|
|
|
|
66edc3a589 |
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
Syzbot reported a bad page state problem caused by a page being freed using free_page() still having a mlocked flag at free_pages_prepare() stage: BUG: Bad page state in process syz.5.504 pfn:61f45 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45 flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 prep_new_page mm/page_alloc.c:1545 [inline] get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457 __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99 kvm_create_vm virt/kvm/kvm_main.c:1235 [inline] kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline] kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530 __do_compat_sys_ioctl fs/ioctl.c:1007 [inline] __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e page last free pid 8399 tgid 8399 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1108 [inline] free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686 folios_put_refs+0x76c/0x860 mm/swap.c:1007 free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline] tlb_batch_pages_flush mm/mmu_gather.c:149 [inline] tlb_flush_mmu_free mm/mmu_gather.c:366 [inline] tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373 tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465 exit_mmap+0x496/0xc40 mm/mmap.c:1926 __mmput+0x115/0x390 kernel/fork.c:1348 exit_mm+0x220/0x310 kernel/exit.c:571 do_exit+0x9b2/0x28e0 kernel/exit.c:926 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Modules linked in: CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 bad_page+0x176/0x1d0 mm/page_alloc.c:501 free_page_is_bad mm/page_alloc.c:918 [inline] free_pages_prepare mm/page_alloc.c:1100 [inline] free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638 kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline] kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386 kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143 __fput+0x23f/0x880 fs/file_table.c:431 task_work_run+0x24f/0x310 kernel/task_work.c:239 exit_task_work include/linux/task_work.h:43 [inline] do_exit+0xa2f/0x28e0 kernel/exit.c:939 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386 do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf745d579 Code: Unable to access opcode bytes at 0xf745d54f. RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4 RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The problem was originally introduced by commit |
|
|
|
1857099c18 |
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
During running KASAN Kunit tests with CONFIG_KASAN enabled, the following "warning" is reported by kunit framework: # kasan_atomics: Test should be marked slow (runtime: 2.604703115s) It took 2.6 seconds on my PC (Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz), apparently, due to multiple atomic checks in kasan_atomics_helper(). Let's mark it with KUNIT_CASE_SLOW which reports now as: # kasan_atomics.speed: slow Link: https://lkml.kernel.org/r/20241101184011.3369247-3-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c28432acf6 |
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
Patch series "kasan: few improvements on kunit tests". This patch series addresses the issue [1] with KASAN symbols used in the Kunit test, but exported as EXPORT_SYMBOL_GPL. Also a small tweak of marking kasan_atomics() as KUNIT_CASE_SLOW to avoid kunit report that the test should be marked as slow. This patch (of 2): Replace EXPORT_SYMBOL_GPL with EXPORT_SYMBOL_IF_KUNIT to mark the symbols as visible only if CONFIG_KUNIT is enabled. KASAN Kunit test should import the namespace EXPORTED_FOR_KUNIT_TESTING to use these marked symbols. Link: https://lkml.kernel.org/r/20241101184011.3369247-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20241101184011.3369247-2-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reported-by: Andrey Konovalov <andreyknvl@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218315 Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ad2bc8812f |
mm: remove unnecessary page_table_lock on stack expansion
Ever since commit
|
|
|
|
93c1e57ade |
mm: huge_memory: use strscpy() instead of strcpy()
Replace strcpy() with strscpy() in mm/huge_memory.c strcpy() has been deprecated because it is generally unsafe, so help to eliminate it from the kernel source. Link: https://github.com/KSPP/linux/issues/88 Link: https://lkml.kernel.org/r/20241101165719.1074234-7-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
24f9cd195f |
mm: shmem: override mTHP shmem default with a kernel parameter
Add the ``thp_shmem=`` kernel command line to allow specifying the default policy of each supported shmem hugepage size. The kernel parameter accepts the following format: thp_shmem=<size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy> For example, thp_shmem=16K-64K:always;128K,512K:inherit;256K:advise;1M-2M:never;4M-8M:within_size Some GPUs may benefit from using huge pages. Since DRM GEM uses shmem to allocate anonymous pageable memory, it's essential to control the huge page allocation policy for the internal shmem mount. This control can be achieved through the ``transparent_hugepage_shmem=`` parameter. Beyond just setting the allocation policy, it's crucial to have granular control over the size of huge pages that can be allocated. The GPU may support only specific huge page sizes, and allocating pages larger/smaller than those sizes would be ineffective. Link: https://lkml.kernel.org/r/20241101165719.1074234-6-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1c8d484975 |
mm: move ``get_order_from_str()`` to internal.h
In order to implement a kernel parameter similar to ``thp_anon=`` for shmem, we'll need the function ``get_order_from_str()``. Instead of duplicating the function, move the function to a shared header, in which both mm/shmem.c and mm/huge_memory.c will be able to use it. Link: https://lkml.kernel.org/r/20241101165719.1074234-5-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9490428111 |
mm: shmem: control THP support through the kernel command line
Patch series "mm: add more kernel parameters to control mTHP", v5. This series introduces four patches related to the kernel parameters controlling mTHP and a fifth patch replacing `strcpy()` for `strscpy()` in the file `mm/huge_memory.c`. The first patch is a straightforward documentation update, correcting the format of the kernel parameter ``thp_anon=``. The second, third, and fourth patches focus on controlling THP support for shmem via the kernel command line. The second patch introduces a parameter to control the global default huge page allocation policy for the internal shmem mount. The third patch moves a piece of code to a shared header to ease the implementation of the fourth patch. Finally, the fourth patch implements a parameter similar to ``thp_anon=``, but for shmem. The goal of these changes is to simplify the configuration of systems that rely on mTHP support for shmem. For instance, a platform with a GPU that benefits from huge pages may want to enable huge pages for shmem. Having these kernel parameters streamlines the configuration process and ensures consistency across setups. This patch (of 4): Add a new kernel command line to control the hugepage allocation policy for the internal shmem mount, ``transparent_hugepage_shmem``. The parameter is similar to ``transparent_hugepage`` and has the following format: transparent_hugepage_shmem=<policy> where ``<policy>`` is one of the seven valid policies available for shmem. Configuring the default huge page allocation policy for the internal shmem mount can be beneficial for DRM GPU drivers. Just as CPU architectures, GPUs can also take advantage of huge pages, but this is possible only if DRM GEM objects are backed by huge pages. Since GEM uses shmem to allocate anonymous pageable memory, having control over the default huge page allocation policy allows for the exploration of huge pages use on GPUs that rely on GEM objects backed by shmem. Link: https://lkml.kernel.org/r/20241101165719.1074234-2-mcanal@igalia.com Link: https://lkml.kernel.org/r/20241101165719.1074234-4-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: dri-devel@lists.freedesktop.org Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kernel-dev@igalia.com Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
8e1817b6ba |
vma: detect infinite loop in vma tree
There have been no reported infinite loops in the tree, but checking the detection of an infinite loop during validation is simple enough. Add the detection to the validate_mm() function so that error reports are clear and don't just report stalls. This does not protect against internal maple tree issues, but it does detect too many vmas being returned from the tree. The variance of +10 is to allow for the debugging output to be more useful for nearly correct counts. In the event of more than 10 over the map_count, the count will be set to -1 for easier identification of a potential infinite loop. Note that the mmap lock is held to ensure a consistent tree state during the validation process. [akpm@linux-foundation.org: add comment] Link: https://lkml.kernel.org/r/20241031193608.1965366-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ec397ea00c |
mm: page_frag: use __alloc_pages() to replace alloc_pages_node()
It seems there is about 24Bytes binary size increase for __page_frag_cache_refill() after refactoring in arm64 system with 64K PAGE_SIZE. By doing the gdb disassembling, It seems we can have more than 100Bytes decrease for the binary size by using __alloc_pages() to replace alloc_pages_node(), as there seems to be some unnecessary checking for nid being NUMA_NO_NODE, especially when page_frag is part of the mm system. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-8-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
0c3ce2f502 |
mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'
Currently there is one 'struct page_frag' for every 'struct sock' and 'struct task_struct', we are about to replace the 'struct page_frag' with 'struct page_frag_cache' for them. Before begin the replacing, we need to ensure the size of 'struct page_frag_cache' is not bigger than the size of 'struct page_frag', as there may be tens of thousands of 'struct sock' and 'struct task_struct' instances in the system. By or'ing the page order & pfmemalloc with lower bits of 'va' instead of using 'u16' or 'u32' for page size and 'u8' for pfmemalloc, we are able to avoid 3 or 5 bytes space waste. And page address & pfmemalloc & order is unchanged for the same page in the same 'page_frag_cache' instance, it makes sense to fit them together. After this patch, the size of 'struct page_frag_cache' should be the same as the size of 'struct page_frag'. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-7-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
8218f62c9c |
mm: page_frag: use initial zero offset for page_frag_alloc_align()
We are about to use page_frag_alloc_*() API to not just allocate memory for skb->data, but also use them to do the memory allocation for skb frag too. Currently the implementation of page_frag in mm subsystem is running the offset as a countdown rather than count-up value, there may have several advantages to that as mentioned in [1], but it may have some disadvantages, for example, it may disable skb frag coalescing and more correct cache prefetching We have a trade-off to make in order to have a unified implementation and API for page_frag, so use a initial zero offset in this patch, and the following patch will try to make some optimization to avoid the disadvantages as much as possible. 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/ CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-4-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
65941f10ca |
mm: move the page fragment allocator from page_alloc into its own file
Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
|
|
408a8dc623 |
mm/memory-failure: replace sprintf() with sysfs_emit()
As Documentation/filesystems/sysfs.rst suggested, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Link: https://lkml.kernel.org/r/20241029101853.37890-1-zhangguopeng@kylinos.cn Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f914ac96ee |
memcg: add flush tracepoint
This tracepoint gives visibility on how often the flushing of memcg stats occurs and contains info on whether it was forced, skipped, and the value of stats updated. It can help with understanding how readers are affected by having to perform the flush, and the effectiveness of the flush by inspecting the number of stats updated. Paired with the recently added tracepoints for tracing rstat updates, it can also help show correlation where stats exceed thresholds frequently. Link: https://lkml.kernel.org/r/20241029021106.25587-3-inwardvessel@gmail.com Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e1479b880c |
memcg: rename do_flush_stats and add force flag
Patch series "memcg: tracepoint for flushing stats", v3. This series adds new capability for understanding frequency and circumstances behind flushing memcg stats. This patch (of 2): Change the name to something more consistent with others in the file and use double unders to signify it is associated with the mem_cgroup_flush_stats() API call. Additionally include a new flag that call sites use to indicate a forced flush; skipping checks and flushing unconditionally. There are no changes in functionality. Link: https://lkml.kernel.org/r/20241029021106.25587-1-inwardvessel@gmail.com Link: https://lkml.kernel.org/r/20241029021106.25587-2-inwardvessel@gmail.com Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ab6e8e74e4 |
mm: delete the unused put_pages_list()
The last user of put_pages_list() converted away from it in 6.10 commit
|
|
|
|
662df3e5c3 |
mm: madvise: implement lightweight guard page mechanism
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise. Currently users must establish PROT_NONE ranges to achieve this. However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs. In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges. The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - establishing PTE markers such that accesses to them cause a fault followed by a SIGSGEV signal being raised. This is achieved through the PTE marker mechanism, which we have already extended to provide PTE_MARKER_GUARD, which we installed via the generic page walking logic which we have extended for this purpose. These guard ranges are established with MADV_GUARD_INSTALL. If the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect). Any existing guard entries will be left untouched. There is therefore no nesting of guarded pages. Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both instances the memory range may be reused at which point a user would expect guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges. The guard property can be removed from ranges via MADV_GUARD_REMOVE. The ranges over which this is applied, should they contain non-guard entries, will be untouched, with only guard entries being cleared. We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive). Racing page faults can cause repeated attempts to install guard pages that are interrupted, result in a zap, and this process can end up being repeated. If this happens more than would be expected in normal operation, we rescind locks and retry the whole thing, which avoids lock contention in this scenario. Link: https://lkml.kernel.org/r/6aafb5821bf209f277dfae0787abb2ef87a37542.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7c53dfbdb0 |
mm: add PTE_MARKER_GUARD PTE marker
Add a new PTE marker that results in any access causing the accessing process to segfault. This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range. This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination. Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect. We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or MADV_FREE. We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker be encountered there, as we explicitly do not support this operation and this should not occur. Link: https://lkml.kernel.org/r/f47f3d5acca2dcf9bbf655b6d33f3dc713e4a4a0.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabkba@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5f6170a469 |
mm: pagewalk: add the ability to install PTEs
Patch series "implement lightweight guard pages", v4. Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process. The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region. With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted. This series takes a different approach - an idea suggested by Vlastimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges. Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code. We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging. This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'. After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed. By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs. This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning. The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL which installs page table-level guard page markers - and MADV_GUARD_REMOVE - which clears them. Guard markers can be installed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the guard page markers in the page tables. There is no concept of 'nested' guard markers, multiple attempts to install guard markers in a range will, after the first attempt, have no effect. Importantly, removing guard markers over a range that contains both guard markers and ordinary backed memory has no effect on anything but the guard markers (including leaving huge pages un-split), so a user can safely remove guard markers over a range of memory leaving the rest intact. The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism. Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it. We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse). We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not remove guard markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics). It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the markers if they wish. We simply implement it in such a way as to provide the least surprising behaviour. An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of guard ranges. This patch (of 5): The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks. This is useful for traversing existing page table entries, but precludes the ability to establish new ones. Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere. Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal. If a THP huge page is encountered at either the PMD or PUD level we split it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), otherwise if there is only an ops->install_pte(), we avoid the unnecessary split. We do not support hugetlb at this stage. If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges. If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range. If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before. As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs. We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h. We take the opportunity to additionally clean up the page walker logic to be a little easier to follow. Link: https://lkml.kernel.org/r/cover.1730123433.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/51b432ebef013e3fdf9f92101533435de1bffadf.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Vlastimil Babka <vbabkba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4e4d9c72c9 |
kasan: delete CONFIG_KASAN_MODULE_TEST
Since we've migrated all tests to the KUnit framework, we can delete CONFIG_KASAN_MODULE_TEST and mentioning of it in the documentation as well. I've used the online translator to modify the non-English documentation. [snovitoll@gmail.com: fix indentation in translation] Link: https://lkml.kernel.org/r/20241020042813.3223449-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20241016131802.3115788-4-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ca79a00bb9 |
kasan: migrate copy_user_test to kunit
Migrate the copy_user_test to the KUnit framework to verify out-of-bound detection via KASAN reports in copy_from_user(), copy_to_user() and their static functions. This is the last migrated test in kasan_test_module.c, therefore delete the file. [arnd@arndb.de: export copy_to_kernel_nofault] Link: https://lkml.kernel.org/r/20241018151112.3533820-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20241016131802.3115788-3-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
aaf2914aec |
mm: add per-order mTHP swpin counters
This helps profile the sizes of folios being swapped in. Currently,
only mTHP swap-out is being counted.
The new interface can be found at:
/sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats
swpin
For example,
cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
12809
cat /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
4763
[v-songbaohua@oppo.com: add a blank line in doc]
Link: https://lkml.kernel.org/r/20241030233423.80759-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20241026082423.26298-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
ed882add6d |
mm: zswap: zswap_store_page() will initialize entry after adding to xarray.
This incorporates Yosry's suggestions in [1] for further simplifying zswap_store_page(). If the page is successfully compressed and added to the xarray, we get the pool/objcg refs, and initialize all the entry's members. Only after this, we add it to the zswap LRU. In the time between the entry's addition to the xarray and it's member initialization, we are protected against concurrent stores/loads/swapoff through the folio lock, and are protected against writeback because the entry is not on the LRU yet. This way, we don't have to drop the pool/objcg refs, now that the entry initialization is centralized to the successful page store code path. zswap_compress() is modified to take a zswap_pool parameter in keeping with this simplification (as against obtaining this from entry->pool). [1]: https://lore.kernel.org/all/CAJD7tkZh6ufHQef5HjXf_F5b5LC1EATexgseD=4WvrO+a6Ni6w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20241002173329.213722-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0c560dd860 |
mm: swap: count successful large folio zswap stores in hugepage zswpout stats
Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage stats so that successful large folio zswap stores can be accounted under the per-order sysfs "zswpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout Other non-zswap swap device swap-out events will be counted under the existing sysfs "swpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout Also, added documentation for the newly added sysfs per-order hugepage "zswpout" stats. The documentation clarifies that only non-zswap swapouts will be accounted in the existing "swpout" stats. Link: https://lkml.kernel.org/r/20241001053222.6944-8-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
b7c0ccdfba |
mm: zswap: support large folios in zswap_store()
This series enables zswap_store() to accept and store large folios. The
most significant contribution in this series is from the earlier RFC
submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to
mm-unstable as of 9-30-2024 in patch 6 of this series, and adapted based
on code review comments received for the current patch-series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
The first few patches do the prep work for supporting large folios in
zswap_store. Patch 6 provides the main functionality to swap-out large
folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout"
counters that get incremented upon successful zswap_store of large folios,
and also updates the documentation for this:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
This series is a pre-requisite for zswap compress batching of large folio
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent patch-series, with performance improvement data.
Thanks to Ying Huang for pre-posting review feedback and suggestions!
Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama, Ying and
Matthew for their helpful feedback, code/data reviews and suggestions!
I would like to thank Ryan Roberts for his original RFC [1].
System setup for testing:
=========================
Testing of this series was done with mm-unstable as of 9-27-2024, commit
de2fbaa6d9c3576ec7133ed02a370ec9376bf000 (without this patch-series) and
mm-unstable 9-30-2024 commit c121617e3606be6575cdacfdb63cc8d67b46a568
(with this patch-series). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at
2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high was
fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and
sleeping for 10 sec before exiting:
usemem --init-time -w -O -s 10 -n 30 10g
Other kernel configuration parameters:
zswap compressors : zstd, deflate-iaa
zswap allocator : zsmalloc
vm.page-cluster : 2
In the experiments where "deflate-iaa" is used as the zswap compressor,
IAA "compression verification" is enabled by default (cat
/sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors, and the experimental data listed
below is with verify_compress set to "1".
Metrics reporting methodology:
==============================
Total and average throughput are derived from the individual 30 processes'
throughputs reported by usemem. elapsed/sys times are measured with perf.
All percentage changes are "new" vs. "old"; hence a positive value
denotes an increase in the metric, whether it is throughput or latency,
and a negative value denotes a reduction in the metric. Positive
throughput change percentages and negative latency change percentages
denote improvements.
The vm stats and sysfs hugepages stats included with the performance data
provide details on the swapout activity to zswap/swap device.
Testing labels used in data summaries:
======================================
The data refers to these test configurations and the before/after
comparisons that they do:
before-case1:
-------------
mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K)
In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
into 4K folios that get processed by zswap.
before-case2:
-------------
mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios)
In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
folios, which will then be stored by the SSD swap device.
after:
------
v10 of this patch-series, CONFIG_THP_SWAP=Y
The "after" is CONFIG_THP_SWAP=Y and v10 of this patch-series, that results
in 64K/2M folios to not be split, and to be processed by zswap_store.
Regression Testing:
===================
I ran vm-scalability usemem without large folios, i.e., only 4K folios with
mm-unstable and this patch-series. The main goal was to make sure that
there is no functional or performance regression wrt the earlier zswap
behavior for 4K folios, now that 4K folios will be processed by the new
zswap_store() code.
The data indicates there is no significant regression.
-------------------------------------------------------------------------------
4K folios:
==========
zswap compressor zstd zstd zstd zstd v10
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 4,793,363 4,880,978 4,853,074 1% -1%
Average throughput (KB/s) 159,778 162,699 161,769 1% -1%
elapsed time (sec) 130.14 123.17 126.29 -3% 3%
sys time (sec) 3,135.53 2,985.64 3,083.18 -2% 3%
memcg_high 446,826 444,626 452,930
memcg_swap_fail 0 0 0
zswpout 48,932,107 48,931,971 48,931,820
zswpin 383 386 397
pswpout 0 0 0
pswpin 0 0 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 0 0 0
pgmajfault 3,063 3,077 3,479
swap_ra 93 94 96
swap_ra_hit 47 47 50
ZSWPOUT-64kB n/a n/a 0
SWPOUT-64kB 0 0 0
-------------------------------------------------------------------------------
Performance Testing:
====================
We list the data for 64K folios with before/after data per-compressor,
followed by the same for 2M pmd-mappable folios.
-------------------------------------------------------------------------------
64K folios: zstd:
=================
zswap compressor zstd zstd zstd zstd v10
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,222,213 1,076,611 6,159,776 18% 472%
Average throughput (KB/s) 174,073 35,887 205,325 18% 472%
elapsed time (sec) 120.50 347.16 108.33 -10% -69%
sys time (sec) 2,930.33 248.16 2,549.65 -13% 927%
memcg_high 416,773 552,200 465,874
memcg_swap_fail 3,192,906 1,293 1,012
zswpout 48,931,583 20,903 48,931,218
zswpin 384 363 410
pswpout 0 40,778,448 0
pswpin 0 16 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 3,192,906 1,293 1,012
pgmajfault 3,452 3,072 3,061
swap_ra 90 87 107
swap_ra_hit 42 43 57
ZSWPOUT-64kB n/a n/a 3,057,173
SWPOUT-64kB 0 2,548,653 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
64K folios: deflate-iaa:
========================
zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,652,608 1,089,180 7,189,778 27% 560%
Average throughput (KB/s) 188,420 36,306 239,659 27% 560%
elapsed time (sec) 102.90 343.35 87.05 -15% -75%
sys time (sec) 2,246.86 213.53 1,864.16 -17% 773%
memcg_high 576,104 502,907 642,083
memcg_swap_fail 4,016,117 1,407 1,478
zswpout 61,163,423 22,444 57,798,716
zswpin 401 368 454
pswpout 0 40,862,080 0
pswpin 0 20 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 4,016,117 1,407 1,478
pgmajfault 3,063 3,153 3,122
swap_ra 96 93 156
swap_ra_hit 46 45 83
ZSWPOUT-64kB n/a n/a 3,611,032
SWPOUT-64kB 0 2,553,880 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
2M folios: zstd:
================
zswap compressor zstd zstd zstd zstd v10
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,895,500 1,109,694 6,484,224 10% 484%
Average throughput (KB/s) 196,516 36,989 216,140 10% 484%
elapsed time (sec) 108.77 334.28 106.33 -2% -68%
sys time (sec) 2,657.14 94.88 2,376.13 -11% 2404%
memcg_high 64,200 66,316 56,898
memcg_swap_fail 101,182 70 27
zswpout 48,931,499 36,507 48,890,640
zswpin 380 379 377
pswpout 0 40,166,400 0
pswpin 0 0 0
thp_swpout 0 78,450 0
thp_swpout_fallback 101,182 70 27
2MB-mthp_swpout_fallback 0 0 27
pgmajfault 3,067 3,417 3,311
swap_ra 91 90 854
swap_ra_hit 45 45 810
ZSWPOUT-2MB n/a n/a 95,459
SWPOUT-2MB 0 78,450 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
2M folios: deflate-iaa:
=======================
zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 6,286,587 1,126,785 7,073,464 13% 528%
Average throughput (KB/s) 209,552 37,559 235,782 13% 528%
elapsed time (sec) 96.19 333.03 85.79 -11% -74%
sys time (sec) 2,141.44 99.96 1,826.67 -15% 1727%
memcg_high 99,253 64,666 79,718
memcg_swap_fail 129,074 53 165
zswpout 61,312,794 28,321 56,045,120
zswpin 383 406 403
pswpout 0 40,048,128 0
pswpin 0 0 0
thp_swpout 0 78,219 0
thp_swpout_fallback 129,074 53 165
2MB-mthp_swpout_fallback 0 0 165
pgmajfault 3,430 3,077 31,468
swap_ra 91 103 84,373
swap_ra_hit 47 46 84,317
ZSWPOUT-2MB n/a n/a 109,229
SWPOUT-2MB 0 78,219 0
-------------------------------------------------------------------------------
And finally, this is a comparison of deflate-iaa vs. zstd with v10 of this
patch-series:
---------------------------------------------
zswap_store large folios v10
Impr w/ deflate-iaa vs. zstd
64K folios 2M folios
---------------------------------------------
Throughput (KB/s) 17% 9%
elapsed time (sec) -20% -19%
sys time (sec) -27% -23%
---------------------------------------------
Conclusions based on the performance results:
=============================================
v10 wrt before-case1:
---------------------
We see significant improvements in throughput, elapsed and sys time for
zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after
(THP_SWAP=Y) with zswap_store large folios.
v10 wrt before-case2:
---------------------
We see even more significant improvements in throughput and elapsed time
for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
vs. after (large-folio-zswap). The sys time increases with
large-folio-zswap as expected, due to the CPU compression time
vs. asynchronous disk write times, as pointed out by Ying and Yosry.
In before-case2, when zswap does not store large folios, only allocations
and cgroup charging due to 4K folio zswap stores count towards the cgroup
memory limit. However, in the after scenario, with the introduction of
zswap_store() of large folios, there is an added component of the zswap
compressed pool usage from large folio stores from potentially all 30
processes, that gets counted towards the memory limit. As a result, we see
higher swapout activity in the "after" data.
Summary:
========
The v10 data presented above shows that zswap_store of large folios
demonstrates good throughput/performance improvements compared to
conventional SSD swap of large folios with a sufficiently large 525G SSD
swap device. Hence, it seems reasonable for zswap_store to support large
folios, so that further performance improvements can be implemented.
In the experimental setup used in this patchset, we have enabled IAA
compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. We see good
throughput/latency improvements with deflate-iaa vs. zstd with zswap_store
of large folios.
Some of the ideas for further reducing latency that have shown promise in
our experiments, are:
1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.
The tests run for this patchset are using only 1 IAA device per core, that
avails of 2 compress engines on the device. In our experiments with IAA
batching, we distribute compress jobs from all cores to the 8 compress
engines available per socket. We further compress the pages in each folio
in parallel in the accelerator. As a result, we improve compress latency
and reclaim throughput.
In decompress batching, we use swapin_readahead to generate a prefetch
batch of 4K folios that we decompress in parallel in IAA.
------------------------------------------------------------------------------
IAA compress/decompress batching
Further improvements wrt v10 zswap_store Sequential
subpage store using "deflate-iaa":
"deflate-iaa" Batching "deflate-iaa-canned" [2] Batching
Additional Impr Additional Impr
64K folios 2M folios 64K folios 2M folios
------------------------------------------------------------------------------
Throughput (KB/s) 19% 43% 26% 55%
elapsed time (sec) -5% -14% -10% -21%
sys time (sec) 4% -7% -4% -18%
------------------------------------------------------------------------------
With zswap IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in server
scalability experiments in highly contended system scenarios under
significant memory pressure; as compared to software compressors. We hope
to submit this work in subsequent patch series. The current patch-series
is a prequisite for these future submissions.
This patch (of 7):
zswap_store() will store large folios by compressing them page by page.
This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.
zswap_store() calls the newly added zswap_store_page() function for each
page in the folio. zswap_store_page() handles compressing and storing
each page.
We check the global and per-cgroup limits once at the beginning of
zswap_store(), and only check that the limit is not reached yet. This is
racy and inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page(). A new
function zswap_pool_get() is added to facilitate this.
If these one-time checks pass, we compress the pages of the folio, while
maintaining a running count of compressed bytes for all the folio's pages.
If all pages are successfully compressed and stored, we do the cgroup
zswap charging with the total compressed bytes, and batch update the
zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
before returning from zswap_store().
If an error is encountered during the store of any page in the folio, all
pages in that folio currently stored in zswap will be invalidated. Thus,
a folio is either entirely stored in zswap, or entirely not stored in
zswap.
The most important value provided by this patch is it enables swapping out
large folios to zswap without splitting them. Furthermore, it batches
some operations while doing so (cgroup charging, stats updates).
This patch also forms the basis for building compress batching of pages in
a large folio in zswap_store() by compressing up to say, 8 pages of the
folio in parallel in hardware using the Intel In-Memory Analytics
Accelerator (Intel IAA).
This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:
"[RFC,v1] mm: zswap: Store large folios without splitting"
[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
Link: https://lkml.kernel.org/r/20241001053222.6944-1-kanchana.p.sridhar@intel.com
Link: https://lkml.kernel.org/r/20241001053222.6944-7-kanchana.p.sridhar@intel.com
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Originally-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Wajdi Feghali <wajdi.k.feghali@intel.com>
Cc: "Zou, Nanhai" <nanhai.zou@intel.com>
Cc: Barry Song <21cnbao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
6e1fa555ec |
mm: zswap: modify zswap_stored_pages to be atomic_long_t
For zswap_store() to support large folios, we need to be able to do a batch update of zswap_stored_pages upon successful store of all pages in the folio. For this, we need to add folio_nr_pages(), which returns a long, to zswap_stored_pages. Link: https://lkml.kernel.org/r/20241001053222.6944-6-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0201c054c2 |
mm: zswap: rename zswap_pool_get() to zswap_pool_tryget()
Modify the name of the existing zswap_pool_get() to zswap_pool_tryget() to be representative of the call it makes to percpu_ref_tryget(). A subsequent patch will introduce a new zswap_pool_get() that calls percpu_ref_get(). The intent behind this change is for higher level zswap API such as zswap_store() to call zswap_pool_tryget() to check upfront if the pool's refcount is "0" (which means it could be getting destroyed) and to handle this as an error condition. zswap_store() would proceed only if zswap_pool_tryget() returns success, and any additional pool refcounts that need to be obtained for compressing sub-pages in a large folio could simply call zswap_pool_get(). Link: https://lkml.kernel.org/r/20241001053222.6944-4-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3d0f560a36 |
mm: zswap: modify zswap_compress() to accept a page instead of a folio
For zswap_store() to be able to store a large folio by compressing it one page at a time, zswap_compress() needs to accept a page as input. This will allow us to iterate through each page in the folio in zswap_store(), compress it and store it in the zpool. Link: https://lkml.kernel.org/r/20241001053222.6944-3-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2ec0859039 |
Merge branch 'mm-hotfixes-stable' into mm-stable
Pick up
|
|
|
|
e7ac4daeed |
mm: count zeromap read and set for swapout and swapin
When the proportion of folios from the zeromap is small, missing their accounting may not significantly impact profiling. However, it's easy to construct a scenario where this becomes an issue—for example, allocating 1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out and swap-in counts seem to vanish into a black hole, potentially causing semantic ambiguity. On the other hand, Usama reported that zero-filled pages can exceed 10% in workloads utilizing zswap, while Hailong noted that some app in Android have more than 6% zero-filled pages. Before commit |
|
|
|
ace149e083 |
filemap: Fix bounds checking in filemap_read()
If the caller supplies an iocb->ki_pos value that is close to the
filesystem upper limit, and an iterator with a count that causes us to
overflow that limit, then filemap_read() enters an infinite loop.
This behaviour was discovered when testing xfstests generic/525 with the
"localio" optimisation for loopback NFS mounts.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Fixes:
|
|
|
|
28e43197c4 |
20 hotfixes, 14 of which are cc:stable.
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
=n+lO
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes, 14 of which are cc:stable.
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons"
* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Thorsten Blum
ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
signal: restore the override_rlimit logic
fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
ucounts: fix counter leak in inc_rlimit_get_ucounts()
selftests: hugetlb_dio: check for initial conditions to skip in the start
mm: fix docs for the kernel parameter ``thp_anon=``
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
mm/damon/core: handle zero schemes apply interval
mm/damon/core: handle zero {aggregation,ops_update} intervals
mm/mlock: set the correct prev on failure
objpool: fix to make percpu slot allocation more robust
mm/page_alloc: keep track of free highatomic
mm: resolve faulty mmap_region() error path behaviour
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
mm: refactor map_deny_write_exec()
mm: unconditionally close VMAs on error
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
mm/thp: fix deferred split unqueue naming and locking
mm/thp: fix deferred split queue not partially_mapped
|
|
|
|
f1dce1f093 |
slab fix for 6.12-rc7
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmcuE+8ACgkQu+CwddJF iJoAmAf+JhB/c4xgZ6ztCPNRHAeMTBomr578qFqE1uU7HW4rZaWiVAuIYRghpVgj xXXRU1sITBrMJzakRr3kYDIjchv08yDOd/Bx3nkgRUHAozhNh2DVGR7XVF9qKNDU 0Xof4+hNXSAqHsBTgJm3rYq42qdjVrJ0oA83EfwHFRUxVwrc6pARBrbNHprxfx1q /HbGI/FWqF/O2KEO45XuXHc/G4ZxLu/DlsHEcP7jHKG/TU2u3+wIUzGkIe1zgHH8 pD5ARsRA9QG2zQ3Z12guh4zyLVjc+REg29/ko8J5cLLs79KHV7I9nSHW5+bw0425 zAgOmo3P2NwQSnmNo0fdTWlNPniIsg== =+Co+ -----END PGP SIGNATURE----- Merge tag 'slab-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix for duplicate caches in some arm64 configurations with CONFIG_SLAB_BUCKETS (Koichiro Den) * tag 'slab-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: fix warning caused by duplicate kmem_cache creation in kmem_buckets_create |
|
|
|
73da523802 |
mm/damon/tests/dbgfs-kunit: fix the header double inclusion guarding ifdef comment
Closing part of double inclusion guarding macro for dbgfs-kunit.h was
copy-pasted from somewhere (maybe before the initial mainline merge of
DAMON), and not properly updated. Fix it.
Link: https://lkml.kernel.org/r/20241028233058.283381-7-sj@kernel.org
Fixes:
|
|
|
|
12d021659c |
mm/damon/Kconfig: update DBGFS_KUNIT prompt copy for SYSFS_KUNIT
CONFIG_DAMON_SYSFS_KUNIT_TEST prompt is copied from that for DAMON debugfs
interface kunit tests, and not correctly updated. Fix it.
Link: https://lkml.kernel.org/r/20241028233058.283381-6-sj@kernel.org
Fixes:
|
|
|
|
2b1d55498b |
memcg: factor out mem_cgroup_stat_aggregate()
Currently mem_cgroup_css_rstat_flush() is used to flush the per-CPU statistics from a specified CPU into the global statistics of the memcg. It processes three kinds of data in three for loops using exactly the same method. Therefore, the for loop can be factored out and may make the code more clean. Link: https://lkml.kernel.org/r/20241026093407.310955-1-xiujianfeng@huaweicloud.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wang Weiyang <wangweiyang2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e8c1a296b8 |
mm/show_mem: use str_yes_no() helper in show_free_areas()
Remove hard-coded strings by using the str_yes_no() helper function. Link: https://lkml.kernel.org/r/20241026103552.6790-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1bc542c6a0 |
mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
Commit |
|
|
|
33d7f15f91 |
mm: use page->private instead of page->index in percpu
The percpu allocator only uses one field in struct page, just change it from page->index to page->private. Link: https://lkml.kernel.org/r/20241005200121.3231142-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
544ec0ed37 |
mm: remove references to page->index in huge_memory.c
We already have folios in all these places; it's just a matter of using them instead of the pages. Link: https://lkml.kernel.org/r/20241005200121.3231142-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0386aaa6e9 |
bootmem: stop using page->index
Encode the type into the bottom four bits of page->private and the info into the remaining bits. Also turn the bootmem type into a named enum. [arnd@arndb.de: bootmem: add bootmem_type stub function] Link: https://lkml.kernel.org/r/20241015143802.577613-1-arnd@kernel.org [akpm@linux-foundation.org: fix build with !CONFIG_HAVE_BOOTMEM_INFO_NODE] Link: https://lore.kernel.org/oe-kbuild-all/202410090311.eaqcL7IZ-lkp@intel.com/ Link: https://lkml.kernel.org/r/20241005200121.3231142-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
68158bfa3d |
mm: mass constification of folio/page pointers
Now that page_pgoff() takes const pointers, we can constify the pointers to a lot of functions. Link: https://lkml.kernel.org/r/20241005200121.3231142-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
713da0b33b |
mm: renovate page_address_in_vma()
This function doesn't modify any of its arguments, so if we make a few other functions take const pointers, we can make page_address_in_vma() take const pointers too. All of its callers have the containing folio already, so pass that in as an argument instead of recalculating it. Also add kernel-doc Link: https://lkml.kernel.org/r/20241005200121.3231142-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7d3e93eca3 |
mm: use page_pgoff() in more places
There are several places which currently open-code page_pgoff(), convert them to call it. Link: https://lkml.kernel.org/r/20241005200121.3231142-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f7470591f8 |
mm: convert page_to_pgoff() to page_pgoff()
Patch series "page->index removals in mm", v2. As part of shrinking struct page, we need to stop using page->index. This patchset gets rid of most of the remaining references to page->index in mm, as well as increasing the number of functions which take a const folio/page pointer. It shrinks the text segment of mm by a few hundred bytes in my test config, probably mostly from removing calls to compound_head() in page_to_pgoff(). This patch (of 7): Change the function signature to pass in the folio as all three callers have it. This removes a reference to page->index, which we're trying to get rid of. And add kernel-doc. Link: https://lkml.kernel.org/r/20241005200121.3231142-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241005200121.3231142-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e664c2cd98 |
mm/zsmalloc: use memcpy_from/to_page whereever possible
As part of "zsmalloc: replace kmap_atomic with kmap_local_page" [1] we replaced kmap/kunmap_atomic() with kmap_local_page()/kunmap_local(). But later it was found that some of the code could be replaced with already available apis in highmem.h, such as memcpy_from_page()/memcpy_to_page(). Also, update the comments with correct api naming. [1] https://lkml.kernel.org/r/20241001175358.12970-1-quic_pintu@quicinc.com Link: https://lkml.kernel.org/r/20241010175143.27262-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
91d0ec8347 |
zsmalloc: replace kmap_atomic with kmap_local_page
The use of kmap_atomic/kunmap_atomic is deprecated. Replace it will kmap_local_page/kunmap_local all over the place. Also fix SPDX missing license header. WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead + vaddr = kmap_atomic(page); Link: https://lkml.kernel.org/r/20241001175358.12970-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Joe Perches <joe@perches.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4835f747d3 |
alloc_tag: support for page allocation tag compression
Implement support for storing page allocation tag references directly in the page flags instead of page extensions. sysctl.vm.mem_profiling boot parameter it extended to provide a way for a user to request this mode. Enabling compression eliminates memory overhead caused by page_ext and results in better performance for page allocations. However this mode will not work if the number of available page flag bits is insufficient to address all kernel allocations. Such condition can happen during boot or when loading a module. If this condition is detected, memory allocation profiling gets disabled with an appropriate warning. By default compression mode is disabled. Link: https://lkml.kernel.org/r/20241023170759.999909-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0f9b685626 |
alloc_tag: populate memory for module tags as needed
The memory reserved for module tags does not need to be backed by physical pages until there are tags to store there. Change the way we reserve this memory to allocate only virtual area for the tags and populate it with physical pages as needed when we load a module. [surenb@google.com: avoid execmem_vmap() when !MMU] Link: https://lkml.kernel.org/r/20241031233611.3833002-1-surenb@google.com Link: https://lkml.kernel.org/r/20241023170759.999909-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2e45474ab1 |
execmem: add support for cache of large ROX pages
Using large pages to map text areas reduces iTLB pressure and improves performance. Extend execmem_alloc() with an ability to use huge pages with ROX permissions as a cache for smaller allocations. To populate the cache, a writable large page is allocated from vmalloc with VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as ROX. The direct map alias of that large page is exculded from the direct map. Portions of that large page are handed out to execmem_alloc() callers without any changes to the permissions. When the memory is freed with execmem_free() it is invalidated again so that it won't contain stale instructions. An architecture has to implement execmem_fill_trapping_insns() callback and select ARCH_HAS_EXECMEM_ROX configuration option to be able to use the ROX cache. The cache is enabled on per-range basis when an architecture sets EXECMEM_ROX_CACHE flag in definition of an execmem_range. Link: https://lkml.kernel.org/r/20241023162711.2579610-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0c133b1e78 |
module: prepare to handle ROX allocations for text
In order to support ROX allocations for module text, it is necessary to handle modifications to the code, such as relocations and alternatives patching, without write access to that memory. One option is to use text patching, but this would make module loading extremely slow and will expose executable code that is not finally formed. A better way is to have memory allocated with ROX permissions contain invalid instructions and keep a writable, but not executable copy of the module text. The relocations and alternative patches would be done on the writable copy using the addresses of the ROX memory. Once the module is completely ready, the updated text will be copied to ROX memory using text patching in one go and the writable copy will be freed. Add support for that to module initialization code and provide necessary interfaces in execmem. Link: https://lkml.kernel.org/r/20241023162711.2579610-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewd-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c82be0be95 |
mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations
vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explicitly specify node ID will use huge pages only if size_per_node is larger than a huge page. Still the actual allocated memory is not distributed between nodes and there is no advantage in such approach. On the contrary, BPF allocates SZ_2M * num_possible_nodes() for each new bpf_prog_pack, while it could do with a single huge page per pack. Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with NUMA_NO_NODE and use huge pages whenever the requested allocation size is larger than a huge page. Link: https://lkml.kernel.org/r/20241023162711.2579610-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4401e9d10a |
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
damon_feed_loop_next_input() is inefficient and fragile to overflows.
Specifically, 'score_goal_diff_bp' calculation can overflow when 'score'
is high. The calculation is actually unnecessary at all because 'goal' is
a constant of value 10,000. Calculation of 'compensation' is again
fragile to overflow. Final calculation of return value for under-achiving
case is again fragile to overflow when the current score is
under-achieving the target.
Add two corner cases handling at the beginning of the function to make the
body easier to read, and rewrite the body of the function to avoid
overflows and the unnecessary bp value calcuation.
Link: https://lkml.kernel.org/r/20241031161203.47751-1-sj@kernel.org
Fixes:
|
|
|
|
8e7bde615f |
mm/damon/core: handle zero schemes apply interval
DAMON's logics to determine if this is the time to apply damos schemes
assumes next_apply_sis is always set larger than current
passed_sample_intervals. And therefore assume continuously incrementing
passed_sample_intervals will make it reaches to the next_apply_sis in
future. The logic hence does apply the scheme and update next_apply_sis
only if passed_sample_intervals is same to next_apply_sis.
If Schemes apply interval is set as zero, however, next_apply_sis is set
same to current passed_sample_intervals, respectively. And
passed_sample_intervals is incremented before doing the next_apply_sis
check. Hence, next_apply_sis becomes larger than next_apply_sis, and the
logic says it is not the time to apply schemes and update next_apply_sis.
In other words, DAMON stops applying schemes until passed_sample_intervals
overflows.
Based on the documents and the common sense, a reasonable behavior for
such inputs would be applying the schemes for every sampling interval.
Handle the case by removing the assumption.
Link: https://lkml.kernel.org/r/20241031183757.49610-3-sj@kernel.org
Fixes:
|
|
|
|
3488af0970 |
mm/damon/core: handle zero {aggregation,ops_update} intervals
Patch series "mm/damon/core: fix handling of zero non-sampling intervals".
DAMON's internal intervals accounting logic is not correctly handling
non-sampling intervals of zero values for a wrong assumption. This could
cause unexpected monitoring behavior, and even result in infinite hang of
DAMON sysfs interface user threads in case of zero aggregation interval.
Fix those by updating the intervals accounting logic. For details of the
root case and solutions, please refer to commit messages of fixes.
This patch (of 2):
DAMON's logics to determine if this is the time to do aggregation and ops
update assumes next_{aggregation,ops_update}_sis are always set larger
than current passed_sample_intervals. And therefore it further assumes
continuously incrementing passed_sample_intervals every sampling interval
will make it reaches to the next_{aggregation,ops_update}_sis in future.
The logic therefore make the action and update
next_{aggregation,ops_updaste}_sis only if passed_sample_intervals is same
to the counts, respectively.
If Aggregation interval or Ops update interval are zero, however,
next_aggregation_sis or next_ops_update_sis are set same to current
passed_sample_intervals, respectively. And passed_sample_intervals is
incremented before doing the next_{aggregation,ops_update}_sis check.
Hence, passed_sample_intervals becomes larger than
next_{aggregation,ops_update}_sis, and the logic says it is not the time
to do the action and update next_{aggregation,ops_update}_sis forever,
until an overflow happens. In other words, DAMON stops doing aggregations
or ops updates effectively forever, and users cannot get monitoring
results.
Based on the documents and the common sense, a reasonable behavior for
such inputs is doing an aggregation and an ops update for every sampling
interval. Handle the case by removing the assumption.
Note that this could incur particular real issue for DAMON sysfs interface
users, in case of zero Aggregation interval. When user starts DAMON with
zero Aggregation interval and asks online DAMON parameter tuning via DAMON
sysfs interface, the request is handled by the aggregation callback.
Until the callback finishes the work, the user who requested the online
tuning just waits. Hence, the user will be stuck until the
passed_sample_intervals overflows.
Link: https://lkml.kernel.org/r/20241031183757.49610-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20241031183757.49610-2-sj@kernel.org
Fixes:
|
|
|
|
faa242b1d2 |
mm/mlock: set the correct prev on failure
After commit |
|
|
|
c928807f6f |
mm/page_alloc: keep track of free highatomic
OOM kills due to vastly overestimated free highatomic reserves were observed: ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB The second line above shows that the OOM kill was due to the following condition: free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) And the third line shows there were no free pages in any MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'. Therefore __zone_watermark_unusable_free() underestimated the usable free memory by over 1GB, which resulted in the unnecessary OOM kill above. The comments in __zone_watermark_unusable_free() warns about the potential risk, i.e., If the caller does not have rights to reserves below the min watermark then subtract the high-atomic reserves. This will over-estimate the size of the atomic reserve but it avoids a search. However, it is possible to keep track of free pages in reserved highatomic pageblocks with a new per-zone counter nr_free_highatomic protected by the zone lock, to avoid a search when calculating the usable free memory. And the cost would be minimal, i.e., simple arithmetics in the highatomic alloc/free/move paths. Note that since nr_free_highatomic can be relatively small, using a per-cpu counter might cause too much drift and defeat its purpose, in addition to the extra memory overhead. Dependson |
|
|
|
906c38ff52 |
memcg: workingset: remove folio_memcg_rcu usage
The function workingset_activation() is called from folio_mark_accessed() with the guarantee that the given folio can not be freed under us in workingset_activation(). In addition, the association of the folio and its memcg can not be broken here because charge migration is no more. There is no need to use folio_memcg_rcu. Simply use folio_memcg_charged() because that is what this function cares about. [akpm@linux-foundation.org: provide folio_memcg_charged stub for CONFIG_MEMCG=n] Link: https://lkml.kernel.org/r/20241026163707.2479526-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
642c66d84c |
mm/vma: the pgoff is correct if can_merge_right
By this point can_vma_merge_right() must have returned true, which implies
can_vma_merge_before() also returned true, which already asserts that the
pgoff is as expected for a merge with the following VMA, thus this
assignment is redundant.
Below is a more detail explanation.
Current definition of can_vma_merge_right() is:
static bool can_vma_merge_right(struct vma_merge_struct *vmg,
bool can_merge_left)
{
if (!vmg->next || vmg->end != vmg->next->vm_start ||
!can_vma_merge_before(vmg))
return false;
...
}
And:
static bool can_vma_merge_before(struct vma_merge_struct *vmg)
{
pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start);
...
if (vmg->next->vm_pgoff == vmg->pgoff + pglen)
return true;
...
}
Which implies vmg->pgoff == vmg->next->vm_pgoff - pglen.
None of these values are changed between the check and prior assignment,
so this was an entirely redundant assignment.
[akpm@linux-foundation.org: remove now-unused local]
[lorenzo.stoakes@oracle.com: rephrase the changelog]
Link: https://lkml.kernel.org/r/20241024093347.18057-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
5ac87a885a |
mm: defer second attempt at merge on mmap()
Rather than trying to merge again when ostensibly allocating a new VMA, instead defer until the VMA is added and attempt to merge the existing range. This way we have no complicated unwinding logic midway through the process of mapping the VMA. In addition this removes limitations on the VMA not being able to be the first in the virtual memory address space which was previously implicitly required. In theory, for this very same reason, we should unconditionally attempt merge here, however this is likely to have a performance impact so it is better to avoid this given the unlikely outcome of a merge. [lorenzo.stoakes@oracle.com: remove unnecessary indirection] Link: https://lkml.kernel.org/r/5106696d-e625-4d8a-8545-9d1430301730@lucifer.local Link: https://lkml.kernel.org/r/d4f84502605d7651ac114587f507395c0fc76004.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5a689bac0b |
mm: remove unnecessary reset state logic on merge new VMA
The only place where this was used was in mmap_region(), which we have now adjusted to not require this to be performed (we reset ourselves in effect). It also created a dangerous assumption that VMG state could be safely reused after a merge, at which point it may have been mutated in unexpected ways, leading to subtle bugs. Note that it was discovered by Wei Yang that there was also an error in this code - we are comparing vmg->vma with prev after setting it to NULL. This however had no impact, as we previously reset VMA iterator state before attempting merge again, but it was useless effort. In any case, this patch removes all of the logic so also eliminates this wasted effort. Link: https://lkml.kernel.org/r/5d9a59eee6498ae017cc87d89aa723de7179f75d.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0d11630cc5 |
mm: refactor __mmap_region()
We have seen bugs and resource leaks arise from the complexity of the __mmap_region() function. This, and the generally deeply fragile error handling logic and complexity which makes understanding the function difficult make it highly desirable to refactor it into something readable. Achieve this by separating the function into smaller logical parts which are easier to understand and follow, and which importantly very significantly simplify the error handling. Note that we now call vms_abort_munmap_vmas() in more error paths than we used to, however in cases where no abort need occur, vms->nr_pages will be equal to zero and we simply exit this function without doing more than we would have done previously. Importantly, the invocation of the driver mmap hook via mmap_file() now has very simple and obvious handling (this was previously the most problematic part of the mmap() operation). Use a generalised stack-based 'mmap state' to thread through values and also retrieve state as needed. Also avoid ever relying on vma merge (vmg) state after a merge is attempted, instead maintain meaningful state in the mmap state and establish vmg state as and when required. This avoids any subtle bugs arising from merge logic mutating this state and mmap_region() logic later relying upon it. Link: https://lkml.kernel.org/r/25bd2edc3275450f448cbfe0756ce2a7cd06810f.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
52956b0d7f |
mm: isolate mmap internal logic to mm/vma.c
In previous commits we effected improvements to the mmap() logic in mmap_region() and its newly introduced internal implementation function __mmap_region(). However as these changes are intended to be backported, we kept the delta as small as is possible and made as few changes as possible to the newly introduced mm/vma.* files. Take the opportunity to move this logic to mm/vma.c which not only isolates it, but also makes it available for later userland testing which can help us catch such logic errors far earlier. Link: https://lkml.kernel.org/r/93fc2c3aa37dd30590b7e4ee067dfd832007bf7e.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
a29c0e4b2e |
memcg-v1: remove memcg move locking code
The memcg v1's charge move feature has been deprecated. All the places using the memcg move lock, have stopped using it as they don't need the protection any more. Let's proceed to remove all the locking code related to charge moving. Link: https://lkml.kernel.org/r/20241025012304.2473312-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
cf4a65539c |
memcg-v1: no need for memcg locking for MGLRU
While updating the generation of the folios, MGLRU requires that the folio's memcg association remains stable. With the charge migration deprecated, there is no need for MGLRU to acquire locks to keep the folio and memcg association stable. [yuzhao@google.com: remove !rcu_read_lock_held() assertion] Link: https://lkml.kernel.org/r/ZykEtcHrQRq-KrBC@google.com Link: https://syzkaller.appspot.com/bug?extid=24f45b8beab9788e467e Link: https://lore.kernel.org/lkml/67294349.050a0220.701a.0010.GAE@google.com/ [akpm@linux-foundation.org: remove now-unused local] [shakeel.butt@linux.dev: folio_rcu() fixup, per Yu Zhao] Link: https://lkml.kernel.org/r/iwmabnye3nl4merealrawt3bdvfii2pwavwrddrqpraoveet7h@ezrsdhjwwej7 Link: https://lkml.kernel.org/r/20241025012304.2473312-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
568bcf4148 |
memcg-v1: no need for memcg locking for writeback tracking
During the era of memcg charge migration, the kernel has to be make sure that the writeback stat updates do not race with the charge migration. Otherwise it might update the writeback stats of the wrong memcg. Now with the memcg charge migration gone, there is no more race for writeback stat updates and the previous locking can be removed. Link: https://lkml.kernel.org/r/20241025012304.2473312-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
a8cd9d4ce3 |
memcg-v1: no need for memcg locking for dirty tracking
During the era of memcg charge migration, the kernel has to be make sure that the dirty stat updates do not race with the charge migration. Otherwise it might update the dirty stats of the wrong memcg. Now with the memcg charge migration gone, there is no more race for dirty stat updates and the previous locking can be removed. Link: https://lkml.kernel.org/r/20241025012304.2473312-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6b611388b6 |
memcg-v1: remove charge move code
The memcg-v1 charge move feature has been deprecated completely and let's remove the relevant code as well. Link: https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
aa6b4fdf59 |
memcg-v1: fully deprecate move_charge_at_immigrate
Patch series "memcg-v1: fully deprecate charge moving". The memcg v1's charge moving feature has been deprecated for almost 2 years and the kernel warns if someone try to use it. This warning has been backported to all stable kernel and there have not been any report of the warning or the request to support this feature anymore. Let's proceed to fully deprecate this feature. This patch (of 6): Proceed with the complete deprecation of memcg v1's charge moving feature. The deprecation warning has been in the kernel for almost two years and has been ported to all stable kernel since. Now is the time to fully deprecate this feature. Link: https://lkml.kernel.org/r/20241025012304.2473312-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241025012304.2473312-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
729881ffd3 |
mm: shmem: fallback to page size splice if large folio has poisoned pages
The tmpfs has already supported the PMD-sized large folios, and splice() can not read any pages if the large folio has a poisoned page, which is not good as Matthew pointed out in a previous email[1]: "so if we have hwpoison set on one page in a folio, we now can't read bytes from any page in the folio? That seems like we've made a bad situation worse." Thus add a fallback to the PAGE_SIZE splice() still allows reading normal pages if the large folio has hwpoisoned pages. [1] https://lore.kernel.org/all/Zw_d0EVAJkpNJEbA@casper.infradead.org/ [baolin.wang@linux.alibaba.com: code layout cleaup, per dhowells] Link: https://lkml.kernel.org/r/32dd938c-3531-49f7-93e4-b7ff21fec569@linux.alibaba.com Link: https://lkml.kernel.org/r/e3737fbd5366c4de4337bf5f2044817e77a5235b.1729915173.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
477327e106 |
mm/damon/vaddr: add 'nr_piece == 1' check in damon_va_evenly_split_region()
As discussed in [1], damon_va_evenly_split_region() is called to size-evenly split a region into 'nr_pieces' small regions, when nr_pieces == 1, no actual split is required. Check that case for better code readability and add a simple kunit testcase. [1] https://lore.kernel.org/all/20241021163316.12443-1-sj@kernel.org/ Link: https://lkml.kernel.org/r/20241022083927.3592237-3-zhengyejian@huaweicloud.com Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Ye Weihua <yeweihua4@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f3c7a1ede4 |
mm/damon/vaddr: fix issue in damon_va_evenly_split_region()
Patch series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()". v2.
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
`sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);`
both the dividing and the ALIGN_DOWN may cause loss of precision, then
each time split one piece of size 'sz_piece' from origin 'start' to 'end'
would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
And add 'nr_piece == 1' check in damon_va_evenly_split_region() for better
code readability and add a corresponding kunit testcase.
This patch (of 2):
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
`sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);`
both the dividing and the ALIGN_DOWN may cause loss of precision,
then each time split one piece of size 'sz_piece' from origin 'start' to
'end' would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
After this patch, damon-operations test passed:
# ./tools/testing/kunit/kunit.py run damon-operations
[...]
============== damon-operations (6 subtests) ===============
[PASSED] damon_test_three_regions_in_vmas
[PASSED] damon_test_apply_three_regions1
[PASSED] damon_test_apply_three_regions2
[PASSED] damon_test_apply_three_regions3
[PASSED] damon_test_apply_three_regions4
[PASSED] damon_test_split_evenly
================ [PASSED] damon-operations =================
Link: https://lkml.kernel.org/r/20241022083927.3592237-1-zhengyejian@huaweicloud.com
Link: https://lkml.kernel.org/r/20241022083927.3592237-2-zhengyejian@huaweicloud.com
Fixes:
|
|
|
|
ab505e8be0 |
mm/page_alloc: use str_off_on() helper in build_all_zonelists()
Remove hard-coded strings by using the str_off_on() helper function. Link: https://lkml.kernel.org/r/20241021091340.5243-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
8717734fdc |
mm/memcontrol: fix seq_buf size to save memory when PAGE_SIZE is large
Previously the seq_buf used for accumulating the memory.stat output was sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE; If 4K is enough on a 4K page system, then it should also be enough on a 64K page system, so we can save 60K on the static buffer used in mem_cgroup_print_oom_meminfo(). Let's make it so. This also has the beneficial side effect of removing a place in the code that assumed PAGE_SIZE is a compile-time constant. So this helps our quest towards supporting boot-time page size selection. Link: https://lkml.kernel.org/r/20241021130027.3615969-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
628e1b8c47 |
mm: add missing mmu_notifier_clear_young for !MMU_NOTIFIER
Remove the now unnecessary ifdef in mm/damon/vaddr.c as well. Link: https://lkml.kernel.org/r/20241021160212.9935-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
39ac99852f |
mm/page-writeback: raise wb_thresh to prevent write blocking with strictlimit
With the strictlimit flag, wb_thresh acts as a hard limit in balance_dirty_pages() and wb_position_ratio(). When device write operations are inactive, wb_thresh can drop to 0, causing writes to be blocked. The issue occasionally occurs in fuse fs, particularly with network backends, the write thread is blocked frequently during a period. To address it, this patch raises the minimum wb_thresh to a controllable level, similar to the non-strictlimit case. Link: https://lkml.kernel.org/r/20241023100032.62952-1-jimzhao.ai@gmail.com Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
722376934b |
mm/memory.c: simplify pfnmap_lockdep_assert
Use local `mapping' to reduce the pointer chasing.
akpm: extracted from a bugfix which Linus fixed with
|
|
|
|
a284cb8472 |
mm: shmem: improve the tmpfs large folio read performance
tmpfs already supports PMD-sized large folios, but the tmpfs read operation still performs copying at PAGE_SIZE granularity, which is unreasonable. This patch changes tmpfs to copy data at folio granularity, which can improve the read performance, as well as changing to use folio related functions. Moreover, if a large folio has a subpage that is hwpoisoned, it will still fall back to page granularity copying. Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can see about 20% performance improvement, and no regression with bs=4k. Before the patch: READ: bw=10.0GiB/s After the patch: READ: bw=12.0GiB/s Link: https://lkml.kernel.org/r/2129a21a5b9f77d3bb7ddec152c009ce7c5653c4.1729218573.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f3650ef89b |
mm: shmem: update iocb->ki_pos directly to simplify tmpfs read logic
Patch series "Improve the tmpfs large folio read performance", v2. tmpfs already supports PMD-sized large folios, but the tmpfs read operation still performs copying at PAGE_SIZE granularity, which is not perfect. This patchset changes tmpfs to copy data at the folio granularity, which can improve the read performance. Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can see about 20% performance improvement, and no regression with bs=4k. I also did some functional testing with the xfstests suite, and I did not find any regressions with the following xfstests config: FSTYP=tmpfs export TEST_DIR=/mnt/tempfs_mnt export TEST_DEV=/mnt/tempfs_mnt export SCRATCH_MNT=/mnt/scratchdir export SCRATCH_DEV=/mnt/scratchdir This patch (of 2): Using iocb->ki_pos to check if the read bytes exceeds the file size and to calculate the bytes to be read can help simplify the code logic. Meanwhile, this is also a preparation for improving tmpfs large folios read performance in the following patch. Link: https://lkml.kernel.org/r/cover.1729218573.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/e8863e289577e0dc1e365b5419bf2d1c9a24ae3d.1729218573.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5bb6345cd2 |
mm: remove redundant condition for THP folio
folio_test_pmd_mappable() implies folio_test_large(), therefore, simplify the expression for is_thp. Link: https://lkml.kernel.org/r/20241018094151.3458-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4b6b0a5188 |
mm/mremap: remove goto from mremap_to()
mremap_to() has a goto label at the end that doesn't unwind anything. Removing the label makes the code cleaner. This commit also adds documentation to the function. Link: https://lkml.kernel.org/r/20241018174114.2871880-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
58f1069311 |
mm/mremap: cleanup vma_to_resize()
Patch series "mm/mremap: Remove extra vma tree walk", v2. An extra vma tree walk was discovered in some mremap call paths during the discussion on mseal() changes. This patch set removes the extra vma tree walk and further cleans up mremap_to(). This patch (of 2): vma_to_resize() is used in two locations to find and validate the vma for the mremap location. One of the two locations already has the vma, which is then re-found to validate the same vma. This code can be simplified by moving the vma_lookup() from vma_to_resize() to mremap_to() and changing the return type to an int error. Since the function now just validates the vma, the function is renamed to resize_is_valid() to better reflect what it is doing. This commit also adds documentation about the function. Link: https://lkml.kernel.org/r/20241018174114.2871880-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20241018174114.2871880-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0938b16146 |
mm: don't set readahead flag on a folio when lookahead_size > nr_to_read
The readahead flag is set on a folio based on the lookahead_size and
nr_to_read. For example, when the readahead happens from index to index +
nr_to_read, then the readahead `mark` offset from index is set at
nr_to_read - lookahead_size.
There are some scenarios where the lookahead_size > nr_to_read. For
example, readahead window was created, but the file was truncated before
the readahead starts. do_page_cache_ra() will clamp the nr_to_read if the
readahead window extends beyond EOF after truncation. If this happens,
readahead flag should not be set on any folio on the current readahead
window.
The current calculation for `mark` with mapping_min_order > 0 gives
incorrect results when lookahead_size > nr_to_read due to rounding up
operation:
index = 128
nr_to_read = 16
lookahead_size = 28
mapping_min_order = 4 (16 pages)
ra_folio_index = round_up(128 + 16 - 28, 16) = 128;
mark = 128 - 128 = 0; # offset from index to set RA flag
In the above example, the lookahead_size is actually lying outside the
current readahead window. Without this patch, RA flag will be set
incorrectly on the folio at index 128. This can lead to marking the
readahead flag on the wrong folio, therefore, triggering a readahead when
it is not necessary.
Explicitly initialize `mark` to be ULONG_MAX and only calculate it when
lookahead_size is within the readahead window.
Link: https://lkml.kernel.org/r/20241017062342.478973-1-kernel@pankajraghav.com
Fixes:
|
|
|
|
4a9a27fdf7 |
mm: shmem: remove __shmem_huge_global_enabled()
Remove __shmem_huge_global_enabled() since it as only one caller, and remove repeated check of VM_NOHUGEPAGE/MMF_DISABLE_THP as they are checked in shmem_allowable_huge_orders(), also remove unnecessary vma parameter. Link: https://lkml.kernel.org/r/20241017141457.1169092-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9884efd795 |
mm: huge_memory: move file_thp_enabled() into huge_memory.c
file_thp_enabled() is only used in __thp_vma_allowable_orders(), so move it into huge_memory.c, also check READ_ONLY_THP_FOR_FS ahead to avoid unnecessary code if config disabled. Link: https://lkml.kernel.org/r/20241017141457.1169092-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5a90c155de |
tmpfs: don't enable large folios if not supported
tmpfs can support large folios, but there are some configurable options (mount options and runtime deny/force) to enable/disable large folio allocation, so there is a performance issue when performing writes without large folios. The issue is similar to commit |
|
|
|
f1001f3d3b |
mm/mglru: reset page lru tier bits when activating
When a folio is activated, lru_gen_add_folio() moves the folio to the
youngest generation. But unlike folio_update_gen()/folio_inc_gen(),
lru_gen_add_folio() doesn't reset the folio lru tier bits (LRU_REFS_MASK |
LRU_REFS_FLAGS). This inconsistency can affect how pages are aged via
folio_mark_accessed() (e.g. fd accesses), though no user visible impact
related to this has been detected yet.
Note that lru_gen_add_folio() cannot clear PG_workingset if the activation
is due to workingset refault, otherwise PSI accounting will be skipped.
So fix lru_gen_add_folio() to clear the lru tier bits other than
PG_workingset when activating a folio, and also clear all the lru tier
bits when a folio is activated via folio_activate() in
lru_gen_look_around().
Link: https://lkml.kernel.org/r/20241017181528.3358821-1-weixugc@google.com
Fixes:
|
|
|
|
d3ea85c6c5 |
mm: swap: use str_true_false() helper function
Remove hard-coded strings by using the helper function str_true_false(). Link: https://lkml.kernel.org/r/20241016141040.79168-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e4137f0881 |
mm, kasan, kmsan: instrument copy_from/to_kernel_nofault
Instrument copy_from_kernel_nofault() with KMSAN for uninitialized kernel memory check and copy_to_kernel_nofault() with KASAN, KCSAN to detect the memory corruption. syzbot reported that bpf_probe_read_kernel() kernel helper triggered KASAN report via kasan_check_range() which is not the expected behaviour as copy_from_kernel_nofault() is meant to be a non-faulting helper. Solution is, suggested by Marco Elver, to replace KASAN, KCSAN check in copy_from_kernel_nofault() with KMSAN detection of copying uninitilaized kernel memory. In copy_to_kernel_nofault() we can retain instrument_write() explicitly for the memory corruption instrumentation. copy_to_kernel_nofault() is tested on x86_64 and arm64 with CONFIG_KASAN_SW_TAGS. On arm64 with CONFIG_KASAN_HW_TAGS, kunit test currently fails. Need more clarification on it. [akpm@linux-foundation.org: fix comment layout, per checkpatch Link: https://lore.kernel.org/linux-mm/CANpmjNMAVFzqnCZhEity9cjiqQ9CVN1X7qeeeAp_6yKjwKo8iw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20241011035310.2982017-1-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reported-by: syzbot+61123a5daeb9f7454599@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=61123a5daeb9f7454599 Reported-by: Andrey Konovalov <andreyknvl@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=210505 Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> [KASAN] Tested-by: Andrey Konovalov <andreyknvl@gmail.com> [KASAN] Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f69c2e4dc6 |
mm/vmstat: defer the refresh_zone_stat_thresholds after all CPUs bringup
refresh_zone_stat_thresholds function has two loops which is expensive for higher number of CPUs and NUMA nodes. Below is the rough estimation of total iterations done by these loops based on number of NUMA and CPUs. Total number of iterations: nCPU * 2 * Numa * mCPU Where: nCPU = total number of CPUs Numa = total number of NUMA nodes mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs) For the system under test with 16 NUMA nodes and 1024 CPUs, this results in a substantial increase in the number of loop iterations during boot-up when NUMA is enabled: No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds takes around 224 ms total for all the CPUs in the system under test. 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds takes around 4.5 seconds total for all the CPUs in the system under test. Calling this for each CPU is expensive when there are large number of CPUs along with multiple NUMAs. Fix this by deferring refresh_zone_stat_thresholds to be called later at once when all the secondary CPUs are up. Also, register the DYN hooks to keep the existing hotplug functionality intact. Link: https://lkml.kernel.org/r/1723443220-20623-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu> Cc: Saurabh Singh Sengar <ssengar@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1f2d03cc53 |
vmscan: add a vmscan event for reclaim_pages
reclaim_folio_list uses a dummy reclaim_stat and is not being used. To know the memory stat, add a new trace event. This is useful how how many pages are not reclaimed or why. This is an example: mm_vmscan_reclaim_pages: nid=0 nr_scanned=112 nr_reclaimed=112 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=0 nr_unmap_fail=0 Currently reclaim_folio_list is only called by reclaim_pages, and reclaim_pages is used by damon and madvise. In the latest Android, reclaim_pages is also used by shmem to reclaim all pages in a address_space. [jaewon31.kim@samsung.com: use sc.nr_scanned rather than new counting] Link: https://lkml.kernel.org/r/20241016143227.961162-1-jaewon31.kim@samsung.com Link: https://lkml.kernel.org/r/20241011124928.1224813-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5708d96da2 |
mm: avoid zeroing user movable page twice with init_on_alloc=1
Commit
|
|
|
|
773ee2cda5 |
mm/zswap: avoid touching XArray for unnecessary invalidation
zswap_invalidation simply calls xa_erase, which acquires the Xarray lock first, then does a look up. This has a higher overhead even if zswap is not used or the tree is empty. So instead, do a very lightweight xa_empty check first, if there is nothing to erase, don't touch the lock or the tree. Using xa_empty rather than zswap_never_enabled is more helpful as it cover both case where zswap wes never used or the particular range doesn't have any zswap entry. And it's safe as the swap slot should be currently pinned by caller with HAS_CACHE. Sequential SWAP in/out tests with zswap disabled showed a minor performance gain, SWAP in of zero page with zswap enabled also showed a performance gain. (swapout is basically unchanged so only test one case): Swapout of 2G zero page using brd as SWAP, zswap disabled (total time, 4 testrun, +0.1%): Before: 1705013 us 1703119 us 1704335 us 1705848 us. After: 1703579 us |
|
|
|
7e1fbaa0df |
mm/hugetlb: perform vmemmap optimization batchly for specific node allocation
When HVO is enabled and huge page memory allocs are made, the freed memory can be aggregated into higher order memory in the following paths, which facilitates further allocs for higher order memory. echo 200000 > /proc/sys/vm/nr_hugepages echo 200000 > /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages grub default_hugepagesz=2M hugepagesz=2M hugepages=200000 Currently not support for releasing aggregations to higher order in the following way, which will releasing to lower order. grub: default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 This patch supports the release of huge page optimizations aggregates to higher order memory. eg: cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-xxx ... default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 Before: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 55282 97039 99307 0 1 1 0 1 1 1 0 Node 0, zone Normal, type Movable 25 11 345 87 48 21 2 20 9 3 75061 Node 0, zone Normal, type Reclaimable 4 2 2 4 3 0 2 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 98888 99650 99679 2 3 1 2 2 2 0 0 Node 1, zone Normal, type Movable 1 1 0 1 1 0 1 0 1 1 75937 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 After: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 152 158 37 2 2 0 3 4 2 6 717 Node 0, zone Normal, type Movable 1 37 53 3 55 49 16 6 2 1 75000 Node 0, zone Normal, type Reclaimable 1 4 3 1 2 1 1 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 5 3 2 1 3 4 2 2 2 0 779 Node 1, zone Normal, type Movable 1 0 1 1 1 0 1 0 1 1 75849 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Link: https://lkml.kernel.org/r/20241012070802.1876-1-suhua1@kingsoft.com Signed-off-by: suhua <suhua1@kingsoft.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0aa3ef3637 |
memcg: add tracing for memcg stat updates
The memcg stats are maintained in rstat infrastructure which provides very fast updates side and reasonable read side. However memcg added plethora of stats and made the read side, which is cgroup rstat flush, very slow. To solve that, threshold was added in the memcg stats read side i.e. no need to flush the stats if updates are within the threshold. This threshold based improvement worked for sometime but more stats were added to memcg and also the read codepath was getting triggered in the performance sensitive paths which made threshold based ratelimiting ineffective. We need more visibility into the hot and cold stats i.e. stats with a lot of updates. Let's add trace to get that visibility. [shakeel.butt@linux.dev: use unsigned long type for memcg_rstat_events, per Yosry] Link: https://lkml.kernel.org/r/20241015213721.3804209-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241010003550.3695245-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6359c39c9d |
mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit
|
|
|
|
f8780515fe |
mm: add pcp high_min high_max to proc zoneinfo
When we do not set percpu_pagelist_high_fraction the kernel will compute the pcp high_min/max by itself, which makes it hard to determine the current high_min/max values. So output the pcp high_min/max values to /proc/zoneinfo. Link: https://lkml.kernel.org/r/20241010120935.656619-1-mengensun@tencent.com Signed-off-by: MengEn Sun <mengensun@tencent.com> Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
002c5d1ca8 |
mm/kmemleak: fix typo in object_no_scan() comment
Replace "corresponding to the give pointer" with "corresponding to the given pointer" Link: https://lkml.kernel.org/r/20241010155439.554416-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
afe789b736 |
kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end
For clarity. It's increasingly hard to reason about the code, when KASLR is moving around the boundaries. In this case where KASLR is randomizing the location of the kernel image within physical memory, the maximum number of address bits for physical memory has not changed. What has changed is the ending address of memory that is allowed to be directly mapped by the kernel. Let's name the variable, and the associated macro accordingly. Also, enhance the comment above the direct_map_physmem_end definition, to further clarify how this all works. Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jordan Niethe <jniethe@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1ced09e033 |
mm: allocate THP on hugezeropage wp-fault
Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and replace it with a PMD-mapped THP. Remember to flush TLB entry corresponding to the hugezeropage. In case of failure, fallback to splitting the PMD. Link: https://lkml.kernel.org/r/20241008061746.285961-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ebcfc63d6b |
mm: abstract THP allocation
Patch series "Do not shatter hugezeropage on wp-fault", v7. It was observed at [1] and [2] that the current kernel behaviour of shattering a hugezeropage is inconsistent and suboptimal. For a VMA with a THP allowable order, when we write-fault on it, the kernel installs a PMD-mapped THP. On the other hand, if we first get a read fault, we get a PMD pointing to the hugezeropage; subsequent write will trigger a write-protection fault, shattering the hugezeropage into one writable page, and all the other PTEs write-protected. The conclusion being, as compared to the case of a single write-fault, applications have to suffer 512 extra page faults if they were to use the VMA as such, plus we get the overhead of khugepaged trying to replace that area with a THP anyway. Instead, replace the hugezeropage with a THP on wp-fault. [1]: https://lore.kernel.org/all/3743d7e1-0b79-4eaf-82d5-d1ca29fe347d@arm.com/ [2]: https://lore.kernel.org/all/1cfae0c0-96a2-4308-9c62-f7a640520242@arm.com/ This patch (of 2): In preparation for the second patch, abstract away the THP allocation logic present in the create_huge_pmd() path, which corresponds to the faulting case when no page is present. There should be no functional change as a result of applying this patch, except that, as David notes at [1], a PMD-aligned address should be passed to update_mmu_cache_pmd(). [1]: https://lore.kernel.org/all/ddd3fcd2-48b3-4170-bcaa-2fe66e093f43@redhat.com/ Link: https://lkml.kernel.org/r/20241008061746.285961-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20241008061746.285961-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
077c7c1e09 |
mm/memory.c: remove stray newline at top of file
Fixes:
|
|
|
|
018d24539d |
percpu: fix data race with pcpu_nr_empty_pop_pages
Fixes the data race by moving the read to be behind the pcpu_lock. This is okay because the code (initially) above it will not increase the empty populated page count because it is populating backing pages that already have allocations served out of them. Link: https://lkml.kernel.org/r/20241008001942.8114-1-dennis@kernel.org Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407191651.f24e499d-oliver.sang@intel.com Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7f24cbc9c4 |
mm/mmap: teach generic_get_unmapped_area{_topdown} to handle hugetlb mappings
Patch series "Unify hugetlb into arch_get_unmapped_area functions", v4.
This is an attempt to get rid of a fair amount of duplicated code wrt.
hugetlb and *get_unmapped_area* functions.
HugeTLB registers a .get_unmapped_area function which gets called from
__get_unmapped_area().
hugetlb_get_unmapped_area() is defined by a bunch of architectures and
it also has a generic definition for those that do not define it.
Short-long story is that there is a ton of duplicated code between
specific hugetlb *_get_unmapped_area_* functions and mm-core functions,
so we can do better by teaching arch_get_unmapped_area* functions how
to deal with hugetlb mappings.
Note that not a lot of things need to be taught though.
hugetlb_get_unmapped_area, that gets called for hugetlb mappings, runs
some sanity checks prior to calling mm_get_unmapped_area_vmflags(), so we
do not need to that down the road in the respective
{generic,arch}_get_unmapped_area* functions.
More information can be found in the respective patches.
LTP mmapstress hugetlb selftests were ran succesfully on:
This patch (of 9):
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach generic_get_unmapped_area{_topdown} to handle
those. The main difference is that we set info.align_mask for huge
mappings.
Link: https://lkml.kernel.org/r/20241007075037.267650-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20241007075037.267650-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
04f315a7dc |
mm: remove misleading 'unlikely' hint in vms_gather_munmap_vmas()
Performance analysis using branch annotation on a fleet of 200 hosts running web servers revealed that the 'unlikely' hint in vms_gather_munmap_vmas() was 100% consistently incorrect. In all observed cases, the branch behavior contradicted the hint. Remove the 'unlikely' qualifier from the condition checking 'vms->uf'. By doing so, we allow the compiler to make optimization decisions based on its own heuristics and profiling data, rather than relying on a static hint that has proven to be inaccurate in real-world scenarios. Link: https://lkml.kernel.org/r/20241004164832.218681-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5f5a3e9530 |
mm/truncate: reset xa_has_values flag on each iteration
Currently mapping_try_invalidate() and invalidate_inode_pages2_range()
traverses the xarray in batches and then for each batch, maintains and
sets the flag named xa_has_values if the batch has a shadow entry to clear
the entries at the end of the iteration.
However they forgot to reset the flag at the end of the iteration which
causes them to always try to clear the shadow entries in the subsequent
iterations where there might not be any shadow entries.
Fix this inefficiency.
Link: https://lkml.kernel.org/r/20241002225150.2334504-1-shakeel.butt@linux.dev
Fixes:
|
|
|
|
e26060d1fb |
mm: swap: make some count_mthp_stat() call-sites be THP-agnostic.
In commit
|
|
|
|
f0327de706 |
gup: convert FOLL_TOUCH case in follow_page_pte() to folio
We already have the folio here, so just use it, removing three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20241002151403.1345296-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
b9a256352f |
mm: remove PageKsm()
All callers have been converted to use folio_test_ksm() or PageAnonNotKsm(), so we can remove this wrapper. Link: https://lkml.kernel.org/r/20241002152533.1350629-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
76f1a82611 |
ksm: convert should_skip_rmap_item() to take a folio
Remove a call to PageKSM() by passing the folio containing tmp_page to should_skip_rmap_item. Removes a hidden call to compound_head(). Link: https://lkml.kernel.org/r/20241002152533.1350629-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
98c3ca0015 |
ksm: convert cmp_and_merge_page() to use a folio
By making try_to_merge_two_pages() and stable_tree_search() return a folio, we can replace kpage with kfolio. This replaces 7 calls to compound_head() with one. [cuigaosheng1@huawei.com: add IS_ERR_OR_NULL check for stable_tree_search()] Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://lkml.kernel.org/r/20241002152533.1350629-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9c0a1b99e3 |
ksm: use a folio in try_to_merge_one_page()
Patch series "Remove PageKsm()". The KSM flag is almost always tested on the folio rather than on the page. This series removes the final users of PageKsm() and makes the flag only This patch (of 5): It is safe to use a folio here because all callers took a refcount on this page. The one wrinkle is that we have to recalculate the value of folio after splitting the page, since it has probably changed. Replaces nine calls to compound_head() with one. Link: https://lkml.kernel.org/r/20241002152533.1350629-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241002152533.1350629-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
65c481f308
|
tmpfs: Initialize sysfs during tmpfs init
Instead of using fs_initcall(), initialize sysfs with the rest of the filesystem. This is the right way to do it because otherwise any error during tmpfs_sysfs_init() would get silently ignored. It's also useful if tmpfs' sysfs ever need to display runtime information. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241101164251.327884-4-andrealmeid@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
18d2f10f62
|
tmpfs: Fix type for sysfs' casefold attribute
DEVICE_STRING_ATTR_RO should be only used by device drivers since it
relies on `struct device` to use device_show_string() function. Using
this with non device code led to a kCFI violation:
> cat /sys/fs/tmpfs/features/casefold
[ 70.558496] CFI failure at kobj_attr_show+0x2c/0x4c (target: device_show_string+0x0/0x38; expected type: 0xc527b809)
Like the other filesystems, fix this by manually declaring the attribute
using kobj_attribute() and writing a proper show() function.
Also, leave macros for anyone that need to expand tmpfs sysfs' with
more attributes.
Fixes:
|
|
|
|
43731516fa |
mm/util: deduplicate code in {kstrdup,kstrndup,kmemdup_nul}
These three functions follow the same pattern. To deduplicate the code, let's introduce a common helper __kmemdup_nul(). Link: https://lkml.kernel.org/r/20241007144911.27693-7-laoar.shao@gmail.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Simon Horman <horms@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alejandro Colomar <alx@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Jan Kara <jack@suse.cz> Cc: Justin Stitt <justinstitt@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matus Jokay <matus.jokay@stuba.sk> Cc: Maxime Ripard <mripard@kernel.org> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Quentin Monnet <qmo@kernel.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
44ff630170 |
mm/util: fix possible race condition in kstrdup()
In kstrdup(), it is critical to ensure that the dest string is always
NUL-terminated. However, potential race condition can occur between a
writer and a reader.
Consider the following scenario involving task->comm:
reader writer
len = strlen(s) + 1;
strlcpy(tsk->comm, buf, sizeof(tsk->comm));
memcpy(buf, s, len);
In this case, there is a race condition between the reader and the writer.
The reader calculates the length of the string `s` based on the old value
of task->comm. However, during the memcpy(), the string `s` might be
updated by the writer to a new value of task->comm.
If the new task->comm is larger than the old one, the `buf` might not be
NUL-terminated. This can lead to undefined behavior and potential
security vulnerabilities.
Let's fix it by explicitly adding a NUL terminator after the memcpy. It
is worth noting that memcpy() is not atomic, so the new string can be
shorter when memcpy() already copied past the new NUL.
Link: https://lkml.kernel.org/r/20241007144911.27693-6-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alejandro Colomar <alx@kernel.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matus Jokay <matus.jokay@stuba.sk>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Quentin Monnet <qmo@kernel.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
1fa00a568d |
mm/cma: fix useless return in void function
There is a unnecessary return statement at the end of void function cma_activate_area. This can be dropped. While at it, also fix another warning related to unsigned. These are reported by checkpatch as well. WARNING: Prefer 'unsigned int' to bare use of 'unsigned' +unsigned cma_area_count; WARNING: void function return statements are not generally useful + return; +} Link: https://lkml.kernel.org/r/20240927181637.19941-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
d3db2c0425 |
mm: optimize invalidation of shadow entries
The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For
each batch, it traverses the page cache tree and collects the entries
(folio and shadow entries) in the struct folio_batch. For the shadow
entries present in the folio_batch, it has to traverse the page cache tree
for each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple
fadvise(DONTNEED) operation.
# time xfs_io -c 'fadvise -d 0 ${file_size}' file
time (sec)
Without 5.12 +- 0.061
With-patch 4.19 +- 0.086 (18.16% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
cb8e64be76 |
mm: optimize truncation of shadow entries
Patch series "mm: optimize shadow entries removal", v2.
Some of our production workloads which processes a large amount of data
spends considerable amount of CPUs on truncation and invalidation of large
sized files (100s of GiBs of size). Tracing the operations showed that
most of the time is in shadow entries removal. This patch series
optimizes the truncation and invalidation operations.
This patch (of 2):
The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
batch, it traverses the page cache tree and collects the entries (folio
and shadow entries) in the struct folio_batch. For the shadow entries
present in the folio_batch, it has to traverse the page cache tree for
each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
On large machines in our production which run workloads manipulating large
amount of data, we have observed that a large amount of CPUs are spent on
truncation of very large files (100s of GiBs file sizes). More
specifically most of time was spent on shadow entries cleanup, so
optimizing the shadow entries cleanup, even a little bit, has good impact.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple truncation
operation.
# time truncate -s 0 file
time (sec)
Without 5.164 +- 0.059
With-patch 4.21 +- 0.066 (18.47% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20240925224716.2904498-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Mason <clm@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
473c371254 |
mm: migrate LRU_REFS_MASK bits in folio_migrate_flags
Bits of LRU_REFS_MASK are not inherited during migration which lead to new folio start from tier0 when MGLRU enabled. Try to bring as much bits of folio->flags as possible since compaction and alloc_contig_range which introduce migration do happen at times. Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
583e66debd |
mm: pgtable: remove pte_offset_map_nolock()
Now no users are using the pte_offset_map_nolock(), remove it. Link: https://lkml.kernel.org/r/d04f9bbbcde048fb6ffa6f2bdbc6f9b22d5286f9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2441774f2d |
mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
In walk_pte_range(), we may modify the pte entry after holding the ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the ptl held, so we should get pmdval and do pmd_same() check to ensure the stability of pmd entry. Link: https://lkml.kernel.org/r/7e9c194a5efacc9609cfd31abb9c7df88b53b530.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e9c74b5431 |
mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will use pte_same() to detect the change of the pte entry, there is no need to get pmdval, so just pass a dummy variable to it. Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
04965da7a4 |
mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
In the caller of map_pte(), we may modify the pvmw->pte after acquiring the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the pvmw->ptl held, so we should get pmdval and do pmd_same() check to ensure the stability of pvmw->pmd. Link: https://lkml.kernel.org/r/2620a48f34c9f19864ab0169cdbf253d31a8fcaa.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
838d023544 |
mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so hpage_collapse_scan_file() path can not find this by traversing file->f_mapping, so there is no concurrency with retract_page_tables(). In addition, we already hold the exclusive mmap_lock, so this new_pte page is stable, so there is no need to get pmdval and do pmd_same() check. Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
24553a978b |
mm: copy_pte_range() use pte_offset_map_rw_nolock()
In copy_pte_range(), we may modify the src_pte entry after holding the src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we already hold the exclusive mmap_lock, and the copy_pte_range() and retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE page is stable, there is no need to get pmdval and do pmd_same() check. Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6dfd0d2cb3 |
mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At this time, the pte_same() check is not performed after the PTL held. So we should get pgt_pmd and do pmd_same() check after the ptl held. Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
d9c1ddf37b |
mm: handle_pte_fault() use pte_offset_map_rw_nolock()
In handle_pte_fault(), we may modify the vmf->pte after acquiring the vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we will do the pte_same() check, so there is no need to get pmdval to do pmd_same() check, just pass a dummy variable to it. Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c85507857b |
mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check in do_swap_page(). In other places, we directly use pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bd6ad65ddc |
mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/9f7cbbaa772385ced1b8931b67a8b9d246c9b82d.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
66efef9b1a |
mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5.
As proposed by David Hildenbrand [1], this series introduces the following
two new helper functions to replace pte_offset_map_nolock().
1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
f2f484085e |
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump. Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
021781b012 |
mm/madvise: unrestrict process_madvise() for current process
The process_madvise() call was introduced in commit |
|
|
|
1cd1a4e71b |
mm/mempolicy: fix comments for better documentation
Fix typo in mempolicy.h and Correct the number of allowed memory policy Link: https://lkml.kernel.org/r/20240926183516.4034-2-tanyaagarwal25699@gmail.com Signed-off-by: Tanya Agarwal <tanyaagarwal25699@gmail.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Anup Sharma <anupnewsmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bbc251f30e |
mm: fix shrink nr.unqueued_dirty counter issue
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both zero. Link: https://lkml.kernel.org/r/20240112012353.1387-1-justinjiang@vivo.com Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
cd3f8467af |
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled the same as an error in all callers, with some translating this into an -ESRCH error. Only proc_mem_open() returns NULL if no mm is found, however in this case it is clearer and makes more sense to explicitly handle the error. Additionally we take the opportunity to refactor the function to eliminate unnecessary nesting. Simplify things by simply returning -ESRCH if no mm is found - this both eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies callers which would return -ESRCH by returning this error directly. [lorenzo.stoakes@oracle.com: prefer neater pointer error comparison] Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9e9e085eff |
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running
KASAN-enabled kernel on a 256-core machine, the following soft lockup is
shown:
watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760]
CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95
Workqueue: events drain_vmap_area_work
RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0
Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75
RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202
RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949
RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50
RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800
R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39
R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0
Call Trace:
<IRQ>
? watchdog_timer_fn+0x2cd/0x390
? __pfx_watchdog_timer_fn+0x10/0x10
? __hrtimer_run_queues+0x300/0x6d0
? sched_clock_cpu+0x69/0x4e0
? __pfx___hrtimer_run_queues+0x10/0x10
? srso_return_thunk+0x5/0x5f
? ktime_get_update_offsets_now+0x7f/0x2a0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? hrtimer_interrupt+0x2ca/0x760
? __sysvec_apic_timer_interrupt+0x8c/0x2b0
? sysvec_apic_timer_interrupt+0x6a/0x90
</IRQ>
<TASK>
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? smp_call_function_many_cond+0x1d8/0xbb0
? __pfx_do_kernel_range_flush+0x10/0x10
on_each_cpu_cond_mask+0x20/0x40
flush_tlb_kernel_range+0x19b/0x250
? srso_return_thunk+0x5/0x5f
? kasan_release_vmalloc+0xa7/0xc0
purge_vmap_node+0x357/0x820
? __pfx_purge_vmap_node+0x10/0x10
__purge_vmap_area_lazy+0x5b8/0xa10
drain_vmap_area_work+0x21/0x30
process_one_work+0x661/0x10b0
worker_thread+0x844/0x10e0
? srso_return_thunk+0x5/0x5f
? __kthread_parkme+0x82/0x140
? __pfx_worker_thread+0x10/0x10
kthread+0x2a5/0x370
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Debugging Analysis:
1. The following ftrace log shows that the lockup CPU spends too much
time iterating vmap_nodes and flushing TLB when purging vm_area
structures. (Some info is trimmed).
kworker: funcgraph_entry: | drain_vmap_area_work() {
kworker: funcgraph_entry: | mutex_lock() {
kworker: funcgraph_entry: 1.092 us | __cond_resched();
kworker: funcgraph_exit: 3.306 us | }
... ...
kworker: funcgraph_entry: | flush_tlb_kernel_range() {
... ...
kworker: funcgraph_exit: # 7533.649 us | }
... ...
kworker: funcgraph_entry: 2.344 us | mutex_unlock();
kworker: funcgraph_exit: $ 23871554 us | }
The drain_vmap_area_work() spends over 23 seconds.
There are 2805 flush_tlb_kernel_range() calls in the ftrace log.
* One is called in __purge_vmap_area_lazy().
* Others are called by purge_vmap_node->kasan_release_vmalloc.
purge_vmap_node() iteratively releases kasan vmalloc
allocations and flushes TLB for each vmap_area.
- [Rough calculation] Each flush_tlb_kernel_range() runs
about 7.5ms.
-- 2804 * 7.5ms = 21.03 seconds.
-- That's why a soft lock is triggered.
2. Extending the soft lockup time can work around the issue (For example,
# echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
above-mentioned speculation: drain_vmap_area_work() spends too much
time.
If we combine all TLB flush operations of the KASAN shadow virtual
address into one operation in the call path
'purge_vmap_node()->kasan_release_vmalloc()', the running time of
drain_vmap_area_work() can be saved greatly. The idea is from the
flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the
soft lockup won't be triggered.
Here is the test result based on 6.10:
[6.10 wo/ the patch]
1. ftrace latency profiling (record a trace if the latency > 20s).
echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh
echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
2. Run `make -j $(nproc)` to compile the kernel source
3. Once the soft lockup is reproduced, check the ftrace log:
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
76) $ 50412985 us | } /* __purge_vmap_area_lazy */
76) $ 50412997 us | } /* drain_vmap_area_work */
76) $ 29165911 us | } /* __purge_vmap_area_lazy */
76) $ 29165926 us | } /* drain_vmap_area_work */
91) $ 53629423 us | } /* __purge_vmap_area_lazy */
91) $ 53629434 us | } /* drain_vmap_area_work */
91) $ 28121014 us | } /* __purge_vmap_area_lazy */
91) $ 28121026 us | } /* drain_vmap_area_work */
[6.10 w/ the patch]
1. Repeat step 1-2 in "[6.10 wo/ the patch]"
2. The soft lockup is not triggered and ftrace log is empty.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace
log.
4. Setting 'tracing_thresh' to 1 second gets ftrace log.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
23) $ 1074942 us | } /* __purge_vmap_area_lazy */
23) $ 1074950 us | } /* drain_vmap_area_work */
The worst execution time of drain_vmap_area_work() is about 1 second.
Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/
Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com
Fixes:
|
|
|
|
15ff4d409e |
mm/memcontrol: add per-memcg pgpgin/pswpin counter
In proactive memory reclamation scenarios, it is necessary to estimate the pswpin and pswpout metrics of the cgroup to determine whether to continue reclaiming anonymous pages in the current batch. This patch will collect these metrics and expose them. [linuszeng@tencent.com: v2] Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Li nk: https://lkml.kernel.org/r/20240913084453.3605621-1-jingxiangzeng.cas@gmail.com Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ba7196e566 |
mm/damon: fix sparse warning for zero initializer
sparse warns about zero initializing an array with {0,}, change it to
the equivalent {0}.
Fixes the sparse warning:
mm/damon/tests/vaddr-kunit.h:69:47: warning: missing braces around initializer
Link: https://lkml.kernel.org/r/xriwklcwjpwcz7eiavo6f7envdar4jychhsk6sfkj5klaznb6b@j6vrvr2sxjht
Fixes:
|
|
|
|
d2d243df44 |
mm: shmem: fix khugepaged activation policy for shmem
Shmem has a separate interface (different from anonymous pages) to control huge page allocation, that means shmem THP can be enabled while anonymous THP is disabled. However, in this case, khugepaged will not start to collapse shmem THP, which is unreasonable. To fix this issue, we should call start_stop_khugepaged() to activate or deactivate the khugepaged thread when setting shmem mTHP interfaces. Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check whether shmem THP is enabled, which will determine if khugepaged should be activated. Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5de195060b |
mm: resolve faulty mmap_region() error path behaviour
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
5baf8b037d |
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().
The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().
Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.
It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.
We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.
This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.
This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.
So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().
[akpm@linux-foundation.org: fix whitespace, per Catalin]
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
0fb4a7ad27 |
mm: refactor map_deny_write_exec()
Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to
use this function early in mmap_region() in a subsequent commit.
While we're here, we refactor the function to be more readable and add
some additional documentation.
Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
4080ef1579 |
mm: unconditionally close VMAs on error
Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
3dd6ed34ce |
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Patch series "fix error handling in mmap_region() and refactor
(hotfixes)", v4.
mmap_region() is somewhat terrifying, with spaghetti-like control flow and
numerous means by which issues can arise and incomplete state, memory
leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
The patches in this series comprise the minimal changes required to
resolve existing issues in mmap_region() error handling, in order that
they can be hotfixed and backported. There is additionally a follow up
series which goes further, separated out from the v1 series and sent and
updated separately.
This patch (of 5):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
coda_file_operations -> mmap -> coda_file_mmap()
shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
-> i915_gem_dmabuf_mmap()
None of these callers interact with vm_ops or mappings in a problematic
way on error, quickly exiting out.
Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com
Fixes:
|
|
|
|
f8f931bba0 |
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit |
|
|
|
e66f3185fa |
mm/thp: fix deferred split queue not partially_mapped
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
The new unlocked list_del_init() in deferred_split_scan() is buggy. I
gave bad advice, it looks plausible since that's a local on-stack list,
but the fact is that it can race with a third party freeing or migrating
the preceding folio (properly unqueueing it with refcount 0 while holding
split_queue_lock), thereby corrupting the list linkage.
The obvious answer would be to take split_queue_lock there: but it has a
long history of contention, so I'm reluctant to add to that. Instead,
make sure that there is always one safe (raised refcount) folio before, by
delaying its folio_put(). (And of course I was wrong to suggest updating
split_queue_len without the lock: leave that until the splice.)
And remove two over-eager partially_mapped checks, restoring those tests
to how they were before: if uncharge_folio() or free_tail_page_prepare()
finds _deferred_list non-empty, it's in trouble whether or not that folio
is partially_mapped (and the flag was already cleared in the latter case).
Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com
Fixes:
|