mirror of https://github.com/torvalds/linux.git
761 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
0b52c42035 |
mm: fix shmem THP counters on migration
The per node numa_stat values for shmem don't change on page migration for
THP:
grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat:
shmem N0=1092616192 N1=10485760
shmem_thp N0=1092616192 N1=10485760
migratepages 9181 0 1:
shmem N0=0 N1=1103101952
shmem_thp N0=1092616192 N1=10485760
Fix that by updating shmem_thp counters likewise to shmem counters on page
migration.
[jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge]
Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com
Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
c33c794828 |
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
04dee9e85c |
mm/various: give up if pte_offset_map[_lock]() fails
Following the examples of nearby code, various functions can just give up if pte_offset_map() or pte_offset_map_lock() fails. And there's no need for a preliminary pmd_trans_unstable() or other such check, since such cases are now safely handled inside. Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
0cb8fd4d14 |
mm/migrate: remove cruft from migration_entry_wait()s
migration_entry_wait_on_locked() does not need to take a mapped pte pointer, its callers can do the unmap first. Annotate it with __releases(ptl) to reduce sparse warnings. Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold __migration_entry_wait() into migration_entry_wait(), preferring the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr(). Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4e096ae180 |
mm: convert migrate_pages() to work on folios
Almost all of the callers & implementors of migrate_pages() were already converted to use folios. compaction_alloc() & compaction_free() are trivial to convert a part of this patch and not worth splitting out. Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
124abced64 |
migrate_pages_batch: simplify retrying and failure counting of large folios
After recent changes to the retrying and failure counting in migrate_pages_batch(), it was found that it's unnecessary to count retrying and failure for normal, large, and THP folios separately. Because we don't use retrying and failure number of large folios directly. So, in this patch, we simplified retrying and failure counting of large folios via counting retrying and failure of normal and large folios together. This results in the reduced line number. Previously, in migrate_pages_batch we need to track whether the source folio is large/THP before splitting. So is_large is used to cache folio_test_large() result. Now, we don't need that variable any more because we don't count retrying and failure of large folios (only counting that of THP folios). So, in this patch, is_large is removed to simplify the code. This is just code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4bb6dc79d9 |
migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHT
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal. When putting a Chromebook under memory pressure (opening over 90 tabs on a 4GB machine) it was fairly easy to see delays waiting for some locks in the kcompactd code path of > 100 ms. While the laptop wasn't amazingly usable in this state, it was still limping along and this state isn't something artificial. Sometimes we simply end up with a lot of memory pressure. Putting the same Chromebook under memory pressure while it was running Android apps (though not stressing them) showed a much worse result (NOTE: this was on a older kernel but the codepaths here are similar). Android apps on ChromeOS currently run from a 128K-block, zlib-compressed, loopback-mounted squashfs disk. If we get a page fault from something backed by the squashfs filesystem we could end up holding a folio lock while reading enough from disk to decompress 128K (and then decompressing it using the somewhat slow zlib algorithms). That reading goes through the ext4 subsystem (because it's a loopback mount) before eventually ending up in the block subsystem. This extra jaunt adds extra overhead. Without much work I could see cases where we ended up blocked on a folio lock for over a second. With more extreme memory pressure I could see up to 25 seconds. We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the two locks that were seen to be slow [1] and that generated much discussion. After discussion, it was decided that we should avoid waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being held for IO. We'll continue with the unbounded wait for the more full SYNC modes. With this change, I couldn't see any slow waits on these locks with my previous testcases. NOTE: The reason I stated digging into this originally isn't because some benchmark had gone awry, but because we've received in-the-field crash reports where we have a hung task waiting on the page lock (which is the equivalent code path on old kernels). While the root cause of those crashes is likely unrelated and won't be fixed by this patch, analyzing those crash reports did point out these very long waits seemed like something good to fix. With this patch we should no longer hang waiting on these locks, but presumably the system will still be in a bad shape and hang somewhere else. [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
22b8cc3e78 |
Add support for new Linear Address Masking CPU feature. This is similar
to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt 5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn +5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+ +YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI= =qitk -----END PGP SIGNATURE----- Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user() |
|
|
|
f3ebdf042d |
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
Staring at the comment "Recheck VMA as permissions can change since
migration started" in remove_migration_pte() can result in confusion,
because if the source PTE/PMD indicates write permissions, then there
should be no need to check VMA write permissions when restoring migration
entries or PTE-mapping a PMD.
Commit
|
|
|
|
851ae64246 |
migrate_pages_batch: fix statistics for longterm pin retry
In commit |
|
|
|
3c811f7883 |
mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"
This reverts commit |
|
|
|
1da28f1b5a |
mm/migrate: drop pte_mkhuge() in remove_migration_pte()
Since the following commit, arch_make_huge_pte() should be used directly
in generic memory subsystem as a platform provided page table helper,
instead of pte_mkhuge(). This just drops pte_mkhuge() from
remove_migration_pte(), which has now become redundant.
'commit
|
|
|
|
428e106ae1 |
mm: Introduce untagged_addr_remote()
untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com |
|
|
|
2ef7dbb269 |
migrate_pages: try migrate in batch asynchronously firstly
When we have locked more than one folios, we cannot wait the lock or bit
(e.g., page lock, buffer head lock, writeback bit) synchronously.
Otherwise deadlock may be triggered. This make it hard to batch the
synchronous migration directly.
This patch re-enables batching synchronous migration via trying to migrate
in batch asynchronously firstly. And any folios that are failed to be
migrated asynchronously will be migrated synchronously one by one.
Test shows that this can restore the TLB flushing batching performance for
synchronous migration effectively.
Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
Fixes:
|
|
|
|
a21d213321 |
migrate_pages: move split folios processing out of migrate_pages_batch()
To simplify the code logic and reduce the line number.
Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
Fixes:
|
|
|
|
fb3592c41a |
migrate_pages: fix deadlock in batched migration
Patch series "migrate_pages: fix deadlock in batched synchronous
migration", v2.
Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei. Analysis shows that if we have locked some other
folios except the one we are migrating, it's not safe in general to wait
synchronously, for example, to wait the writeback to complete or wait to
lock the buffer head.
So 1/3 fixes the deadlock in a simple way, where the batching support for
the synchronous migration is disabled. The change is straightforward and
easy to be understood. While 3/3 re-introduce the batching for
synchronous migration via trying to migrate asynchronously in batch
optimistically, then fall back to migrate synchronously one by one for
fail-to-migrate folios. Test shows that this can restore the TLB flushing
batching performance for synchronous migration effectively.
This patch (of 3):
Two deadlock bugs were reported for the migrate_pages() batching series.
Thanks Hugh and Pengfei! For example, in the following deadlock trace
snippet,
INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000
Workqueue: loop4 loop_rootcg_workfn
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? __pfx_wake_page_function+0x10/0x10
__filemap_get_folio+0x73d/0x770
shmem_get_folio_gfp+0x1fd/0xc80
shmem_write_begin+0x91/0x220
generic_perform_write+0x10e/0x2e0
__generic_file_write_iter+0x17e/0x290
? generic_write_checks+0x12b/0x1a0
generic_file_write_iter+0x97/0x180
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_readv_writev+0x13c/0x210
? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
do_iter_write+0xf6/0x330
vfs_iter_write+0x46/0x70
loop_process_work+0x723/0xfe0
loop_rootcg_workfn+0x28/0x40
process_one_work+0x3cc/0x8d0
worker_thread+0x66/0x630
? __pfx_worker_thread+0x10/0x10
kthread+0x153/0x190
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
INFO: task repro:1023 blocked for more than 147 seconds.
Not tainted 6.2.0-rc4-kvm+ #1314
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004
Call Trace:
<TASK>
__schedule+0x43b/0xd00
schedule+0x6a/0xf0
io_schedule+0x4a/0x80
folio_wait_bit_common+0x1b5/0x4e0
? compaction_alloc+0x77/0x1150
? __pfx_wake_page_function+0x10/0x10
folio_wait_bit+0x30/0x40
folio_wait_writeback+0x2e/0x1e0
migrate_pages_batch+0x555/0x1ac0
? __pfx_compaction_alloc+0x10/0x10
? __pfx_compaction_free+0x10/0x10
? __this_cpu_preempt_check+0x17/0x20
? lock_is_held_type+0xe6/0x140
migrate_pages+0x100e/0x1180
? __pfx_compaction_free+0x10/0x10
? __pfx_compaction_alloc+0x10/0x10
compact_zone+0xe10/0x1b50
? lock_is_held_type+0xe6/0x140
? check_preemption_disabled+0x80/0xf0
compact_node+0xa3/0x100
? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
? _find_first_bit+0x7b/0x90
sysctl_compaction_handler+0x5d/0xb0
proc_sys_call_handler+0x29d/0x420
proc_sys_write+0x2b/0x40
vfs_write+0x3a3/0x780
ksys_write+0xb7/0x180
__x64_sys_write+0x26/0x30
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f3a2471f59d
RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
</TASK>
The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop block
device to complete. While the loop worker task which writes back the
folio B is waiting to lock the shmem folio A, because the folio A backs
the folio B in the loop device. Thus deadlock is triggered.
In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.
To fix the deadlock, in this patch, we avoid to batch the page migration
except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting
is avoided.
The fix can be improved further. We will do that as soon as possible.
Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
Fixes:
|
|
|
|
e77d587a2c |
mm: avoid gcc complaint about pointer casting
The migration code ends up temporarily stashing information of the wrong
type in unused fields of the newly allocated destination folio. That
all works fine, but gcc does complain about the pointer type mis-use:
mm/migrate.c: In function ‘__migrate_folio_extract’:
mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’
1050 | *anon_vmap = (void *)dst->mapping;
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
and gcc is actually right to complain since it really doesn't understand
that this is a very temporary special case where this is ok.
This could be fixed in different ways by just obfuscating the assignment
sufficiently that gcc doesn't see what is going on, but the truly
"proper C" way to do this is by explicitly using a union.
Using unions for type conversions like this is normally hugely ugly and
syntactically nasty, but this really is one of the few cases where we
want to make it clear that we're not doing type conversion, we're really
re-using the value bit-for-bit just using another type.
IOW, this should not become a common pattern, but in this one case using
that odd union is probably the best way to document to the compiler what
is conceptually going on here.
[ Side note: there are valid cases where we convert pointers to other
pointer types, notably the whole "folio vs page" situation, where the
types actually have fundamental commonalities.
The fact that the gcc note is limited to just randomized structures
means that we don't see equivalent warnings for those cases, but it
migth also mean that we miss other cases where we do play these kinds
of dodgy games, and this kind of explicit conversion might be a good
idea. ]
I verified that at least for an allmodconfig build on x86-64, this
generates the exact same code, apart from line numbers and assembler
comment changes.
Fixes:
|
|
|
|
3822a7c409 |
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
|
|
|
|
cd7755800e |
mm: change to return bool for isolate_movable_page()
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users will care about the negative return value, thus we can convert the isolate_movable_page() to return a boolean value to make the code more clear when checking the movable page isolation state. No functional changes intended. [akpm@linux-foundation.org: remove unneeded comment, per Matthew] Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9747b9e924 |
mm: hugetlb: change to return bool for isolate_hugetlb()
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not care about the negative value, thus we can convert the isolate_hugetlb() to return a boolean value to make code more clear when checking the hugetlb isolation state. Moreover converts 2 users which will consider the negative value returned by isolate_hugetlb(). No functional changes intended. [akpm@linux-foundation.org: shorten locked section, per SeongJae Park] Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f7f9c00dfa |
mm: change to return bool for isolate_lru_page()
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
96a9c287e2 |
mm/migrate: fix wrongly apply write bit after mkdirty on sparc64
Nick Bowler reported another sparc64 breakage after the young/dirty
persistent work for page migration (per "Link:" below). That's after a
similar report [2].
It turns out page migration was overlooked, and it wasn't failing before
because page migration was not enabled in the initial report test
environment.
David proposed another way [2] to fix this from sparc64 side, but that
patch didn't land somehow. Neither did I check whether there's any other
arch that has similar issues.
Let's fix it for now as simple as moving the write bit handling to be
after dirty, like what we did before.
Note: this is based on mm-unstable, because the breakage was since 6.1 and
we're at a very late stage of 6.2 (-rc8), so I assume for this specific
case we should target this at 6.3.
[1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/
[2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/
Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com
Fixes:
|
|
|
|
6f7d760e86 |
migrate_pages: move THP/hugetlb migration support check to simplify code
This is a code cleanup patch, no functionality change is expected. After the change, the line number reduces especially in the long migrate_pages_batch(). Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7e12beb8ca |
migrate_pages: batch flushing TLB
The TLB flushing will cost quite some CPU cycles during the folio migration in some situations. For example, when migrate a folio of a process with multiple active threads that run on multiple CPUs. After batching the _unmap and _move in migrate_pages(), the TLB flushing can be batched easily with the existing TLB flush batching mechanism. This patch implements that. We use the following test case to test the patch. On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. Haoxin helped to test the patchset on an ARM64 server with 128 cores, 2 NUMA nodes. Test results show that the page migration performance increases up to 78%. NOTE: TLB flushing is batched only for normal folios, not for THP folios. Because the overhead of TLB flushing for THP folios is much lower than that for normal folios (about 1/512 on x86 platform). Link: https://lkml.kernel.org/r/20230213123444.155149-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ebe75e4751 |
migrate_pages: share more code between _unmap and _move
This is a code cleanup patch to reduce the duplicated code between the _unmap and _move stages of migrate_pages(). No functionality change is expected. Link: https://lkml.kernel.org/r/20230213123444.155149-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
80562ba0d8 |
migrate_pages: move migrate_folio_unmap()
Just move the position of the functions. There's no any functionality change. This is to make it easier to review the next patch via putting code near its position in the next patch. Link: https://lkml.kernel.org/r/20230213123444.155149-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5dfab109d5 |
migrate_pages: batch _unmap and _move
In this patch the _unmap and _move stage of the folio migration is
batched. That for, previously, it is,
for each folio
_unmap()
_move()
Now, it is,
for each folio
_unmap()
for each folio
_move()
Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy folios between batched _unmap and batched _move
stages.
Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
64c8902ed4 |
migrate_pages: split unmap_and_move() to _unmap() and _move()
This is a preparation patch to batch the folio unmapping and moving. In this patch, unmap_and_move() is split to migrate_folio_unmap() and migrate_folio_move(). So, we can batch _unmap() and _move() in different loops later. To pass some information between unmap and move, the original unused dst->mapping and dst->private are used. Link: https://lkml.kernel.org/r/20230213123444.155149-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
42012e0436 |
migrate_pages: restrict number of pages to migrate in batch
This is a preparation patch to batch the folio unmapping and moving for non-hugetlb folios. If we had batched the folio unmapping, all folios to be migrated would be unmapped before copying the contents and flags of the folios. If the folios that were passed to migrate_pages() were too many in unit of pages, the execution of the processes would be stopped for too long time, thus too long latency. For example, migrate_pages() syscall will call migrate_pages() with all folios of a process. To avoid this possible issue, in this patch, we restrict the number of pages to be migrated to be no more than HPAGE_PMD_NR. That is, the influence is at the same level of THP migration. Link: https://lkml.kernel.org/r/20230213123444.155149-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e5bfff8b10 |
migrate_pages: separate hugetlb folios migration
This is a preparation patch to batch the folio unmapping and moving for the non-hugetlb folios. Based on that we can batch the TLB shootdown during the folio migration and make it possible to use some hardware accelerator for the folio copying. In this patch the hugetlb folios and non-hugetlb folios migration is separated in migrate_pages() to make it easy to change the non-hugetlb folios migration implementation. Link: https://lkml.kernel.org/r/20230213123444.155149-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5b85593709 |
migrate_pages: organize stats with struct migrate_pages_stats
Patch series "migrate_pages(): batch TLB flushing", v5.
Now, migrate_pages() migrates folios one by one, like the fake code as
follows,
for each folio
unmap
flush TLB
copy
restore map
If multiple folios are passed to migrate_pages(), there are opportunities
to batch the TLB flushing and copying. That is, we can change the code to
something as follows,
for each folio
unmap
for each folio
flush TLB
for each folio
copy
for each folio
restore map
The total number of TLB flushing IPI can be reduced considerably. And we
may use some hardware accelerator such as DSA to accelerate the folio
copying.
So in this patch, we refactor the migrate_pages() implementation and
implement the TLB flushing batching. Base on this, hardware accelerated
folio copying can be implemented.
If too many folios are passed to migrate_pages(), in the naive batched
implementation, we may unmap too many folios at the same time. The
possibility for a task to wait for the migrated folios to be mapped again
increases. So the latency may be hurt. To deal with this issue, the max
number of folios be unmapped in batch is restricted to no more than
HPAGE_PMD_NR in the unit of page. That is, the influence is at the same
level of THP migration.
We use the following test to measure the performance impact of the
patchset,
On a 2-socket Intel server,
- Run pmbench memory accessing benchmark
- Run `migratepages` to migrate pages of pmbench between node 0 and
node 1 back and forth.
With the patch, the TLB flushing IPI reduces 99.1% during the test and
the number of pages migrated successfully per second increases 291.7%.
Xin Hao helped to test the patchset on an ARM64 server with 128 cores,
2 NUMA nodes. Test results show that the page migration performance
increases up to 78%.
This patch (of 9):
Define struct migrate_pages_stats to organize the various statistics in
migrate_pages(). This makes it easier to collect and consume the
statistics in multiple functions. This will be needed in the following
patches in the series.
Link: https://lkml.kernel.org/r/20230213123444.155149-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230213123444.155149-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
280d724ac2 |
mm/migrate: convert putback_movable_pages() to use folios
Removes 6 calls to compound_head(), and replaces putback_movable_page() with putback_movable_folio() as well. Link: https://lkml.kernel.org/r/20230130214352.40538-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
19979497c0 |
mm/migrate: convert isolate_movable_page() to use folios
Removes 6 calls to compound_head() and prepares the function to take in a folio instead of page argument. Link: https://lkml.kernel.org/r/20230130214352.40538-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
da707a6d18 |
mm/migrate: add folio_movable_ops()
folio_movable_ops() does the same as page_movable_ops() except uses folios instead of pages. This function will help make folio conversions in migrate.c more readable. Link: https://lkml.kernel.org/r/20230130214352.40538-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ea8e72f411 |
mm/hugetlb: convert putback_active_hugepage to take in a folio
Convert putback_active_hugepage() to folio_putback_active_hugetlb(), this removes one user of the Huge Page macros which take in a page. The callers in migrate.c are also cleaned up by being able to directly use the src and dst folio variables. Link: https://lkml.kernel.org/r/20230125170537.96973-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e37d3e838d |
mm/hugetlb: convert alloc_migrate_huge_page to folios
Change alloc_huge_page_nodemask() to alloc_hugetlb_folio_nodemask() and alloc_migrate_huge_page() to alloc_migrate_hugetlb_folio(). Both functions now return a folio rather than a page. Link: https://lkml.kernel.org/r/20230113223057.173292-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6aa3a92012 |
mm/hugetlb: convert isolate_hugetlb to folios
Patch series "continue hugetlb folio conversion", v3. This series continues the conversion of core hugetlb functions to use folios. This series converts many helper funtions in the hugetlb fault path. This is in preparation for another series to convert the hugetlb fault code paths to operate on folios. This patch (of 8): Convert isolate_hugetlb() to take in a folio and convert its callers to pass a folio. Use page_folio() to convert the callers to use a folio is safe as isolate_hugetlb() operates on a head page. Link: https://lkml.kernel.org/r/20230113223057.173292-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20230113223057.173292-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
96f97c438f |
mm: mlock: update the interface to use folios
Update the mlock interface to accept folios rather than pages, bringing the interface in line with the internal implementation. munlock_vma_page() still requires a page_folio() conversion, however this is consistent with the existent mlock_vma_page() implementation and a product of rmap still dealing in pages rather than folios. Link: https://lkml.kernel.org/r/cba12777c5544305014bc0cbec56bb4cc71477d8.1673526881.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
fcd48540d1 |
mm/hugetlb: move swap entry handling into vma lock when faulted
In hugetlb_fault(), there used to have a special path to handle swap entry at the entrance using huge_pte_offset(). That's unsafe because huge_pte_offset() for a pmd sharable range can access freed pgtables if without any lock to protect the pgtable from being freed after pmd unshare. Here the simplest solution to make it safe is to move the swap handling to be after the vma lock being held. We may need to take the fault mutex on either migration or hwpoison entries now (also the vma lock, but that's really needed), however neither of them is hot path. Note that the vma lock cannot be released in hugetlb_fault() when the migration entry is detected, because in migration_entry_wait_huge() the pgtable page will be used again (by taking the pgtable lock), so that also need to be protected by the vma lock. Modify migration_entry_wait_huge() so that it must be called with vma read lock held, and properly release the lock in __migration_entry_wait_huge(). Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e2ca6ba6ba |
MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu. - Several convert-to-folios series from Sidhartha Kumar and Huang Ying. - Some filemap cleanups from Vishal Moola. - David Hildenbrand added the ability to selftest anon memory COW handling. - Some cpuset simplifications from Liu Shixin. - Addition of vmalloc tracing support by Uladzislau Rezki. - Some pagecache folioifications and simplifications from Matthew Wilcox. - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it. - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series shold have been in the non-MM tree, my bad. - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages. - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages. - Peter Xu utilized the PTE marker code for handling swapin errors. - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient. - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand. - zram support for multiple compression streams from Sergey Senozhatsky. - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway. - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations. - Vishal Moola removed the try_to_release_page() wrapper. - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache. - David Hildenbrand did some cleanup and repair work on KSM COW breaking. - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend. - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range(). - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen. - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect. - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages(). - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting. - David Hildenbrand has fixed some of our MM selftests for 32-bit machines. - Many singleton patches, as usual. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml CodAgiA51qwzId3GRytIo/tfWZSezgA= =d19R -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ... |
|
|
|
deb9acc122 |
A large number of cleanups and bug fixes, with many of the bug fixes
found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc.) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented.) -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmOWqrMACgkQ8vlZVpUN gaMvmgf+P2C6vzjn13ZdF+GwFTi4fx4TJ5BZT78LQqvTZqhkfk4k1q2SFfHI7nXT ZWdu1KUQ0SYLo64oaSU9W+2B2pmGi/KgUlrwNhy8DFeGStogPuDVfmGWB63p1UQL ld42mE9q7bjY6nCZSKYXPp2jfSwsHuliHBJ4UfzVNAIwjiUEJ7pGeIrMFdLAEkVm TVNzvlUZaHUnVxhpsP6hs+5WNhHQ2IhWz4rwX01ussNgHTijYac4iaL05wpTvF5e 6NtvfmpOEMAbYrmIkJX4RVss4JNsHNOC0E8fjEHlgXJxBiAI6w8GxTxrS52Y4ELH nHXl/pc0L+I8+yh9B9+s0LBaSuPuTg== =lezv -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with many of the bug fixes found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: fix reserved cluster accounting in __es_remove_extent() ext4: fix inode leak in ext4_xattr_inode_create() on an error path ext4: allocate extended attribute value in vmalloc area ext4: avoid unaccounted block allocation when expanding inode ext4: initialize quota before expanding inode in setproject ioctl ext4: stop providing .writepage hook mm: export buffer_migrate_folio_norefs() ext4: switch to using write_cache_pages() for data=journal writeout jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout ext4: switch to using ext4_do_writepages() for ordered data writeout ext4: move percpu_rwsem protection into ext4_writepages() ext4: provide ext4_do_writepages() ext4: add support for writepages calls that cannot map blocks ext4: drop pointless IO submission from ext4_bio_write_page() ext4: remove nr_submitted from ext4_bio_write_page() ext4: move keep_towrite handling to ext4_bio_write_page() ext4: handle redirtying in ext4_bio_write_page() ext4: fix kernel BUG in 'ext4_write_inline_data_end()' ext4: make ext4_mb_initialize_context return void ext4: fix deadlock due to mbcache entry corruption ... |
|
|
|
893660b0e1 |
slab updates for 6.2-rc1
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmOTQvYACgkQ4CHKc/GJ
qRCaqQf/UjCDmj1vYKcsTzp5L4MDXdQPA7dKtytbnZtROtClVNUzB0jODsfeMI7C
SwbDJRoUU1y99GRFYIx9oGji1q7TYOWS/PsZxOGkv8ILommmQ1kJdZdxt9rOqYNg
3mjCZoQmZMIRipLDrN55C096Mi+mI89kkE4Lkyrigpmxvc0KyX6QBerr+VmaBMHw
DjmFC6Gj+ZH2AX6z7AzOF1gZ42gPBQUjWdHFRcY41dShOQZNl2FPT5ITAvotlJlH
9mj6woCqW936UOcpUl+Qqk7mekDJb1hqmYXV2VAlhprBi6Vcd9PU6GmPPb6w51bS
HkSNNYjkbuNxBXY13PUPcR0hEHv9zw==
=AlWx
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- SLOB deprecation and SLUB_TINY
The SLOB allocator adds maintenance burden and stands in the way of
API improvements [1]. Deprecate it by renaming the config option (to
make users notice) to CONFIG_SLOB_DEPRECATED with updated help text.
SLUB should be used instead as SLAB will be the next on the removal
list.
Based on reports from a riscv k210 board with 8MB RAM, add a
CONFIG_SLUB_TINY option to minimize SLUB's memory usage at the
expense of scalability. This has resolved the k210 regression [2] so
in case there are no others (that wouldn't be resolvable by further
tweaks to SLUB_TINY) plan is to remove SLOB in a few cycles.
Existing defconfigs with CONFIG_SLOB are converted to
CONFIG_SLUB_TINY.
- kmalloc() slub_debug redzone improvements
A series from Feng Tang that builds on the tracking or requested size
for kmalloc() allocations (for caches with debugging enabled) added
in 6.1, to make redzone checks consider the requested size and not
the rounded up one, in order to catch more subtle buffer overruns.
Includes new slub_kunit test.
- struct slab fields reordering to accomodate larger rcu_head
RCU folks would like to grow rcu_head with debugging options, which
breaks current struct slab layout's assumptions, so reorganize it to
make this possible.
- Miscellaneous improvements/fixes:
- __alloc_size checking compiler workaround (Kees Cook)
- Optimize and cleanup SLUB's sysfs init (Rasmus Villemoes)
- Make SLAB compatible with PROVE_RAW_LOCK_NESTING (Jiri Kosina)
- Correct SLUB's percpu allocation estimates (Baoquan He)
- Re-enableS LUB's run-time failslab sysfs control (Alexander Atanasov)
- Make tools/vm/slabinfo more user friendly when not run as root (Rong Tao)
- Dead code removal in SLUB (Hyeonggon Yoo)
* tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (31 commits)
mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED
mm, slub: don't aggressively inline with CONFIG_SLUB_TINY
mm, slub: remove percpu slabs with CONFIG_SLUB_TINY
mm, slub: split out allocations from pre/post hooks
mm/slub, kunit: Add a test case for kmalloc redzone check
mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation
mm, slub: refactor free debug processing
mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY
mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY
mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY
mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY
mm, slub: disable SYSFS support with CONFIG_SLUB_TINY
mm, slub: add CONFIG_SLUB_TINY
mm, slab: ignore hardened usercopy parameters when disabled
slab: Remove special-casing of const 0 size allocations
slab: Clean up SLOB vs kmalloc() definition
mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
mm/migrate: make isolate_movable_page() skip slab pages
mm/slab: move and adjust kernel-doc for kmem_cache_alloc
mm/slub, percpu: correct the calculation of early percpu allocation size
...
|
|
|
|
e26355e215 |
mm: export buffer_migrate_folio_norefs()
Ext4 needs this function to allow safe migration for journalled data pages. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221207112722.22220-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
4c74b65f47 |
mm/migrate.c: stop using 0 as NULL pointer
mm/migrate.c:1198:24: warning: Using plain integer as NULL pointer Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3080 Link: https://lkml.kernel.org/r/20221116012345.84870-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
eaec4e639f |
migrate: convert migrate_pages() to use folios
Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. THP specific code are converted to be large folio. Link: https://lkml.kernel.org/r/20221109012348.93849-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
49f5185922 |
migrate: convert unmap_and_move() to use folios
Patch series "migrate: convert migrate_pages()/unmap_and_move() to use folios", v2. The conversion is quite straightforward, just replace the page API to the corresponding folio API. migrate_pages() and unmap_and_move() mostly work with folios (head pages) only. This patch (of 2): Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. Link: https://lkml.kernel.org/r/20221109012348.93849-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20221109012348.93849-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
16fd6b31dd |
Revert "mm: migration: fix the FOLL_GET failure on following huge page"
Revert commit |
|
|
|
345c62d163 |
mm/hugetlb: convert move_hugetlb_state() to folios
Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to take in folios. [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build] Link: https://lkml.kernel.org/r/20221101223059.460937-10-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
8b8817630a |
mm/migrate: make isolate_movable_page() skip slab pages
In the next commit we want to rearrange struct slab fields to allow a larger rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct list_head slab_list", where the value of prev pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can confuse PageMovable() to be a false positive and cause a GPF as reported by lkp [1]. To fix this, make isolate_movable_page() skip pages with the PageSlab flag set. This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page allocation and freeing, and their counterparts to isolate_movable_page(). Based on my RFC from [2]. Added a comment update from Matthew's variant in [3] and, as done there, moved the PageSlab checks to happen before trying to take the page lock. [1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/ [2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/ [3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/ Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> |
|
|
|
fd4a7ac329 |
mm: migrate: try again if THP split is failed due to page refcnt
When creating a virtual machine, we will use memfd_create() to get a file descriptor which can be used to create share memory mappings using the mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to allocate physical pages for the virtual machine. When allocating physical pages for the guest, the host can fallback to allocate some CMA pages for the guest when over half of the zone's free memory is in the CMA area. In guest os, when the application wants to do some data transaction with DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and create IOMMU mappings for the DMA pages. However, when calling VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be failed to longterm-pin sometimes. After some invetigation, we found the pages used to do DMA mapping can contain some CMA pages, and these CMA pages will cause a possible failure of the longterm-pin, due to failed to migrate the CMA pages. The reason of migration failure may be temporary reference count or memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error, which makes the application failed to start. I observed one migration failure case (which is not easy to reproduce) is that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' count is also 1. That means when migrating a THP which is in CMA area, but can not allocate a new THP due to memory fragmentation, so it will split the THP. However THP split is also failed, probably the reason is temporary reference count of this THP. And the temporary reference count can be caused by dropping page caches (I observed the drop caches operation in the system), but we can not drop the shmem page caches due to they are already dirty at that time. Especially for THP split failure, which is caused by temporary reference count, we can try again to mitigate the failure of migration in this case according to previous discussion [1]. [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/ Link: https://lkml.kernel.org/r/6784730480a1df82e8f4cba1ed088e4ac767994b.1666599848.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
e51da3a9b6 |
mm/hugetlb: add folio_hstate()
Helper function to retrieve hstate information from a hugetlb folio. Link: https://lkml.kernel.org/r/20220922154207.1575343-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Cross <ccross@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
03e5f82ea6 |
mm: migrate: fix return value if all subpages of THPs are migrated successfully
During THP migration, if THPs are not migrated but they are split and all
subpages are migrated successfully, migrate_pages() will still return the
number of THP pages that were not migrated. This will confuse the callers
of migrate_pages(). For example, the longterm pinning will failed though
all pages are migrated successfully.
Thus we should return 0 to indicate that all pages are migrated in this
case
Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com
Fixes:
|
|
|
|
16ce101db8 |
mm/memory.c: fix race when faulting a device private page
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
29eea9b5a9 |
mm: convert page_get_anon_vma() to folio_get_anon_vma()
With all callers now passing in a folio, rename the function and convert all callers. Removes a couple of calls to compound_head() and a reference to page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c33db29231 |
migrate: convert unmap_and_move_huge_page() to use folios
Saves several calls to compound_head() and removes a couple of uses of page->lru. Link: https://lkml.kernel.org/r/20220902194653.1739778-52-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
682a71a1b6 |
migrate: convert __unmap_and_move() to use folios
Removes a lot of calls to compound_head(). Also remove a VM_BUG_ON that can never trigger as the PageAnon bit is the bottom bit of page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-51-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f7091ed64e |
mm: fix the handling Non-LRU pages returned by follow_page
The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.
And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().
Since the zone device page can't be a transparent huge page, so drop the
redundant zone device page check for split_huge_pages_pid(). (by Miaohe)
Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
Fixes:
|
|
|
|
467b171af8 |
mm/demotion: update node_is_toptier to work with memory tiers
With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier [sj@kernel.org: include missed header file, memory-tiers.h] Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h] [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers] Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
6c542ab757 |
mm/demotion: build demotion targets based on explicit memory tiers
This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9195244022 |
mm/demotion: move memory demotion related code
This moves memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7047b5a40b |
mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP
If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will be split, and the subpages of fail-to-migrate THP will be tried to migrate again, so we should not account the retry counter in the second loop, since we already accounted 'nr_thp_failed' in the first loop. Moreover we also do not need retry 10 times for -EAGAIN case for the subpages of fail-to-migrate THP in the second loop, since we already regarded the THP as migration failure, and save some migration time (for the worst case, will try 512 * 10 times) according to previous discussion [1]. [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com Tested-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
077309bc1e |
migrate_pages(): fix failure counting for retry
After 10 retries, we will give up and the remaining pages will be counted
as failure in nr_failed and nr_thp_failed. We should count the failure in
nr_failed_pages too. This is done in this patch.
Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
Fixes:
|
|
|
|
e6fa8a79fe |
migrate_pages(): fix failure counting for THP splitting
If THP is failed to be migrated, it may be split and retry. But after
splitting, the head page will be left in "from" list, although THP
migration failure has been counted already. If the head page is failed to
be migrated too, the failure will be counted twice incorrectly. So this
is fixed in this patch via moving the head page of THP after splitting to
"thp_split_pages" too.
Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
Fixes:
|
|
|
|
577be05c89 |
migrate_pages(): fix failure counting for THP on -ENOSYS
If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
unmap_and_move_huge_page() will return -ENOSYS. For THP, splitting will
be tried, but if splitting doesn't succeed, the THP will be left in "from"
list wrongly. If some other pages are retried, the THP migration failure
will counted again. This is fixed via moving the failure THP from "from"
to "ret_pages".
Another issue of the original code is that the unsupported failure
processing isn't consistent between THP and hugetlbfs page. Make them
consistent in this patch to make the code easier to be understood too.
Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
Fixes:
|
|
|
|
5fc30916b5 |
migrate_pages(): fix failure counting for THP subpages retrying
If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
split into thp_split_pages, and after other pages are migrated, pages in
thp_split_pages will be migrated with no_subpage_counting == true, because
its failure have been counted already. If some pages in thp_split_pages
are retried during migration, we should not count their failure if
no_subpage_counting == true too. This is done this patch to fix the
failure counting for THP subpages retrying.
Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
Fixes:
|
|
|
|
fbed53b477 |
migrate_pages(): fix THP failure counting for -ENOMEM
In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
returned, and migrate_pages() will try to split the THP unless "reason" is
MR_NUMA_MISPLACED (that is, nosplit == true). But when nosplit == true,
the THP migration failure will not be counted.
This is incorrect, so in this patch, the THP migration failure will be
counted for -ENOMEM regardless of nosplit is true or false. The nr_failed
counting isn't fixed because it's not used. Added some comments for it
per Baolin's suggestion.
Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
Fixes:
|
|
|
|
9c62ff005f |
migrate_pages(): remove unnecessary list_safe_reset_next()
Before commit
|
|
|
|
a7504ed14f |
migrate: fix syscall move_pages() return value for failure
Patch series "migrate_pages(): fix several bugs in error path", v3.
During review the code of migrate_pages() and build a test program for
it. Several bugs in error path are identified and fixed in this
series.
Most patches are tested via
- Apply error-inject.patch in Linux kernel
- Compile test-migrate.c (with -lnuma)
- Test with test-migrate.sh
error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
It turns out that error injection is an important tool to fix bugs in
error path.
This patch (of 8):
The return value of move_pages() syscall is incorrect when counting
the remaining pages to be migrated. For example, for the following
test program,
"
#define _GNU_SOURCE
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <numaif.h>
#include <numa.h>
#ifndef MADV_FREE
#define MADV_FREE 8 /* free pages only if memory pressure */
#endif
#define ONE_MB (1024 * 1024)
#define MAP_SIZE (16 * ONE_MB)
#define THP_SIZE (2 * ONE_MB)
#define THP_MASK (THP_SIZE - 1)
#define ERR_EXIT_ON(cond, msg) \
do { \
int __cond_in_macro = (cond); \
if (__cond_in_macro) \
error_exit(__cond_in_macro, (msg)); \
} while (0)
void error_msg(int ret, int nr, int *status, const char *msg)
{
int i;
fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
msg, ret, strerror(errno));
if (!nr)
return;
fprintf(stderr, "status: ");
for (i = 0; i < nr; i++)
fprintf(stderr, "%d ", status[i]);
fprintf(stderr, "\n");
}
void error_exit(int ret, const char *msg)
{
error_msg(ret, 0, NULL, msg);
exit(1);
}
int page_size;
bool do_vmsplice;
bool do_thp;
static int pipe_fds[2];
void *addr;
char *pn;
char *pn1;
void *pages[2];
int status[2];
void prepare()
{
int ret;
struct iovec iov;
if (addr) {
munmap(addr, MAP_SIZE);
close(pipe_fds[0]);
close(pipe_fds[1]);
}
ret = pipe(pipe_fds);
ERR_EXIT_ON(ret, "pipe");
addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
if (do_thp) {
ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
ERR_EXIT_ON(ret, "advise hugepage");
}
pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
pn1 = pn + THP_SIZE;
pages[0] = pn;
pages[1] = pn1;
*pn = 1;
if (do_vmsplice) {
iov.iov_base = pn;
iov.iov_len = page_size;
ret = vmsplice(pipe_fds[1], &iov, 1, 0);
ERR_EXIT_ON(ret < 0, "vmsplice");
}
status[0] = status[1] = 1024;
}
void test_migrate()
{
int ret;
int nodes[2] = { 1, 1 };
pid_t pid = getpid();
prepare();
ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 1, status, "move 1 page");
prepare();
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");
prepare();
*pn1 = 1;
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages");
prepare();
*pn1 = 1;
nodes[1] = 0;
ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
}
int main(int argc, char *argv[])
{
numa_run_on_node(0);
page_size = getpagesize();
test_migrate();
fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
do_vmsplice = true;
test_migrate();
fprintf(stderr, "\nTest THP:\n");
do_thp = true;
do_vmsplice = false;
test_migrate();
fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
do_vmsplice = true;
test_migrate();
return 0;
}
"
The output of the current kernel is,
"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0
Make page 0 cannot be migrated:
Error: move 1 page, ret : 0, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 0, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
status: 1024 1024
"
While the expected output is,
"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0
Make page 0 cannot be migrated:
Error: move 1 page, ret : 1, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 1, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
status: 1024 1024
"
Fix this via correcting the remaining pages counting. With the fix,
the output for the test program as above is expected.
Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
Fixes:
|
|
|
|
2e3468778d |
mm: remember young/dirty bit for page migrations
When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset is big enough to store both the PFN value and the A/D bits. Otherwise the A/D bits are dropped like before. Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
33024536ba |
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.
To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified.
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote. But this isn't a perfect
algorithm to identify the hot pages. Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g. 60 seconds). So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault. Which is a kind of mostly frequently accessed (MFU) algorithm.
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.
A choice is to promote/demote as fast as possible. But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.
A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting. But
the workload latency will be better. This is implemented in this patchset
as the page promotion rate limit mechanism.
The promotion hot threshold is workload and system configuration
dependent. So in this patchset, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.
We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed. The test results are
as follows,
pmbench score promote rate
(accesses/s) MB/s
------------- ------------
base 146887704.1 725.6
hot selection 165695601.2 544.0
rate limit 162814569.8 165.2
auto adjustment 170495294.0 136.9
From the results above,
With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.
With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.
With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.
Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).
sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120
The tps can be improved about 5%.
This patch (of 3):
To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified. Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote. But this isn't a perfect algorithm to
identify the hot pages. Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g. 60 seconds). The most frequently
accessed (MFU) algorithm is better.
So, in this patch we implemented a better hot page selection algorithm.
Which is based on NUMA balancing page table scanning and hint page fault
as follows,
- When the page tables of the processes are scanned to change PTE/PMD
to be PROT_NONE, the current time is recorded in struct page as scan
time.
- When the page is accessed, hint page fault will occur. The scan
time is gotten from the struct page. And The hint page fault
latency is defined as
hint page fault time - scan time
The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher. So the hint page
fault latency is a better estimation of the page hot/cold.
It's hard to find some extra space in struct page to hold the scan time.
Fortunately, we can reuse some bits used by the original NUMA balancing.
NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth. But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node. So the accessing CPU and PID information are unnecessary for the
slow memory pages. We can reuse these bits in struct page to record the
scan time. For the fast memory pages, these bits are used as before.
For the hot threshold, the default value is 1 second, which works well in
our performance test. All pages with hint page fault latency < hot
threshold will be considered hot.
It's hard for users to determine the hot threshold. So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment. We will continue to work on a hot threshold automatic
adjustment mechanism.
The downside of the above method is that the response time to the workload
hot spot changing may be much longer. For example,
- A previous cold memory area becomes hot
- The hint page fault will be triggered. But the hint page fault
latency isn't shorter than the hot threshold. So the pages will
not be promoted.
- When the memory area is scanned again, maybe after a scan period,
the hint page fault latency measured will be shorter than the hot
threshold and the pages will be promoted.
To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.
Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.
Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
8315682148 |
mm: migration: fix the FOLL_GET failure on following huge page
Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.
Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
But these huge page APIs don't support FOLL_GET:
1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
It will cause WARN_ON_ONCE for FOLL_GET.
3. follow_huge_pgd() in mm/hugetlb.c
This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.
After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.
Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
Fixes:
|
|
|
|
6614a3c316 |
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
|
|
|
9d0ddc0cb5 |
fs: Remove aops->migratepage()
With all users converted to migrate_folio(), remove this operation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
b890ec2a2c |
hugetlb: Convert to migrate_folio
This involves converting migrate_huge_page_move_mapping(). We also need a folio variant of hugetlb_set_page_subpool(), but that's for a later patch. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> |
|
|
|
2ec810d596 |
mm/migrate: Add filemap_migrate_folio()
There is nothing iomap-specific about iomap_migratepage(), and it fits a pattern used by several other filesystems, so move it to mm/migrate.c, convert it to be filemap_migrate_folio() and convert the iomap filesystems to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> |
|
|
|
541846502f |
mm/migrate: Convert migrate_page() to migrate_folio()
Convert all callers to pass a folio. Most have the folio already available. Switch all users from aops->migratepage to aops->migrate_folio. Also turn the documentation into kerneldoc. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> |
|
|
|
108ca83581 |
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
Now that both callers have a folio, convert this function to take a folio & rename it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
67235182a4 |
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their declarations to buffer.h and switch all filesystems that have wired them up. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
2be7fa10c0 |
mm/migrate: Convert writeout() to take a folio
Use a folio throughout this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
8faa8ef5dd |
mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
Use a folio throughout. migrate_page() will be converted to migrate_folio() later. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
5490da4f06 |
fs: Add aops->migrate_folio
Provide a folio-based replacement for aops->migratepage. Update the documentation to document migrate_folio instead of migratepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
68f2736a85 |
mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesystems. They just need their own movable_operations structure, which we can point to directly from page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
3218f8712d |
mm: handling Non-LRU pages returned by vm_normal_pages
With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well. Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> [v2] Reviewed-by: Alistair Popple <apopple@nvidia.com> [v6] Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ad1ac596e8 |
mm/migration: fix potential pte_unmap on an not mapped pte
__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller. But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant
that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.
Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
Fixes:
|
|
|
|
7ce82f4c3f |
mm/migration: return errno when isolate_huge_page failed
We might fail to isolate huge page due to e.g. the page is under
migration which cleared HPageMigratable. We should return errno in this
case rather than always return 1 which could confuse the user, i.e. the
caller might think all of the memory is migrated while the hugetlb page is
left behind. We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.
Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes:
|
|
|
|
160088b3b6 |
mm/migration: remove unneeded lock page and PageMovable check
When non-lru movable page was freed from under us, __ClearPageMovable must have been done. So we can remove unneeded lock page and PageMovable check here. Also free_pages_prepare() will clear PG_isolated for us, so we can further remove ClearPageIsolated as suggested by David. Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
b653db7735 |
mm: Clear page->private when splitting or migrating a page
In our efforts to remove uses of PG_private, we have found folios with
the private flag clear and folio->private not-NULL. That is the root
cause behind
|
|
|
|
98931dd95f |
Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
enablement of the recent huge page vmemmap optimization feature.
Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
David Vernet has done some fixup work on the memcg selftests.
Peter Xu has taught userfaultfd to handle write protection faults against
shmem- and hugetlbfs-backed files.
More DAMON development from SeongJae Park - adding online tuning of the
feature and support for monitoring of fixed virtual address ranges. Also
easier discovery of which monitoring operations are available.
Nadav Amit has done some optimization of TLB flushing during mprotect().
Neil Brown continues to labor away at improving our swap-over-NFS support.
David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
Peng Liu fixed some errors in the core hugetlb code.
Joao Martins has reduced the amount of memory consumed by device-dax's
compound devmaps.
Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
Roman Gushchin has done more work on the memcg selftests.
And, of course, many smaller fixes and cleanups. Notably, the customary
million cleanup serieses from Miaohe Lin.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
=nFe6
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Almost all of MM here. A few things are still getting finished off,
reviewed, etc.
- Yang Shi has improved the behaviour of khugepaged collapsing of
readonly file-backed transparent hugepages.
- Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
- Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
runtime enablement of the recent huge page vmemmap optimization
feature.
- Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
- Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
- Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
- David Vernet has done some fixup work on the memcg selftests.
- Peter Xu has taught userfaultfd to handle write protection faults
against shmem- and hugetlbfs-backed files.
- More DAMON development from SeongJae Park - adding online tuning of
the feature and support for monitoring of fixed virtual address
ranges. Also easier discovery of which monitoring operations are
available.
- Nadav Amit has done some optimization of TLB flushing during
mprotect().
- Neil Brown continues to labor away at improving our swap-over-NFS
support.
- David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
- Peng Liu fixed some errors in the core hugetlb code.
- Joao Martins has reduced the amount of memory consumed by
device-dax's compound devmaps.
- Some cleanups of the arch-specific pagemap code from Anshuman
Khandual.
- Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
- Roman Gushchin has done more work on the memcg selftests.
... and, of course, many smaller fixes and cleanups. Notably, the
customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
mm: kfence: use PAGE_ALIGNED helper
selftests: vm: add the "settings" file with timeout variable
selftests: vm: add "test_hmm.sh" to TEST_FILES
selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
selftests: vm: add migration to the .gitignore
selftests/vm/pkeys: fix typo in comment
ksm: fix typo in comment
selftests: vm: add process_mrelease tests
Revert "mm/vmscan: never demote for memcg reclaim"
mm/kfence: print disabling or re-enabling message
include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
mm: fix a potential infinite loop in start_isolate_page_range()
MAINTAINERS: add Muchun as co-maintainer for HugeTLB
zram: fix Kconfig dependency warning
mm/shmem: fix shmem folio swapoff hang
cgroup: fix an error handling path in alloc_pagecache_max_30M()
mm: damon: use HPAGE_PMD_SIZE
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
nodemask.h: fix compilation error with GCC12
...
|
|
|
|
e7e3ffeb27 |
mm/migrate: convert move_to_new_page() into move_to_new_folio()
Pass in the folios that we already have in each caller. Saves a lot of calls to compound_head(). Link: https://lkml.kernel.org/r/20220504182857.4013401-27-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
717aeab429 |
mm: convert sysfs input to bool using kstrtobool()
Sysfs input conversion to corrosponding bool value e.g. "false" or "0" to false, "true" or "1" to true are currently handled through strncmp at multiple places. Use kstrtobool() to convert sysfs input to bool value. [akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy] Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.com Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Fitzgerald <rf@opensource.cirrus.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
68189fef88 |
fs: Change try_to_free_buffers() to take a folio
All but two of the callers already have a folio; pass a folio into try_to_free_buffers(). This removes the last user of cancel_dirty_page() so remove that wrapper function too. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> |
|
|
|
6c287605fd |
mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the page table
entry gets write-protected.
With that information at hand, we can extend our COW logic to always reuse
anonymous pages that are exclusive. For anonymous pages that might be
shared, the existing logic applies.
As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry. Especially PTE vs. PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page
tables exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned. Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.
For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via a
page table entry".
To not spill PG_anon_exclusive code all over the mm code-base, we let the
anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present page table
entry, it must only pin if PG_anon_exclusive is set for the mapped
(sub)page.
This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
reliable.
Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared. Clearing will fail if there are GUP pins on the page:
* For fork(), this means having to copy the page and not being able to
share it. fork() protects against concurrent GUP using the PT lock and
the src_mm->write_protect_seq.
* For KSM, this means sharing will fail. For swap this means, unmapping
will fail, For migration this means, migration will fail early. All
three cases protect against concurrent GUP using the PT lock and a
proper clear/invalidate+flush of the relevant page table entry.
This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it. It improves the situation for
O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, if
fork() is *not* involved, however swapout and fork() are still
problematic. Properly using FOLL_PIN instead of FOLL_GET for these GUP
users will fix the issue for them.
I. Details about basic handling
I.1. Fresh anonymous pages
page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
the mechanism fresh anonymous pages come into life (besides migration code
where we copy the page->mapping), all fresh anonymous pages will start out
as exclusive.
I.2. COW reuse handling of anonymous pages
When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it. Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused. If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.
Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
pages are a scarce resource.
I.3. Migration handling
try_to_migrate() has to try marking an exclusive anonymous page shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.
Writable migration entries implicitly point at shared anonymous pages.
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.
When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
I.4. Swapout handling
try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. For now, information about exclusivity is lost. In
the future, we might want to remember that information in the swap entry
in some cases, however, it requires more thought, care, and a way to store
that information in swap entries.
I.5. Swapin handling
do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
I.6. THP handling
__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.
a) In case we have a readable-exclusive PMD migration entry, simply
insert readable-exclusive PTE migration entries.
b) In case we have a present PMD entry and we don't want to freeze
("convert to migration entries"), simply forward PG_anon_exclusive to
all sub-pages, no need to temporarily clear the bit.
c) In case we have a present PMD entry and want to freeze, handle it
similar to try_to_migrate(): try marking the page shared first. In
case we fail, we ignore the "freeze" instruction and simply split
ordinarily. try_to_migrate() will properly fail because the THP is
still mapped via PTEs.
When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.
I.7. fork() handling fork() handling is relatively easy, because
PG_anon_exclusive is only expressive for some page table entry types.
a) Present anonymous pages
page_try_dup_anon_rmap() will mark the given subpage shared -- which will
fail if the page is pinned. If it failed, we have to copy (or PTE-map a
PMD to handle it on the PTE level).
Note that device exclusive entries are just a pointer at a PageAnon()
page. fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.
b) Device private entry
Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.
page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
fail because they cannot get pinned.
c) HW poison entries
PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.
d) Migration entries
Writable and readable-exclusive entries are converted to readable entries:
possibly shared.
I.8. mprotect() handling
mprotect() only has to properly handle the new readable-exclusive
migration entry:
When write-protecting a migration entry that points at an anonymous page,
remember the information about exclusivity via the "readable-exclusive"
migration entry type.
II. Migration and GUP-fast
Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper clear/invalidate+flush
to make the following scenario impossible:
1. try_to_migrate() places a migration entry after checking for GUP pins
and marks the page possibly shared.
2. GUP-fast pins the page due to lack of synchronization
3. fork() converts the "writable/readable-exclusive" migration entry into a
readable migration entry
4. Migration fails due to the GUP pin (failing to freeze the refcount)
5. Migration entries are restored. PG_anon_exclusive is lost
-> We have a pinned page that is not marked exclusive anymore.
Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.
III. Swapout and GUP-fast
Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly shared
and synchronize against GUP-fast by a proper clear/invalidate+flush to
make the following scenario impossible:
1. try_to_unmap() places a swap entry after checking for GUP pins and
clears exclusivity information on the page.
2. GUP-fast pins the page due to lack of synchronization.
-> We have a pinned page that is not marked exclusive anymore.
If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply. This is future work.
Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
28c5209dfd |
mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for page_add_anon_rmap() now. RMAP_COMPOUND is implicit for hugetlb pages and ignored. Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f1e2db12e4 |
mm/rmap: remove do_page_add_anon_rmap()
... and instead convert page_add_anon_rmap() to accept flags. Passing flags instead of bools is usually nicer either way, and we want to more often also pass RMAP_EXCLUSIVE in follow up patches when detecting that an anonymous page is exclusive: for example, when restoring an anonymous page from a writable migration entry. This is a preparation for marking an anonymous page inside page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed. Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
fb3d824d1a |
mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()
... and move the special check for pinned pages into page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages via a new pageflag, clearing it only after making sure that there are no GUP pins on the anonymous page. We really only care about pins on anonymous pages, because they are prone to getting replaced in the COW handler once mapped R/O. For !anon pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about that, at least not that I could come up with an example. Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we know we're dealing with anonymous pages. Also, drop the handling of pinned pages from copy_huge_pud() and add a comment if ever supporting anonymous pages on the PUD level. This is a preparation for tracking exclusivity of anonymous pages in the rmap code, and disallowing marking a page shared (-> failing to duplicate) if there are GUP pins on a page. Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
7d6e2d9638 |
mm: untangle config dependencies for demote-on-reclaim
At the time demote-on-reclaim was introduced, it was tied to CONFIG_HOTPLUG_CPU + CONFIG_MIGRATE, but that is not really accurate. The only two things we need to depend on are CONFIG_NUMA + CONFIG_MIGRATE, so clean this up. Furthermore, we only register the hotplug memory notifier when the system has CONFIG_MEMORY_HOTPLUG. Link: https://lkml.kernel.org/r/20220322224016.4574-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9c42fe4e30 |
mm: migrate: simplify the refcount validation when migrating hugetlb mapping
There is no need to validate the hugetlb page's refcount before trying to freeze the hugetlb page's expected refcount, instead we can just rely on the page_ref_freeze() to simplify the validation. Moreover we are always under the page lock when migrating the hugetlb page mapping, which means nowhere else can remove it from the page cache, so we can remove the xas_load() validation under the i_pages lock. Link: https://lkml.kernel.org/r/eb2fbbeaef2b1714097b9dec457426d682ee0635.1649676424.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
4cd614841c |
mm/migration: fix possible do_pages_stat_array racing with memory offline
When follow_page peeks a page, the page could be migrated and then be offlined while it's still being used by the do_pages_stat_array(). Use FOLL_GET to hold the page refcnt to fix this potential race. Link: https://lkml.kernel.org/r/20220318111709.60311-12-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3f26c88bd6 |
mm/migration: fix potential invalid node access for reclaim-based migration
If we failed to setup hotplug state callbacks for mm/demotion:online in
some corner cases, node_demotion will be left uninitialized. Invalid node
might be returned from the next_demotion_node() when doing reclaim-based
migration. Use kcalloc to allocate node_demotion to fix the issue.
Link: https://lkml.kernel.org/r/20220318111709.60311-11-linmiaohe@huawei.com
Fixes:
|
|
|
|
69a041ff50 |
mm/migration: fix potential page refcounts leak in migrate_pages
In -ENOMEM case, there might be some subpages of fail-to-migrate THPs left
in thp_split_pages list. We should move them back to migration list so
that they could be put back to the right list by the caller otherwise the
page refcnt will be leaked here. Also adjust nr_failed and nr_thp_failed
accordingly to make vm events account more accurate.
Link: https://lkml.kernel.org/r/20220318111709.60311-10-linmiaohe@huawei.com
Fixes:
|
|
|
|
f430893b01 |
mm/migration: remove some duplicated codes in migrate_pages
Remove the duplicated codes in migrate_pages to simplify the code. Minor readability improvement. No functional change intended. Link: https://lkml.kernel.org/r/20220318111709.60311-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
91925ab8cc |
mm/migration: avoid unneeded nodemask_t initialization
Avoid unneeded next_pass and this_pass initialization as they're always set before using to save possible cpu cycles when there are plenty of nodes in the system. Link: https://lkml.kernel.org/r/20220318111709.60311-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3eefb826c5 |
mm/migration: use helper macro min in do_pages_stat
We could use helper macro min to help set the chunk_nr to simplify the code. Link: https://lkml.kernel.org/r/20220318111709.60311-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
cb1c37b1c6 |
mm/migration: use helper function vma_lookup() in add_page_for_migration
We could use helper function vma_lookup() to lookup the needed vma to simplify the code. Link: https://lkml.kernel.org/r/20220318111709.60311-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
b75454e101 |
mm/migration: remove unneeded local variable page_lru
We can use page_is_file_lru() directly to help account the isolated pages to simplify the code a bit. Link: https://lkml.kernel.org/r/20220318111709.60311-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
5202978b48 |
mm/migration: remove unneeded local variable mapping_locked
Patch series "A few cleanup and fixup patches for migration", v2. This series contains a few patches to remove unneeded variables, jump label and use helper to simplify the code. Also we fix some bugs such as page refcounts leak , invalid node access and so on. More details can be found in the respective changelogs. This patch (of 11): When mapping_locked is true, TTU_RMAP_LOCKED is always set to ttu. We can check ttu instead so mapping_locked can be removed. And ttu is either 0 or TTU_RMAP_LOCKED now. Change '|=' to '=' to reflect this. Link: https://lkml.kernel.org/r/20220318111709.60311-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220318111709.60311-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
bc53008eea |
mm/vmscan: make sure wakeup_kswapd with managed zone
wakeup_kswapd() only wake up kswapd when the zone is managed. For two callers of wakeup_kswapd(), they are node perspective. * wake_all_kswapds * numamigrate_isolate_page If we picked up a !managed zone, this is not we expected. This patch makes sure we pick up a managed zone for wakeup_kswapd(). And it also use managed_zone in migrate_balanced_pgdat() to get the proper zone. [richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()] Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c185e494ae |
mm/migrate: Use a folio in migrate_misplaced_transhuge_page()
Unify alloc_misplaced_dst_page() and alloc_misplaced_dst_page_thp(). Removes an assumption that compound pages are HPAGE_PMD_ORDER. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> |
|
|
|
ffe06786b5 |
mm/migrate: Use a folio in alloc_migration_target()
This removes an assumption that a large folio is HPAGE_PMD_ORDER as well as letting us remove the call to prep_transhuge_page() and a few hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> |
|
|
|
adb11e78c5 |
mm/munlock: protect the per-CPU pagevec by a local_lock_t
The access to mlock_pvec is protected by disabling preemption via get_cpu_var() or implicit by having preemption disabled by the caller (in mlock_page_drain() case). This breaks on PREEMPT_RT since folio_lruvec_lock_irq() acquires a sleeping lock in this section. Create struct mlock_pvec which consits of the local_lock_t and the pagevec. Acquire the local_lock() before accessing the per-CPU pagevec. Replace mlock_page_drain() with a _local() version which is invoked on the local CPU and acquires the local_lock_t and a _remote() version which uses the pagevec from a remote CPU which offline. Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
4cc79b3303 |
mm/migration: add trace events for base page and HugeTLB migrations
This adds two trace events for base page and HugeTLB page migrations. These events, closely follow the implementation details like setting and removing of PTE migration entries, which are essential operations for migration. The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both <events/migration.h> and <events/tlb.h> based trace events. Hence drop redundant CREATE_TRACE_POINTS from other places which could have otherwise conflicted during build. Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
9030fb0bb9 |
Folio changes for 5.18
- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
|
|
|
|
734c15700c |
mm: only re-generate demotion targets when a numa node changes its N_CPU state
Abhishek reported that after patch [1], hotplug operations are taking
roughly double the expected time. [2]
The reason behind is that the CPU callbacks that
migrate_on_reclaim_init() sets always call set_migration_target_nodes()
whenever a CPU is brought up/down.
But we only care about numa nodes going from having cpus to become
cpuless, and vice versa, as that influences the demotion_target order.
We do already have two CPU callbacks (vmstat_cpu_online() and
vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
callbacks in migrate_on_reclaim_init() and only call
set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
numa node change its N_CPU state.
[1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@intel.com/
[2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@linux.vnet.ibm.com/
[osalvador@suse.de: add feedback from Huang Ying]
Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
Fixes:
|
|
|
|
c574bbe917 |
NUMA balancing: optimize page placement for memory tiering system
With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are usually different. In such system, because of the memory accessing pattern changing etc, some pages in the slow memory may become hot globally. So in this patch, the NUMA balancing mechanism is enhanced to optimize the page placement among the different memory types according to hot/cold dynamically. In a typical memory tiering system, there are CPUs, fast memory and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). That is, the fast memory is regarded as local while the slow memory is regarded as remote. So it's possible for the recently accessed pages in the slow memory node to be promoted to the fast memory node via the existing NUMA balancing mechanism. The original NUMA balancing mechanism will stop to migrate pages if the free memory of the target node becomes below the high watermark. This is a reasonable policy if there's only one memory type. But this makes the original NUMA balancing mechanism almost do not work to optimize page placement among different memory types. Details are as follows. It's the common cases that the working-set size of the workload is larger than the size of the fast memory nodes. Otherwise, it's unnecessary to use the slow memory at all. So, there are almost always no enough free pages in the fast memory nodes, so that the globally hot pages in the slow memory node cannot be promoted to the fast memory node. To solve the issue, we have 2 choices as follows, a. Ignore the free pages watermark checking when promoting hot pages from the slow memory node to the fast memory node. This will create some memory pressure in the fast memory node, thus trigger the memory reclaiming. So that, the cold pages in the fast memory node will be demoted to the slow memory node. b. Define a new watermark called wmark_promo which is higher than wmark_high, and have kswapd reclaiming pages until free pages reach such watermark. The scenario is as follows: when we want to promote hot-pages from a slow memory to a fast memory, but fast memory's free pages would go lower than high watermark with such promotion, we wake up kswapd with wmark_promo watermark in order to demote cold pages and free us up some space. So, next time we want to promote hot-pages we might have a chance of doing so. The choice "a" may create high memory pressure in the fast memory node. If the memory pressure of the workload is high, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, the hot pages promotion will stop earlier because its allocation watermark is higher than that of the normal memory allocation. So in this patch, choice "b" is implemented. A new zone watermark (WMARK_PROMO) is added. Which is larger than the high watermark and can be controlled via watermark_scale_factor. In addition to the original page placement optimization among sockets, the NUMA balancing mechanism is extended to be used to optimize page placement according to hot/cold among different memory types. So the sysctl user space interface (numa_balancing) is extended in a backward compatible way as follow, so that the users can enable/disable these functionality individually. The sysctl is converted from a Boolean value to a bits field. The definition of the flags is, - 0: NUMA_BALANCING_DISABLED - 1: NUMA_BALANCING_NORMAL - 2: NUMA_BALANCING_MEMORY_TIERING We have tested the patch with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Thanks Andrew Morton to help fix the document format error. Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
e39bb6be9f |
NUMA Balancing: add page promotion counter
Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13 With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are different. After commit |
|
|
|
356ea38656 |
mm/migrate: fix race between lock page and clear PG_Isolated
When memory is tight, system may start to compact memory for large
continuous memory demands. If one process tries to lock a memory page
that is being locked and isolated for compaction, it may wait a long time
or even forever. This is because compaction will perform non-atomic
PG_Isolated clear while holding page lock, this may overwrite PG_waiters
set by the process that can't obtain the page lock and add itself to the
waiting queue to wait for the lock to be unlocked.
CPU1 CPU2
lock_page(page); (successful)
lock_page(); (failed)
__ClearPageIsolated(page); SetPageWaiters(page) (may be overwritten)
unlock_page(page);
The solution is to not perform non-atomic operation on page flags while
holding page lock.
Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com
Signed-off-by: andrew.yang <andrew.yang@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Vlastimil Babka" <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: "William Kucharski" <william.kucharski@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
fc89213a63 |
mm,migrate: fix establishing demotion target
In commit |
|
|
|
b698f0a177 |
mm/fs: delete PF_SWAPWRITE
PF_SWAPWRITE has been redundant since v3.2 commit
|
|
|
|
87d2762e22 |
mm: remove unneeded local variable follflags
We can pass FOLL_GET | FOLL_DUMP to follow_page directly to simplify the code a bit in add_page_for_migration and split_huge_pages_pid. Link: https://lkml.kernel.org/r/20220311072002.35575-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
3150be8fa8 |
mm: replace multiple dcache flush with flush_dcache_folio()
Simplify the code by using flush_dcache_folio(). Link: https://lkml.kernel.org/r/20220210123058.79206-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2771739a71 |
mm: fix missing cache flush for all tail pages of compound page
The D-cache maintenance inside move_to_new_page() only consider one
page, there is still D-cache maintenance issue for tail pages of
compound page (e.g. THP or HugeTLB).
THP migration is only enabled on x86_64, ARM64 and powerpc, while
powerpc and arm64 need to maintain the consistency between I-Cache and
D-Cache, which depends on flush_dcache_page() to maintain the
consistency between I-Cache and D-Cache.
But there is no issues on arm64 and powerpc since they already considers
the compound page cache flushing in their icache flush function.
HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
riscv, s390 and sh, while arm has handled the compound page cache flush
in flush_dcache_page(), but most others do not.
In theory, the issue exists on many architectures. Fix this by not
using flush_dcache_folio() since it is not backportable.
Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com
Fixes:
|
|
|
|
65462462ff |
mm/gup: follow_pfn_pte(): -EEXIST cleanup
Remove a quirky special case from follow_pfn_pte(), and adjust its callers to match. Caller changes include: __get_user_pages(): Regardless of any FOLL_* flags, get_user_pages() and its variants should handle PFN-only entries by stopping early, if the caller expected **pages to be filled in. This makes for a more reliable API, as compared to the previous approach of skipping over such entries (and thus leaving them silently unwritten). move_pages(): squash the -EEXIST error return from follow_page() into -EFAULT, because -EFAULT is listed in the man page, whereas -EEXIST is not. Link: https://lkml.kernel.org/r/20220204020010.68930-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2f031c6f04 |
mm/rmap: Convert rmap_walk() to take a folio
This ripples all the way through to every calling and called function from rmap. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
4eecb8b916 |
mm/migrate: Convert remove_migration_ptes() to folios
Convert the implementation and all callers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
4b8554c527 |
mm/rmap: Convert try_to_migrate() to folios
Convert the callers to pass a folio and the try_to_migrate_one() worker to use a folio throughout. Fixes an assumption that a folio must be <= PMD size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
2aff7a4755 |
mm: Convert page_vma_mapped_walk to work on PFNs
page_mapped_in_vma() really just wants to walk one page, but as the code stands, if passed the head page of a compound page, it will walk every page in the compound page. Extract pfn/nr_pages/pgoff from the struct page early, so they can be overridden by page_mapped_in_vma(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
eed05e54d2 |
mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK
Instead of declaring a struct page_vma_mapped_walk directly, use these helpers to allow us to transition to a PFN approach in the following patches. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
76cbbead25 |
mm: move the migrate_vma_* device migration code into its own file
Split the code used to migrate to and from ZONE_DEVICE memory from migrate.c into a new file. Link: https://lkml.kernel.org/r/20220210072828.2930359-14-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
aaf7d70cc5 |
mm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Link: https://lkml.kernel.org/r/20220210072828.2930359-13-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
1776c0d102 |
mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Link: https://lkml.kernel.org/r/20220210072828.2930359-12-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
27674ef6c7 |
mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
b74355078b |
mm/munlock: page migration needs mlock pagevec drained
Page migration of a VM_LOCKED page tends to fail, because when the old page is unmapped, it is put on the mlock pagevec with raised refcount, which then fails the freeze. At first I thought this would be fixed by a local mlock_page_drain() at the upper rmap_walk() level - which would have nicely batched all the munlocks of that page; but tests show that the task can too easily move to another cpu, leaving pagevec residue behind which fails the migration. So try_to_migrate_one() drain the local pagevec after page_remove_rmap() from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose TTU_IGNORE_MLOCK users would want the same treatment; and do the same in remove_migration_pte() - not important when successfully inserting a new page, but necessary when hoping to retry after failure. Any new pagevec runs the risk of adding a new way of stranding, and we might discover other corners where mlock_page_drain() or lru_add_drain() would now help. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
c3096e6782 |
mm/migrate: __unmap_and_move() push good newpage to LRU
Compaction, NUMA page movement, THP collapse/split, and memory failure do isolate unevictable pages from their "LRU", losing the record of mlock_count in doing so (isolators are likely to use page->lru for their own private lists, so mlock_count has to be presumed lost). That's unfortunate, and we should put in some work to correct that: one can imagine a function to build up the mlock_count again - but it would require i_mmap_rwsem for read, so be careful where it's called. Or page_referenced_one() and try_to_unmap_one() might do that extra work. But one place that can very easily be improved is page migration's __unmap_and_move(): a small adjustment to where the successful new page is put back on LRU, and its mlock_count (if any) is built back up by remove_migration_ptes(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
cea86fe246 |
mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
ffa65753c4 |
mm/migrate.c: rework migration_entry_wait() to not take a pageref
This fixes the FIXME in migrate_vma_check_page(). Before migrating a page migration code will take a reference and check there are no unexpected page references, failing the migration if there are. When a thread faults on a migration entry it will take a temporary reference to the page to wait for the page to become unlocked signifying the migration entry has been removed. This reference is dropped just prior to waiting on the page lock, however the extra reference can cause migration failures so it is desirable to avoid taking it. As migration code already has a reference to the migrating page an extra reference to wait on PG_locked is unnecessary so long as the reference can't be dropped whilst setting up the wait. When faulting on a migration entry the ptl is taken to check the migration entry. Removing a migration entry also requires the ptl, and migration code won't drop its page reference until after the migration entry has been removed. Therefore retaining the ptl of a migration entry is sufficient to ensure the page has a reference. Reworking migration_entry_wait() to hold the ptl until the wait setup is complete means the extra page reference is no longer needed. [apopple@nvidia.com: v5] Link: https://lkml.kernel.org/r/20211213033848.1973946-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20211118020754.954425-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f56caedaf9 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "146 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak, dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap, memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb, userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp, ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits) mm/damon: hide kernel pointer from tracepoint event mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging mm/damon/dbgfs: remove an unnecessary variable mm/damon: move the implementation of damon_insert_region to damon.h mm/damon: add access checking for hugetlb pages Docs/admin-guide/mm/damon/usage: update for schemes statistics mm/damon/dbgfs: support all DAMOS stats Docs/admin-guide/mm/damon/reclaim: document statistics parameters mm/damon/reclaim: provide reclamation statistics mm/damon/schemes: account how many times quota limit has exceeded mm/damon/schemes: account scheme actions that successfully applied mm/damon: remove a mistakenly added comment for a future feature Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning Docs/admin-guide/mm/damon/usage: remove redundant information Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks mm/damon: convert macro functions to static inline functions mm/damon: modify damon_rand() macro to static inline function mm/damon: move damon_rand() definition into damon.h ... |
|
|
|
f1e8db04b6 |
mm/migrate: remove redundant variables used in a for-loop
The variable addr is being set and incremented in a for-loop but not actually being used. It is redundant and so addr and also variable start can be removed. Link: https://lkml.kernel.org/r/20211221185729.609630-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
dcee9bf5bf |
mm/migrate: move node demotion code to near its user
Now, node_demotion and next_demotion_node() are placed between __unmap_and_move() and unmap_and_move(). This hurts code readability. So move them near their users in the file. There's no functionality change in this patch. Link: https://lkml.kernel.org/r/20211206031227.3323097-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7813a1b525 |
mm: migrate: add more comments for selecting target node randomly
As Yang Shi suggested [1], it will be helpful to explain why we should select target node randomly now if there are multiple target nodes. [1] https://lore.kernel.org/all/CAHbLzkqSqCL+g7dfzeOw8fPyeEC0BBv13Ny1UVGHDkadnQdR=g@mail.gmail.com/ Link: https://lkml.kernel.org/r/c31d36bd097c6e9e69fc0f409c43b78e53e64fc2.1637766801.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ac16ec8353 |
mm: migrate: support multiple target nodes demotion
We have some machines with multiple memory types like below, which have
one fast (DRAM) memory node and two slow (persistent memory) memory
nodes. According to current node demotion policy, if node 0 fills up,
its memory should be migrated to node 1, when node 1 fills up, its
memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop.
But this is not efficient and suitbale memory migration route for our
machine with multiple slow memory nodes. Since the distance between
node 0 to node 1 and node 0 to node 2 is equal, and memory migration
between slow memory nodes will increase persistent memory bandwidth
greatly, which will hurt the whole system's performance.
Thus for this case, we can treat the slow memory node 1 and node 2 as a
whole slow memory region, and we should migrate memory from node 0 to
node 1 and node 2 if node 0 fills up.
This patch changes the node_demotion data structure to support multiple
target nodes, and establishes the migration path to support multiple
target nodes with validating if the node distance is the best or not.
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 62153 MB
node 0 free: 55135 MB
node 1 cpus:
node 1 size: 127007 MB
node 1 free: 126930 MB
node 2 cpus:
node 2 size: 126968 MB
node 2 free: 126878 MB
node distances:
node 0 1 2
0: 10 20 20
1: 20 10 20
2: 20 20 10
Link: https://lkml.kernel.org/r/00728da107789bb4ed9e0d28b1d08fd8056af2ef.1636697263.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
5d39a7ebc8 |
mm: migrate: correct the hugetlb migration stats
Correct the migration stats for hugetlb with using compound_nr() instead of thp_nr_pages(), meanwhile change 'nr_failed_pages' to record the number of normal pages failed to migrate, including THP and hugetlb, and 'nr_succeeded' will record the number of normal pages migrated successfully. [baolin.wang@linux.alibaba.com: fix docs, per Mike] Link: https://lkml.kernel.org/r/141bdfc6-f898-3cc3-f692-726c5f6cb74d@linux.alibaba.com Link: https://lkml.kernel.org/r/71a4b6c22f208728fe8c78ad26375436c4ff9704.1636275127.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b5bade978e |
mm: migrate: fix the return value of migrate_pages()
Patch series "Improve the migration stats". According to talk with Zi Yan [1], this patch set changes the return value of migrate_pages() to avoid returning a number which is larger than the number of pages the users tried to migrate by move_pages() syscall. Also fix the hugetlb migration stats and migration stats in trace_mm_compaction_migratepages(). [1] https://lore.kernel.org/linux-mm/7E44019D-2A5D-4BA7-B4D5-00D4712F1687@nvidia.com/ This patch (of 3): As Zi Yan pointed out, the syscall move_pages() can return a non-migrated number larger than the number of pages the users tried to migrate, when a THP page is failed to migrate. This is confusing for users. Since other migration scenarios do not care about the actual non-migrated number of pages except the memory compaction migration which will fix in following patch. Thus we can change the return value to return the number of {normal page, THP, hugetlb} instead to avoid this issue, and the number of THP splits will be considered as the number of non-migrated THP, no matter how many subpages of the THP are migrated successfully. Meanwhile we should still keep the migration counters using the number of normal pages. Link: https://lkml.kernel.org/r/cover.1636275127.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/6486fabc3e8c66ff613e150af25e89b3147977a6.1636275127.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Zi Yan <ziy@nvidia.com> Co-developed-by: Zi Yan <ziy@nvidia.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1eba86c096 |
mm: change page type prior to adding page table entry
Patch series "page table check", v3. Ensure that some memory corruptions are prevented by checking at the time of insertion of entries into user page tables that there is no illegal sharing. We have recently found a problem [1] that existed in kernel since 4.14. The problem was caused by broken page ref count and led to memory leaking from one process into another. The problem was accidentally detected by studying a dump of one process and noticing that one page contains memory that should not belong to this process. There are some other page->_refcount related problems that were recently fixed: [2], [3] which potentially could also lead to illegal sharing. In addition to hardening refcount [4] itself, this work is an attempt to prevent this class of memory corruption issues. It uses a simple state machine that is independent from regular MM logic to check for illegal sharing at time pages are inserted and removed from page tables. [1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com [2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com [3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com [4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com This patch (of 4): There are a few places where we first update the entry in the user page table, and later change the struct page to indicate that this is anonymous or file page. In most places, however, we first configure the page metadata and then insert entries into the page table. Page table check, will use the information from struct page to verify the type of entry is inserted. Change the order in all places to first update struct page, and later to update page table. This means that we first do calls that may change the type of page (anon or file): page_move_anon_rmap page_add_anon_rmap do_page_add_anon_rmap page_add_new_anon_rmap page_add_file_rmap hugepage_add_anon_rmap hugepage_add_new_anon_rmap And after that do calls that add entries to the page table: set_huge_pte_at set_pte_at Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Turner <pjt@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6b24ca4a1a |
mm: Use multi-index entries in the page cache
We currently store large folios as 2^N consecutive entries. While this consumes rather more memory than necessary, it also turns out to be buggy. A writeback operation which starts within a tail page of a dirty folio will not write back the folio as the xarray's dirty bit is only set on the head index. With multi-index entries, the dirty bit will be found no matter where in the folio the operation starts. This does end up simplifying the page cache slightly, although not as much as I had hoped. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> |
|
|
|
9f2b04a25a |
filemap: Add folio_put_wait_locked()
Convert all three callers of put_and_wait_on_page_locked() to folio_put_wait_locked(). This shrinks the kernel overall by 19 bytes. filemap_update_page() shrinks by 19 bytes while __migration_entry_wait() is unchanged. folio_put_wait_locked() is 14 bytes smaller than put_and_wait_on_page_locked(), but pmd_migration_entry_wait() grows by 14 bytes. It removes the assumption from pmd_migration_entry_wait() that pages cannot be larger than a PMD (which is true today, but may be interesting to explore in the future). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> |
|
|
|
ab09243aa9 |
mm/migrate.c: remove MIGRATE_PFN_LOCKED
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0ef0246214 |
mm: migrate: simplify the file-backed pages validation when migrating its mapping
There is no need to validate the file-backed page's refcount before trying to freeze the page's expected refcount, instead we can rely on the folio_ref_freeze() to validate if the page has the expected refcount before migrating its mapping. Moreover we are always under the page lock when migrating the page mapping, which means nowhere else can remove it from the page cache, so we can remove the xas_load() validation under the i_pages lock. Link: https://lkml.kernel.org/r/cover.1629447552.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/df4c129fd8e86a95dbc55f4663d77441cc0d3bd1.1629447552.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
512b7931ad |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ... |
|
|
|
20f9ba4f99 |
mm: migrate: make demotion knob depend on migration
The memory demotion needs to call migrate_pages() to do the jobs. And it is controlled by a knob, however, the knob doesn't depend on CONFIG_MIGRATION. The knob could be truned on even though MIGRATION is disabled, this will not cause any crash since migrate_pages() would just return -ENOSYS. But it is definitely not optimal to go through demotion path then retry regular swap every time. And it doesn't make too much sense to have the knob visible to the users when !MIGRATION. Move the related code from mempolicy.[h|c] to migrate.[h|c]. Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
49f8275c7d |
Memory folios
Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54 Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3 Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd LYCtbeawtkikPVXZMfWybsx5vn0c3Q== =7SBe -----END PGP SIGNATURE----- Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ... |