mirror of https://github.com/torvalds/linux.git
486 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
6ffcd825e7 |
mm: Remove __delete_from_page_cache()
This wrapper is no longer used. Remove it and all references to it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
67f22ba775 |
mm/memory-failure: disable unpoison once hw error happens
Currently unpoison_memory(unsigned long pfn) is designed for soft poison(hwpoison-inject) only. Since |
|
|
|
98931dd95f |
Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
enablement of the recent huge page vmemmap optimization feature.
Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
David Vernet has done some fixup work on the memcg selftests.
Peter Xu has taught userfaultfd to handle write protection faults against
shmem- and hugetlbfs-backed files.
More DAMON development from SeongJae Park - adding online tuning of the
feature and support for monitoring of fixed virtual address ranges. Also
easier discovery of which monitoring operations are available.
Nadav Amit has done some optimization of TLB flushing during mprotect().
Neil Brown continues to labor away at improving our swap-over-NFS support.
David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
Peng Liu fixed some errors in the core hugetlb code.
Joao Martins has reduced the amount of memory consumed by device-dax's
compound devmaps.
Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
Roman Gushchin has done more work on the memcg selftests.
And, of course, many smaller fixes and cleanups. Notably, the customary
million cleanup serieses from Miaohe Lin.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
=nFe6
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Almost all of MM here. A few things are still getting finished off,
reviewed, etc.
- Yang Shi has improved the behaviour of khugepaged collapsing of
readonly file-backed transparent hugepages.
- Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
- Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
runtime enablement of the recent huge page vmemmap optimization
feature.
- Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
- Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
- Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
- David Vernet has done some fixup work on the memcg selftests.
- Peter Xu has taught userfaultfd to handle write protection faults
against shmem- and hugetlbfs-backed files.
- More DAMON development from SeongJae Park - adding online tuning of
the feature and support for monitoring of fixed virtual address
ranges. Also easier discovery of which monitoring operations are
available.
- Nadav Amit has done some optimization of TLB flushing during
mprotect().
- Neil Brown continues to labor away at improving our swap-over-NFS
support.
- David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
- Peng Liu fixed some errors in the core hugetlb code.
- Joao Martins has reduced the amount of memory consumed by
device-dax's compound devmaps.
- Some cleanups of the arch-specific pagemap code from Anshuman
Khandual.
- Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
- Roman Gushchin has done more work on the memcg selftests.
... and, of course, many smaller fixes and cleanups. Notably, the
customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
mm: kfence: use PAGE_ALIGNED helper
selftests: vm: add the "settings" file with timeout variable
selftests: vm: add "test_hmm.sh" to TEST_FILES
selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
selftests: vm: add migration to the .gitignore
selftests/vm/pkeys: fix typo in comment
ksm: fix typo in comment
selftests: vm: add process_mrelease tests
Revert "mm/vmscan: never demote for memcg reclaim"
mm/kfence: print disabling or re-enabling message
include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
mm: fix a potential infinite loop in start_isolate_page_range()
MAINTAINERS: add Muchun as co-maintainer for HugeTLB
zram: fix Kconfig dependency warning
mm/shmem: fix shmem folio swapoff hang
cgroup: fix an error handling path in alloc_pagecache_max_30M()
mm: damon: use HPAGE_PMD_SIZE
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
nodemask.h: fix compilation error with GCC12
...
|
|
|
|
6d4675e601 |
mm: don't be stuck to rmap lock on reclaim path
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap). It makes reclaim path stuck. In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.
This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.
Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).
In a heavy app workload in Android, trace shows following statistics. It
almost removes rmap lock contention from reclaim path.
Martin Liu reported:
Before:
max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read
601 0 601 145.544681 28817 198 rmap_walk_file
After:
max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
NaN NaN NaN NaN NaN 0.0 NaN
0 0 0 0.127645 1 12 rmap_walk_file
[minchan@kernel.org: add comment, per Matthew]
Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
e240ac52f7 |
mm/memory-failure.c: simplify num_poisoned_pages_inc/dec
Originally, do num_poisoned_pages_inc() in memory failure routine, use num_poisoned_pages_dec() to rollback the number if filtered/ cancelled. Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(), this make this clear and simple. Link: https://lkml.kernel.org/r/20220509105641.491313-6-pizhenwei@bytedance.com Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
9113eaf331 |
mm/memory-failure.c: add hwpoison_filter for soft offline
hwpoison_filter is missing in the soft offline path, this leads an issue: after enabling the corrupt filter, the user process still has a chance to inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN which is expected to reject. Also do a minor change in comment of memory_failure(). Link: https://lkml.kernel.org/r/20220509105641.491313-4-pizhenwei@bytedance.com Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c8bd84f73f |
mm/memory-failure.c: simplify num_poisoned_pages_dec
Don't decrease the number of poisoned pages in page_alloc.c, let the memory-failure.c do inc/dec poisoned pages only. Also simplify unpoison_memory(), only decrease the number of poisoned pages when: - TestClearPageHWPoison() succeed - put_page_back_buddy succeed After decreasing, print necessary log. Finally, remove clear_page_hwpoison() and unpoison_taken_off_page(). Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
60f272f6b0 |
mm/memory-failure.c: move clear_hwpoisoned_pages
Patch series "memory-failure: fix hwpoison_filter", v2. As well known, the memory failure mechanism handles memory corrupted event, and try to send SIGBUS to the user process which uses this corrupted page. For the virtualization case, QEMU catches SIGBUS and tries to inject MCE into the guest, and the guest handles memory failure again. Thus the guest gets the minimal effect from hardware memory corruption. The further step I'm working on: 1, try to modify code to decrease poisoned pages in a single place (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series). 2, try to use page_handle_poison() to handle SetPageHWPoison() and num_poisoned_pages_inc() together. It would be best to call num_poisoned_pages_inc() in a single place too. 3, introduce memory failure notifier list in memory-failure.c: notify the corrupted PFN to someone who registers this list. If I can complete [1] and [2] part, [3] will be quite easy(just call notifier list after increasing poisoned page). 4, introduce memory recover VQ for memory balloon device, and registers memory failure notifier list. During the guest kernel handles memory failure, balloon device gets notified by memory failure notifier list, and tells the host to recover the corrupted PFN(GPA) by the new VQ. 5, host side remaps the corrupted page(HVA), and tells the guest side to unpoison the PFN(GPA). Then the guest fixes the corrupted page(GPA) dynamically. This patch (of 5): clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of poisoned pages, this actually works as part of memory failure. Move this function from sparse.c to memory-failure.c, finally there is no CONFIG_MEMORY_FAILURE in sparse.c. Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
014bb1de4f |
mm: create new mm/swap.h header file
Patch series "MM changes to improve swap-over-NFS support". Assorted improvements for swap-via-filesystem. This is a resend of these patches, rebased on current HEAD. The only substantial changes is that swap_dirty_folio has replaced swap_set_page_dirty. Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem. It has previously worked for NFS but that broke a few releases back. This series changes to use a new ->swap_rw rather than ->readpage and ->direct_IO. It also makes other improvements. There is a companion series already in linux-next which fixes various issues with NFS. Once both series land, a final patch is needed which changes NFS over to use ->swap_rw. This patch (of 10): Many functions declared in include/linux/swap.h are only used within mm/ Create a new "mm/swap.h" and move some of these declarations there. Remove the redundant 'extern' from the function declarations. [akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h] Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
b283d983a7 |
mm, hugetlb, hwpoison: separate branch for free and in-use hugepage
We know that HPageFreed pages should have page refcount 0, so get_page_unless_zero() always fails and returns 0. So explicitly separate the branch based on page state for minor optimization and better readability. Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jp Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
ef526b17bc |
mm/memory-failure.c: dissolve truncated hugetlb page
If me_huge_page meets a truncated but not yet freed hugepage, it won't be dissolved even if we hold the last refcnt. It's because the hugepage has NULL page_mapping while it's not anonymous hugepage too. Thus we lose the last chance to dissolve it into buddy to save healthy subpages. Remove PageAnon check to handle these hugepages too. Link: https://lkml.kernel.org/r/20220414114941.11223-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
3f87137068 |
mm/memory-failure.c: minor cleanup for HWPoisonHandlable
Patch series "A few fixup and cleanup patches for memory failure", v2. This series contains a patch to clean up the HWPoisonHandlable and another one to dissolve truncated hugetlb page. More details can be found in the respective changelogs. This patch (of 2): The local variable movable can be removed by returning true directly. Also fix typo 'mirgate'. No functional change intended. Link: https://lkml.kernel.org/r/20220414114941.11223-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220414114941.11223-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
2ba2b008a8 |
Revert "mm/memory-failure.c: fix race with changing page compound again"
Reverts commit
|
|
|
|
f361e2462e |
mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED
In already hwpoisoned case, memory_failure() is supposed to return with releasing the page refcount taken for error handling. But currently the refcount is not released when called with MF_COUNT_INCREASED, which makes page refcount inconsistent. This should be rare and non-critical, but it might be inconvenient in testing (unpoison doesn't work). Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
f142e70750 |
mm/memory-failure.c: remove unnecessary (void*) conversions
No need cast (void*) to (struct hwp_walk*). Link: https://lkml.kernel.org/r/20220322142826.25939-1-liqiong@nfschina.com Signed-off-by: liqiong <liqiong@nfschina.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1825b93b62 |
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
The following VM_BUG_ON_FOLIO() is triggered when memory error event
happens on the (thp/folio) pages which are about to be freed:
[ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
[ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000
[ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
[ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000
[ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
[ 1160.251815] ------------[ cut here ]------------
[ 1160.253438] kernel BUG at include/linux/mm.h:788!
[ 1160.256162] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G E 5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ #3
[ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd
[ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00
[ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292
[ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000
[ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff
[ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff
[ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000
[ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000
[ 1160.286641] FS: 00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000
[ 1160.289498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0
[ 1160.293031] Call Trace:
[ 1160.293724] <TASK>
[ 1160.294334] get_hwpoison_page+0x47d/0x570
[ 1160.295474] memory_failure+0x106/0xaa0
[ 1160.296474] ? security_capable+0x36/0x50
[ 1160.297524] hard_offline_page_store+0x43/0x80
[ 1160.298684] kernfs_fop_write_iter+0x11c/0x1b0
[ 1160.299829] new_sync_write+0xf9/0x160
[ 1160.300810] vfs_write+0x209/0x290
[ 1160.301835] ksys_write+0x4f/0xc0
[ 1160.302718] do_syscall_64+0x3b/0x90
[ 1160.303664] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1160.304981] RIP: 0033:0x7eff54b018b7
As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is
called from dump_page("hwpoison: unhandlable page") in get_any_page().
The below explains the mechanism of the race:
CPU 0 CPU 1
memory_failure
get_hwpoison_page
get_any_page
dump_page
compound = PageCompound
free_pages_prepare
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP
folio_entire_mapcount
VM_BUG_ON_FOLIO(!folio_test_large(folio))
So replace dump_page() with safer one, pr_err().
Link: https://lkml.kernel.org/r/20220427053220.719866-1-naoya.horiguchi@linux.dev
Fixes:
|
|
|
|
b4e61fc031 |
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
Patch series "mm/memory-failure: rework fix on huge_zero_page splitting". This patch (of 2): This reverts commit |
|
|
|
d173d5417f |
mm/memory-failure.c: skip huge_zero_page in memory_failure()
Kernel panic when injecting memory_failure for the global
huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows.
Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2499!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
RIP: 0010:split_huge_page_to_list+0x66a/0x880
Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
try_to_split_thp_page+0x3a/0x130
memory_failure+0x128/0x800
madvise_inject_error.cold+0x8b/0xa1
__x64_sys_madvise+0x54/0x60
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc3754f8bf9
Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
This makes huge_zero_page bail out explicitly before split in
memory_failure(), thus the panic above won't happen again.
Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com
Fixes:
|
|
|
|
405ce05123 |
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes:
|
|
|
|
9030fb0bb9 |
Folio changes for 5.18
- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
|
|
|
|
bf6445bc8f |
mm/memory-failure.c: make non-LRU movable pages unhandlable
We can not really handle non-LRU movable pages in memory failure. Typically they are balloon, zsmalloc, etc. Assuming we run into a base (4K) non-LRU movable page, we could reach as far as identify_page_state(), it should not fall into any category except me_unknown. For the non-LRU compound movable pages, they could be taken for transhuge pages but it's unexpected to split non-LRU movable pages using split_huge_page_to_list in memory_failure. So we could just simply make non-LRU movable pages unhandlable to avoid these possible nasty cases. Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
593396b86e |
mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
into its one caller"), invalidate_inode_page() can invalidate the pages
in the swap cache because the check of page->mapping != mapping is
removed. But invalidate_inode_page() is not expected to deal with the
pages in swap cache. Also non-lru movable page can reach here too.
They're not page cache pages. Skip these pages by checking
PageSwapCache and PageLRU.
Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
888af2701d |
mm/memory-failure.c: fix race with changing page compound again
Patch series "A few fixup patches for memory failure", v2. This series contains a few patches to fix the race with changing page compound page, make non-LRU movable pages unhandlable and so on. More details can be found in the respective changelogs. There is a race window where we got the compound_head, the hugetlb page could be freed to buddy, or even changed to another compound page just before we try to get hwpoison page. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page *head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is introduced to record this event. [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi] Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220312074613.4798-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a06ad3c0c7 |
mm/hwpoison: add in-use hugepage hwpoison filter judgement
After successfully obtaining the reference count of the huge page, it is still necessary to call hwpoison_filter() to make a filter judgement, otherwise the filter hugepage will be unmaped and the related process may be killed. Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
d1fe111fb6 |
mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b04d3eebeb |
mm/memory-failure.c: remove unnecessary PageTransTail check
When we reach here, we're guaranteed to have non-compound page as thp is already splited. Remove this unnecessary PageTransTail check. Link: https://lkml.kernel.org/r/20220218090118.1105-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2ab916790f |
mm/memory-failure.c: remove obsolete comment in __soft_offline_page
Since commit
|
|
|
|
357670f79e |
mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings()
Only for hugetlb pages in shared mappings, try_to_unmap should take semaphore in write mode here. Rework the code to make it clear. Link: https://lkml.kernel.org/r/20220218090118.1105-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
67ff51c6a6 |
mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev
Since commit
|
|
|
|
75ee64b3c9 |
mm/memory-failure.c: fix race with changing page more robustly
We're only intended to deal with the non-Compound page after we split thp in memory_failure. However, the page could have changed compound pages due to race window. If this happens, we could retry once to hopefully handle the page next round. Also remove unneeded orig_head. It's always equal to the hpage. So we can use hpage directly and remove this redundant one. Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
49775047cf |
mm/memory-failure.c: rework the signaling logic in kill_proc
BUS_MCEERR_AR code is only sent when MF_ACTION_REQUIRED is set and the target is current. Rework the code to make this clear. Link: https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a994402bc4 |
mm/memory-failure.c: catch unexpected -EFAULT from vma_address()
It's unexpected to walk the page table when vma_address() return
-EFAULT. But dev_pagemap_mapping_shift() is called only when vma
associated to the error page is found already in
collect_procs_{file,anon}, so vma_address() should not return -EFAULT
except with some bug, as Naoya pointed out. We can use VM_BUG_ON_VMA()
to catch this bug here.
Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
577553f489 |
mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap
Patch series "A few cleanup and fixup patches for memory failure", v3. This series contains a few patches to simplify the code logic, remove unneeded variable and remove obsolete comment. Also we fix race changing page more robustly in memory_failure. More details can be found in the respective changelogs. This patch (of 8): The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set. So we do not need to check these flags again. Link: https://lkml.kernel.org/r/20220218090118.1105-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220218090118.1105-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
046545a661 |
mm/hwpoison: fix error page recovered but reported "not recovered"
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a
UCNA signature, and the core reporting and SRAR signature machine check
when the data is about to be consumed.
If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But returns the
wrong error code.
Console log looks like this:
mce: Uncorrected hardware memory error in user-access at 3710b3400
Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
Memory failure: 0x3710b3: already hardware poisoned
Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
mce: Memory error not recovered
kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again. But current code simply fails to do this, so fix it to
make sure to work as intended. This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.
[tony.luck@intel.com: reword some parts of commit message]
Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
Fixes:
|
|
|
|
ae483c2006 |
mm/memory-failure.c: remove obsolete comment
With the introduction of mf_mutex, most of memory error handling process is mutually exclusive, so the in-line comment about subtlety about double-checking PageHWPoison is no more correct. So remove it. Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
9595d76942 |
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls to compound_head(). If any callers were passing a tail page before, this would have failed to lock the anon VMA as page->mapping is not valid for tail pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
869f7ee6f6 |
mm/rmap: Convert try_to_unmap() to take a folio
Change all three callers and the worker function try_to_unmap_one(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
|
|
|
d6c75dc22c |
mm/truncate: Split invalidate_inode_page() into mapping_evict_folio()
Some of the callers already have the address_space and can avoid calling folio_mapping() and checking if the folio was already truncated. Also add kernel-doc and fix the return type (in case we ever support folios larger than 4TB). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> |
|
|
|
61e28cf054 |
memory-failure: fetch compound_head after pgmap_pfn_valid()
memory_failure_dev_pagemap() at the moment assumes base pages (e.g. dax_lock_page()). For devmap with compound pages fetch the compound_head in case a tail page memory failure is being handled. Currently this is a nop, but in the advent of compound pages in dev_pagemap it allows memory_failure_dev_pagemap() to keep working. Without this fix memory-failure handling (i.e. MCEs on pmem) with device-dax configured namespaces will regress (and crash). Link: https://lkml.kernel.org/r/20211202204422.26777-2-joao.m.martins@oracle.com Reported-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f56caedaf9 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "146 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak, dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap, memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb, userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp, ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits) mm/damon: hide kernel pointer from tracepoint event mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging mm/damon/dbgfs: remove an unnecessary variable mm/damon: move the implementation of damon_insert_region to damon.h mm/damon: add access checking for hugetlb pages Docs/admin-guide/mm/damon/usage: update for schemes statistics mm/damon/dbgfs: support all DAMOS stats Docs/admin-guide/mm/damon/reclaim: document statistics parameters mm/damon/reclaim: provide reclamation statistics mm/damon/schemes: account how many times quota limit has exceeded mm/damon/schemes: account scheme actions that successfully applied mm/damon: remove a mistakenly added comment for a future feature Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning Docs/admin-guide/mm/damon/usage: remove redundant information Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks mm/damon: convert macro functions to static inline functions mm/damon: modify damon_rand() macro to static inline function mm/damon: move damon_rand() definition into damon.h ... |
|
|
|
0b8f0d8700 |
mm: fix some comment errors
Link: https://lkml.kernel.org/r/20211101040208.460810-1-fuqf0919@gmail.com Signed-off-by: Quanfa Fu <fuqf0919@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
bf181c5825 |
mm/hwpoison: fix unpoison_memory()
After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation. Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages. So __get_hwpoison_page() highly likely returns -EBUSY
(meaning to fail to grab page refcount) and unpoison just clears
PG_hwpoison without releasing a refcount. That does not lead to a
critical issue like kernel panic, but unpoisoned pages never get back to
buddy (leaked permanently), which is not good.
To (partially) fix this, we need to identify "taken off" pages from
other types of hwpoisoned pages. We can't use refcount or page flags
for this purpose, so a pseudo flag is defined by hacking ->private
field. Someone might think that put_page() is enough to cancel
taken-off pages, but the normal free path contains some operations not
suitable for the current purpose, and can fire VM_BUG_ON().
Note that unpoison_memory() is now supposed to be cancel hwpoison events
injected only by madvise() or
/sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
injection, so please don't try to use unpoison when testing with MCE
injection.
[lkp@intel.com: report build failure for ARCH=i386]
Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
c9fdc4d548 |
mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
These action_page_types are no longer used, so remove them. Link: https://lkml.kernel.org/r/20211115084006.3728254-3-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
91d005479e |
mm/hwpoison: mf_mutex for soft offline and unpoison
Patch series "mm/hwpoison: fix unpoison_memory()", v4. The main purpose of this series is to sync unpoison code to recent changes around how hwpoison code takes page refcount. Unpoison should work or simply fail (without crash) if impossible. The recent works of keeping hwpoison pages in shmem pagecache introduce a new state of hwpoisoned pages, but unpoison for such pages is not supported yet with this series. It seems that soft-offline and unpoison can be used as general purpose page offline/online mechanism (not in the context of memory error). I think that we need some additional works to realize it because currently soft-offline and unpoison are assumed not to happen so frequently (print out too many messages for aggressive usecases). But anyway this could be another interesting next topic. v1: https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/ v2: https://lore.kernel.org/linux-mm/20211025230503.2650970-1-naoya.horiguchi@linux.dev/ v3: https://lore.kernel.org/linux-mm/20211105055058.3152564-1-naoya.horiguchi@linux.dev/ This patch (of 3): Originally mf_mutex is introduced to serialize multiple MCE events, but it is not that useful to allow unpoison to run in parallel with memory_failure() and soft offline. So apply mf_mutex to soft offline and unpoison. The memory failure handler and soft offline handler get simpler with this. Link: https://lkml.kernel.org/r/20211115084006.3728254-1-naoya.horiguchi@linux.dev Link: https://lkml.kernel.org/r/20211115084006.3728254-2-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a760542666 |
mm: shmem: don't truncate page if memory failure happens
The current behavior of memory failure is to truncate the page cache regardless of dirty or clean. If the page is dirty the later access will get the obsolete data from disk without any notification to the users. This may cause silent data loss. It is even worse for shmem since shmem is in-memory filesystem, truncating page cache means discarding data blocks. The later read would return all zero. The right approach is to keep the corrupted page in page cache, any later access would return error for syscalls or SIGBUS for page fault, until the file is truncated, hole punched or removed. The regular storage backed filesystems would be more complicated so this patch is focused on shmem. This also unblock the support for soft offlining shmem THP. [akpm@linux-foundation.org: coding style fixes] [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()] Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org [Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com> and Muchun Song <songmuchun@bytedance.com> proposed and reworked the error handling of shmem_write_begin() suggested by Linus] Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/ Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ajay Garg <ajaygargnsit@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Andy Lavr <andy.lavr@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
bfed6efb8e |
- Add support for handling hw errors in SGX pages: poisoning, recovering
from poison memory and error injection into SGX pages - A bunch of changes to the SGX selftests to simplify and allow of SGX features testing without the need of a whole SGX software stack - Add a sysfs attribute which is supposed to show the amount of SGX memory in a NUMA node, similar to what /proc/meminfo is to normal memory - The usual bunch of fixes and cleanups too -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcDQMACgkQEsHwGGHe VUq42xAAjWM0AFpIxgUBpbE0swV3ZMulnndl3/vA5XN+9Yn7Q52+AFyPRE0s7Zam Ap+cInh2Il7d/sv54rZ4x/j7+TH4i7s8fWPVU/XiPALQuOuw0/B1wJJ+jmMiPFiU 3jr7DkUPyWjWTHduMY/tk+xMOpkx1XsxJheYnKvsKVW+fjJ0vPuftAZtfu2z2VOh 3JLcp5cAXPxW0UK9gdoF5bCBQhBu0NRguTbhHhbByAixQO2GyVSKLSRovUdj0a+y QRrQ6hgcvpTOsVHJoWJ7yIX4SBzQTe9Bg6dT9DghOxE4Sc2GH89hu7wRztGawBJO nLyzWgiW9ttjQutDpBvZANNVcFAPAdtDWczrzZpREbrGKkzT+kOBnIIL1LWITWOy 2YWTO3ytW0KNIK85GzMjSVOKRMgaHJeBaGuYZ7Z0kb3GuUPJ9zRlaRxNapKQFuzA 0PGoA4IDT+2Afy7VYBBNUA2d/WverFQuXKusSxK6b5zJ173o5/DXL2q0d3gn/j8Z hhxJUJyVOsfRXSG4NKrj4se4FiA0n/RL4oyUZR9iJ8kWzzZTd0eZTAn468bpGIp5 yiOlPOLgsmu0xzVmAtG1+4d2+S2x+Ec5YE0sP1V/JLNciYk3Ebp7UyfnS3tn33Xc cpdWjELvD1LJVpMEURnbjRrwU6OiiAekYJCP/9lmK9zfOGpwRHc= =vFTM -----END PGP SIGNATURE----- Merge tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SGX updates from Borislav Petkov: - Add support for handling hw errors in SGX pages: poisoning, recovering from poison memory and error injection into SGX pages - A bunch of changes to the SGX selftests to simplify and allow of SGX features testing without the need of a whole SGX software stack - Add a sysfs attribute which is supposed to show the amount of SGX memory in a NUMA node, similar to what /proc/meminfo is to normal memory - The usual bunch of fixes and cleanups too * tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/sgx: Fix NULL pointer dereference on non-SGX systems selftests/sgx: Fix corrupted cpuid macro invocation x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node x86/sgx: Fix minor documentation issues selftests/sgx: Add test for multiple TCS entry selftests/sgx: Enable multiple thread support selftests/sgx: Add page permission and exception test selftests/sgx: Rename test properties in preparation for more enclave tests selftests/sgx: Provide per-op parameter structs for the test enclave selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribed selftests/sgx: Move setup_test_encl() to each TEST_F() selftests/sgx: Encpsulate the test enclave creation selftests/sgx: Dump segments and /proc/self/maps only on failure selftests/sgx: Create a heap for the test enclave selftests/sgx: Make data measurement for an enclave segment optional selftests/sgx: Assign source for each segment selftests/sgx: Fix a benign linker warning x86/sgx: Add check for SGX pages to ghes_do_memory_failure() x86/sgx: Add hook to error injection address validation x86/sgx: Hook arch_memory_failure() into mainline code ... |
|
|
|
2a57d83c78 |
mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()
Hulk Robot reported a panic in put_page_testzero() when testing
madvise() with MADV_SOFT_OFFLINE. The BUG() is triggered when retrying
get_any_page(). This is because we keep MF_COUNT_INCREASED flag in
second try but the refcnt is not increased.
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:737!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 5 PID: 2135 Comm: sshd Tainted: G B 5.16.0-rc6-dirty #373
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: release_pages+0x53f/0x840
Call Trace:
free_pages_and_swap_cache+0x64/0x80
tlb_flush_mmu+0x6f/0x220
unmap_page_range+0xe6c/0x12c0
unmap_single_vma+0x90/0x170
unmap_vmas+0xc4/0x180
exit_mmap+0xde/0x3a0
mmput+0xa3/0x250
do_exit+0x564/0x1470
do_group_exit+0x3b/0x100
__do_sys_exit_group+0x13/0x20
__x64_sys_exit_group+0x16/0x20
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Modules linked in:
---[ end trace e99579b570fe0649 ]---
RIP: 0010:release_pages+0x53f/0x840
Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com
Fixes:
|
|
|
|
e37e7b0b3b |
mm, hwpoison: fix condition in free hugetlb page path
When a memory error hits a tail page of a free hugepage,
__page_handle_poison() is expected to be called to isolate the error in
4kB unit, but it's not called due to the outdated if-condition in
memory_failure_hugetlb(). This loses the chance to isolate the error in
the finer unit, so it's not optimal. Drop the condition.
This "(p != head && TestSetPageHWPoison(head)" condition is based on the
old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was
set on the subpage), so it's not necessray any more. By getting to set
PG_hwpoison on head page for hugepages, concurrent error events on
different subpages in a single hugepage can be prevented by
TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb().
So dropping the condition should not reopen the race window originally
mentioned in commit
|
|
|
|
03b122da74 |
x86/sgx: Hook arch_memory_failure() into mainline code
Add a call inside memory_failure() to call the arch specific code to check if the address is an SGX EPC page and handle it. Note the SGX EPC pages do not have a "struct page" entry, so the hook goes in at the same point as the device mapping hook. Pull the call to acquire the mutex earlier so the SGX errors are also protected. Make set_mce_nospec() skip SGX pages when trying to adjust the 1:1 map. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com |
|
|
|
d0b51bfb23 |
Revert "mm: shmem: don't truncate page if memory failure happens"
This reverts commit
|
|
|
|
512b7931ad |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ... |
|
|
|
4966455d91 |
mm: hwpoison: handle non-anonymous THP correctly
Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP support for tmpfs and read-only file cache has been added. They could be offlined by split THP, just like anonymous THP. Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b9d02f1bdd |
mm: shmem: don't truncate page if memory failure happens
The current behavior of memory failure is to truncate the page cache regardless of dirty or clean. If the page is dirty the later access will get the obsolete data from disk without any notification to the users. This may cause silent data loss. It is even worse for shmem since shmem is in-memory filesystem, truncating page cache means discarding data blocks. The later read would return all zero. The right approach is to keep the corrupted page in page cache, any later access would return error for syscalls or SIGBUS for page fault, until the file is truncated, hole punched or removed. The regular storage backed filesystems would be more complicated so this patch is focused on shmem. This also unblock the support for soft offlining shmem THP. [arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()] Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
dd0f230a0a |
mm: hwpoison: refactor refcount check handling
Memory failure will report failure if the page still has extra pinned refcount other than from hwpoison after the handler is done. Actually the check is not necessary for all handlers, so move the check into specific handlers. This would make the following keeping shmem page in page cache patch easier. There may be expected extra pin for some cases, for example, when the page is dirty and in swapcache. Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ba9eb3cef9 |
mm/memory_failure: constify static mm_walk_ops
The only usage of hwp_walk_ops is to pass its address to walk_page_range() which takes a pointer to const mm_walk_ops as argument. Make it const to allow the compiler to put it in read-only memory. Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.com Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
96c84dde36 |
mm: don't include <linux/dax.h> in <linux/mempolicy.h>
Not required at all, and having this causes a huge kernel rebuild as soon as something in dax.h changes. Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0aaa58eca6 |
printk changes for 5.16
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGBCBkACgkQUqAMR0iA lPLMdg/6Ag9V5Q6DPvbYe0WK8wfrrRL39Eic+K6wrYBVK/8rvMUy4Oee5tyOqCz7 z9GM+SivWRtEdEy8X/HzoawMQEuy3jLcaFoCNxHcScmc6R5Sd8otxPU5Lo8aZPLN Pulni9EprysI2zhLqq5m6o/F9pMOY0y8uKbD1mgIHEV9yoLan+CZ+vahf/eFwYQu NtYlMoK2KbS2mChGOZuLsthhyNxcCNFWWNwpBBQz7iJ9ZvnKCZ3EwG7Nx34Rx7ZE TYZ2iga3TTONsoCk0IClbA6zRIowgumKQl9aY9Oci1MXdIEug42i0GEl+p4iCkrH VhLyPsvJG6xyE6aCg/p2SB1vPasY+pp94VfTjFfmMulYdUHK7ipfZCR3ddxayR4B PEsITibo/hHYEVerMMSyVXttiPS7qFhIyZkNuX/xpCMLz8RSFjgU5QhR848A4scM r+qv1p7xkdBRvH3jlStrpLRnGtqOucvbNQgyvQiinm0yunpJN8FZgEsHnP60E5+j DLpQF/bK2h7PhE2Wy8/iINi49/dZiIldZ1gZV4BxjuJ5zwSLdiuR9aP51RK4IRhV qraLwU6yNv0k4v6sjXV78inQQ2vkqy/MBYMe3zqnpYbJB2DZYCbeRE62whrdEd4W wxHxiY7r9dR6gtJB52kGepbryd3JIMdI49oFRjvGi2shaXG1AZ0= =t12m -----END PGP SIGNATURE----- Merge tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Extend %pGp print format to print hex value of the page flags - Use kvmalloc instead of kmalloc to allocate devkmsg buffers - Misc cleanup and warning fixes * tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: Update %pGp documentation about that it prints hex value lib/vsprintf.c: Amend static asserts for format specifier flags vsprintf: Make %pGp print the hex value test_printf: Append strings more efficiently test_printf: Remove custom appending of '|' test_printf: Remove separate page_flags variable test_printf: Make pft array const ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK printk: use gnu_printf format attribute for printk_sprint() printk: avoid -Wsometimes-uninitialized warning printk: use kvmalloc instead of kmalloc for devkmsg_user |
|
|
|
49f8275c7d |
Memory folios
Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54 Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3 Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd LYCtbeawtkikPVXZMfWybsx5vn0c3Q== =7SBe -----END PGP SIGNATURE----- Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ... |
|
|
|
eac96c3efd |
mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit |
|
|
|
c7cb42e944 |
mm: hwpoison: remove the unnecessary THP check
When handling THP hwpoison checked if the THP is in allocation or free
stage since hwpoison may mistreat it as hugetlb page. After commit
|
|
|
|
23efd0804c |
vsprintf: Make %pGp print the hex value
All existing users of %pGp want the hex value as well as the decoded flag names. This looks awkward (passing the same parameter to printf twice), so move that functionality into the core. If we want, we can make that optional with flag arguments to %pGp in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org |
|
|
|
bbc6b703b2 |
mm/memcg: Convert mem_cgroup_uncharge() to take a folio
Convert all the callers to call page_folio(). Most of them were already using a head page, but a few of them I can't prove were, so this may actually fix a bug. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> |
|
|
|
5c91c0e77b |
mm/memory_failure: fix the missing pte_unmap() call
The paired pte_unmap() call is missing before the dev_pagemap_mapping_shift() returns. So fix it. David says: "I guess this code never runs on 32bit / highmem, that's why we didn't notice so far". [akpm@linux-foundation.org: cleanup] Link: https://lkml.kernel.org/r/20210923122642.4999-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
acfa299a4a |
mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()
Commit |
|
|
|
14726903c8 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ... |
|
|
|
5ac95884a7 |
mm/migrate: enable returning precise migrate_pages() success count
Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f87060d345 |
mm: fix panic caused by __page_handle_poison()
In commit |
|
|
|
941ca063eb |
mm: hwpoison: dump page for unhandlable page
Currently just very simple message is shown for unhandlable page, e.g. non-LRU page, like: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It is not very helpful for further debug, calling dump_page() could show more useful information. Calling dump_page() in get_any_page() in order to not duplicate the call in a couple of different places. It may be called with pcp disabled and holding memory hotplug lock, it should be not a big deal since hwpoison handler is not called very often. [shy828301@gmail.com: remove redundant pr_info per Noaya Horiguchi] Link: https://lkml.kernel.org/r/20210824020946.195257-3-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210819054116.266126-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Mackey <tdmackey@twitter.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
d0505e9f7d |
mm: hwpoison: don't drop slab caches for offlining non-LRU page
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline. But
if the page is not slab page all the effort is wasted in vain. Even
though it is a slab page, it is not guaranteed the page could be freed
at all.
However the side effect and cost is quite high. It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches. It could make the most
workingset gone in order to just offline a page. And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.
Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.
Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:
BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
in-flight: 409271:memory_failure_work_func
pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
workqueue mm_percpu_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
pending: vmstat_update
workqueue cgroup_destroy: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
pending: css_release_work_fn
There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1
Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
Workqueue: events memory_failure_work_func
RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
Call Trace:
__queue_work+0xd6/0x3c0
queue_work_on+0x1c/0x30
uncharge_batch+0x10e/0x110
mem_cgroup_uncharge_list+0x6d/0x80
release_pages+0x37f/0x3f0
__pagevec_release+0x1c/0x50
__invalidate_mapping_pages+0x348/0x380
inode_lru_isolate+0x10a/0x160
__list_lru_walk_one+0x7b/0x170
list_lru_walk_one+0x4a/0x60
prune_icache_sb+0x37/0x50
super_cache_scan+0x123/0x1a0
do_shrink_slab+0x10c/0x2c0
shrink_slab+0x1f1/0x290
drop_slab_node+0x4d/0x70
soft_offline_page+0x1ac/0x5b0
memory_failure_work_func+0x6a/0x90
process_one_work+0x19e/0x340
worker_thread+0x30/0x360
kthread+0x116/0x130
The lockup made the machine is quite unusable. And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.
But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:
soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.
Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: David Mackey <tdmackey@twitter.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
a21c184fe2 |
mm/hwpoison: fix some obsolete comments
Since commit |
|
|
|
ed8c2f492d |
mm/hwpoison: change argument struct page **hpagep to *hpage
It's unnecessary to pass in a struct page **hpagep because it's never modified. Changing to use *hpage to simplify the code. Link: https://lkml.kernel.org/r/20210814105131.48814-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ea3732f7a1 |
mm/hwpoison: fix potential pte_unmap_unlock pte error
If the first pte is equal to poisoned_pfn, i.e. check_hwpoisoned_entry() return 1, the wrong ptep - 1 would be passed to pte_unmap_unlock(). Link: https://lkml.kernel.org/r/20210814105131.48814-3-linmiaohe@huawei.com Fixes: ad9c59c24095 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ae611d072c |
mm/hwpoison: remove unneeded variable unmap_success
Patch series "Cleanups and fixup for hwpoison" This series contains cleanups to remove unneeded variable, fix some obsolete comments and so on. Also we fix potential pte_unmap_unlock on wrong pte. More details can be found in the respective changelogs. This patch (of 4): unmap_success is used to indicate whether page is successfully unmapped but it's irrelated with ZONE_DEVICE page and unmap_success is always true here. Remove this unneeded one. Link: https://lkml.kernel.org/r/20210814105131.48814-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210814105131.48814-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
aa99f3c2b9 |
\n
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3 ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2 hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg== =pDBR -----END PGP SIGNATURE----- Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs hole punching vs cache filling race fixes from Jan Kara: "Fix races leading to possible data corruption or stale data exposure in multiple filesystems when hole punching races with operations such as readahead. This is the series I was sending for the last merge window but with your objection fixed - now filemap_fault() has been modified to take invalidate_lock only when we need to create new page in the page cache and / or bring it uptodate" * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: filesystems/locking: fix Malformed table warning cifs: Fix race between hole punch and page fault ceph: Fix race between hole punch and page fault fuse: Convert to using invalidate_lock f2fs: Convert to using invalidate_lock zonefs: Convert to using invalidate_lock xfs: Convert double locking of MMAPLOCK to use VFS helpers xfs: Convert to use invalidate_lock xfs: Refactor xfs_isilocked() ext2: Convert to using invalidate_lock ext4: Convert to use mapping->invalidate_lock mm: Add functions to lock invalidate_lock for two mappings mm: Protect operations adding pages to page cache with invalidate_lock documentation: Sync file_operations members with reality mm: Fix comments mentioning i_mutex |
|
|
|
fcc00621d8 |
mm/hwpoison: retry with shake_page() for unhandlable pages
HWPoisonHandlable() sometimes returns false for typical user pages due
to races with average memory events like transfers over LRU lists. This
causes failures in hwpoison handling.
There's retry code for such a case but does not work because the retry
loop reaches the retry limit too quickly before the page settles down to
handlable state. Let get_any_page() call shake_page() to fix it.
[naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached]
Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev
Fixes:
|
|
|
|
9608703e48 |
mm: Fix comments mentioning i_mutex
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jan Kara <jack@suse.cz> |
|
|
|
041711ce7c |
mm: fix spelling mistakes
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
36af67370e |
mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly before the page is accounted as unmapped. It is unlikely to coincide with hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would do well to use it. Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1fb08ac63b |
mm: rmap: make try_to_unmap() void function
Currently try_to_unmap() return bool value by checking page_mapcount(), however this may return false positive since page_mapcount() doesn't check all subpages of compound page. The total_mapcount() could be used instead, but its cost is higher since it traverses all subpages. Actually the most callers of try_to_unmap() don't care about the return value at all. So just need check if page is still mapped by page_mapped() when necessary. And page_mapped() does bail out early when it finds mapped subpage. Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
510d25c92e |
mm/hwpoison: disable pcp for page_handle_poison()
Recent changes by patch "mm/page_alloc: allow high-order pages to be stored on the per-cpu lists" makes kernels determine whether to use pcp by pcp_allowed_order(), which breaks soft-offline for hugetlb pages. Soft-offline dissolves a migration source page, then removes it from buddy free list, so it's assumed that any subpage of the soft-offlined hugepage are recognized as a buddy page just after returning from dissolve_free_huge_page(). pcp_allowed_order() returns true for hugetlb, so this assumption is no longer true. So disable pcp during dissolve_free_huge_page() and take_page_off_buddy() to prevent soft-offlined hugepages from linking to pcp lists. Soft-offline should not be common events so the impact on performance should be minimal. And I think that the optimization of Mel's patch could benefit to hugetlb so zone_pcp_disable() is called only in hwpoison context. Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0ed950d1f2 |
mm,hwpoison: make get_hwpoison_page() call get_any_page()
__get_hwpoison_page() could fail to grab refcount by some race condition, so it's helpful if we can handle it by retrying. We already have retry logic, so make get_hwpoison_page() call get_any_page() when called from memory_failure(). As a result, get_hwpoison_page() can return negative values (i.e. error code), so some callers are also changed to handle error cases. soft_offline_page() does nothing for -EBUSY because that's enough and users in userspace can easily handle it. unpoison_memory() is also unchanged because it's broken and need thorough fixes (will be done later). Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a3f5d80ea4 |
mm,hwpoison: send SIGBUS with error virutal address
Now an action required MCE in already hwpoisoned address surely sends a SIGBUS to current process, but the SIGBUS doesn't convey error virtual address. That's not optimal for hwpoison-aware applications. To fix the issue, make memory_failure() call kill_accessing_process(), that does pagetable walk to find the error virtual address. It could find multiple virtual addresses for the same error page, and it seems hard to tell which virtual address is correct one. But that's rare and sending incorrect virtual address could be better than no address. So let's report the first found virtual address for now. [naoya.horiguchi@nec.com: fix walk_page_range() return] Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ea6d063010 |
mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit
|
|
|
|
47af12bae1 |
mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
When memory_failure() is called with MF_ACTION_REQUIRED on the page that has already been hwpoisoned, memory_failure() could fail to send SIGBUS to the affected process, which results in infinite loop of MCEs. Currently memory_failure() returns 0 if it's called for already hwpoisoned page, then the caller, kill_me_maybe(), could return without sending SIGBUS to current process. An action required MCE is raised when the current process accesses to the broken memory, so no SIGBUS means that the current process continues to run and access to the error page again soon, so running into MCE loop. This issue can arise for example in the following scenarios: - Two or more threads access to the poisoned page concurrently. If local MCE is enabled, MCE handler independently handles the MCE events. So there's a race among MCE events, and the second or latter threads fall into the situation in question. - If there was a precedent memory error event and memory_failure() for the event failed to unmap the error page for some reason, the subsequent memory access to the error page triggers the MCE loop situation. To fix the issue, make memory_failure() return an error code when the error page has already been hwpoisoned. This allows memory error handler to control how it sends signals to userspace. And make sure that any process touching a hwpoisoned page should get a SIGBUS even in "already hwpoisoned" path of memory_failure() as is done in page fault path. Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
171936ddaf |
mm/memory-failure: use a mutex to avoid memory_failure() races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [akpm@linux-foundation.org: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
e8675d291a |
mm/memory-failure: make sure wait for page writeback in memory_failure
Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
clear_inode:
kernel BUG at fs/inode.c:519!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : clear_inode+0x280/0x2a8
lr : clear_inode+0x280/0x2a8
Call trace:
clear_inode+0x280/0x2a8
ext4_clear_inode+0x38/0xe8
ext4_free_inode+0x130/0xc68
ext4_evict_inode+0xb20/0xcb8
evict+0x1a8/0x3c0
iput+0x344/0x460
do_unlinkat+0x260/0x410
__arm64_sys_unlinkat+0x6c/0xc0
el0_svc_common+0xdc/0x3b0
el0_svc_handler+0xf8/0x160
el0_svc+0x10/0x218
Kernel panic - not syncing: Fatal exception
A crash dump of this problem show that someone called __munlock_pagevec
to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
-> munlock_vma_pages_range -> __munlock_pagevec.
As a result memory_failure will call identify_page_state without
wait_on_page_writeback. And after truncate_error_page clear the mapping
of this page. end_page_writeback won't call sb_clear_inode_writeback to
clear inode->i_wb_list. That will trigger BUG_ON in clear_inode!
Fix it by checking PageWriteback too to help determine should we skip
wait_on_page_writeback.
Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
Fixes:
|
|
|
|
25182f05ff |
mm,hwpoison: fix race with hugetlb page allocation
When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:
CPU0: CPU1:
gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)
__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.
So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).
Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes:
|
|
|
|
f0953a1bba |
mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
4d75136be8 |
mm/memory-failure: unnecessary amount of unmapping
It appears that unmap_mapping_range() actually takes a 'size' as its third
argument rather than a location, the current calling fashion causes
unnecessary amount of unmapping to occur.
Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com
Fixes:
|
|
|
|
34dc45be45 |
mm: fix memory_failure() handling of dax-namespace metadata
Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata. In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().
Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e. metadata pfns are never user mapped.
Link: https://lkml.kernel.org/r/161058501758.1840162.4239831989762604527.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
|
|
|
30c9cf4927 |
mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events
When a memory uncorrected error is triggered by process who accessed the address with error, It's Action Required Case for only current process which triggered this; This Action Required case means Action optional to other process who share the same page. Usually killing current process will be sufficient, other processes sharing the same page will get be signaled when they really touch the poisoned page. But there is another scenario that other processes sharing the same page want to be signaled early with PF_MCE_EARLY set. In this case, we should get them into kill list and signal BUS_MCEERR_AO to them. So in this patch, task_early_kill will check current process if force_early is set, and if not current,the code will fallback to find_early_kill_thread() to check if there is PF_MCE_EARLY process who cares the error. In kill_proc(), BUS_MCEERR_AR is only send to current, other processes in kill list will be signaled with BUS_MCEERR_AO. Link: https://lkml.kernel.org/r/20210122132424.313c8f5f.yaoaili@kingsoft.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
dad4e5b390 |
mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.
When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.
Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
|
|
|
6696d2a6f3 |
mm,hwpoison: fix printing of page flags
Format %pG expects a lower case 'p' in order to print the flags.
Fix it.
Link: https://lkml.kernel.org/r/20210108085202.4506-1-osalvador@suse.de
Fixes:
|
|
|
|
3f4b815a43 |
mm,hwpoison: return -EBUSY when migration fails
Currently, we return -EIO when we fail to migrate the page. Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error. Let us return -EBUSY instead, as we do in case we failed to isolate the page. While are it, let us remove the "ret" print as its value does not change. Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1e8aaedb18 |
mm,memory_failure: always pin the page in madvise_inject_error
madvise_inject_error() uses get_user_pages_fast to translate the address
we specified to a page. After [1], we drop the extra reference count for
memory_failure() path. That commit says that memory_failure wanted to
keep the pin in order to take the page out of circulation.
The truth is that we need to keep the page pinned, otherwise the page
might be re-used after the put_page() and we can end up messing with
someone else's memory.
E.g:
CPU0
process X CPU1
madvise_inject_error
get_user_pages
put_page
page gets reclaimed
process Y allocates the page
memory_failure
// We mess with process Y memory
madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.
Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.
[1] ("23e7b5c2e271: mm, madvise_inject_error:
Let memory_failure() optionally take a page reference")
Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
Fixes:
|
|
|
|
47e431f43b |
mm,hwpoison: remove drain_all_pages from shake_page
get_hwpoison_page already drains pcplists, previously disabling them when trying to grab a refcount. We do not need shake_page to take care of it anymore. Link: https://lkml.kernel.org/r/20201204102558.31607-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Qian Cai <qcai@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2f7141600d |
mm,hwpoison: disable pcplists before grabbing a refcount
Currently, we have a sort of retry mechanism to make sure pages in pcp-lists are spilled to the buddy system, so we can handle those. We can save us this extra checks with the new disable-pcplist mechanism that is available with [1]. zone_pcplist_disable makes sure to 1) disable pcplists, so any page that is freed up from that point onwards will end up in the buddy system and 2) drain pcplists, so those pages that already in pcplists are spilled to buddy. With that, we can make a common entry point for grabbing a refcount from both soft_offline and memory_failure paths that is guarded by zone_pcplist_disable/zone_pcplist_enable. [1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/ Link: https://lkml.kernel.org/r/20201204102558.31607-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
8295d535e2 |
mm,hwpoison: refactor get_any_page
Patch series "HWPoison: Refactor get page interface", v2. This patch (of 3): When we want to grab a refcount via get_any_page, we call __get_any_page that calls get_hwpoison_page to get the actual refcount. get_any_page() is only there because we have a sort of retry mechanism in case the page we met is unknown to us or if we raced with an allocation. Also __get_any_page() prints some messages about the page type in case the page was a free page or the page type was unknown, but if anything, we only need to print a message in case the pagetype was unknown, as that is reporting an error down the chain. Let us merge get_any_page() and __get_any_page(), and let the message be printed in soft_offline_page. While we are it, we can also remove the 'pfn' parameter as it is no longer used. Link: https://lkml.kernel.org/r/20201204102558.31607-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201204102558.31607-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a8b2c2ce89 |
mm,hwpoison: take free pages off the buddy freelists
The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that are gatekeeper for poisoned pages. Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.
As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.
Let us fix this the same way we did for soft_offline [2], taking the page
off the buddy freelist so it is completely unreachable.
Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then run some sort of memory
stress system.
Just for a matter of reference, I put a dump_page() in compaction_alloc()
to trigger for HWPoison patches:
page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
flags: 0xfffffc0800000(hwpoison)
raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: compaction_alloc
CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x6d/0x8b
compaction_alloc+0xb2/0xc0
migrate_pages+0x2a6/0x12a0
compact_zone+0x5eb/0x11c0
proactive_compact_node+0x89/0xf0
kcompactd+0x2d0/0x3a0
kthread+0x118/0x130
ret_from_fork+0x22/0x30
After that, if e.g: a process faults in the page, it will get killed
unexpectedly.
Fix it by containing the page immediatelly.
Besides that, two more changes can be noticed:
* MF_DELAYED no longer suits as we are fixing the issue by containing
the page immediately, so it does no longer rely on the allocation-time
checks to stop HWPoison to be handed over.
gain unless it is unpoisoned, so we fixed the situation.
Because of that, let us use MF_RECOVERED from now on.
* The second block that handles PageBuddy pages is no longer needed:
We call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.
Currently, get_hwpoison_page already calls drain_all_pages, and we call
get_hwpoison_page right before coming here, so we should be on the safe
side.
[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/cover/11792607/
[osalvador@suse.de: take the poisoned subpage off the buddy frelists]
Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de
Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
17e395b60f |
mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
Patch series "HWpoison: further fixes and cleanups", v5. This patchset includes some more fixes and a cleanup. Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy freelist, since having them there has proved to be bad (see [1] and pathch#2's commit log). Patch#3 does the same for hugetlb pages. [1] https://lkml.org/lkml/2020/9/22/565 This patch (of 4): A page with 0-refcount and !PageBuddy could perfectly be a pcppage. Currently, we bail out with an error if we encounter such a page, meaning that we do not handle pcppages neither from hard-offline nor from soft-offline path. Fix this by draining pcplists whenever we find this kind of page and retry the check again. It might be that pcplists have been spilled into the buddy allocator and so we can handle it. Link: https://lkml.kernel.org/r/20201013144447.6706-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201013144447.6706-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
013339df11 |
mm/rmap: always do TTU_IGNORE_ACCESS
Since commit |
|
|
|
336bf30eb7 |
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:
- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).
- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes:
|
|
|
|
5460875999 |
mm/memory-failure: remove a wrapper for alloc_migration_target()
There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b94e02822d |
mm,hwpoison: try to narrow window race for free pages
Aristeu Rozanski reported that a customer test case started to report -EBUSY after the hwpoison rework patchset. There is a race window between spotting a free page and taking it off its buddy freelist, so it might be that by the time we try to take it off, the page has been already allocated. This patch tries to handle such race window by trying to handle the new type of page again if the page was allocated under us. Reported-by: Aristeu Rozanski <aris@ruivo.org> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Aristeu Rozanski <aris@ruivo.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1f2481ddbe |
mm,hwpoison: double-check page count in __get_any_page()
Soft offlining could fail with EIO due to the race condition with hugepage migration. This issuse became visible due to the change by previous patch that makes soft offline handler take page refcount by its own. We have no way to directly pin zero refcount page, and the page considered as a zero refcount page could be allocated just after the first check. This patch adds the second check to find the race and gives us chance to handle it more reliably. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
5d1fd5dc87 |
mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
memory_failure() is supposed to call action_result() when it handles a memory error event, but there's one missing case. So let's add it. I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so this patch also adds them. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
5a2ffca3c2 |
mm,hwpoison: return 0 if the page is already poisoned in soft-offline
Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.
1) madvise:
madvise_inject_error skips any poisoned page and continues
the loop.
If that was the only page to madvise, it returns 0.
2) /sys/devices/system/memory/:
When calling soft_offline_page_store()->soft_offline_page(),
we return -EBUSY in case the page is already poisoned.
This is inconsistent with a) the above example and b)
memory_failure, where we return 0 if the page was poisoned.
Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
let soft_offline_page return 0 if it finds the page already poisoned.
Please, note that this represents a user-api change, since now the return
error when calling soft_offline_page_store()->soft_offline_page() will be
different.
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
6b9a217eda |
mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page
Merging soft_offline_huge_page and __soft_offline_page let us get rid of quite some duplicated code, and makes the code much easier to follow. Now, __soft_offline_page will handle both normal and hugetlb pages. Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-11-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
79f5f8fab4 |
mm,hwpoison: rework soft offline for in-use pages
This patch changes the way we set and handle in-use poisoned pages. Until
now, poisoned pages were released to the buddy allocator, trusting that
the checks that take place at allocation time would act as a safe net and
would skip that page.
This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be in a buddy freelist.
Although this might not be the only user, having poisoned pages in the
buddy allocator seems a bad idea as we should only have free pages that
are ready and meant to be used as such.
Before explaining the taken approach, let us break down the kind of pages
we can soft offline.
- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)
* Normal pages (order-0 and anon-THP)
- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.
Either way, do not call put_page directly from those paths. Instead, we
keep the page and send it to page_handle_poison to perform the right
handling.
page_handle_poison sets the HWPoison flag and does the last put_page.
Down the chain, we placed a check for HWPoison page in
free_pages_prepare, that just skips any poisoned page, so those pages
do not end up in any pcplist/freelist.
After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.
If we see that the check in free_pages_prepare creates trouble, we can
always do what we do for free pages:
- wait until the page hits buddy's freelists
- take it off, and flag it
The downside of the above approach is that we could race with an
allocation, so by the time we want to take the page off the buddy, the
page has been already allocated so we cannot soft offline it.
But the user could always retry it.
* Hugetlb pages
- We isolate-and-migrate them
After the migration has been successful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.
While for non-hugetlb pages we cared about closing the race with an
allocation, doing so for hugetlb pages requires quite some additional
and intrusive code (we would need to hook in free_huge_page and some other
places).
So I decided to not make the code overly complicated and just fail
normally if the page we allocated in the meantime.
We can always build on top of this.
As a bonus, because of the way we handle now in-use pages, we no longer
need the put-as-isolation-migratetype dance, that was guarding for poisoned
pages to end up in pcplists.
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Aristeu Rozanski <aris@ruivo.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
06be6ff3d2 |
mm,hwpoison: rework soft offline for free pages
When trying to soft-offline a free page, we need to first take it off the buddy allocator. Once we know is out of reach, we can safely flag it as poisoned. take_page_off_buddy will be used to take a page meant to be poisoned off the buddy allocator. take_page_off_buddy calls break_down_buddy_pages, which splits a higher-order page in case our page belongs to one. Once the page is under our control, we call page_handle_poison to set it as poisoned and grab a refcount on it. Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
694bf0b0cd |
mm,hwpoison: unify THP handling for hard and soft offline
Place the THP's page handling in a helper and use it from both hard and soft-offline machinery, so we get rid of some duplicated code. Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-8-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
dd6e2402fa |
mm,hwpoison: kill put_hwpoison_page
After commit
|
|
|
|
7e27f22c9e |
mm,hwpoison: unexport get_hwpoison_page and make it static
Since get_hwpoison_page is only used in memory-failure code now, let us un-export it and make it private to that code. Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-5-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1b473becde |
mm, hwpoison: remove recalculating hpage
hpage is never used after try_to_split_thp_page() in memory_failure(), so we don't have to update hpage. So let's not recalculate/use hpage. Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-3-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7d9d46ac87 |
mm,hwpoison: cleanup unused PageHuge() check
Patch series "HWPOISON: soft offline rework", v7. This patchset fixes a couple of issues that the patchset Naoya sent [1] contained due to rebasing problems and a misunterdansting. Main focus of this series is to stabilize soft offline. Historically soft offlined pages have suffered from racy conditions because PageHWPoison is used to a little too aggressively, which (directly or indirectly) invades other mm code which cares little about hwpoison. This results in unexpected behavior or kernel panic, which is very far from soft offline's "do not disturb userspace or other kernel component" policy. An example of this can be found here [2]. Along with several cleanups, this code refactors and changes the way soft offline work. Main point of this change set is to contain target page "via buddy allocator" or in migrating path. For ther former we first free the target page as we do for normal pages, and once it has reached buddy and it has been taken off the freelists, we flag it as HWpoison. For the latter we never get to release the page in unmap_and_move, so the page is under our control and we can handle it in hwpoison code. [1] https://patchwork.kernel.org/cover/11704083/ [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u This patch (of 14): Drop the PageHuge check, which is dead code since memory_failure() forks into memory_failure_hugetlb() for hugetlb pages. memory_failure() and memory_failure_hugetlb() shares some functions like hwpoison_user_mappings() and identify_page_state(), so they should properly handle 4kB page, thp, and hugetlb. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Qian Cai <cai@lca.pw> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Oscar Salvador <osalvador@suse.com> Link: https://lkml.kernel.org/r/20200922135650.1634-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20200922135650.1634-2-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2c3125977e |
mm/memory-failure.c: remove unused macro `writeback'
Unlike others we don't use the marco writeback. so let's remove it to tame gcc warning: mm/memory-failure.c:827: warning: macro "writeback" is not used [-Wunused-macros] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Link: https://lkml.kernel.org/r/1599715096-20369-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
c43bc03d0a |
mm/memory-failure: do pgoff calculation before for_each_process()
There is no need to calculate pgoff in each loop of for_each_process(), so move it to the place before for_each_process(), which can save some CPU cycles. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Link: http://lkml.kernel.org/r/20200818082647.34322-1-tian.xianting@h3c.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f56753ac2a |
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
19fc7bed25 |
mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation. Since there is no fundamental difference, it's better to keep just one rather than keeping all variants. This patch implements base migration target allocation function. In the following patches, variants will be converted to use this function. Changes should be mechanical, but, unfortunately, there are some differences. First, some callers' nodemask is assgined to NULL since NULL nodemask will be considered as all available nodes, that is, &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if user provided gfp_mask has it. This is because future caller of this function requires to set this node constaint. Lastly, if provided nodeid is NUMA_NO_NODE, nodeid is set up to the node where migration source lives. It helps to remove simple wrappers for setting up the nodeid. Note that PageHighmem() call in previous function is changed to open-code "is_highmem_idx()" since it provides more readability. [akpm@linux-foundation.org: tweak patch title, per Vlastimil] [akpm@linux-foundation.org: fix typo in comment] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
03151c6e0b |
mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread
Action Required memory error should happen only when a processor is
about to access to a corrupted memory, so it's synchronous and only
affects current process/thread.
Recently commit
|
|
|
|
4e018b450a |
mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill
Patch series "hwpoison: fixes signaling on memory error" This is a small patchset to solve issues in memory error handler to send SIGBUS to proper process/thread as expected in configuration. Please see descriptions in individual patches for more details. This patch (of 2): Early-kill policy is controlled from two types of settings, one is per-process setting prctl(PR_MCE_KILL) and the other is system-wide setting vm.memory_failure_early_kill. Users expect per-process setting to override system-wide setting as many other settings do, but early-kill setting doesn't work as such. For example, if a system configures vm.memory_failure_early_kill to 1 (enabled), a process receives SIGBUS even if it's configured to explicitly disable PF_MCE_KILL by prctl(). That's not desirable for applications with their own policies. This patch is suggesting to change the priority of these two types of settings, by checking sysctl_memory_failure_early_kill only when a given process has the default kill policy. Note that this patch is solving a thread choice issue too. Originally, collect_procs() always chooses the main thread when vm.memory_failure_early_kill is 1, even if the process has a dedicated thread for memory error handling. SIGBUS should be sent to the dedicated thread if early-kill is enabled via vm.memory_failure_early_kill as we are doing for PR_MCE_KILL_EARLY processes. Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1591321039-22141-1-git-send-email-naoya.horiguchi@nec.com Link: http://lkml.kernel.org/r/1591321039-22141-2-git-send-email-naoya.horiguchi@nec.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
118d6e9829 |
ACPI updates for 5.8-rc1
- Update the ACPICA code in the kernel to upstream revision
20200430:
* Move acpi_gbl_next_cmd_num definition (Erik Kaneda).
* Ignore AE_ALREADY_EXISTS status in the disassembler when parsing
create operators (Erik Kaneda).
* Add status checks to the dispatcher (Erik Kaneda).
* Fix required parameters for _NIG and _NIH (Erik Kaneda).
* Make acpi_protocol_lengths static (Yue Haibing).
- Fix ACPI table reference counting errors in several places, mostly
in error code paths (Hanjun Guo).
- Extend the Generic Event Device (GED) driver to support _Exx and
_Lxx handler methods (Ard Biesheuvel).
- Add new acpi_evaluate_reg() helper and modify the ACPI PCI hotplug
code to use it (Hans de Goede).
- Add new DPTF battery participant driver and make the DPFT power
participant driver create more sysfs device attributes (Srinivas
Pandruvada).
- Improve the handling of memory failures in APEI (James Morse).
- Add new blacklist entry for Acer TravelMate 5735Z to the backlight
driver (Paul Menzel).
- Add i2c address for thermal control to the PMIC driver (Mauro
Carvalho Chehab).
- Allow the ACPI processor idle driver to work on platforms with
only one ACPI C-state present (Zhang Rui).
- Fix kobject reference count leaks in error code paths in two
places (Qiushi Wu).
- Delete unused proc filename macros and make some symbols static
(Pascal Terjan, Zheng Zengkai, Zou Wei).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VHb8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxVboQAIjYda2RhQANIlIvoEa+Qd2/FBd3HXgU
Mv0LZ6y1xxxEZYeKne7zja1hzt5WetuZ1hZHGfg8YkXyrLqZGxfCIFbbhSA90BGG
PGzFerGmOBNzB3I9SN6iQY7vSqoFHvQEV1PVh24d+aHWZqj2lnaRRq+GT54qbRLX
/U3Hy5glFl8A/DCBP4cpoEjDr4IJHY68DathkDK2Ep2ybXV6B401uuqx8Su/OBd/
MQmJTYI1UK/RYBXfdzS9TIZahnkxBbU1cnLFy08Ve2mawl5YsHPEbvm77a0yX2M6
sOAerpgyzYNivAuOLpNIwhUZjpOY66nQuKAQaEl2cfRUkqt4nbmq7yDoH3d2MJLC
/Ccz955rV2YyD1DtyV+PyT+HB+/EVwH/+UCZ+gsSbdHvOiwdFU6VaTc2eI1qq8K9
4m5eEZFrAMPlvTzj/xVxr2Hfw1lbm23J5B5n7sM5HzYbT6MUWRQpvfV4zM3jTGz0
rQd8JmcHVvZk/MV1mGrYHrN5TnGTLWpbS4Yv1lAQa6FP0N0NxzVud7KRfLKnCnJ1
vh5yzW2fCYmVulJpuqxJDfXSqNV7n40CFrIewSp6nJRQXnWpImqHwwiA8fl51+hC
fBL72Ey08EHGFnnNQqbebvNglsodRWJddBy43ppnMHtuLBA/2GVKYf2GihPbpEBq
NHtX+Rd3vlWW
=xH3i
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to upstream revision
20200430, fix several reference counting errors related to ACPI
tables, add _Exx / _Lxx support to the GED driver, add a new
acpi_evaluate_reg() helper, add new DPTF battery participant driver
and extend the DPFT power participant driver, improve the handling of
memory failures in the APEI code, add a blacklist entry to the
backlight driver, update the PMIC driver and the processor idle
driver, fix two kobject reference count leaks, and make a few janitory
changes.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20200430:
- Move acpi_gbl_next_cmd_num definition (Erik Kaneda).
- Ignore AE_ALREADY_EXISTS status in the disassembler when parsing
create operators (Erik Kaneda).
- Add status checks to the dispatcher (Erik Kaneda).
- Fix required parameters for _NIG and _NIH (Erik Kaneda).
- Make acpi_protocol_lengths static (Yue Haibing).
- Fix ACPI table reference counting errors in several places, mostly
in error code paths (Hanjun Guo).
- Extend the Generic Event Device (GED) driver to support _Exx and
_Lxx handler methods (Ard Biesheuvel).
- Add new acpi_evaluate_reg() helper and modify the ACPI PCI hotplug
code to use it (Hans de Goede).
- Add new DPTF battery participant driver and make the DPFT power
participant driver create more sysfs device attributes (Srinivas
Pandruvada).
- Improve the handling of memory failures in APEI (James Morse).
- Add new blacklist entry for Acer TravelMate 5735Z to the backlight
driver (Paul Menzel).
- Add i2c address for thermal control to the PMIC driver (Mauro
Carvalho Chehab).
- Allow the ACPI processor idle driver to work on platforms with only
one ACPI C-state present (Zhang Rui).
- Fix kobject reference count leaks in error code paths in two places
(Qiushi Wu).
- Delete unused proc filename macros and make some symbols static
(Pascal Terjan, Zheng Zengkai, Zou Wei)"
* tag 'acpi-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe()
ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
ACPI: GED: use correct trigger type field in _Exx / _Lxx handling
ACPI: DPTF: Add battery participant driver
ACPI: DPTF: Additional sysfs attributes for power participant driver
ACPI: video: Use native backlight on Acer TravelMate 5735Z
arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work
ACPI: APEI: Kick the memory_failure() queue for synchronous errors
mm/memory-failure: Add memory_failure_queue_kick()
ACPI / PMIC: Add i2c address for thermal control
ACPI: GED: add support for _Exx / _Lxx handler methods
ACPI: Delete unused proc filename macros
ACPI: hotplug: PCI: Use the new acpi_evaluate_reg() helper
ACPI: utils: Add acpi_evaluate_reg() helper
ACPI: debug: Make two functions static
ACPI: sleep: Put the FACS table after using it
ACPI: scan: Put SPCR and STAO table after using it
ACPI: EC: Put the ACPI table after using it
ACPI: APEI: Put the HEST table for error path
ACPI: APEI: Put the error record serialization table for error path
...
|
|
|
|
872e9a205c |
mm, memory_failure: don't send BUS_MCEERR_AO for action required error
Some processes dont't want to be killed early, but in "Action Required" case, those also may be killed by BUS_MCEERR_AO when sharing memory with other which is accessing the fail memory. And sending SIGBUS with BUS_MCEERR_AO for action required error is strange, so ignore the non-current processes here. Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
062022315e |
mm/memory-failure: Add memory_failure_queue_kick()
The GHES code calls memory_failure_queue() from IRQ context to schedule work on the current CPU so that memory_failure() can sleep. For synchronous memory errors the arch code needs to know any signals that memory_failure() will trigger are pending before it returns to user-space, possibly when exiting from the IRQ. Add a helper to kick the memory failure queue, to ensure the scheduled work has happened. This has to be called from process context, so may have been migrated from the original cpu. Pass the cpu the work was queued on. Change memory_failure_work_func() to permit being called on the 'wrong' cpu. Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Tyler Baicar <baicar@os.amperecomputing.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
|
|
|
9de4f22a60 |
mm: code cleanup for MADV_FREE
Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
c0d0381ade |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7506851837 |
mm/memory-failure.c: use page_shift() in add_to_kill()
page_shift() is supported after the commit
|
|
|
|
feec24a613 |
mm, soft-offline: convert parameter to pfn
Currently soft_offline_page() receives struct page, and its sibling memory_failure() receives pfn. This discrepancy looks weird and makes precheck on pfn validity tricky. So let's align them. Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
996ff7a08d |
mm/memory-failure.c clean up around tk pre-allocation
add_to_kill() expects the first 'tk' to be pre-allocated, it makes subsequent allocations on need basis, this makes the code a bit difficult to read. Move all the allocation internal to add_to_kill() and drop the **tk argument. Link: http://lkml.kernel.org/r/1565112345-28754-2-git-send-email-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
96c804a6ae |
mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
We should check for pfn_to_online_page() to not access uninitialized memmaps. Reshuffle the code so we don't have to duplicate the error message. Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Fixes: |
|
|
|
3d7fed4ad8 |
mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once
Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings. The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.
Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.
Details -
ndctl inject-error --block=10 --count=1 namespace6.0
./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb6600000
mmaped address 0x7f3cf3600000
doing local read at address 0x7f3cf3601400
Killed
Console messages in instrumented kernel -
mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
=> to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
=> to deliver SIGBUS
Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
fec88ab0af |
HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror' feature
merged last cycle. In linux-next we now see AMDGPU and nouveau to be
using this API.
- Remove old or transitional hmm APIs. These are hold overs from the past
with no users, or APIs that existed only to manage cross tree conflicts.
There are still a few more of these cleanups that didn't make the merge
window cut off.
- Improve some core mm APIs:
* export alloc_pages_vma() for driver use
* refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
* refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers use
the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
=wKvp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
|
|
|
|
135e53514e |
mm/memory-failure.c: clarify error message
Some user who install SIGBUS handler that does longjmp out therefore keeping the process alive is confused by the error message "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption" Slightly modify the error message to improve clarity. Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
5ad18b2e60 |
Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman: "A source of error over the years has been that force_sig has taken a task parameter when it is only safe to use force_sig with the current task. The force_sig function is built for delivering synchronous signals such as SIGSEGV where the userspace application caused a synchronous fault (such as a page fault) and the kernel responded with a signal. Because the name force_sig does not make this clear, and because the force_sig takes a task parameter the function force_sig has been abused for sending other kinds of signals over the years. Slowly those have been fixed when the oopses have been tracked down. This set of changes fixes the remaining abusers of force_sig and carefully rips out the task parameter from force_sig and friends making this kind of error almost impossible in the future" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus signal: Remove the signal number and task parameters from force_sig_info signal: Factor force_sig_info_to_task out of force_sig_info signal: Generate the siginfo in force_sig signal: Move the computation of force into send_signal and correct it. signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal signal: Remove the task parameter from force_sig_fault signal: Use force_sig_fault_to_task for the two calls that don't deliver to current signal: Explicitly call force_sig_fault on current signal/unicore32: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from ptrace_break signal/nds32: Remove tsk parameter from send_sigtrap signal/riscv: Remove tsk parameter from do_trap signal/sh: Remove tsk parameter from force_sig_info_fault signal/um: Remove task parameter from send_sigtrap signal/x86: Remove task parameter from send_sigtrap signal: Remove task parameter from force_sig_mceerr signal: Remove task parameter from force_sig signal: Remove task parameter from force_sigsegv ... |
|
|
|
25b2995a35 |
mm: remove MEMORY_DEVICE_PUBLIC support
The code hasn't been used since it was added to the tree, and doesn't appear to actually be usable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
|
|
|
faf53def3b |
mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the
suboptimal code in current soft-offline code. See the following part:
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
...
} else {
/*
* We set PG_hwpoison only when the migration source hugepage
* was successfully dissolved, because otherwise hwpoisoned
* hugepage remains on free hugepage list, then userspace will
* find it as SIGBUS by allocation failure. That's not expected
* in soft-offlining.
*/
ret = dissolve_free_huge_page(page);
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
}
}
return ret;
Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.
dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.
This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.
[n-horiguchi@ah.jp.nec.com: v3]
Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes:
|
|
|
|
b38e5962f8 |
mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like
that. It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.
Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.
Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes:
|
|
|
|
1439f94c54 |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 263
Based on 1 normalized pattern(s): this software may be redistributed and or modified under the terms of the gnu general public license gpl version 2 only as published by the free software foundation extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
f8eac9011b |
signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the task parameter to make this obvious. This also makes it clear that force_sig_mceerr passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
46612b751c |
mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
When soft_offline_in_use_page() runs on a thp tail page after pmd is split, we trigger the following VM_BUG_ON_PAGE(): Memory failure: 0x3755ff: non anonymous thp __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000 Soft offlining pfn 0x34d805 at process virtual address 0x20fff000 page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x2fffff80000000() raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:519! soft_offline_in_use_page() passed refcount and page lock from tail page to head page, which is not needed because we can pass any subpage to split_huge_page(). Naoya had fixed a similar issue in |
|
|
|
6376360ecb |
mm: hwpoison: use do_send_sig_info() instead of force_sig()
Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.
The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes. As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do. OOM solves this issue by using
do_send_sig_info() as done in commit
|
|
|
|
ddeaab32a8 |
hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
This reverts
|
|
|
|
b43a999005 |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes:
|
|
|
|
27359fd6e5 |
dax: Fix unlock mismatch with updated API
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.
In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.
Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:
WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0
Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes:
|
|
|
|
2923b27e54 |
libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
|
|
|
|
d4ae9916ea |
mm: soft-offline: close the race against page allocation
A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to allocate a page that was just freed on the way of soft-offline. This is undesirable because soft-offline (which is about corrected error) is less aggressive than hard-offline (which is about uncorrected error), and we can make soft-offline fail and keep using the page for good reason like "system is busy." Two main changes of this patch are: - setting migrate type of the target page to MIGRATE_ISOLATE. As done in free_unref_page_commit(), this makes kernel bypass pcplist when freeing the page. So we can assume that the page is in freelist just after put_page() returns, - setting PG_hwpoison on free page under zone->lock which protects freelists, so this allows us to avoid setting PG_hwpoison on a page that is decided to be allocated soon. [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment] Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <zy.zhengyi@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6bc9b56433 |
mm: fix race on soft-offlining free huge pages
Patch series "mm: soft-offline: fix race against page allocation".
Xishi recently reported the issue about race on reusing the target pages
of soft offlining. Discussion and analysis showed that we need make
sure that setting PG_hwpoison should be done in the right place under
zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
hanldes free buddy page's case.
This patch (of 2):
There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.
The process killing is caused by the following flow:
CPU 0 CPU 1 CPU 2
soft offline
get_any_page
// find the hugetlb is free
mmap a hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
// succeed
soft_offline_free_page
// set hwpoison flag
mmap the hugetlb file
page fault
...
hugetlb_fault
hugetlb_no_page
find_lock_page
return VM_FAULT_HWPOISON
mm_fault_error
do_sigbus
// kill the process
The hugetlb allocation failure comes from the following flow:
CPU 0 CPU 1
mmap a hugetlb file
// reserve all free page but don't fault-in
soft offline
get_any_page
// find the hugetlb is free
soft_offline_free_page
// set hwpoison flag
dissolve_free_huge_page
// fail because all free hugepages are reserved
page fault
...
hugetlb_fault
hugetlb_no_page
alloc_huge_page
...
dequeue_huge_page_node_exact
// ignore hwpoisoned hugepage
// and finally fail due to no-mem
The root cause of this is that current soft-offline code is written based
on an assumption that PageHWPoison flag should be set at first to avoid
accessing the corrupted data. This makes sense for memory_failure() or
hard offline, but does not for soft offline because soft offline is about
corrected (not uncorrected) error and is safe from data lost. This patch
changes soft offline semantics where it sets PageHWPoison flag only after
containment of the error page completes successfully.
Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
1c4c3b99c0 |
mm: fix page_freeze_refs and page_unfreeze_refs in comments
page_freeze_refs/page_unfreeze_refs have already been relplaced by page_ref_freeze/page_ref_unfreeze , but they are not modified in the comments. Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6100e34b25 |
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.
Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
|
|
|
|
ae1139ece1 |
mm, memory_failure: Collect mapping size in collect_procs()
In preparation for supporting memory_failure() for dax mappings, teach collect_procs() to also determine the mapping size. Unlike typical mappings the dax mapping size is determined by walking page-table entries rather than using the compound-page accounting for THP pages. Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> |