mirror of https://github.com/torvalds/linux.git
942 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
4c6355b25e |
mm: memcontrol: charge swapin pages on instantiation
Right now, users that are otherwise memory controlled can easily escape their containment and allocate significant amounts of memory that they're not being charged for. That's because swap readahead pages are not being charged until somebody actually faults them into their page table. This can be exploited with MADV_WILLNEED, which triggers arbitrary readahead allocations without charging the pages. There are additional problems with the delayed charging of swap pages: 1. To implement refault/workingset detection for anonymous pages, we need to have a target LRU available at swapin time, but the LRU is not determinable until the page has been charged. 2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be stable when the page is isolated from the LRU; otherwise, the locks change under us. But swapcache gets charged after it's already on the LRU, and even if we cannot isolate it ourselves (since charging is not exactly optional). The previous patch ensured we always maintain cgroup ownership records for swap pages. This patch moves the swapcache charging point from the fault handler to swapin time to fix all of the above problems. v2: simplify swapin error checking (Joonsoo) [hughd@google.com: fix livelock in __read_swap_cache_async()] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
9d82c69438 |
mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API
With the page->mapping requirement gone from memcg, we can charge anon and file-thp pages in one single step, right after they're allocated. This removes two out of three API calls - especially the tricky commit step that needed to happen at just the right time between when the page is "set up" and when it's "published" - somewhat vague and fluid concepts that varied by page type. All we need is a freshly allocated page and a memcg context to charge. v2: prevent double charges on pre-allocated hugepages in khugepaged [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL] Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
be5d0a74c6 |
mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
3fba69a56e |
mm: memcontrol: drop @compound parameter from memcg charging API
The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
8226f11318 |
MIPS updates for v5.8:
- added support for MIPSr5 and P5600 cores - converted Loongson PCI driver into a PCI host driver using the generic PCI framework - added emulation of CPUCFG command for Loogonson64 cpus - removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA - ioremap cleanup - fix for a race between two threads faulting the same page - various cleanups and fixes -----BEGIN PGP SIGNATURE----- iQJOBAABCAA4FiEEbt46xwy6kEcDOXoUeZbBVTGwZHAFAl7WK54aHHRzYm9nZW5k QGFscGhhLmZyYW5rZW4uZGUACgkQeZbBVTGwZHAbjA/9EEFeqNg9UNUH6/TS18QV qkxKp0+LC4Jk+SduzLyYsYy6l/dSaKYl8m9jyJsWjM6BvBZTcMJJOnzIPRafI0s+ MK8GCSZunAkm25DsDvfobQUkbQ/UHjY/fuRpNslbDcsYqIKv90hUMd21ccXY6KC5 RY+aMlpjgksg1X8JJ7k1Rs05sXyUPqpESteyqehF1b/+Iyv7H2L3v5EvQwvPDs6f TyVgNJU2B3RCU6/uAcWmHdVLxXd+Y8fM0vC8DCO0pg0rGf4be0FbZztHmeq6r2wy g7wsO7acKWGzulFQD5ftVSQ6i8KHIDNDePmDMtU5oFcXkzUDdGvd3j3Gst19/nve ZftNmQHOY1JqGUOhdq1fDG/4M3Vc5bvh3W6eMG22TuMLEWsOF8teY8uUa/vxOb+B 2NsJ9q6ylRS7RDWWOrApJWfFYPvhr5wlLxT+azWNa9y3bjV8vDLjNdU0mRLA1nsu yLzYMwIhtWfZhkJZ+xJVSmQ6LjAHDN5TF/LEx/9itLg5t9wrEosFPAtOv8V15hy4 KBNvvWeoy7RRmBTNuKh7r9Ui4jw7GgxL4D1OwzCsF//GAiGyuuh0zMuUE8EXA6K5 MpdGt+bSOcLl8ILTtGir8e4MXLawDH8n94f8QWLb9FcOvU4KHUjRKU7EQ6dyD5dk a7xskGLXWdVO3IJ/Xvxcaeo= =eAtN -----END PGP SIGNATURE----- Merge tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS updates from Thomas Bogendoerfer: - added support for MIPSr5 and P5600 cores - converted Loongson PCI driver into a PCI host driver using the generic PCI framework - added emulation of CPUCFG command for Loogonson64 cpus - removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA - ioremap cleanup - fix for a race between two threads faulting the same page - various cleanups and fixes * tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (143 commits) MIPS: ralink: drop ralink_clk_init for mt7621 MIPS: ralink: bootrom: mark a function as __init to save some memory MIPS: Loongson64: Reorder CPUCFG model match arms MIPS: Expose Loongson CPUCFG availability via HWCAP MIPS: Loongson64: Guard against future cores without CPUCFG MIPS: Fix build warning about "PTR_STR" redefinition MIPS: Loongson64: Remove not used pci.c MIPS: Loongson64: Define PCI_IOBASE MIPS: CPU_LOONGSON2EF need software to maintain cache consistency MIPS: DTS: Fix build errors used with various configs MIPS: Loongson64: select NO_EXCEPT_FILL MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() MIPS: mm: add page valid judgement in function pte_modify mm/memory.c: Add memory read privilege on page fault handling mm/memory.c: Update local TLB if PTE entry exists MIPS: Do not flush tlb page when updating PTE entry MIPS: ingenic: Default to a generic board MIPS: ingenic: Add support for GCW Zero prototype MIPS: ingenic: DTS: Add memory info of GCW Zero MIPS: Loongson64: Switch to generic PCI driver ... |
|
|
|
6972f55c41 |
mm/memory: remove unnecessary pte_devmap case in copy_one_pte()
Since commit
|
|
|
|
44bf431b47 |
mm/memory.c: Add memory read privilege on page fault handling
Here add pte_sw_mkyoung function to make page readable on MIPS platform during page fault handling. This patch improves page fault latency about 10% on my MIPS machine with lmbench lat_pagefault case. It is noop function on other arches, there is no negative influence on those architectures. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> |
|
|
|
7df6769743 |
mm/memory.c: Update local TLB if PTE entry exists
If two threads concurrently fault at the same page, the thread that won the race updates the PTE and its local TLB. For now, the other thread gives up, simply does nothing, and continues. It could happen that this second thread triggers another fault, whereby it only updates its local TLB while handling the fault. Instead of triggering another fault, let's directly update the local TLB of the second thread. Function update_mmu_tlb is used here to update local TLB on the second thread, and it is defined as empty on other arches. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> |
|
|
|
8cd3984d81 |
mm/memory.c: add vm_insert_pages()
Add the ability to insert multiple pages at once to a user VM with lower PTE spinlock operations. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument] [arjunroy@google.com: add missing page_count() check to vm_insert_pages()] Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com [arjunroy@google.com: vm_insert_pages() checks if pte_index defined] Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
8efd6f5b17 |
mm/memory.c: refactor insert_page to prepare for batched-lock insert
Add helper methods for vm_insert_page()/insert_page() to prepare for vm_insert_pages(), which batch-inserts pages to reduce spinlock operations when inserting multiple consecutive pages into the user page table. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
552657b7b3 |
mm: fix ambiguous comments for better code readability
The parameter of remap_pfn_range() @pfn passed from the caller is actually a page-frame number converted by corresponding physical address of kernel memory, the original comment is ambiguous that may mislead the users. Meanwhile, there is an ambiguous typo "VMM" in the comment of vm_area_struct. So fixing them will make the code more readable. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f45ec5ff16 |
userfaultfd: wp: support swap and page migration
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b569a17607 |
userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
UFFD_EVENT_FORK support for uffd-wp should be already there, except that we should clean the uffd-wp bit if uffd fork event is not enabled. Detect that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked by VM_UFFD_WP. Do this for both small PTEs and huge PMDs. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
292924b260 |
userfaultfd: wp: apply _PAGE_UFFD_WP bit
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new flags
are exclusively used. Then,
- For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd
- For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace
And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.
Do this change for both PTEs and huge PMDs. Then we can start to identify
which PTE/PMD is write protected by general (e.g., COW or soft dirty
tracking), and which is for userfaultfd-wp.
Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().
After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp. Here the
userfault-wp will have higher priority and will be handled first. Only
after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
the general COW. These are the steps on what will happen with such a
page:
1. CPU accesses write protected shared page (so both protected by
general COW and uffd-wp), blocked by uffd-wp first because in
do_wp_page we'll handle uffd-wp first, so it has higher priority
than general COW.
2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
to remove the uffd-wp bit upon the PTE/PMD. However here we
still keep the write bit cleared. Notify the blocked CPU.
3. The blocked CPU resumes the page fault process with a fault
retry, during retry it'll notice it was not with the uffd-wp bit
this time but it is still write protected by general COW, then
it'll go though the COW path in the fault handler, copy the page,
apply write bit where necessary, and retry again.
4. The CPU will be able to access this page with write bit set.
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
529b930b87 |
userfaultfd: wp: hook userfault handler to write protection fault
There are several cases write protection fault happens. It could be a write to zero page, swaped page or userfault write protected page. When the fault happens, there is no way to know if userfault write protect the page before. Here we just blindly issue a userfault notification for vma with VM_UFFD_WP regardless if app write protects it yet. Application should be ready to handle such wp fault. In the swapin case, always swapin as readonly. This will cause false positive userfaults. We need to decide later if to eliminate them with a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY). hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be handled by a swap entry bit like anonymous memory. The main problem with no easy solution to eliminate the false positives, will be if/when userfaultfd is extended to real filesystem pagecache. When the pagecache is freed by reclaim we can't leave the radix tree pinned if the inode and in turn the radix tree is reclaimed as well. The estimation is that full accuracy and lack of false positives could be easily provided only to anonymous memory (as long as there's no fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs and hugetlbfs, it's most certainly worth to achieve it but in a later incremental patch. [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
396bcc5299 |
mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE
Commit |
|
|
|
3122e80efc |
mm/vma: make vma_is_accessible() available for general use
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ea9448b254 |
drm: add support for hugepages to TTM
-----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJehnToAAoJEAx081l5xIa+bYEP/3IW+bip83OSR/Ay/29qmeBh FMZjz9G+jClVArea+8dlbmGohpQfkLuBiDBE1Ujxl9iqsm3STdIdbv9bHccqs2g8 mtptkZ5qKwuOi7NhcNG5E5vy60bEAbZ9/QtXok5nckega2sdP7cr+uzZgp/Zc/Vo v9H8Wk6/l/MUF8agIXmgChpXII17lIyYbtbH5NV+PpsZMhAaAg2g4Z4vBP5Ue+Nc myNcdzKLF3nq++gBfIZ4gzAAnnqN2eYFvkSdvRSdn9HuXcur1tQHjMwC/DJuk8h7 5dsaplrRLceMEqn6d61oWBJclPefXlkazvHzqNA9Zwr98yVev5h7tiT3BKNVTbKW iPoXCt55fJosvXAsJxW4UgXZy7kMGZdZ8GmSlwmZsA0kJRvOuuvWChvu/ugwnIeR DUWb5sa0Bn9aoczJ4Qq61O7CqtvhOf6NK24Jcc/HSk/iDbZ2tEnCPEXeCm0GibQ5 PAFLfE1fZUcEeZlOp+zbZ6ni6XbLL9LX2Dkum/3zEvhf1rdF+0692ZM4o9VwedAX 2TpE4kywhbYxhUq3MbyRzP3knu7pJYb0KCOfyg6Rqn/vCo17+PksRF+6XvzUVlzr VtRYU87TVP5FqIw+e3yela2alP/oo4kEe37n536TcRgFtU7vItcCA5vLuDSOivjX 08B6Hy4QK2M0yKFuuAT5 =KO6E -----END PGP SIGNATURE----- Merge tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm Pull drm hugepage support from Dave Airlie: "This adds support for hugepages to TTM and has been tested with the vmwgfx drivers, though I expect other drivers to start using it" * tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm: drm/vmwgfx: Hook up the helpers to align buffer objects drm/vmwgfx: Introduce a huge page aligning TTM range manager drm: Add a drm_get_unmapped_area() helper drm/vmwgfx: Support huge page faults drm/ttm, drm/vmwgfx: Support huge TTM pagefaults mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entries mm: Split huge pages on write-notify or COW mm: Introduce vma_is_special_huge fs: Constify vma argument to vma_is_dax |
|
|
|
abd69b9e00 |
mm/memory.c: clarify a confusing comment for vm_iomap_memory
The param "start" actually referes to the physical memory start, which is to be mapped into virtual area vma. And it is the field vma->vm_start which stands for the start of the area. Most of the time, we do not read through whole implementation of a function but only the definition and essential comments. Accurate comments are definitely the base stone. Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
86a76331d9 |
mm: clarify a confusing comment for remap_pfn_range()
It really made me scratch my head. Replace the comment with an accurate and consistent description. The parameter pfn actually refers to the page frame number which is right-shifted by PAGE_SHIFT from the physical address. Signed-off-by: WANG Wenhu <wenhu.wang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
327e9fd489 |
mm: Split huge pages on write-notify or COW
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the huge_fault() callback to split huge page table entries if needed. However for module users that requires export of the split_huge_xxx() functionality which may be undesired. Instead split pre-existing huge page-table entries on VM_FAULT_FALLBACK return. We currently only do COW and write-notify on the PTE level, so if the huge_fault() handler returns VM_FAULT_FALLBACK on wp faults, split the huge pages and page-table entries. Also do this for huge PUDs if there is no huge_fault() handler and the vma is not anonymous, similar to how it's done for PMDs. Note that fs/dax.c still does the splitting in the huge_fault() handler, but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd() if needed. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
c3e5ea6ee5 |
mm: avoid data corruption on CoW fault into PFN-mapped VMA
Jeff Moyer has reported that one of xfstests triggers a warning when run
on DAX-enabled filesystem:
WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
...
wp_page_copy+0x98c/0xd50 (unreliable)
do_wp_page+0xd8/0xad0
__handle_mm_fault+0x748/0x1b90
handle_mm_fault+0x120/0x1f0
__do_page_fault+0x240/0xd70
do_page_fault+0x38/0xd0
handle_page_fault+0x10/0x30
The warning happens on failed __copy_from_user_inatomic() which tries to
copy data into a CoW page.
This happens because of race between MADV_DONTNEED and CoW page fault:
CPU0 CPU1
handle_mm_fault()
do_wp_page()
wp_page_copy()
do_wp_page()
madvise(MADV_DONTNEED)
zap_page_range()
zap_pte_range()
ptep_get_and_clear_full()
<TLB flush>
__copy_from_user_inatomic()
sees empty PTE and fails
WARN_ON_ONCE(1)
clear_page()
The solution is to re-try __copy_from_user_inatomic() under PTL after
checking that PTE is matches the orig_pte.
The second copy attempt can still fail, like due to non-readable PTE, but
there's nothing reasonable we can do about, except clearing the CoW page.
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Justin He <Justin.He@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
9717c1cea1 |
drm ttm/mm changes for 5.6-rc1
-----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJeOKqlAAoJEAx081l5xIa+NKoP/1hhIF6iV6QwNxCvNukhiuNc VhS9LPllEQ3G86QE+CfPWmFmh1IBnbJi3CwigC1UX4nA2XoN2Eccm8f5BmitG+Dm YZNZBk8e1z9E68gqpdjA2qr3hddB8ZWcWW8SNqL0PID1UMdXlVr5QM+2Tw7flTQp YsajCwssMVZWIgWya+n+A//Qdu5z95KC0ycLjIrT1g+Hl1dxUUixqcbEWppE2wrF mJqTBJsfrPo2Njb6PFUKvUmzAZPJGWxLp4cGYJhHgqNrEtwJmmwhWpUDd+BpfxHn jq+ir0ALAph+quWdtouArrf1ibS+uipQl+7uaIW5J963czxxNRr7eNl5KP9gaHod zmtplXlw0ruNgCrDhRIJ4B4SB5M2nAZk7Y/xeHK+GwaCOGVt4rCaBWGBz/lorm7u 9phhygqYKdoIJc3JmXcKcXZHgLWuGgobWmYFlA/vNloXMeY5C5LPgDmAKIW40nut pKg9iGH6fJoN0HUcECPIbv1bterZy7YKK6OWl9TRHfMnabLJXNCej8hcEnf6KQ2l spDO//L8tICBftrcZdJFunzoFFlTavF18XBqm9ZdGNfNo9BIipwFJQEQDahVKHSM 6Du0kmuVoB02QLPXEUpA6+W5rFw7M5Qzi47pUnzM/VFJFkFx/eSsfs119aXJRXtc jIgI1vHIPFCX8Zg4SN/O =uiav -----END PGP SIGNATURE----- Merge tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm Pull drm ttm/mm updates from Dave Airlie: "Thomas Hellstrom has some more changes to the TTM layer that needed a patch to the mm subsystem. This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack that has limitations in the TTM layer" * tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm: mm, drm/ttm: Fix vm page protection handling mm: Add a vmf_insert_mixed_prot() function |
|
|
|
b45f1b3b58 |
Merge branch 'ttm-prot-fix' of git://people.freedesktop.org/~thomash/linux into drm-next
A small fix for the long-standing ttm vm page protection hack. Sent as a separate PR as it touches mm, has all acks in place. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellström (VMware) <thellstrom@vmware.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200116102411.3056-1-thomas_os@shipmail.org |
|
|
|
5379e4dd32 |
mm, drm/ttm: Fix vm page protection handling
TTM graphics buffer objects may, transparently to user-space, move between IO and system memory. When that happens, all PTEs pointing to the old location are zapped before the move and then faulted in again if needed. When that happens, the page protection caching mode- and encryption bits may change and be different from those of struct vm_area_struct::vm_page_prot. We were using an ugly hack to set the page protection correctly. Fix that and instead export and use vmf_insert_mixed_prot() or use vmf_insert_pfn_prot(). Also get the default page protection from struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot(). This way we catch modifications done by the vm system for drivers that want write-notification. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
574c5b3d0e |
mm: Add a vmf_insert_mixed_prot() function
The TTM module today uses a hack to be able to set a different page protection than struct vm_area_struct::vm_page_prot. To be able to do this properly, add the needed vm functionality as vmf_insert_mixed_prot(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
1e5f8a3085 |
Linux 5.5-rc3
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4AEiYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGR3sH/ixrBBYUVyjRPOxS ce4iVoTqphGSoAzq/3FA1YZZOPQ/Ep0NXL4L2fTGxmoiqIiuy8JPp07/NKbHQjj1 Rt6PGm6cw2pMJHaK9gRdlTH/6OyXkp06OkH1uHqKYrhPnpCWDnj+i2SHAX21Hr1y oBQh4/XKvoCMCV96J2zxRsLvw8OkQFE0ouWWfj6LbpXIsmWZ++s0OuaO1cVdP/oG j+j2Voi3B3vZNQtGgJa5W7YoZN5Qk4ZIj9bMPg7bmKRd3wNB228AiJH2w68JWD/I jCA+JcITilxC9ud96uJ6k7SMS2ufjQlnP0z6Lzd0El1yGtHYRcPOZBgfOoPU2Euf 33WGSyI= =iEwx -----END PGP SIGNATURE----- Merge tag 'v5.5-rc3' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
be1db4753e |
mm/memory.c: add apply_to_existing_page_range() helper
apply_to_page_range() takes an address range, and if any parts of it are not covered by the existing page table hierarchy, it allocates memory to fill them in. In some use cases, this is not what we want - we want to be able to operate exclusively on PTEs that are already in the tables. Add apply_to_existing_page_range() for this. Adjust the walker functions for apply_to_page_range to take 'create', which switches them between the old and new modes. This will be used in KASAN vmalloc. [akpm@linux-foundation.org: reduce code duplication] [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/] [akpm@linux-foundation.org: initialize __apply_to_page_range::err] Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
923717cbab |
sched/rt, mm: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the pte_unmap_same() and SLUB code over to use CONFIG_PREEMPTION. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Chistoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20191015191821.11479-26-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
f949286c66 |
mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h
There are no architectures that use include/asm-generic/4level-fixup.h therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define. Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Anatoly Pugachev <matorola@gmail.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Peter Rosin <peda@axentia.se> Cc: Richard Weinberger <richard@nod.at> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Sam Creasey <sammy@sammy.net> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
3cde287bb4 |
mm/memory.c: replace is_zero_pfn with is_huge_zero_pmd for thp
For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero page or not. We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock() makes sure of it. This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody complains about it. Gerald Schaefer asked: : Maybe the description could also mention the symptom of this bug? : I would assume that it affects anon/dirty accounting in gather_pte_stats(), : for huge mappings, if zero page mappings are not correctly recognized. I came across this while I was looking at the code, so I'm not aware of any symptom. Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f4f5329d45 |
mm: fix typos in comments when calling __SetPageUptodate()
There are several places emphasise the effect of __SetPageUptodate(), while the comment seems to have a typo in two places. Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
625110b5e9 |
mm/memory.c: fix a huge pud insertion race during faulting
A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault(). That will lead to pmd_alloc() returning an
invalid pmd pointer.
Fix this by adding a pud_trans_unstable() function similar to
pmd_trans_unstable() and check whether the pud is really stable before
using the pmd pointer.
Race:
Thread 1: Thread 2: Comment
create_huge_pud() Fallback - not taken.
create_huge_pud() Taken.
pmd_alloc() Returns an invalid pointer.
This will result in user-visible huge page data corruption.
Note that this was caught during a code audit rather than a real
experienced problem. It looks to me like the only implementation that
currently creates huge pud pagetable entries is dev_dax_huge_fault()
which doesn't appear to care much about private (COW) mappings or
write-tracking which is, I believe, a prerequisite for create_huge_pud()
falling back on thread 1, but not in thread 2.
Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
Fixes:
|
|
|
|
e4dcad204d |
rss_stat: add support to detect RSS updates of external mm
When a process updates the RSS of a different process, the rss_stat tracepoint appears in the context of the process doing the update. This can confuse userspace that the RSS of process doing the update is updated, while in reality a different process's RSS was updated. This issue happens in reclaim paths such as with direct reclaim or background reclaim. This patch adds more information to the tracepoint about whether the mm being updated belongs to the current process's context (curr field). We also include a hash of the mm pointer so that the process who the mm belongs to can be uniquely identified (mm_id field). Also vsprintf.c is refactored a bit to allow reuse of hashing code. [akpm@linux-foundation.org: remove unused local `str'] [joelaf@google.com: inline call to ptr_to_hashval] Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reported-by: Ioannis Ilkos <ilkos@google.com> Acked-by: Petr Mladek <pmladek@suse.com> [lib/vsprintf.c] Cc: Tim Murray <timmurray@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Carmen Jackson <carmenjackson@google.com> Cc: Mayank Gupta <mayankgupta@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Minchan Kim <minchan@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
b3d1411b67 |
mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in
various kernel trees for half a year now. Many reported to me it is
really useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original
patch: o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
struct trace_entry ent; \
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.org
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
89b15332af |
mm: drop mmap_sem before calling balance_dirty_pages() in write fault
One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:
A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():
balance_dirty_pages+0x724/0x905
balance_dirty_pages_ratelimited+0x254/0x390
fault_dirty_shared_page.isra.96+0x4a/0x90
do_wp_page+0x33e/0x400
__handle_mm_fault+0x6f0/0xfa0
handle_mm_fault+0xe4/0x200
__do_page_fault+0x22b/0x4a0
page_fault+0x45/0x50
Somebody tries to change the address space, contending for the mmap_sem in
write-mode:
call_rwsem_down_write_failed_killable+0x13/0x20
do_mprotect_pkey+0xa8/0x330
SyS_mprotect+0xf/0x20
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:
call_rwsem_down_read_failed+0x14/0x30
proc_pid_cmdline_read+0xa0/0x480
__vfs_read+0x23/0x140
vfs_read+0x87/0x130
SyS_read+0x42/0x90
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.
Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
83d116c530 |
mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.
To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
Below call trace is from arm64 do_page_fault for debugging purpose:
[ 110.016195] Call trace:
[ 110.016826] do_page_fault+0x5a4/0x690
[ 110.017812] do_mem_abort+0x50/0xb0
[ 110.018726] el1_da+0x20/0xc4
[ 110.019492] __arch_copy_from_user+0x180/0x280
[ 110.020646] do_wp_page+0xb0/0x860
[ 110.021517] __handle_mm_fault+0x994/0x1338
[ 110.022606] handle_mm_fault+0xe8/0x180
[ 110.023584] do_page_fault+0x240/0x690
[ 110.024535] do_mem_abort+0x50/0xb0
[ 110.025423] el0_da+0x20/0x24
The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
pmd=000000023d4b3003, pte=360000298607bd3
As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."
This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()
Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).
[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
Signed-off-by: Jia He <justin.he@arm.com>
Reported-by: Yibo Cai <Yibo.Cai@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
|
|
6aa9b8b2c6 |
mm: do not hash address in print_bad_pte()
Using %px to show the actual address in print_bad_pte() to help us to debug issue. Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7b167b6810 |
mm: release the spinlock on zap_pte_range
In our testing (camera recording), Miguel and Wei found unmap_page_range() takes above 6ms with preemption disabled easily. When I see that, the reason is it holds page table spinlock during entire 512 page operation in a PMD. 6.2ms is never trivial for user experince if RT task couldn't run in the time because it could make frame drop or glitch audio problem. I had a time to benchmark it via adding some trace_printk hooks between pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing device is 2018 premium mobile device. I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the task runs on little core even though it doesn't have any IPI and LRU lock contention. It's already too heavy. If I remove activate_page, 35-40% overhead of zap_pte_range is gone so most of overhead(about 0.7ms) comes from activate_page via mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could accumulate up to several ms. So this patch adds a check for need_resched() in the loop, and a preemption point if necessary. Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Miguel de Dios <migueldedios@google.com> Reported-by: Wei Wang <wvw@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
9da99f20ec |
mm: remove redundant assignment of entry
Since ptent will not be changed after previous assignment of entry, it is not necessary to do the assignment again. Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
dc617f29db |
vfs: don't allow writes to swap files
Don't let userspace write to an active swap file because the kernel effectively has a long term lease on the storage and things could get seriously corrupted if we let this happen. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> |
|
|
|
43675e6fbb |
mm: thp: make transhuge_vma_suitable available for anonymous THP
transhuge_vma_suitable() was only available for shmem THP, but anonymous THP has the same check except pgoff check. And, it will be used for THP eligible check in the later patch, so make it available for all kind of THPs. This also helps reduce code duplication slightly. Since anonymous THP doesn't have to check pgoff, so make pgoff check shmem vma only. And regroup some functions in include/linux/mm.h to solve compile issue since transhuge_vma_suitable() needs call vma_is_anonymous() which was defined after huge_mm.h is included. [akpm@linux-foundation.org: fix typo] [yang.shi@linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
fec88ab0af |
HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror' feature
merged last cycle. In linux-next we now see AMDGPU and nouveau to be
using this API.
- Remove old or transitional hmm APIs. These are hold overs from the past
with no users, or APIs that existed only to manage cross tree conflicts.
There are still a few more of these cleanups that didn't make the merge
window cut off.
- Improve some core mm APIs:
* export alloc_pages_vma() for driver use
* refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
* refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers use
the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
=wKvp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
|
|
|
|
1e426fe282 |
mm: use down_read_killable for locking mmap_sem in access_remote_vm
This function is used by ptrace and proc files like /proc/pid/cmdline and /proc/pid/environ. Access_remote_vm never returns error codes, all errors are ignored and only size of successfully read data is returned. So, if current task was killed we'll simply return 0 (bytes read). Mmap_sem could be locked for a long time or forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
96756fcb83 |
mm/memory.c: fail when offset == num in first check of __vm_map_pages()
If the caller asks us for offset == num, we should already fail in the first check, i.e. the one testing for offsets beyond the object. At the moment, we are failing on the second test anyway, since count cannot be 0. Still, to agree with the comment of the first test, we should first test it there. Link: http://lkml.kernel.org/r/20190528193004.GA7744@gmail.com Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
8b1e0f81fb |
mm/pgtable: drop pgtable_t variable from pte_fn_t functions
Drop the pgtable_t variable from all implementation for pte_fn_t as none of them use it. apply_to_pte_range() should stop computing it as well. Should help us save some cycles. Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
eb085574a7 |
mm, swap: fix race between swapoff and some swap operations
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff. This may cause the race
like below,
CPU 1 CPU 2
----- -----
do_swap_page
swapin_readahead
__read_swap_cache_async
swapoff swapcache_prepare
p->swap_map = NULL __swap_duplicate
p->swap_map[?] /* !!! NULL pointer access */
Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice. But it is still a race need to be fixed.
To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device. If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.
Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device(). >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.
In addition to swap_map, cluster_info, etc. data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.
Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.
Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed. The overhead in hot-path of both methods is
similar. The advantages of RCU based method are,
1. stop_machine() may disturb the normal execution code path on other
CPUs.
2. File cache uses RCU to protect its radix tree. If the similar
mechanism is used for swap cache too, it is easier to share code
between them.
3. RCU is used to protect swap cache in total_swapcache_pages() and
exit_swap_address_space() already. The two mechanisms can be
merged to simplify the logic.
Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes:
|
|
|
|
465fc3a9b3 |
mm/memory.c: trivial clean up in insert_page()
Make the success case use the same cleanup path as the failure case. Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomain Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
897e6365cd |
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
This replaces the hacky ->fault callback, which is currently directly called from common code through a hmm specific data structure as an exercise in layering violations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
|
|
|
25b2995a35 |
mm: remove MEMORY_DEVICE_PUBLIC support
The code hasn't been used since it was added to the tree, and doesn't appear to actually be usable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
|
|
|
457c899653 |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
|
|
a667d7456f |
mm: introduce new vm_map_pages() and vm_map_pages_zero() API
Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5. This patch (of 5): Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop. As this pattern is common across different drivers, it can be generalized by creating new functions and using them across the drivers. vm_map_pages() is the API which can be used to map kernel memory/pages in drivers which have considered vm_pgoff vm_map_pages_zero() is the API which can be used to map a range of kernel memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is passed as default 0 for those drivers. We _could_ then at a later "fix" these drivers which are using vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting simply by removing the _zero suffix on the function name and if that causes regressions, it gives us an easy way to revert. Tested on Rockchip hardware and display is working, including talking to Lima via prime. Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Suggested-by: Russell King <linux@armlinux.org.uk> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Heiko Stuebner <heiko@sntech.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Sandy Huang <hjc@rock-chips.com> Cc: David Airlie <airlied@linux.ie> Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Pawel Osciak <pawel@osciak.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7269f99993 |
mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6f4f13e8d9 |
mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
0968621917 |
Printk changes for 5.2
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY= =WZfC -----END PGP SIGNATURE----- Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Allow state reset of printk_once() calls. - Prevent crashes when dereferencing invalid pointers in vsprintf(). Only the first byte is checked for simplicity. - Make vsprintf warnings consistent and inlined. - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf modifiers. - Some clean up of vsprintf and test_printf code. * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: lib/vsprintf: Make function pointer_string static vsprintf: Limit the length of inlined error messages vsprintf: Avoid confusion between invalid address and value vsprintf: Prevent crash when dereferencing invalid pointers vsprintf: Consolidate handling of unknown pointer specifiers vsprintf: Factor out %pO handler as kobject_string() vsprintf: Factor out %pV handler as va_format() vsprintf: Factor out %p[iI] handler as ip_addr_string() vsprintf: Do not check address of well-known strings vsprintf: Consistent %pK handling for kptr_restrict == 0 vsprintf: Shuffle restricted_pointer() printk: Tie printk_once / printk_deferred_once into .data.once for reset treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively lib/test_printf: Switch to bitmap_zalloc() |
|
|
|
d75f773c86 |
treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion specifiers. The former are deprecated, therefore switch the current users to use the preferred variant. The changes have been produced by the following command: git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \ while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done And verifying the result. Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: xen-devel@lists.xenproject.org Cc: linux-acpi@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-pci@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-btrfs@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-mm@kvack.org Cc: ceph-devel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: David Sterba <dsterba@suse.com> (for btrfs) Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c) Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci) Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com> |
|
|
|
fa0aafb8ac |
asm-generic/tlb: Remove tlb_flush_mmu_free()
As the comment notes; it is a potentially dangerous operation. Just use tlb_flush_mmu(), that will skip the (double) TLB invalidate if it really isn't needed anyway. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
ed6a79352c |
asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of PowerPC specific bits. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
cae85cb8ad |
mm/memory.c: fix modifying of page protection by insert_pfn()
Aneesh has reported that PPC triggers the following warning when
excercising DAX code:
IP set_pte_at+0x3c/0x190
LR insert_pfn+0x208/0x280
Call Trace:
insert_pfn+0x68/0x280
dax_iomap_pte_fault.isra.7+0x734/0xa40
__xfs_filemap_fault+0x280/0x2d0
do_wp_page+0x48c/0xa40
__handle_mm_fault+0x8d0/0x1fd0
handle_mm_fault+0x140/0x250
__do_page_fault+0x300/0xd60
handle_page_fault+0x18
Now that is WARN_ON in set_pte_at which is
VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
The problem is that on some architectures set_pte_at() cannot cope with
a situation where there is already some (different) valid entry present.
Use ptep_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PTE.
Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
Fixes:
|
|
|
|
fc8efd2ddf |
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8. This is a stress test, where one thread mmaps/writes/munmaps memory area and other thread is trying to read from it: CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51 Hardware name: IBM 2964 N63 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8) Call Trace: ([<0000000000000000>] (null)) [<00000000001adae4>] lock_acquire+0xec/0x258 [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98 [<000000000012a780>] page_table_free+0x48/0x1a8 [<00000000002f6e54>] do_fault+0xdc/0x670 [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0 [<00000000002fb138>] handle_mm_fault+0x1b0/0x320 [<00000000001248cc>] do_dat_exception+0x19c/0x2c8 [<000000000080e5ee>] pgm_check_handler+0x19e/0x200 page_table_free() is called with NULL mm parameter, but because "0" is a valid address on s390 (see S390_lowcore), it keeps going until it eventually crashes in lockdep's lock_acquire. This crash is reproducible at least since 4.14. Problem is that "vmf->vma" used in do_fault() can become stale. Because mmap_sem may be released, other threads can come in, call munmap() and cause "vma" be returned to kmem cache, and get zeroed/re-initialized and re-used: handle_mm_fault | __handle_mm_fault | do_fault | vma = vmf->vma | do_read_fault | __do_fault | vma->vm_ops->fault(vmf); | mmap_sem is released | | | do_munmap() | remove_vma_list() | remove_vma() | vm_area_free() | # vma is released | ... | # same vma is allocated | # from kmem cache | do_mmap() | vm_area_alloc() | memset(vma, 0, ...) | pte_free(vma->vm_mm, ...); | page_table_free | spin_lock_bh(&mm->context.lock);| <crash> | Cache mm_struct to avoid using potentially stale "vma". [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
a862f68a8b |
docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions either misformatted or omitted at all which makes kernel-doc script unhappy: $ make V=1 htmldocs ... ./mm/util.c:36: info: Scanning doc for kstrdup ./mm/util.c:41: warning: No description found for return value of 'kstrdup' ./mm/util.c:57: info: Scanning doc for kstrdup_const ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const' ./mm/util.c:75: info: Scanning doc for kstrndup ./mm/util.c:83: warning: No description found for return value of 'kstrndup' ... Fixing the formatting and adding the missing return value descriptions eliminates ~100 such warnings. Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
04a8645304 |
mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Enable that by passing old pte value as the arg. Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0cbe3e26ab |
mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit
|
|
|
|
d9f7979c92 |
mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
0ee930e6ca |
mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would destroy their page type. Add an explicit check for this instead of assuming that kernel drivers always get this right. Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
2d432cb709 |
mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace. A buggy device driver might try this, or an attacker might be able to find a way to make it happen. Christoph said: : Let's just fail the code. Currently this may work with SLUB. But SLAB : and SLOB overlay fields with mapcount. So you would have a corrupted page : struct if you mapped a slab page to user space. Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
52d1e606ee |
mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s] Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
98fa15f34c |
mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
1ed7293ac4 |
mm/memory.c: initialise mmu_notifier_range correctly
One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
incorrectly.
Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
Fixes:
|
|
|
|
63f3655f95 |
mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
wait_on_page_bit+0x82/0xa0
shrink_page_list+0x907/0x960
shrink_inactive_list+0x2c7/0x680
shrink_node_memcg+0x404/0x830
shrink_node+0xd8/0x300
do_try_to_free_pages+0x10d/0x330
try_to_free_mem_cgroup_pages+0xd5/0x1b0
try_charge+0x14d/0x720
memcg_kmem_charge_memcg+0x3c/0xa0
memcg_kmem_charge+0x7e/0xd0
__alloc_pages_nodemask+0x178/0x260
alloc_pages_current+0x95/0x140
pte_alloc_one+0x17/0x40
__pte_alloc+0x1e/0x110
alloc_set_pte+0x5fe/0xc20
do_fault+0x103/0x970
handle_mm_fault+0x61e/0xd10
__do_page_fault+0x252/0x4d0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
task2:
__lock_page+0x86/0xa0
mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
ext4_writepages+0x479/0xd60
do_writepages+0x1e/0x30
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x272/0x600
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x268/0x300
wb_workfn+0xb4/0x390
process_one_work+0x189/0x420
worker_thread+0x4e/0x4b0
kthread+0xe6/0x100
ret_from_fork+0x41/0x50
He adds
"task1 is waiting for the PageWriteback bit of the page that task2 has
collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
LOCKED bit the page which tasks1 has locked"
More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path. So
this is essentially ABBA deadlock:
lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_pne
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.
[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes:
|
|
|
|
4cf5892495 |
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
7635d9cbe8 |
mm, thp, proc: report THP eligibility for each vma
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ac46d4f3c4 |
mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f2c57d91b0 |
mm: Fix warning in insert_pfn()
In DAX mode a write pagefault can race with write(2) in the following
way:
CPU0 CPU1
write fault for mapped zero page (hole)
dax_iomap_rw()
iomap_apply()
xfs_file_iomap_begin()
- allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- invalidates radix tree entries in given range
dax_iomap_pte_fault()
grab_mapping_entry()
- no entry found, creates empty
...
xfs_file_iomap_begin()
- finds already allocated block
...
vmf_insert_mixed_mkwrite()
- WARNs and does nothing because there
is still zero page mapped in PTE
unmap_mapping_pages()
This race results in WARN_ON from insert_pfn() and is occasionally
triggered by fstest generic/344. Note that the race is otherwise
harmless as before write(2) on CPU0 is finished, we will invalidate page
tables properly and thus user of mmap will see modified data from
write(2) from that point on. So just restrict the warning only to the
case when the PFN in PTE is not zero page.
Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
ff09d7ec97 |
mm/memory.c: recheck page table entry with page table lock held
We clear the pte temporarily during read/modify/write update of the pte. If we take a page fault while the pte is cleared, the application can get SIGBUS. One such case is with remap_pfn_range without a backing vm_ops->fault callback. do_fault will return SIGBUS in that case. cpu 0 cpu1 mprotect() ptep_modify_prot_start()/pte cleared. . . page fault. . . prep_modify_prot_commit() Fix this by taking page table lock and rechecking for pte_none. [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run] Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ido Schimmel <idosch@idosch.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
9b5a8e00d4 |
mm: convert insert_pfn() to vm_fault_t
All callers convert its errno into a vm_fault_t, so convert it to return a vm_fault_t directly. Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
79f3aa5ba9 |
mm: convert __vm_insert_mixed() to vm_fault_t
Both of its callers currently convert its errno return into a vm_fault_t, so move the conversion into __vm_insert_mixed(). Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
6d958546ff |
mm: inline vm_insert_pfn_prot() into caller
vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline it and convert some of the errnos into vm_fault codes earlier. Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ae2b01f370 |
mm: remove vm_insert_pfn()
All callers are now converted to vmf_insert_pfn() so convert vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn() to being a compatibility wrapper around vmf_insert_pfn_prot(). Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
bc12e6ad96 |
mm: make vm_insert_pfn_prot() static
Now this is no longer used outside mm/memory.c, make it static. Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
f5e6d1d5f8 |
mm: introduce vmf_insert_pfn_prot()
Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno. Also unexport vm_insert_pfn_prot as it has no modular users. Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
5d74763745 |
mm: remove vm_insert_mixed()
All callers are now converted to vmf_insert_mixed() so convert vmf_insert_mixed() from being a compatibility wrapper into the real function. Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
196d9d8bb7 |
mm/memory: Move mmu_gather and TLB invalidation code into its own file
In preparation for maintaining the mmu_gather code as its own entity, move the implementation out of memory.c and into its own file. Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> |
|
|
|
a6d60245d6 |
asm-generic/tlb: Track which levels of the page tables have been cleared
It is common for architectures with hugepage support to require only a single TLB invalidation operation per hugepage during unmap(), rather than iterating through the mapping at a PAGE_SIZE increment. Currently, however, the level in the page table where the unmap() operation occurs is not stored in the mmu_gather structure, therefore forcing architectures to issue additional TLB invalidation operations or to give up and over-invalidate by e.g. invalidating the entire TLB. Ideally, we could add an interval rbtree to the mmu_gather structure, which would allow us to associate the correct mapping granule with the various sub-mappings within the range being invalidated. However, this is costly in terms of book-keeping and memory management, so instead we approximate by keeping track of the page table levels that are cleared and provide a means to query the smallest granule required for invalidation. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Will Deacon <will.deacon@arm.com> |
|
|
|
1b2de5d039 |
mm/cow: don't bother write protecting already write-protected pages
This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.
It's trivial to just avoid write protecting the page table entry if it
was already not writable.
This patch was inspired by
https://bugzilla.kernel.org/show_bug.cgi?id=200447
which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.
That bug was fixed by Eric Biederman's signal handling series
culminating in commit
|
|
|
|
33e17876ea |
Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
|
|
|
|
2b74030354 |
mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit
|
|
|
|
fd1102f0aa |
mm: mmu_notifier fix for tlb_end_vma
The generic tlb_end_vma does not call invalidate_range mmu notifier, and it resets resets the mmu_gather range, which means the notifier won't be called on part of the range in case of an unmap that spans multiple vmas. ARM64 seems to be the only arch I could see that has notifiers and uses the generic tlb_end_vma. I have not actually tested it. [ Catalin and Will point out that ARM64 currently only uses the notifiers for KVM, which doesn't use the ->invalidate_range() callback right now, so it's a bug, but one that happens to not affect them. So not necessary for stable. - Linus ] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
d86564a2f0 |
mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().
This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.
But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.
NOTE: s390 was also needing something like this and might now
be able to use the generic code again.
[ Modified to be on top of Nick's cleanups, which simplified this patch
now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]
Fixes:
|
|
|
|
a6f572084f |
mm/tlb: Remove tlb_remove_table() non-concurrent condition
Will noted that only checking mm_users is incorrect; we should also
check mm_count in order to cover CPUs that have a lazy reference to
this mm (and could do speculative TLB operations).
If removing this turns out to be a performance issue, we can
re-instate a more complete check, but in tlb_table_flush() eliding the
call_rcu_sched().
Fixes:
|
|
|
|
db7ddef301 |
mm: move tlb_table_flush to tlb_flush_mmu_free
There is no need to call this from tlb_flush_mmu_tlbonly, it logically belongs with tlb_flush_mmu_free. This makes future fixes simpler. [ This was originally done to allow code consolidation for the mmu_notifier fix, but it also ends up helping simplify the HAVE_RCU_TABLE_INVALIDATE fix. - Linus ] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
52a288c736 |
x86/mm/tlb: Revert the recent lazy TLB patches
Revert commits: |
|
|
|
50c150f262 |
Revert "mm: always flush VMA ranges affected by zap_page_range"
There was a bug in Linux that could cause madvise (and mprotect?) system calls to return to userspace without the TLB having been flushed for all the pages involved. This could happen when multiple threads of a process made simultaneous madvise and/or mprotect calls. This was noticed in the summer of 2017, at which time two solutions were created: |
|
|
|
29ef680ae7 |
memcg, oom: move out_of_memory back to the charge path
Commit
|
|
|
|
c9f4cd7138 |
mm, huge page: copy target sub-page last when copy huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.
If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.
In
|
|
|
|
c6ddfb6c58 |
mm, clear_huge_page: move order algorithm into a separate function
Patch series "mm, huge page: Copy target sub-page last when copy huge page", v2. Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In |
|
|
|
fadae29530 |
thp: use mm_file_counter to determine update which rss counter
Since commit |
|
|
|
e1fb4a0864 |
dax: remove VM_MIXEDMAP for fsdax and device dax
This patch is reworked from an earlier patch that Dan has posted: https://patchwork.kernel.org/patch/10131727/ VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
73ba2fb33c |
for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
|
|
|
|
958f338e96 |
Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
|