mirror of https://github.com/torvalds/linux.git
335 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
f9a705ad1c |
ARM:
- New page table code for both hypervisor and guest stage-2 - Introduction of a new EL2-private host context - Allow EL2 to have its own private per-CPU variables - Support of PMU event filtering - Complete rework of the Spectre mitigation PPC: - Fix for running nested guests with in-kernel IRQ chip - Fix race condition causing occasional host hard lockup - Minor cleanups and bugfixes x86: - allow trapping unknown MSRs to userspace - allow userspace to force #GP on specific MSRs - INVPCID support on AMD - nested AMD cleanup, on demand allocation of nested SVM state - hide PV MSRs and hypercalls for features not enabled in CPUID - new test for MSR_IA32_TSC writes from host and guest - cleanups: MMU, CPUID, shared MSRs - LAPIC latency optimizations ad bugfixes For x86, also included in this pull request is a new alternative and (in the future) more scalable implementation of extended page tables that does not need a reverse map from guest physical addresses to host physical addresses. For now it is disabled by default because it is still lacking a few of the existing MMU's bells and whistles. However it is a very solid piece of work and it is already available for people to hammer on it. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+S8dsUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroM40Af+M46NJmuS5rcwFfybvK/c42KT6svX Co1NrZDwzSQ2mMy3WQzH9qeLvb+nbY4sT3n5BPNPNsT+aIDPOTDt//qJ2/Ip9UUs tRNea0MAR96JWLE7MSeeRxnTaQIrw/AAZC0RXFzZvxcgytXwdqBExugw4im+b+dn Dcz8QxX1EkwT+4lTm5HC0hKZAuo4apnK1QkqCq4SdD2QVJ1YE6+z7pgj4wX7xitr STKD6q/Yt/0ndwqS0GSGbyg0jy6mE620SN6isFRkJYwqfwLJci6KnqvEK67EcNMu qeE017K+d93yIVC46/6TfVHzLR/D1FpQ8LZ16Yl6S13OuGIfAWBkQZtPRg== =AD6a -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "For x86, there is a new alternative and (in the future) more scalable implementation of extended page tables that does not need a reverse map from guest physical addresses to host physical addresses. For now it is disabled by default because it is still lacking a few of the existing MMU's bells and whistles. However it is a very solid piece of work and it is already available for people to hammer on it. Other updates: ARM: - New page table code for both hypervisor and guest stage-2 - Introduction of a new EL2-private host context - Allow EL2 to have its own private per-CPU variables - Support of PMU event filtering - Complete rework of the Spectre mitigation PPC: - Fix for running nested guests with in-kernel IRQ chip - Fix race condition causing occasional host hard lockup - Minor cleanups and bugfixes x86: - allow trapping unknown MSRs to userspace - allow userspace to force #GP on specific MSRs - INVPCID support on AMD - nested AMD cleanup, on demand allocation of nested SVM state - hide PV MSRs and hypercalls for features not enabled in CPUID - new test for MSR_IA32_TSC writes from host and guest - cleanups: MMU, CPUID, shared MSRs - LAPIC latency optimizations ad bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits) kvm: x86/mmu: NX largepage recovery for TDP MMU kvm: x86/mmu: Don't clear write flooding count for direct roots kvm: x86/mmu: Support MMIO in the TDP MMU kvm: x86/mmu: Support write protection for nesting in tdp MMU kvm: x86/mmu: Support disabling dirty logging for the tdp MMU kvm: x86/mmu: Support dirty logging for the TDP MMU kvm: x86/mmu: Support changed pte notifier in tdp MMU kvm: x86/mmu: Add access tracking for tdp_mmu kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU kvm: x86/mmu: Add TDP MMU PF handler kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg kvm: x86/mmu: Support zapping SPTEs in the TDP MMU KVM: Cache as_id in kvm_memory_slot kvm: x86/mmu: Add functions to handle changed TDP SPTEs kvm: x86/mmu: Allocate and free TDP MMU roots kvm: x86/mmu: Init / Uninit the TDP MMU kvm: x86/mmu: Introduce tdp_iter KVM: mmu: extract spte.h and spte.c KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp ... |
|
|
|
66af4f5cb1 |
x86/kvm: Update the comment about asynchronous page fault in exc_page_fault()
KVM was switched to interrupt-based mechanism for 'page ready' event delivery in Linux-5.8 (see commit |
|
|
|
ca1b66922a |
* Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory by
sending them a SIGBUS on return to user space and umapping the faulty
memory, by Tony Luck and Youquan Song.
* memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
* New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
* Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the hw
eval phase and they don't make it into production.
* Misc fixes, improvements and cleanups, as always.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
=u1Wg
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory
by sending them a SIGBUS on return to user space and umapping the
faulty memory, by Tony Luck and Youquan Song.
- memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
- New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
- Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the
hw eval phase and they don't make it into production.
- Misc fixes, improvements and cleanups, as always.
* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
x86/mce: Decode a kernel instruction to determine if it is copying from user
x86/mce: Recover from poison found while copying from user space
x86/mce: Avoid tail copy when machine check terminated a copy from user
x86/mce: Add _ASM_EXTABLE_CPY for copy user access
x86/mce: Provide method to find out the type of an exception handler
x86/mce: Pass pointer to saved pt_regs to severity calculation routines
x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
x86/mce: Add Skylake quirk for patrol scrub reported errors
RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
x86/mce: Annotate mce_rd/wrmsrl() with noinstr
x86/mce/dev-mcelog: Do not update kflags on AMD systems
x86/mce: Stop mce_reign() from re-computing severity for every CPU
x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
x86/mce: Increase maximum number of banks to 64
x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
RAS/CEC: Fix cec_init() prototype
|
|
|
|
3006381013 |
x86/mce: Decode a kernel instruction to determine if it is copying from user
All instructions copying data between kernel and user memory are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY entries in the exception table. ex_fault_handler_type() returns EX_HANDLER_UACCESS for both of these. Recovery is only possible when the machine check was triggered on a read from user memory. In this case the same strategy for recovery applies as if the user had made the access in ring3. If the fault was in kernel memory while copying to user there is no current recovery plan. For MOV and MOVZ instructions a full decode of the instruction is done to find the source address. For MOVS instructions the source address is in the %rsi register. The function fault_in_kernel_space() determines whether the source address is kernel or user, upgrade it from "static" so it can be used here. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-7-tony.luck@intel.com |
|
|
|
4819e15f74 |
x86/mm/32: Bring back vmalloc faulting on x86_32
One can not simply remove vmalloc faulting on x86-32. Upstream commit: |
|
|
|
968614fc7b |
mm/x86: use general page fault accounting
Use the general page fault accounting by passing regs into handle_mm_fault(). Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
bce617edec |
mm: do page fault accounting in handle_mm_fault
Patch series "mm: Page fault accounting cleanups", v5.
This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit
|
|
|
|
ca15ca406f |
mm: remove unneeded includes of <asm/pgalloc.h>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
a27a0a5549 |
x86/entry: Cleanup idtentry_enter/exit
Remove the temporary defines and fixup all references. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.855839271@linutronix.de |
|
|
|
b037b09b90 |
x86/entry: Rename idtentry_enter/exit_cond_rcu() to idtentry_enter/exit()
They were originally called _cond_rcu because they were special versions with conditional RCU handling. Now they're the standard entry and exit path, so the _cond_rcu part is just confusing. Drop it. Also change the signature to make them more extensible and more foolproof. No functional change -- it's pure refactoring. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/247fc67685263e0b673e1d7f808182d28ff80359.1593795633.git.luto@kernel.org |
|
|
|
25f12ae45f |
maccess: rename probe_kernel_address to get_kernel_nofault
Better describe what this helper does, and match the naming of copy_from_kernel_nofault. Also switch the argument order around, so that it acts and looks like get_user(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
fe557319aa |
maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
3e77abda65 |
x86/idt: Consolidate idt functionality
- Move load_current_idt() out of line and replace the hideous comment with a lockdep assert. This allows to make idt_table and idt_descr static. - Mark idt_table read only after the IDT initialization is complete. - Shuffle code around to consolidate the #ifdef sections into one. - Adapt the F00F bug code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528145523.084915381@linutronix.de |
|
|
|
91eeafea1e |
x86/entry: Switch page fault exception to IDTENTRY_RAW
Convert page fault exceptions to IDTENTRY_RAW:
- Implement the C entry point with DEFINE_IDTENTRY_RAW
- Add the CR2 read into the exception handler
- Add the idtentry_enter/exit_cond_rcu() invocations in
in the regular page fault handler and in the async PF
part.
- Emit the ASM stub with DECLARE_IDTENTRY_RAW
- Remove the ASM idtentry in 64-bit
- Remove the CR2 read from 64-bit
- Remove the open coded ASM entry code in 32-bit
- Fix up the XEN/PV code
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202118.238455120@linutronix.de
|
|
|
|
49893c5cb2 |
x86/entry: Convert Invalid Opcode exception to IDTENTRY
Convert #UD to IDTENTRY: - Implement the C entry point with DEFINE_IDTENTRY - Emit the ASM stub with DECLARE_IDTENTRY - Remove the ASM idtentry in 64bit - Remove the open coded ASM entry code in 32bit - Fixup the XEN/PV code - Fixup the FOOF bug call in fault.c - Remove the old prototypes No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134904.955511913@linutronix.de |
|
|
|
ca4c6a9858 |
x86/traps: Make interrupt enable/disable symmetric in C code
Traps enable interrupts conditionally but rely on the ASM return code to disable them again. That results in redundant interrupt disable and trace calls. Make the trap handlers disable interrupts before returning to avoid that, which allows simplification of the ASM entry code in follow up changes. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505134903.622702796@linutronix.de |
|
|
|
c1e8d7c6a7 |
mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
da1c55f1b2 |
mmap locking API: rename mmap_sem to mmap_lock
Rename the mmap_sem field to mmap_lock. Any new uses of this lock should now go through the new mmap locking api. The mmap_lock is still implemented as a rwsem, though this could change in the future. [akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
d8ed45c5dc |
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
039aeb9deb |
ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested virtualization
- Nested AMD event injection facelift, building on the rework of generic code
and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page fault
work, will come next week.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp
ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1
5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4
7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ
asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy
CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg==
=v7Wn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
|
|
|
|
7f0a002b5a |
x86/mm: remove vmalloc faulting
Remove fault handling on vmalloc areas, as the vmalloc code now takes care of synchronizing changes to all page-tables in the system. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200515140023.25469-8-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
73f693c3a7 |
mm: remove vmalloc_sync_(un)mappings()
These functions are not needed anymore because the vmalloc and ioremap mappings are now synchronized when they are created or torn down. Remove all callers and function definitions. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200515140023.25469-7-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
86cf69f1d8 |
x86/mm/32: implement arch_sync_kernel_mappings()
Implement the function to sync changes in vmalloc and ioremap ranges to all page-tables. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200515140023.25469-6-joro@8bytes.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
ef68017eb5 |
x86/kvm: Handle async page faults directly through do_page_fault()
KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de |
|
|
|
3122e80efc |
mm/vma: make vma_is_accessible() available for general use
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paulburton@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
4064b98270 |
mm: allow VM_FAULT_RETRY for multiple times
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
dde1607248 |
mm: introduce FAULT_FLAG_DEFAULT
Although there're tons of arch-specific page fault handlers, most of them are still sharing the same initial value of the page fault flags. Say, merely all of the page fault handlers would allow the fault to be retried, and they also allow the fault to respond to SIGKILL. Let's define a default value for the fault flags to replace those initial page fault flags that were copied over. With this, it'll be far easier to introduce new fault flag that can be used by all the architectures instead of touching all the archs. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160238.9694-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
39678191cd |
x86/mm: use helper fault_signal_pending()
Let's move the fatal signal check even earlier so that we can directly use the new fault_signal_pending() in x86 mm code. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-5-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
763802b53a |
x86/mm: split vmalloc_sync_all()
Commit |
|
|
|
bcc8aff6af |
Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Misc updates:
- Remove last remaining calls to exception_enter/exception_exit() and
simplify the entry code some more.
- Remove force_iret()
- Add support for "Fast Short Rep Mov", which is available starting
with Ice Lake Intel CPUs - and make the x86 assembly version of
memmove() use REP MOV for all sizes when FSRM is available.
- Micro-optimize/simplify the 32-bit boot code a bit.
- Use a more future-proof SYSRET instruction mnemonic"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Simplify calculation of output address
x86/entry/64: Add instruction suffix to SYSRET
x86: Remove force_iret()
x86/cpufeatures: Add support for fast short REP; MOVSB
x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page fault
x86/context-tracking: Remove exception_enter/exit() from do_page_fault()
|
|
|
|
ee6352b2c4 |
x86/context-tracking: Remove exception_enter/exit() from do_page_fault()
do_page_fault(), like other exceptions, is already covered by user_enter() and user_exit() when the exception triggers in userspace. As explained in: |
|
|
|
186525bd6b |
mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions
- Untangle the somewhat incestous way of how VMALLOC_START is used all across the
kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
It doesn't help that vmalloc.h only includes a single asm header:
#include <asm/page.h> /* pgprot_t */
So there was no existing cross-arch way to decouple address layout
definitions from page.h details. I used this:
#ifndef VMALLOC_START
# include <asm/vmalloc.h>
#endif
This way every architecture that wants to simplify page.h can do so.
- Also on x86 we had a couple of LDT related inline functions that used
the late-stage address space layout positions - but these could be
uninlined without real trouble - the end result is cleaner this way as
well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
9a62d20027 |
x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc()
ranges: before such vmap ranges are reused we make sure that they are
unmapped from every task's page tables.
This is really easy on pagetable setups where the kernel page tables
are shared between all tasks - this is the case on 32-bit kernels
with SHARED_KERNEL_PMD = 1.
But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating
over the pgd_list and clearing all pmd entries in the pgds that
are cleared in the init_mm.pgd, which is the reference pagetable
that the vmalloc() code uses.
In that context the current practice of vmalloc_sync_all() iterating
until FIX_ADDR_TOP is buggy:
for (address = VMALLOC_START & PMD_MASK;
address >= TASK_SIZE_MAX && address < FIXADDR_TOP;
address += PMD_SIZE) {
struct page *page;
Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc
address ranges:
VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR
This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges
that don't clear their pmds, but it's lethal for the LDT range,
which relies on having different mappings in different processes,
and 'synchronizing' them in the vmalloc sense corrupts those
pagetable entries (clearing them).
This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD
off and makes this the dominant mapping mode on 32-bit.
To make LDT working again vmalloc_sync_all() must only iterate over
the volatile parts of the kernel address range that are identical
between all processes.
So the correct check in vmalloc_sync_all() is "address < VMALLOC_END"
to make sure the VMALLOC areas are synchronized and the LDT
mapping is not falsely overwritten.
The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either,
but this is not really a proplem since their PMDs get established
during bootup and never change.
This change fixes the ldt_gdt selftest in my setup.
[ mingo: Fixed up the changelog to explain the logic and modified the
copying to only happen up until VMALLOC_END. ]
Reported-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 7757d607c6b3: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
8e998fc24d |
x86/mm: Sync also unmappings in vmalloc_sync_all()
With huge-page ioremap areas the unmappings also need to be synced between
all page-tables. Otherwise it can cause data corruption when a region is
unmapped and later re-used.
Make the vmalloc_sync_one() function ready to sync unmappings and make sure
vmalloc_sync_all() iterates over all page-tables even when an unmapped PMD
is found.
Fixes:
|
|
|
|
51b75b5b56 |
x86/mm: Check for pfn instead of page in vmalloc_sync_one()
Do not require a struct page for the mapped memory location because it
might not exist. This can happen when an ioremapped region is mapped with
2MB pages.
Fixes:
|
|
|
|
c6dd78fcb8 |
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 specific fixes and updates:
- The CR2 corruption fixes which store CR2 early in the entry code
and hand the stored address to the fault handlers.
- Revert a forgotten leftover of the dropped FSGSBASE series.
- Plug a memory leak in the boot code.
- Make the Hyper-V assist functionality robust by zeroing the shadow
page.
- Remove a useless check for dead processes with LDT
- Update paravirt and VMware maintainers entries.
- A few cleanup patches addressing various compiler warnings"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Prevent clobbering of saved CR2 value
x86/hyper-v: Zero out the VP ASSIST PAGE on allocation
x86, boot: Remove multiple copy of static function sanitize_boot_params()
x86/boot/compressed/64: Remove unused variable
x86/boot/efi: Remove unused variables
x86/mm, tracing: Fix CR2 corruption
x86/entry/64: Update comments and sanity tests for create_gap
x86/entry/64: Simplify idtentry a little
x86/entry/32: Simplify common_exception
x86/paravirt: Make read_cr2() CALLEE_SAVE
MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
x86/process: Delete useless check for dead process with LDT
x86: math-emu: Hide clang warnings for 16-bit overflow
x86/e820: Use proper booleans instead of 0/1
x86/apic: Silence -Wtype-limits compiler warnings
x86/mm: Free sme_early_buffer after init
x86/boot: Fix memory leak in default_get_smp_config()
Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix the test
|
|
|
|
a0d14b8909 |
x86/mm, tracing: Fix CR2 corruption
Despite the current efforts to read CR2 before tracing happens there still
exist a number of possible holes:
idtentry page_fault do_page_fault has_error_code=1
call error_entry
TRACE_IRQS_OFF
call trace_hardirqs_off*
#PF // modifies CR2
CALL_enter_from_user_mode
__context_tracking_exit()
trace_user_exit(0)
#PF // modifies CR2
call do_page_fault
address = read_cr2(); /* whoopsie */
And similar for i386.
Fix it by pulling the CR2 read into the entry code, before any of that
stuff gets a chance to run and ruin things.
Reported-by: He Zhe <zhe.he@windriver.com>
Reported-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: bp@alien8.de
Cc: rostedt@goodmis.org
Cc: torvalds@linux-foundation.org
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: jgross@suse.com
Cc: joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20190711114336.116812491@infradead.org
Debugged-by: Steven Rostedt <rostedt@goodmis.org>
|
|
|
|
b98cca444d |
mm, kprobes: generalize and rename notify_page_fault() as kprobe_page_fault()
Architectures which support kprobes have very similar boilerplate around
calling kprobe_fault_handler(). Use a helper function in kprobes.h to
unify them, based on the x86 code.
This changes the behaviour for other architectures when preemption is
enabled. Previously, they would have disabled preemption while calling
the kprobe handler. However, preemption would be disabled if this fault
was due to a kprobe, so we know the fault was not due to a kprobe
handler and can simply return failure.
This behaviour was introduced in commit
|
|
|
|
5ad18b2e60 |
Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman: "A source of error over the years has been that force_sig has taken a task parameter when it is only safe to use force_sig with the current task. The force_sig function is built for delivering synchronous signals such as SIGSEGV where the userspace application caused a synchronous fault (such as a page fault) and the kernel responded with a signal. Because the name force_sig does not make this clear, and because the force_sig takes a task parameter the function force_sig has been abused for sending other kinds of signals over the years. Slowly those have been fixed when the oopses have been tracked down. This set of changes fixes the remaining abusers of force_sig and carefully rips out the task parameter from force_sig and friends making this kind of error almost impossible in the future" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus signal: Remove the signal number and task parameters from force_sig_info signal: Factor force_sig_info_to_task out of force_sig_info signal: Generate the siginfo in force_sig signal: Move the computation of force into send_signal and correct it. signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal signal: Remove the task parameter from force_sig_fault signal: Use force_sig_fault_to_task for the two calls that don't deliver to current signal: Explicitly call force_sig_fault on current signal/unicore32: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from ptrace_break signal/nds32: Remove tsk parameter from send_sigtrap signal/riscv: Remove tsk parameter from do_trap signal/sh: Remove tsk parameter from force_sig_info_fault signal/um: Remove task parameter from send_sigtrap signal/x86: Remove task parameter from send_sigtrap signal: Remove task parameter from force_sig_mceerr signal: Remove task parameter from force_sig signal: Remove task parameter from force_sigsegv ... |
|
|
|
e0a446ce39 |
x86/vsyscall: Document odd SIGSEGV error code for vsyscalls
Even if vsyscall=none, user page faults on the vsyscall page are reported as though the PROT bit in the error code was set. Add a comment explaining why this is probably okay and display the value in the test case. While at it, explain why the behavior is correct with respect to PKRU. Modify also the selftest to print the odd error code so that there is a way to demonstrate the odd behaviour. If anyone really cares about more accurate emulation, the behaviour could be changed. But that needs a real good justification. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/75c91855fd850649ace162eec5495a1354221aaa.1561610354.git.luto@kernel.org |
|
|
|
918ce32509 |
x86/vsyscall: Show something useful on a read fault
Just segfaulting the application when it tries to read the vsyscall page in xonly mode is not helpful for those who need to debug it. Emit a hint. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jann Horn <jannh@google.com> Link: https://lkml.kernel.org/r/8016afffe0eab497be32017ad7f6f7030dc3ba66.1561610354.git.luto@kernel.org |
|
|
|
318759b473 |
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
Stephen Rothwell <sfr@canb.auug.org.au> reported:
> After merging the userns tree, today's linux-next build (i386 defconfig)
> produced this warning:
>
> arch/x86/mm/fault.c: In function 'do_sigbus':
> arch/x86/mm/fault.c:1017:22: warning: unused variable 'tsk' [-Wunused-variable]
> struct task_struct *tsk = current;
> ^~~
>
> Introduced by commit
>
>
|
|
|
|
2e1661d267 |
signal: Remove the task parameter from force_sig_fault
As synchronous exceptions really only make sense against the current task (otherwise how are you synchronous) remove the task parameter from from force_sig_fault to make it explicit that is what is going on. The two known exceptions that deliver a synchronous exception to a stopped ptraced task have already been changed to force_sig_fault_to_task. The callers have been changed with the following emacs regular expression (with obvious variations on the architectures that take more arguments) to avoid typos: force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)] -> force_sig_fault(\1,\2,\3) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
351b6825b3 |
signal: Explicitly call force_sig_fault on current
Update the calls of force_sig_fault that pass in a variable that is set to current earlier to explicitly use current. This is to make the next change that removes the task parameter from force_sig_fault easier to verify. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
f8eac9011b |
signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the task parameter to make this obvious. This also makes it clear that force_sig_mceerr passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
|
|
|
0bc40e549a |
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
"The changes in here are:
- text_poke() fixes and an extensive set of executability lockdowns,
to (hopefully) eliminate the last residual circumstances under
which we are using W|X mappings even temporarily on x86 kernels.
This required a broad range of surgery in text patching facilities,
module loading, trampoline handling and other bits.
- tweak page fault messages to be more informative and more
structured.
- remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
default.
- reduce KASLR granularity on 5-level paging kernels from 512 GB to
1 GB.
- misc other changes and updates"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/mm: Initialize PGD cache during mm initialization
x86/alternatives: Add comment about module removal races
x86/kprobes: Use vmalloc special flag
x86/ftrace: Use vmalloc special flag
bpf: Use vmalloc special flag
modules: Use vmalloc special flag
mm/vmalloc: Add flag for freeing of special permsissions
mm/hibernation: Make hibernation handle unmapped pages
x86/mm/cpa: Add set_direct_map_*() functions
x86/alternatives: Remove the return value of text_poke_*()
x86/jump-label: Remove support for custom text poker
x86/modules: Avoid breaking W^X while loading modules
x86/kprobes: Set instruction page as executable
x86/ftrace: Set trampoline pages as executable
x86/kgdb: Avoid redundant comparison of patched code
x86/alternatives: Use temporary mm for text poking
x86/alternatives: Initialize temporary mm for patching
fork: Provide a function for copying init_mm
uprobes: Initialize uprobes earlier
x86/mm: Save debug registers when loading a temporary mm
...
|
|
|
|
a65c88e16f |
x86/mm: Remove in_nmi() warning from 64-bit implementation of vmalloc_fault()
In-NMI warnings have been added to vmalloc_fault() via: |
|
|
|
ea2f8d6060 |
x86/fault: Make fault messages more succinct
So we are going to be staring at those in the next years, let's make them more succinct. In particular: - change "address = " to "address: " - "-privileged" reads funny. It should be simply "kernel" or "user" - "from kernel code" reads funny too. "kernel mode" or "user mode" is more natural. An actual example says more than 1000 words, of course: [ 0.248370] BUG: kernel NULL pointer dereference, address: 00000000000005b8 [ 0.249120] #PF: supervisor write access in kernel mode [ 0.249717] #PF: error_code(0x0002) - not-present page Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave.hansen@linux.intel.com Cc: luto@kernel.org Cc: riel@surriel.com Cc: sean.j.christopherson@intel.com Cc: yu-cheng.yu@intel.com Link: http://lkml.kernel.org/r/20190421183524.GC6048@zn.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
18ea35c5ed |
x86/fault: Decode and print #PF oops in human readable form
Linus pointed out that deciphering the raw #PF error code and printing
a more human readable message are two different things, and also that
printing the negative cases is mostly just noise[1]. For example, the
USER bit doesn't mean the fault originated in user code and stating
that an oops wasn't due to a protection keys violation isn't interesting
since an oops on a keys violation is a one-in-a-million scenario.
Remove the per-bit decoding of the error code and instead print:
- the raw error code
- why the fault occurred
- the effective privilege level of the access
- the type of access
- whether the fault originated in user code or kernel code
This provides the user with the information needed to triage 99.9% of
oopses without polluting the log with useless information or conflating
the error_code with the CPL.
Sample output:
BUG: kernel NULL pointer dereference, address = 0000000000000008
#PF: supervisor-privileged instruction fetch from kernel code
#PF: error_code(0x0010) - not-present page
BUG: unable to handle page fault for address = ffffbeef00000000
#PF: supervisor-privileged instruction fetch from kernel code
#PF: error_code(0x0010) - not-present page
BUG: unable to handle page fault for address = ffffc90000230000
#PF: supervisor-privileged write access from kernel code
#PF: error_code(0x000b) - reserved bit violation
[1] https://lkml.kernel.org/r/CAHk-=whk_fsnxVMvF1T2fFCaP2WrvSybABrLQCWLJyCvHw6NKA@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20181221213657.27628-3-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
f28b11a2ab |
x86/fault: Reword initial BUG message for unhandled page faults
Reword the NULL pointer dereference case to simply state that a NULL
pointer was dereferenced, i.e. drop "unable to handle" as that implies
that there are instances where the kernel actual does handle NULL
pointer dereferences, which is not true barring funky exception fixup.
For the non-NULL case, replace "kernel paging request" with "page fault"
as the kernel can technically oops on faults that originated in user
code. Dropping "kernel" also allows future patches to provide detailed
information on where the fault occurred, e.g. user vs. kernel, without
conflicting with the initial BUG message.
In both cases, replace "at address=" with wording more appropriate to
the oops, as "at" may be interpreted as stating that the address is the
RIP of the instruction that faulted.
Last, and probably least, further qualify the NULL-pointer path by
checking that the fault actually originated in kernel code. It's
technically possible for userspace to map address 0, and not printing
a super specific message is the least of our worries if the kernel does
manage to oops on an actual NULL pointer dereference from userspace.
Before:
BUG: unable to handle kernel NULL pointer dereference at ffffbeef00000000
BUG: unable to handle kernel paging request at ffffbeef00000000
After:
BUG: kernel NULL pointer dereference, address = 0000000000000008
BUG: unable to handle page fault for address = ffffbeef00000000
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20181221213657.27628-2-sean.j.christopherson@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|