linux

Commit Graph

Author	SHA1	Message	Date
Ben Gardon	8283e36abf	KVM: x86/mmu: Propagate memslot const qualifier In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:43 -05:00
Ben Gardon	4d78d0b39a	KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages The vCPU argument to mmu_try_to_unsync_pages is now only used to get a pointer to the associated struct kvm, so pass in the kvm pointer from the beginning to remove the need for a vCPU when calling the function. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Ben Gardon	9d395a0a7a	KVM: x86/mmu: Remove need for a vcpu from kvm_slot_page_track_is_active kvm_slot_page_track_is_active only uses its vCPU argument to get a pointer to the assoicated struct kvm, so just pass in the struct KVM to remove the need for a vCPU pointer. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Maciej S. Szmigiero	f4209439b5	KVM: Optimize gfn lookup in kvm_zap_gfn_range() Introduce a memslots gfn upper bound operation and use it to optimize kvm_zap_gfn_range(). This way this handler can do a quick lookup for intersecting gfns and won't have to do a linear scan of the whole memslot set. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:35 -05:00
Maciej S. Szmigiero	a54d806688	KVM: Keep memslots in tree-based structures instead of array-based ones The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:34 -05:00
Maciej S. Szmigiero	f5756029ee	KVM: x86: Use nr_memslot_pages to avoid traversing the memslots array There is no point in recalculating from scratch the total number of pages in all memslots each time a memslot is created or deleted. Use KVM's cached nr_memslot_pages to compute the default max number of MMU pages. Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow an unsigned long variable. Write this "* 20 / 1000" operation as "/ 50" instead to avoid such overflow. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: use common KVM field and rework changelog accordingly] Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:29 -05:00
Sean Christopherson	a955cad84c	KVM: x86/mmu: Retry page fault if root is invalidated by memslot update Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:12 -05:00
Sean Christopherson	f47491d7f3	KVM: x86/mmu: Handle "default" period when selectively waking kthread Account for the '0' being a default, "let KVM choose" period, when determining whether or not the recovery worker needs to be awakened in response to userspace reducing the period. Failure to do so results in the worker not being awakened properly, e.g. when changing the period from '0' to any small-ish value. Fixes: `4dfe4f40d8` ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:27 -05:00
Paolo Bonzini	28f091bc2f	KVM: MMU: shadow nested paging does not have PKU Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding incorrect interpretations of the nested hypervisor's page tables. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:26 -05:00
Sean Christopherson	4b85c921cd	KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path Drop the "flush" param and return values to/from the TDP MMU's helper for zapping collapsible SPTEs. Because the helper runs with mmu_lock held for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic zap handles the necessary remote TLB flush. Similarly, because mmu_lock is dropped and re-acquired between zapping legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must handle remote TLB flushes from the legacy MMU before calling into the TDP MMU. Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:25 -05:00
Lai Jiangshan	05b29633c7	KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() INVLPG operates on guest virtual address, which are represented by vcpu->arch.walk_mmu. In nested virtualization scenarios, kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were emulated by L0 (in practice, it hardly happen) when nested two-dimensional paging is enabled, the call to ->tlb_flush_gva() would be skipped and the hardware TLB entry would not be invalidated. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:21 -05:00
Lai Jiangshan	12ec33a705	KVM: X86: Fix when shadow_root_level=5 && guest root_level<4 If the is an L1 with nNPT in 32bit, the shadow walk starts with pae_root. Fixes: `a717a780fc` ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Vitaly Kuznetsov	feb627e8d6	KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN Commit `63f5a1909f` ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls after first successful KVM_RUN and promissed to make this sequence forbiden in 5.16. It's time to fulfil the promise. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211122175818.608220-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Hou Wenlong	8ed716ca7d	KVM: x86/mmu: Pass parameter flush as false in kvm_tdp_mmu_zap_collapsible_sptes() Since tlb flush has been done for legacy MMU before kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush should be false for kvm_tdp_mmu_zap_collapsible_sptes(). Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:58 -05:00
Hou Wenlong	c7785d85b6	KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range() If the parameter flush is set, zap_gfn_range() would flush remote tlb when yield, then tlb flush is not needed outside. So use the return value of zap_gfn_range() directly instead of OR on it in kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(). Fixes: `3039bcc744` ("KVM: Move x86's MMU notifier memslot walkers to generic code") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:57 -05:00
Paolo Bonzini	817506df9d	Merge branch 'kvm-5.16-fixes' into kvm-master * Fixes for Xen emulation * Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache * Fixes for migration of 32-bit nested guests on 64-bit hypervisor * Compilation fixes * More SEV cleanups	2021-11-18 02:11:57 -05:00
Maxim Levitsky	b8453cdcf2	KVM: x86/mmu: include EFER.LMA in extended mmu role Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the guest root level and is not reflected in kvm_mmu_page_role.level when TDP is in use. When simply running the guest, it is impossible for EFER.LMA and kvm_mmu.root_level to get out of sync, as the guest cannot transition from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without first bouncing through a different MMU context. And stuffing guest state via KVM_SET_SREGS{,2} also ensures a full MMU context reset. However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to set guest state when migrating the VM while L2 is active, the vCPU state will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will have been configured using L2's state, despite not being used for L2. If L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will be configured for guest PAE paging, but will match the mmu_role for 64-bit paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit. Alternatively, the root_mmu's role could be invalidated after a successful KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu, i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily tricky, and not even needed if L1 and L2 do have the same role (e.g., they are both 64-bit guests and run with the same CR4). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
Paolo Bonzini	f5396f2d82	Merge branch 'kvm-5.16-fixes' into kvm-master * Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status * Fix selftests on APICv machines * Fix sparse warnings * Fix detection of KVM features in CPUID * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN * Fixes and cleanups for MSR bitmap handling * Cleanups for INVPCID * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures	2021-11-11 11:03:05 -05:00
Junaid Shahid	10c30de019	kvm: mmu: Use fast PF path for access tracking of huge pages when possible The fast page fault path bails out on write faults to huge pages in order to accommodate dirty logging. This change adds a check to do that only when dirty logging is actually enabled, so that access tracking for huge pages can still use the fast path for write faults in the common case. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211104003359.2201967-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-11 10:56:20 -05:00
Linus Torvalds	d7e0a795bf	ARM: * More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. * Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated * Fixes for the vgic placement in the IPA space, together with a bunch of selftests * More memcg accounting of the memory allocated on behalf of a guest * Timer and vgic selftests * Workarounds for the Apple M1 broken vgic implementation * KConfig cleanups * New kvmarm.mode=none option, for those who really dislike us RISC-V: * New KVM port. x86: * New API to control TSC offset from userspace * TSC scaling for nested hypervisors on SVM * Switch masterclock protection from raw_spin_lock to seqcount * Clean up function prototypes in the page fault code and avoid repeated memslot lookups * Convey the exit reason to userspace on emulation failure * Configure time between NX page recovery iterations * Expose Predictive Store Forwarding Disable CPUID leaf * Allocate page tracking data structures lazily (if the i915 KVM-GT functionality is not compiled in) * Cleanups, fixes and optimizations for the shadow MMU code s390: * SIGP Fixes * initial preparations for lazy destroy of secure VMs * storage key improvements/fixes * Log the guest CPNC Starting from this release, KVM-PPC patches will come from Michael Ellerman's PPC tree. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGBOiEUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNowwf/axlx3g9sgCwQHr12/6UF/7hL/RwP 9z+pGiUzjl2YQE+RjSvLqyd6zXh+h4dOdOKbZDLSkSTbcral/8U70ojKnQsXM0XM 1LoymxBTJqkgQBLm9LjYreEbzrPV4irk4ygEmuk3CPOHZu8xX1ei6c5LdandtM/n XVUkXsQY+STkmnGv4P3GcPoDththCr0tBTWrFWtxa0w9hYOxx0ay1AZFlgM4FFX0 QFuRc8VBLoDJpIUjbkhsIRIbrlHc/YDGjuYnAU7lV/CIME8vf2BW6uBwIZJdYcDj 0ejozLjodEnuKXQGnc8sXFioLX2gbMyQJEvwCgRvUu/EU7ncFm1lfs7THQ== =UxKM -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. - Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated - Fixes for the vgic placement in the IPA space, together with a bunch of selftests - More memcg accounting of the memory allocated on behalf of a guest - Timer and vgic selftests - Workarounds for the Apple M1 broken vgic implementation - KConfig cleanups - New kvmarm.mode=none option, for those who really dislike us RISC-V: - New KVM port. x86: - New API to control TSC offset from userspace - TSC scaling for nested hypervisors on SVM - Switch masterclock protection from raw_spin_lock to seqcount - Clean up function prototypes in the page fault code and avoid repeated memslot lookups - Convey the exit reason to userspace on emulation failure - Configure time between NX page recovery iterations - Expose Predictive Store Forwarding Disable CPUID leaf - Allocate page tracking data structures lazily (if the i915 KVM-GT functionality is not compiled in) - Cleanups, fixes and optimizations for the shadow MMU code s390: - SIGP Fixes - initial preparations for lazy destroy of secure VMs - storage key improvements/fixes - Log the guest CPNC Starting from this release, KVM-PPC patches will come from Michael Ellerman's PPC tree" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) RISC-V: KVM: fix boolreturn.cocci warnings RISC-V: KVM: remove unneeded semicolon RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions RISC-V: KVM: Factor-out FP virtualization into separate sources KVM: s390: add debug statement for diag 318 CPNC data KVM: s390: pv: properly handle page flags for protected guests KVM: s390: Fix handle_sske page fault handling KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol KVM: x86: On emulation failure, convey the exit reason, etc. to userspace KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info KVM: x86: Clarify the kvm_run.emulation_failure structure layout KVM: s390: Add a routine for setting userspace CPU state KVM: s390: Simplify SIGP Set Arch handling KVM: s390: pv: avoid stalls when making pages secure KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm KVM: s390: pv: avoid double free of sida page KVM: s390: pv: add macros for UVC CC values s390/mm: optimize reset_guest_reference_bit() s390/mm: optimize set_guest_storage_key() s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present ...	2021-11-02 11:24:14 -07:00
Sean Christopherson	21fa324654	KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helper Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a separate helper to clean up the unholy mess that kvm_zap_gfn_range() has become. In addition to deep nesting, the rmaps zapping spreads out the declaration of several variables and is generally a mess. Clean up the mess now so that future work to improve the memslots implementation doesn't need to deal with it. Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:52 -04:00
Sean Christopherson	e8be2a5ba8	KVM: x86/mmu: Drop a redundant remote TLB flush in kvm_zap_gfn_range() Remove an unnecessary remote TLB flush in kvm_zap_gfn_range() now that said function holds mmu_lock for write for its entire duration. The flush was added by the now-reverted commit to allow TDP MMU to flush while holding mmu_lock for read, as the transition from write=>read required dropping the lock and thus a pending flush needed to be serviced. Fixes: `5a324c24b6` ("Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:46 -04:00
Sean Christopherson	bc3b3c1002	KVM: x86/mmu: Drop a redundant, broken remote TLB flush A recent commit to fix the calls to kvm_flush_remote_tlbs_with_address() in kvm_zap_gfn_range() inadvertantly added yet another flush instead of fixing the existing flush. Drop the redundant flush, and fix the params for the existing flush. Cc: stable@vger.kernel.org Fixes: `2822da4466` ("KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:30 -04:00
Lai Jiangshan	61b05a9fd4	KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:44:43 -04:00
Lai Jiangshan	264d3dc1d3	KVM: X86: pair smp_wmb() of mmu_try_to_unsync_pages() with smp_rmb() The commit `578e1c4db2` ("kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire() isn't used on the load of SPTE.W. smp_load_acquire() orders _subsequent_ loads after sp->is_unsync; it does not order _earlier_ loads before the load of sp->is_unsync. This has no functional change; smp_rmb() is a NOP on x86, and no compiler barrier is required because there is a VMEXIT between the load of SPTE.W and kvm_mmu_snc_roots. Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:43:21 -04:00
Junaid Shahid	4dfe4f40d8	kvm: x86: mmu: Make NX huge page recovery period configurable Currently, the NX huge page recovery thread wakes up every minute and zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX huge pages at a time. This is intended to ensure that only a relatively small number of pages get zapped at a time. But for very large VMs (or more specifically, VMs with a large number of executable pages), a period of 1 minute could still result in this number being too high (unless the ratio is changed significantly, but that can result in split pages lingering on for too long). This change makes the period configurable instead of fixing it at 1 minute. Users of large VMs can then adjust the period and/or the ratio to reduce the number of pages zapped at one time while still maintaining the same overall duration for cycling through the entire list. By default, KVM derives a period from the ratio such that a page will remain on the list for 1 hour on average. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20211020010627.305925-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
David Matlack	610265ea3d	KVM: x86/mmu: Rename slot_handle_leaf to slot_handle_level_4k slot_handle_leaf is a misnomer because it only operates on 4K SPTEs whereas "leaf" is used to describe any valid terminal SPTE (4K or large page). Rename slot_handle_leaf to slot_handle_level_4k to avoid confusion. Making this change makes it more obvious there is a benign discrepency between the legacy MMU and the TDP MMU when it comes to dirty logging. The legacy MMU only iterates through 4K SPTEs when zapping for collapsing and when clearing D-bits. The TDP MMU, on the other hand, iterates through SPTEs on all levels. The TDP MMU behavior of zapping SPTEs at all levels is technically overkill for its current dirty logging implementation, which always demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only if the SPTE can be replaced by a larger page, i.e. will not spuriously zap 2m (or larger) SPTEs. Opportunistically add comments to explain this discrepency in the code. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20211019162223.3935109-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Paolo Bonzini	2839180ce5	KVM: x86/mmu: clean up prefetch/prefault/speculative naming "prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
David Stevens	1e76a3ce0d	KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:25 -04:00
Chenyi Qiang	a3ca5281bb	KVM: MMU: Reset mmu->pkru_mask to avoid stale data When updating mmu->pkru_mask, the value can only be added but it isn't reset in advance. This will make mmu->pkru_mask keep the stale data. Fix this issue. Fixes: `2d344105f5` ("KVM, pkeys: introduce pkru_mask to cache conditions") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20211021071022.1140-1-chenyi.qiang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-21 11:09:29 -04:00
Andrei Vagin	a7cc099f2e	KVM: x86/mmu: kvm_faultin_pfn has to return false if pfh is returned This looks like a typo in `8f32d5e563`. This change didn't intend to do any functional changes. The problem was caught by gVisor tests. Fixes: `8f32d5e563` ("KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Message-Id: <20211015163221.472508-1-avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 08:19:18 -04:00
David Stevens	deae4a10f1	KVM: x86: only allocate gfn_track when necessary Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:58 -04:00
David Matlack	53597858db	KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pages mmu_try_to_unsync_pages checks if page tracking is active for the given gfn, which requires knowing the memslot. We can pass down the memslot via make_spte to avoid this lookup. The memslot is also handy for make_spte's marking of the gfn as dirty: we can test whether dirty page tracking is enabled, and if so ensure that pages are mapped as writable with 4K granularity. Apart from the warning, no functional change is intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
David Matlack	8a9f566ae4	KVM: x86/mmu: Avoid memslot lookup in rmap_add Avoid the memslot lookup in rmap_add, by passing it down from the fault handling code to mmu_set_spte and then to rmap_add. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
Paolo Bonzini	a12f43818b	KVM: MMU: pass struct kvm_page_fault to mmu_set_spte mmu_set_spte is called for either PTE prefetching or page faults. The three boolean arguments write_fault, speculative and host_writable are always respectively false/true/true for prefetching and coming from a struct kvm_page_fault for page faults. Let mmu_set_spte distinguish these two situation by accepting a possibly NULL struct kvm_page_fault argument. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:56 -04:00
Paolo Bonzini	7158bee4b4	KVM: MMU: pass kvm_mmu_page struct to make_spte The level and A/D bit support of the new SPTE can be found in the role, which is stored in the kvm_mmu_page struct. This merges two arguments into one. For the TDP MMU, the kvm_mmu_page was not used (kvm_tdp_mmu_map does not use it if the SPTE is already present) so we fetch it just before calling make_spte. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:55 -04:00
Paolo Bonzini	eb5cd7ffe1	KVM: MMU: remove unnecessary argument to mmu_set_spte The level of the new SPTE can be found in the kvm_mmu_page struct; there is no need to pass it down. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:55 -04:00
Paolo Bonzini	ad67e4806e	KVM: MMU: clean up make_spte return value Now that make_spte is called directly by the shadow MMU (rather than wrapped by set_spte), it only has to return one boolean value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
Paolo Bonzini	4758d47e0d	KVM: MMU: inline set_spte in FNAME(sync_page) Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, FNAME(sync_page) already has a struct kvm_mmu_page *, but set_spte had to fish it back out of sptep's private page data. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
Paolo Bonzini	d786c7783b	KVM: MMU: inline set_spte in mmu_set_spte Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the similarity is hidden by set_spte. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:54 -04:00
David Matlack	888104138c	KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track Now that kvm_page_fault has a pointer to the memslot it can be passed down to the page tracking code to avoid a redundant slot lookup. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
David Matlack	e710c5f6be	KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault The memslot for the faulting gfn is used throughout the page fault handling code, so capture it in kvm_page_fault as soon as we know the gfn and use it in the page fault handling code that has direct access to the kvm_page_fault struct. Replace various tests using is_noslot_pfn with more direct tests on fault->slot being NULL. This, in combination with the subsequent patch, improves "Populate memory time" in dirty_log_perf_test by 5% when using the legacy MMU. There is no discerable improvement to the performance of the TDP MMU. No functional change intended. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
Paolo Bonzini	bcc4f2bc50	KVM: MMU: mark page dirty in make_spte This simplifies set_spte, which we want to remove, and unifies code between the shadow MMU and the TDP MMU. The warning will be added back later to make_spte as well. There is a small disadvantage in the TDP MMU; it may unnecessarily mark a page as dirty twice if two vCPUs end up mapping the same page twice. However, this is a very small cost for a case that is already rare. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
David Matlack	68be1306ca	KVM: x86/mmu: Fold rmap_recycle into rmap_add Consolidate rmap_recycle and rmap_add into a single function since they are only ever called together (and only from one place). This has a nice side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In addition it makes mmu_set_spte(), which is a very long function, a little shorter. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Sean Christopherson	b1a429fb18	KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults WARN and bail if the shadow walk for faulting in a SPTE terminates early, i.e. doesn't reach the expected level because the walk encountered a terminal SPTE. The shadow walks for page faults are subtle in that they install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop body, and consume the newly created non-leaf SPTE in the loop control, e.g. __shadow_walk_next(). In other words, the walks guarantee that the walk will stop if and only if the target level is reached by installing non-leaf SPTEs to guarantee the walk remains valid. Opportunistically use fault->goal-level instead of it.level in FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at the target level. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	f0066d94c9	KVM: MMU: change tracepoints arguments to kvm_page_fault Pass struct kvm_page_fault to tracepoints instead of extracting the arguments from the struct. This also lets the kvm_mmu_spte_requested tracepoint pick the gfn directly from fault->gfn, instead of using the address. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	536f0e6ace	KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of extracting the arguments from the struct. Tweak a bit the conditions to avoid long lines. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	73a3c65947	KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of extracting the arguments from the struct; the results are also stored in the struct, so the callers are adjusted consequently. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	3c8ad5a675	KVM: MMU: change fast_page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to fast_page_fault() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	2f6305dd56	KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_tdp_mmu_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	43b74355ef	KVM: MMU: change __direct_map() arguments to kvm_page_fault Pass struct kvm_page_fault to __direct_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	3a13f4fea3	KVM: MMU: change handle_abnormal_pfn() arguments to kvm_page_fault Pass struct kvm_page_fault to handle_abnormal_pfn() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	3647cd04b7	KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to outputs of kvm_faultin_pfn(). For now they have to be extracted again from struct kvm_page_fault in the subsequent steps, but this is temporary until other functions in the chain are switched over as well. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	b8a5d55115	KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of page_fault_handle_page_track(). The fields are initialized in the callers, and page_fault_handle_page_track() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	4326e57ef4	KVM: MMU: change direct_page_fault() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of direct_page_fault(). The fields are initialized in the callers, and direct_page_fault() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Also adjust FNAME(page_fault) to store the max_level in struct kvm_page_fault, to keep it similar to the direct map path. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	c501040abc	KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to mmu->page_fault() instead of extracting the arguments from the struct. FNAME(page_fault) can use the precomputed bools from the error code. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:48 -04:00
Paolo Bonzini	d055f028a5	KVM: MMU: pass unadulterated gpa to direct_page_fault Do not bother removing the low bits of the gpa. This masking dates back to the very first commit of KVM but it is unnecessary, as exemplified by the other call in kvm_tdp_page_fault. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:48 -04:00
Lai Jiangshan	3e44dce4d0	KVM: X86: Move PTE present check from loop body to __shadow_walk_next() So far, the loop bodies already ensure the PTE is present before calling __shadow_walk_next(): Some loop bodies simply exit with a !PRESENT directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map() do not currently guard their walks with is_shadow_present_pte, but only because they install present non-leaf SPTEs in the loop itself. But checking pte present in __shadow_walk_next() (which is called from shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE would lead to attempting to read a the next level SPTE from a garbage iter->shadow_addr. It also allows to remove the is_shadow_present_pte() checks from the loop bodies. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Lai Jiangshan	f1c4a88c41	KVM: X86: Don't unsync pagetables when speculative We'd better only unsync the pagetable when there just was a really write fault on a level-1 pagetable. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Lai Jiangshan	5591c0694d	KVM: X86: Zap the invalid list after remote tlb flushing In mmu_sync_children(), it can zap the invalid list after remote tlb flushing. Emptifying the invalid list ASAP might help reduce a remote tlb flushing in some cases. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	c3e5e415bc	KVM: X86: Change kvm_sync_page() to return true when remote flush is needed Currently kvm_sync_page() returns true when there is any present spte. But the return value is ignored in the callers. Changing kvm_sync_page() to return true when remote flush is needed and changing mmu->sync_page() not to directly flush can combine and reduce remote flush requests. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	06152b2dec	KVM: X86: Remove kvm_mmu_flush_or_zap() Because local_flush is useless, kvm_mmu_flush_or_zap() can be removed and kvm_mmu_remote_flush_or_zap is used instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	bd047e5440	KVM: X86: Don't flush current tlb on shadow page modification After any shadow page modification, flushing tlb only on current VCPU is weird due to other VCPU's tlb might still be stale. In other words, if there is any mandatory tlb-flushing after shadow page modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be set and the tlbs of all VCPUs should be flushed. There is not point to only flush current tlb except when the request is from vCPU's or pCPU's activities. If there was any bug that mandatory tlb-flushing is required and SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch would expose the bug in a more destructive way. The related code paths are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet. Currently, there is no optional tlb-flushing after sync page related code is changed to flush tlb timely. So we can just remove these local flushing code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Sean Christopherson	c6cecc4b93	KVM: x86/mmu: Complete prefetch for trailing SPTEs for direct, legacy MMU Make a final call to direct_pte_prefetch_many() if there are "trailing" SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN. The call to direct_pte_prefetch_many() in the loop only handles the case where there are !PRESENT SPTEs preceding a PRESENT SPTE. E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all SPTEs for the following GFNs are !PRESENT, the loop will terminate with "start = sptep+1" and not prefetch any SPTEs. Prefetching trailing SPTEs as intended can drastically reduce the number of guest page faults, e.g. accessing the first byte of every 4kb page in a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory, the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M. Note, this only affects memory that is backed by 4kb pages as KVM doesn't prefetch when installing hugepages. Shadow paging prefetching is not affected as it does not batch the prefetches due to the need to process the corresponding guest PTE. The TDP MMU is not affected because it doesn't have prefetching, yet... Fixes: `957ed9effd` ("KVM: MMU: prefetch ptes when intercepted guest #PF") Cc: Sergey Senozhatsky <senozhatsky@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210818235615.2047588-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Lai Jiangshan	65855ed8b0	KVM: X86: Synchronize the shadow pagetable before link it If gpte is changed from non-present to present, the guest doesn't need to flush tlb per SDM. So the host must synchronze sp before link it. Otherwise the guest might use a wrong mapping. For example: the guest first changes a level-1 pagetable, and then links its parent to a new place where the original gpte is non-present. Finally the guest can access the remapped area without flushing the tlb. The guest's behavior should be allowed per SDM, but the host kvm mmu makes it wrong. Fixes: `4731d4c7a0` ("KVM: MMU: out of sync shadow core") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-23 11:01:00 -04:00
Paolo Bonzini	4ac214574d	KVM: MMU: mark role_regs and role accessors as maybe unused It is reasonable for these functions to be used only in some configurations, for example only if the host is 64-bits (and therefore supports 64-bit guests). It is also reasonable to keep the role_regs and role accessors in sync even though some of the accessors may be used only for one of the two sets (as is the case currently for CR4.LA57).. Because clang reports warnings for unused inlines declared in a .c file, mark both sets of accessors as __maybe_unused. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-06 06:56:38 -04:00
Sean Christopherson	e7177339d7	Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" Revert a misguided illegal GPA check when "translating" a non-nested GPA. The check is woefully incomplete as it does not fill in @exception as expected by all callers, which leads to KVM attempting to inject a bogus exception, potentially exposing kernel stack information in the process. WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525 CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0 RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525 Call Trace: x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853 kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199 handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline] vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038 vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712 vcpu_run arch/x86/kvm/x86.c:9779 [inline] kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010 kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652 The bug has escaped notice because practically speaking the GPA check is useless. The GPA check in question only comes into play when KVM is walking guest page tables (or "translating" CR3), and KVM already handles illegal GPA checks by setting reserved bits in rsvd_bits_mask for each PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an illegal CR3. This particular failure doesn't hit the existing reserved bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging doesn't define any reserved PA bits checks, which KVM emulates by only incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32. Simply remove the bogus check. There is zero meaningful value and no architectural justification for supporting guest.MAXPHYADDR < 32, and properly filling the exception would introduce non-trivial complexity. This reverts commit `ec7771ab47`. Fixes: `ec7771ab47` ("KVM: x86: mmu: Add guest physical address check in translate_gpa()") Cc: stable@vger.kernel.org Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210831164224.1119728-2-seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-06 06:18:02 -04:00
Sean Christopherson	a717a780fc	KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host Include pml5_root in the set of special roots if and only if the host, and thus NPT, is using 5-level paging. mmu_alloc_special_roots() expects special roots to be allocated as a bundle, i.e. they're either all valid or all NULL. But for pml5_root, that expectation only holds true if the host uses 5-level paging, which causes KVM to WARN about pml5_root being NULL when the other special roots are valid. The silver lining of 4-level vs. 5-level NPT being tied to the host kernel's paging level is that KVM's shadow root level is constant; unlike VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR. That means KVM can still expect pml5_root to be bundled with the other special roots, it just needs to be conditioned on the shadow root level. Fixes: `cb0f722aff` ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210824005824.205536-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-06 05:56:38 -04:00
Wei Huang	cb0f722aff	KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host When the 5-level page table CPU flag is set in the host, but the guest has CR4.LA57=0 (including the case of a 32-bit guest), the top level of the shadow NPT page tables will be fixed, consisting of one pointer to a lower-level table and 511 non-present entries. Extend the existing code that creates the fixed PML4 or PDP table, to provide a fixed PML5 table if needed. This is not needed on EPT because the number of layers in the tables is specified in the EPTP instead of depending on the host CR4. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:07:48 -04:00
Wei Huang	746700d21f	KVM: x86: Allow CPU to force vendor-specific TDP level AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set. To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force a fixed TDP level. Signed-off-by: Wei Huang <wei.huang2@amd.com> Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:44 -04:00
Paolo Bonzini	ec607a564f	KVM: x86: clamp host mapping level to max_level in kvm_mmu_max_mapping_level This change started as a way to make kvm_mmu_hugepage_adjust a bit simpler, but it does fix two bugs as well. One bug is in zapping collapsible PTEs. If a large page size is disallowed but not all of them, kvm_mmu_max_mapping_level will return the host mapping level and the small PTEs will be zapped up to that level. However, if e.g. 1GB are prohibited, we can still zap 4KB mapping and preserve the 2MB ones. This can happen for example when NX huge pages are in use. The second would happen when userspace backs guest memory with a 1gb hugepage but only assign a subset of the page to the guest. 1gb pages would be disallowed by the memslot, but not 2mb. kvm_mmu_max_mapping_level() would fall through to the host_pfn_mapping_level() logic, see the 1gb hugepage, and map the whole thing into the guest. Fixes: `2f57b7051f` ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:41 -04:00
Mingwei Zhang	71f51d2c32	KVM: x86/mmu: Add detailed page size stats Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Cc: Jing Zhang <jingzhangos@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20210803044607.599629-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:34 -04:00
Mingwei Zhang	4293ddb788	KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte Drop an unnecessary is_shadow_present_pte() check when updating the rmaps after installing a non-MMIO SPTE. set_spte() is used only to create shadow-present SPTEs, e.g. MMIO SPTEs are handled early on, mmu_set_spte() runs with mmu_lock held for write, i.e. the SPTE can't be zapped between writing the SPTE and updating the rmaps. Opportunistically combine the "new SPTE" logic for large pages and rmaps. No functional change intended. Suggested-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20210803044607.599629-2-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:34 -04:00
Maxim Levitsky	9cc13d60ba	KVM: x86/mmu: allow APICv memslot to be enabled but invisible on AMD, APIC virtualization needs to dynamicaly inhibit the AVIC in a response to some events, and this is problematic and not efficient to do by enabling/disabling the memslot that covers APIC's mmio range. Plus due to SRCU locking, it makes it more complex to request AVIC inhibition. Instead, the APIC memslot will be always enabled, but be invisible to the guest, such as the MMU code will not install a SPTE for it, when it is inhibited and instead jump straight to emulating the access. When inhibiting the AVIC, this SPTE will be zapped. This code is based on a suggestion from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:22 -04:00
Maxim Levitsky	8f32d5e563	KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code This will allow it to return RET_PF_EMULATE for APIC mmio emulation. This code is based on a patch from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210810205251.424103-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:20 -04:00
Maxim Levitsky	33a5c0009d	KVM: x86/mmu: rename try_async_pf to kvm_faultin_pfn try_async_pf is a wrong name for this function, since this code is used when asynchronous page fault is not enabled as well. This code is based on a patch from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210810205251.424103-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:20 -04:00
Maxim Levitsky	edb298c663	KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range This together with previous patch, ensures that kvm_zap_gfn_range doesn't race with page fault running on another vcpu, and will make this page fault code retry instead. This is based on a patch suggested by Sean Christopherson: https://lkml.org/lkml/2021/7/22/1025 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:19 -04:00
Maxim Levitsky	88f585358b	KVM: x86/mmu: add comment explaining arguments to kvm_zap_gfn_range This comment makes it clear that the range of gfns that this function receives is non inclusive. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:18 -04:00
Maxim Levitsky	2822da4466	KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address kvm_flush_remote_tlbs_with_address expects (start gfn, number of pages), and not (start gfn, end gfn) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:16 -04:00
Sean Christopherson	5a324c24b6	Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock" This together with the next patch will fix a future race between kvm_zap_gfn_range and the page fault handler, which will happen when AVIC memslot is going to be only partially disabled. The performance impact is minimal since kvm_zap_gfn_range is only called by users, update_mtrr() and kvm_post_set_cr0(). Both only use it if the guest has non-coherent DMA, in order to honor the guest's UC memtype. MTRR and CD setup only happens at boot, and generally in an area where the page tables should be small (for CD) or should not include the affected GFNs at all (for MTRRs). This is based on a patch suggested by Sean Christopherson: https://lkml.org/lkml/2021/7/22/1025 Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:15 -04:00
Peter Xu	3bcd0662d6	KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file Use this file to dump rmap statistic information. The statistic is done by calculating the rmap count and the result is log-2-based. An example output of this looks like (idle 6GB guest, right after boot linux): Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 Level=4K: 3086676 53045 12330 1272 502 121 76 2 0 0 0 Level=2M: 5947 231 0 0 0 0 0 0 0 0 0 Level=1G: 32 0 0 0 0 0 0 0 0 0 0 Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-5-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-20 16:06:11 -04:00
Paolo Bonzini	9a63b4517c	Merge branch 'kvm-tdpmmu-fixes' into HEAD Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.	2021-08-13 03:35:01 -04:00
Sean Christopherson	ce25681d59	KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock Add yet another spinlock for the TDP MMU and take it when marking indirect shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with nested TDP, KVM may encounter shadow pages for the TDP entries managed by L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and misbehaves when a shadow page is marked unsync via a TDP MMU page fault, which runs with mmu_lock held for read, not write. Lack of a critical section manifests most visibly as an underflow of unsync_children in clear_unsync_child_bit() due to unsync_children being corrupted when multiple CPUs write it without a critical section and without atomic operations. But underflow is the best case scenario. The worst case scenario is that unsync_children prematurely hits '0' and leads to guest memory corruption due to KVM neglecting to properly sync shadow pages. Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock would functionally be ok. Usurping the lock could degrade performance when building upper level page tables on different vCPUs, especially since the unsync flow could hold the lock for a comparatively long time depending on the number of indirect shadow pages and the depth of the paging tree. For simplicity, take the lock for all MMUs, even though KVM could fairly easily know that mmu_lock is held for write. If mmu_lock is held for write, there cannot be contention for the inner spinlock, and marking shadow pages unsync across multiple vCPUs will be slow enough that bouncing the kvm_arch cacheline should be in the noise. Note, even though L2 could theoretically be given access to its own EPT entries, a nested MMU must hold mmu_lock for write and thus cannot race against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to be taken by the TDP MMU, as opposed to being taken by any MMU for a VM that is running with the TDP MMU enabled. Holding mmu_lock for read also prevents the indirect shadow page from being freed. But as above, keep it simple and always take the lock. Alternative #1, the TDP MMU could simply pass "false" for can_unsync and effectively disable unsync behavior for nested TDP. Write protecting leaf shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such VMMs typically don't modify TDP entries, but the same may not hold true for non-standard use cases and/or VMMs that are migrating physical pages (from L1's perspective). Alternative #2, the unsync logic could be made thread safe. In theory, simply converting all relevant kvm_mmu_page fields to atomics and using atomic bitops for the bitmap would suffice. However, (a) an in-depth audit would be required, (b) the code churn would be substantial, and (c) legacy shadow paging would incur additional atomic operations in performance sensitive paths for no benefit (to legacy shadow paging). Fixes: `a2855afc7e` ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210812181815.3378104-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-13 03:32:14 -04:00
Paolo Bonzini	c3e9434c98	Merge branch 'kvm-vmx-secctl' into HEAD Merge common topic branch for 5.14-rc6 and 5.15 merge window.	2021-08-10 13:45:26 -04:00
David Matlack	93e083d4f4	KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap gfn_to_rmap was removed in the previous patch so there is no need to retain the double underscore on __gfn_to_rmap. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-06 07:52:58 -04:00
David Matlack	601f8af01e	KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_recycle rmap_add() and rmap_recycle() both run in the context of the vCPU and thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This enables rmap_add() and rmap_recycle() to take advantage of vcpu->last_used_slot and avoid expensive memslot searching. This change improves the performance of "Populate memory time" in dirty_log_perf_test with tdp_mmu=N. In addition to improving the performance, "Populate memory time" no longer scales with the number of memslots in the VM. Command \| Before \| After ------------------------------- \| ---------------- \| ------------- ./dirty_log_perf_test -v64 -x1 \| 15.18001570s \| 14.99469366s ./dirty_log_perf_test -v64 -x64 \| 18.71336392s \| 14.98675076s Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-06 07:52:29 -04:00
Sean Christopherson	d5aaad6f83	KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds Take a signed 'long' instead of an 'unsigned long' for the number of pages to add/subtract to the total number of pages used by the MMU. This fixes a zero-extension bug on 32-bit kernels that effectively corrupts the per-cpu counter used by the shrinker. Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus an unsigned 32-bit value on 32-bit kernels. As a result, the value used to adjust the per-cpu counter is zero-extended (unsigned -> signed), not sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to 4294967295 and effectively corrupts the counter. This was found by a staggering amount of sheer dumb luck when running kvm-unit-tests on a 32-bit KVM build. The shrinker just happened to kick in while running tests and do_shrink_slab() logged an error about trying to free a negative number of objects. The truly lucky part is that the kernel just happened to be a slightly stale build, as the shrinker no longer yells about negative objects as of commit `18bb473e50` ("mm: vmscan: shrink deferred objects proportional to priority"). vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460 Fixes: `bc8a3d8925` ("kvm: mmu: Fix overflow on kvm mmu page limit calculation") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210804214609.1096003-1-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-05 03:33:56 -04:00
Peter Xu	a75b540451	KVM: X86: Optimize zapping rmap Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be slow. The easy way is to travers the rmap list, collecting the a/d bits and free the slots along the way. Provide a pte_list_destroy() and do exactly that. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220605.26377-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00
Peter Xu	13236e25eb	KVM: X86: Optimize pte_list_desc with per-array counter Add a counter field into pte_list_desc, so as to simplify the add/remove/loop logic. E.g., we don't need to loop over the array any more for most reasons. This will make more sense after we've switched the array size to be larger otherwise the counter will be a waste. Initially I wanted to store a tail pointer at the head of the array list so we don't need to traverse the list at least for pushing new ones (if without the counter we traverse both the list and the array). However that'll need slightly more change without a huge lot benefit, e.g., after we grow entry numbers per array the list traversing is not so expensive. So let's be simple but still try to get as much benefit as we can with just these extra few lines of changes (not to mention the code looks easier too without looping over arrays). I used the same a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch further speeds up the total fork time of about 4%, which is a total of 33% of vanilla kernel: Vanilla: 473.90 (+-5.93%) 3->15 slots: 366.10 (+-4.94%) Add counter: 351.00 (+-3.70%) [1] `825436f825` Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220602.26327-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00
Peter Xu	dc1cff9691	KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger Currently rmap array element only contains 3 entries. However for EPT=N there could have a lot of guest pages that got tens of even hundreds of rmap entry. A normal distribution of a 6G guest (even if idle) shows this with rmap count statistics: Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 Level=4K: 3089171 49005 14016 1363 235 212 15 7 0 0 0 Level=2M: 5951 227 0 0 0 0 0 0 0 0 0 Level=1G: 32 0 0 0 0 0 0 0 0 0 0 If we do some more fork some pages will grow even larger rmap counts. This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT will be slightly more efficient; but still not too large so less waste when array not full. It should not affecting EPT=Y since EPT normally only has zero or one rmap entry for each page, so no array is even allocated. With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch speeds up fork time of about 29%. Before: 473.90 (+-5.93%) After: 366.10 (+-4.94%) [1] `825436f825` Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-6-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-04 05:55:56 -04:00
Hamza Mahfooz	269e9552d2	KVM: const-ify all relevant uses of struct kvm_memory_slot As alluded to in commit `f36f3f2846` ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Message-Id: <20210713023338.57108-1-someguy@effective-light.com> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-03 06:04:24 -04:00
David Matlack	6e8eb2060c	KVM: x86/mmu: fast_page_fault support for the TDP MMU Make fast_page_fault interoperate with the TDP MMU by leveraging walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and introducing a new helper function kvm_tdp_mmu_fast_pf_get_last_sptep to grab the lowest level sptep. Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210713220957.3493520-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:47 -04:00
David Matlack	c5c8c7c530	KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU Acquire the RCU read lock in walk_shadow_page_lockless_begin and release it in walk_shadow_page_lockless_end when the TDP MMU is enabled. This should not introduce any functional changes but is used in the following commit to make fast_page_fault interoperate with the TDP MMU. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210713220957.3493520-4-dmatlack@google.com> [Use if...else instead of if(){return;}] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:47 -04:00
David Matlack	76cd325ea7	KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault fast_page_fault is only called from direct_page_fault where we know the address is a gpa. Fixes: `736c291c9f` ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210713220957.3493520-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 11:01:46 -04:00
Peter Xu	ec1cf69c37	KVM: X86: Add per-vm stat for max rmap list size Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap for the vm. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:37 -04:00
Sean Christopherson	7fa2a34751	KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() Return the old SPTE when clearing a SPTE and push the "old SPTE present" check to the caller. Private shadow page support will use the old SPTE in rmap_remove() to determine whether or not there is a linked private shadow page. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <b16bac1fd1357aaf39e425aab2177d3f89ee8318.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:37 -04:00
Sean Christopherson	03fffc5493	KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce indentation Employ a 'continue' to reduce the indentation for linking a new shadow page during __direct_map() in preparation for linking private pages. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <702419686d5700373123f6ea84e7a946c2cad8b4.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:37 -04:00
Sean Christopherson	19025e7bc5	KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <298980aa5fc5707184ac082287d13a800cd9c25f.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-08-02 09:36:36 -04:00
Linus Torvalds	405386b021	* Allow again loading KVM on 32-bit non-PAE builds * Fixes for host SMIs on AMD * Fixes for guest SMIs on AMD * Fixes for selftests on s390 and ARM * Fix memory leak * Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDwRi4UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOt4AgAl6xEkMwDC74d/QFIOA7s2GD3ugfa z5XqGN1qz/nmEMnuIg6/tjTXDPmn/dfLMqy8RGZfyUv6xbgPcv/7JuFMRILvwGTb SbOVrGnR/QOhMdlfWH34qDkXeEsthTXSgQgVm/iiED0TttvQYVcZ/E9mgzaWQXor T1yTug2uAUXJ1EBxY0ZBo2kbh+BvvdmhEF0pksZOuwqZdH3zn3QCXwAwkL/OtUYE M6nNn3j1LU38C4OK1niXOZZVOuMIdk/l7LyFpjUQTFlIqitQAPtBE5MD+K+A9oC2 Yocxyj2tId1e6o8bLic/oN8/LpdORTvA/wDMj5M1DcMzvxQuQIpGYkcVGg== =gjVA -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: - Allow again loading KVM on 32-bit non-PAE builds - Fixes for host SMIs on AMD - Fixes for guest SMIs on AMD - Fixes for selftests on s390 and ARM - Fix memory leak - Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: selftests: smm_test: Test SMM enter from L2 KVM: nSVM: Restore nested control upon leaving SMM KVM: nSVM: Fix L1 state corruption upon return from SMM KVM: nSVM: Introduce svm_copy_vmrun_state() KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails KVM: SVM: add module param to control the #SMI interception KVM: SVM: remove INIT intercept handler KVM: SVM: #SMI interception must not skip the instruction KVM: VMX: Remove vmx_msr_index from vmx.h KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc kvm: debugfs: fix memory leak in kvm_create_vm_debugfs KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR ...	2021-07-15 11:56:07 -07:00
Sean Christopherson	fc9bf2e087	KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs Ignore "dynamic" host adjustments to the physical address mask when generating the masks for guest PTEs, i.e. the guest PA masks. The host physical address space and guest physical address space are two different beasts, e.g. even though SEV's C-bit is the same bit location for both host and guest, disabling SME in the host (which clears shadow_me_mask) does not affect the guest PTE->GPA "translation". For non-SEV guests, not dropping bits is the correct behavior. Assuming KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits that are lost as collateral damage from memory encryption are treated as reserved bits, i.e. KVM will never get to the point where it attempts to generate a gfn using the affected bits. And if userspace wants to create a bogus vCPU, then userspace gets to deal with the fallout of hardware doing odd things with bad GPAs. For SEV guests, not dropping the C-bit is technically wrong, but it's a moot point because KVM can't read SEV guest's page tables in any case since they're always encrypted. Not to mention that the current KVM code is also broken since sme_me_mask does not have to be non-zero for SEV to be supported by KVM. The proper fix would be to teach all of KVM to correctly handle guest private memory, but that's a task for the future. Fixes: `d0ec49d4de` ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-5-seanjc@google.com> [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:56 -04:00
Linus Torvalds	36824f198c	ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4 QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6 +/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w== =NcRh -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "This covers all architectures (except MIPS) so I don't expect any other feature pull requests this merge window. ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits) KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled kvm: x86: disable the narrow guest module parameter on unload selftests: kvm: Allows userspace to handle emulation errors. kvm: x86: Allow userspace to handle emulation errors KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic KVM: x86: Enhance comments for MMU roles and nested transition trickiness KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU KVM: x86/mmu: Use MMU's role to determine PTTYPE KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers KVM: x86/mmu: Add a helper to calculate root from role_regs KVM: x86/mmu: Add helper to update paging metadata KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper KVM: x86/mmu: Get nested MMU's root level from the MMU's role ...	2021-06-28 15:40:51 -07:00
Linus Torvalds	8e4d7a78f0	Misc cleanups & removal of obsolete code. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZejQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hKCg//a0wiOyJBiWLAW0uiOucF2ICVQZj5rKgi M4HRJZ9jNkFUFVQ/eXYI7uedSqJ6B4hwoUqU6Yp6e05CF/Jgxe2OQXnknearjtDp xs8yBsnLolCrtHzWvuJAZL8InXwvUYrsxu1A8kWKd1ezZQ2V2aFEI4KtYcPVoBBi hRNMy1JVJbUoCG5s/CbsMpTKH0ehQFGsG46rCLQJ4s9H3rcYaCv9NY2q1EYKBrha ZiZjPSFBKaTAVEoc3tUbqsNZAqgyuwRcBQL0K5VDI9p92fudvKgsTI7erbmp+Lij mLhjjoPQK1C07kj0HpCPyoGMiTbJ2piag/jZnxSEiQnNxmZjqjRUhDuDhp6uc/SE 98CEYWPoVbU7N6QLEurHVRAfaQ/ZC7PfiR7lhkoJHizaszFY1NFRxplsU1rzTwGq YZdr+y49tJTCU1wIvWF2eFBZHBEgfA6fP4TRGgVsQ7r8IhugR1nCLcnTfMLYXt2t 9Fe57M7cBgZMgNf5AgvraowugJrTLX7240YPKxHnv5yLjIBt4bulm8X4Lq/MKgc+ UbRfB7Trd2c9T4EVDy26rQ7qk+VC8rIbzEp4kvlDpV8u7BtLYhVonxVz6qPong5b NxOczaFsfL5gWJmfGU+vfc+RFl2lNhQQMLo/gdEn89qZL8nxL/4byejwfCs0YfC2 wgDXNwRJb+g= =YqZp -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups & removal of obsolete code" * tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Correct kernel-doc's arg name in sgx_encl_release() doc: Remove references to IBM Calgary x86/setup: Document that Windows reserves the first MiB x86/crash: Remove crash_reserve_low_1M() x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options x86/alternative: Align insn bytes vertically x86: Fix leftover comment typos x86/asm: Simplify __smp_mb() definition x86/alternatives: Make the x86nops[] symbol static	2021-06-28 13:10:25 -07:00
Sean Christopherson	27de925044	KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on Let the guest use 1g hugepages if TDP is enabled and the host supports GBPAGES, KVM can't actively prevent the guest from using 1g pages in this case since they can't be disabled in the hardware page walker. While injecting a page fault if a bogus 1g page is encountered during a software page walk is perfectly reasonable since KVM is simply honoring userspace's vCPU model, doing so arguably doesn't provide any meaningful value, and at worst will be horribly confusing as the guest will see inconsistent behavior and seemingly spurious page faults. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-55-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:48 -04:00
Sean Christopherson	f82fdaf536	KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT Drop the extra reset of shadow_zero_bits in the nested NPT flow now that shadow_mmu_init_context computes the correct level for nested NPT. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-52-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	7cd138db5c	KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic Drop the pre-computed last_nonleaf_level, which is arguably wrong and at best confusing. Per the comment: Can have large pages at levels 2..last_nonleaf_level-1. the intent of the variable would appear to be to track what levels can _legally_ have large pages, but that intent doesn't align with reality. The computed value will be wrong for 5-level paging, or if 1gb pages are not supported. The flawed code is not a problem in practice, because except for 32-bit PSE paging, bit 7 is reserved if large pages aren't supported at the level. Take advantage of this invariant and simply omit the level magic math for 64-bit page tables (including PAE). For 32-bit paging (non-PAE), the adjustments are needed purely because bit 7 is ignored if PSE=0. Retain that logic as is, but make is_last_gpte() unique per PTTYPE so that the PSE check is avoided for PAE and EPT paging. In the spirit of avoiding branches, bump the "last nonleaf level" for 32-bit PSE paging by adding the PSE bit itself. Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are not PTEs and will blow up all the other gpte helpers. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-51-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	961f84457c	KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU Extract the reserved SPTE check and print helpers in get_mmio_spte() to new helpers so that KVM can also WARN on reserved badness when making a SPTE. Tag the checking helper with __always_inline to improve the probability of the compiler generating optimal code for the checking loop, e.g. gcc appears to avoid using %rbp when the helper is tagged with a vanilla "inline". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-48-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	36f267871e	KVM: x86/mmu: Use MMU's role to determine PTTYPE Use the MMU's role instead of vCPU state or role_regs to determine the PTTYPE, i.e. which helpers to wire up. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	fe660f7244	KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers Skip paging32E_init_context() and paging64_init_context_common() and go directly to paging64_init_context() (was the common version) now that the relevant flows don't need to distinguish between 64-bit PAE and 32-bit PAE for other reasons. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	f4bd6f7376	KVM: x86/mmu: Add a helper to calculate root from role_regs Add a helper to calculate the level for non-EPT page tables from the MMU's role_regs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	533f9a4b38	KVM: x86/mmu: Add helper to update paging metadata Consolidate MMU guest metadata updates into a common helper for TDP, shadow, and nested MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	af0eb17e99	KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 Don't bother updating the bitmasks and last-leaf information if paging is disabled as the metadata will never be used. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	fa4b558802	KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls Move calls to reset_rsvds_bits_mask() out of the various mode statements and under a more generic CR0.PG=1 check. This will allow for additional code consolidation in the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	87e99d7d70	KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper Get LA57 from the role_regs, which are initialized from the vCPU even though TDP is enabled, instead of pulling the value directly from the vCPU when computing the guest's root_level for TDP MMUs. Note, the check is inside an is_long_mode() statement, so that requirement is not lost. Use role_regs even though the MMU's role is available and arguably "better". A future commit will consolidate the guest root level logic, and it needs access to EFER.LMA, which is not tracked in the role (it can't be toggled on VM-Exit, unlike LA57). Drop is_la57_mode() as there are no remaining users, and to discourage pulling MMU state from the vCPU (in the future). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	5472fcd4c6	KVM: x86/mmu: Get nested MMU's root level from the MMU's role Initialize the MMU's (guest) root_level using its mmu_role instead of redoing the calculations. The role_regs used to calculate the mmu_role are initialized from the vCPU, i.e. this should be a complete nop. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	a4c93252fe	KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers Drop kvm_mmu.nx as there no consumers left. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-39-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	90599c2801	KVM: x86/mmu: Use MMU's role to get EFER.NX during MMU configuration Get the MMU's effective EFER.NX from its role instead of using the one-off, dedicated flag. This will allow dropping said flag in a future commit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-38-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	84a1622604	KVM: x86/mmu: Use MMU's role/role_regs to compute context's metadata Use the MMU's role and role_regs to calculate the MMU's guest root level and NX bit. For some flows, the vCPU state may not be correct (or relevant), e.g. EPT doesn't interact with EFER.NX and nested NPT will configure the guest_mmu with possibly-stale vCPU state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	b67a93a87e	KVM: x86/mmu: Use MMU's roles to compute last non-leaf level Use the MMU's role to get CR4.PSE when determining the last level at which the guest _cannot_ create a non-leaf PTE, i.e. cannot create a huge page. Note, the existing logic is arguably wrong when considering 5-level paging and the case where 1gb pages aren't supported. In practice, the logic is confusing but not broken, because except for 32-bit non-PAE paging, bit 7 (_PAGE_PSE) bit is reserved when a huge page isn't supported at that level. I.e. setting bit 7 will terminate the guest walk one way or another. Furthermore, last_nonleaf_level is only consulted after KVM has verified there are no reserved bits set. All that confusion will be addressed in a future patch by dropping last_nonleaf_level entirely. For now, massage the code to continue the march toward using mmu_role for (almost) all MMU computations. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	2e4c06618d	KVM: x86/mmu: Use MMU's role to compute PKRU bitmask Use the MMU's role to calculate the Protection Keys (Restrict Userspace) bitmask instead of pulling bits from current vCPU state. For some flows, the vCPU state may not be correct (or relevant), e.g. EPT doesn't interact with PKRU. Case in point, the "ept" param simply disappears. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	c596f1470a	KVM: x86/mmu: Use MMU's role to compute permission bitmask Use the MMU's role to generate the permission bitmasks for the MMU. For some flows, the vCPU state may not be correct (or relevant), e.g. the nested NPT MMU can be initialized with incoherent vCPU state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	b705a277b7	KVM: x86/mmu: Drop vCPU param from reserved bits calculator Drop the vCPU param from __reset_rsvds_bits_mask() as it's now unused, and ideally will remain unused in the future. Any information that's needed by the low level helper should be explicitly provided as it's used for both shadow/host MMUs and guest MMUs, i.e. vCPU state may be meaningless or simply wrong. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	4e9c0d80db	KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bits Use the MMU's role to get CR4.PSE when calculating reserved bits for the guest's PTEs. Practically speaking, this is a glorified nop as the role always come from vCPU state for the relevant flows, but converting to the roles will provide consistency once everything else is converted, and will Just Work if the "always comes from vCPU" behavior were ever to change (unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	8c985b2d8e	KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bits Unconditionally pass pse=false when calculating reserved bits for shadow PTEs. CR4.PSE is only relevant for 32-bit non-PAE paging, which KVM does not use for shadow paging (including nested NPT). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	18db1b1790	KVM: x86/mmu: Always set new mmu_role immediately after checking old role Refactor shadow MMU initialization to immediately set its new mmu_role after verifying it differs from the old role, and so that all flavors of MMU initialization share the same check-and-set pattern. Immediately setting the role will allow future commits to use mmu_role to configure the MMU without consuming stale state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	84c679f5f5	KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is active Don't set cr4_pke or cr4_la57 in the MMU role if long mode isn't active, which is required for protection keys and 5-level paging to be fully enabled. Ignoring the bit avoids unnecessary reconfiguration on reuse, and also means consumers of mmu_role don't need to manually check for long mode. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	ca8d664f50	KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0 Don't set CR0/CR4/EFER bits in the MMU role if paging is disabled, paging modifiers are irrelevant if there is no paging in the first place. Somewhat arbitrarily clear gpte_is_8_bytes for shadow paging if paging is disabled in the guest. Again, there are no guest PTEs to process, so the size is meaningless. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	6066772455	KVM: x86/mmu: Add accessors to query mmu_role bits Add accessors via a builder macro for all mmu_role bits that track a CR0, CR4, or EFER bit, abstracting whether the bits are in the base or the extended role. Future commits will switch to using mmu_role instead of vCPU state to configure the MMU, i.e. there are about to be a large number of users. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	167f8a5cae	KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans Rename "nxe" to "efer_nx" so that future macro magic can use the pattern <reg>_<bit> for all CR0, CR4, and EFER bits that included in the role. Using "efer_nx" also makes it clear that the role bit reflects EFER.NX, not the NX bit in the corresponding PTE. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	8626c120ba	KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_role Use the provided role_regs to calculate the mmu_role instead of pulling bits from current vCPU state. For some flows, e.g. nested TDP, the vCPU state may not be correct (or relevant). Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	cd6767c334	KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU role Do not incorporate CR0/CR4 bits into the role for the nested EPT MMU, as EPT behavior is not influenced by CR0/CR4. Note, this is the guest_mmu, (L1's EPT), not nested_mmu (L2's IA32 paging); the nested_mmu does need CR0/CR4, and is initialized in a separate flow. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	af09897229	KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context() Consolidate the MMU metadata update calls to deduplicate code, and to prep for future cleanup. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	594e91a100	KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from regs Introduce "struct kvm_mmu_role_regs" to hold the register state that is incorporated into the mmu_role. For nested TDP, the register state that is factored into the MMU isn't vCPU state; the dedicated struct will be used to propagate the correct state throughout the flows without having to pass multiple params, and also provides helpers for the various flag accessors. Intentionally make the new helpers cumbersome/ugly by prepending four underscores. In the not-too-distant future, it will be preferable to use the mmu_role to query bits as the mmu_role can drop irrelevant bits without creating contradictions, e.g. clearing CR4 bits when CR0.PG=0. Reserve the clean helper names (no underscores) for the mmu_role. Add a helper for vCPU conversion, which is the common case. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	d555f7057e	KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUs Use the mmu_role to initialize shadow root level instead of assuming the level of KVM's shadow root (host) is the same as that of the guest root, or in the case of 32-bit non-PAE paging where KVM forces PAE paging. For nested NPT, the shadow root level cannot be adapted to L1's NPT root level and is instead always the TDP root level because NPT uses the current host CR0/CR4/EFER, e.g. 64-bit KVM can't drop into 32-bit PAE to shadow L1's NPT. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	16be1d1292	KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper Move nested NPT's invocation of reset_shadow_zero_bits_mask() into the MMU proper and unexport said function. Aside from dropping an export, this is a baby step toward eliminating the call entirely by fixing the shadow_root_level confusion. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	20f632bd00	KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper Grab all CR0/CR4 MMU role bits from current vCPU state when initializing a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(), as the CR0/CR4 update masks must exactly match the mmu_role bits, with one exception (see below). The "full" CR0/CR4 will be used by future commits to initialize the MMU and its role, as opposed to the current approach of pulling everything from vCPU, which is incorrect for certain flows, e.g. nested NPT. CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU) but can't be toggled via MOV CR4 while long mode is active. I.e. LA57 needs to be in the mmu_role, but technically doesn't need to be checked by kvm_post_set_cr4(). However, the extra check is completely benign as the hardware restrictions simply mean LA57 will never be _the_ cause of a MMU reset during MOV CR4. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	18feaad3c6	KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUs Drop the smep_andnot_wp role check from the "uses NX" calculation now that all non-nested shadow MMUs treat NX as used via the !TDP check. The shadow MMU for nested NPT, which shares the helper, does not need to deal with SMEP (or WP) as NPT walks are always "user" accesses and WP is explicitly noted as being ignored: Table walks for guest page tables are always treated as user writes at the nested page table level. A table walk for the guest page itself is always treated as a user access at the nested page table level The host hCR0.WP bit is ignored under nested paging. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	dbc4739b6b	KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	0337f585f5	KVM: x86/mmu: Rename unsync helper and update related comments Rename mmu_need_write_protect() to mmu_try_to_unsync_pages() and update a variety of related, stale comments. Add several new comments to call out subtle details, e.g. that upper-level shadow pages are write-tracked, and that can_unsync is false iff KVM is in the process of synchronizing pages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	479a1efc81	KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page() Nove the kvm_unlink_unsync_page() call out of kvm_sync_page() and into it's sole caller, and fold __kvm_sync_page() into kvm_sync_page() since the latter becomes a pure pass-through. There really should be no reason for code to do a complete sync of a shadow page outside of the full kvm_mmu_sync_roots(), e.g. the one use case that creeped in turned out to be flawed and counter-productive. Drop the stale comment about @sp->gfn needing to be write-protected, as it directly contradicts the kvm_mmu_get_page() usage. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	07dc4f35a4	KVM: x86/mmu: comment on kvm_mmu_get_page's syncing of pages Explain the usage of sync_page() in kvm_mmu_get_page(), which is subtle in how and why it differs from mmu_sync_children(). Signed-off-by: Sean Christopherson <seanjc@google.com> [Split out of a different patch by Sean. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	2640b08653	KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches When synchronizing a shadow page, WARN and zap the page if its mmu role isn't compatible with the current MMU context, where "compatible" is an exact match sans the bits that have no meaning in the overall MMU context or will be explicitly overwritten during the sync. Many of the helpers used by sync_page() are specific to the current context, updating a SMM vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2 PTEs might work but would be extremely bizaree, and so on and so forth. Drop the guard with respect to 8-byte vs. 4-byte PTEs in __kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped trying to sync shadow pages irrespective of the current MMU context. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	00a669780f	KVM: x86/mmu: Use MMU role to check for matching guest page sizes Originally, __kvm_sync_page used to check the cr4_pae bit in the role to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte or the other way round. However, in commit `47c42e6b41` ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it was observed that this did not work for nested EPT, where the page table size would be 8 bytes even if CR4.PAE=0. (Note that the check still has to be done for nested NPT, so it is not possible to use tdp_enabled or similar). Therefore, a hack was introduced to identify nested EPT shadow pages and unconditionally call __kvm_sync_page() on them. However, it is possible to do without the hack to identify nested EPT shadow pages: if EPT is active, there will be no shadow pages in non-EPT format, and all of them will have gpte_is_8_bytes set to true; we can just check the MMU role directly, and the test will always be true. Even for non-EPT shadow MMUs, this test should really always be true now that __kvm_sync_page() is called if and only if the role is an exact match (kvm_mmu_get_page()) or is part of the current MMU context (kvm_mmu_sync_roots()). A future commit will convert the likely-pointless check into a meaningful WARN to enforce that the mmu_roles of the current context and the shadow page are compatible. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	ddc16abbba	KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN When creating a new upper-level shadow page, zap unsync shadow pages at the same target gfn instead of attempting to sync the pages. This fixes a bug where an unsync shadow page could be sync'd with an incompatible context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is relatively benign as sync_page() is all but guaranteed to fail its check that the guest's desired gfn (for the to-be-sync'd page) matches the current gfn associated with the shadow page. I.e. kvm_sync_page() would end up zapping the page anyways. Alternatively, __kvm_sync_page() could be modified to explicitly verify the mmu_role of the unsync shadow page is compatible with the current MMU context. But, except for this specific case, __kvm_sync_page() is called iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page() requires an exact role match, and the call from kvm_sync_mmu_roots() is only synchronizing shadow pages from the current MMU (which better be compatible or KVM has problems). And as described above, attempting to sync shadow pages when creating an upper-level shadow page is unlikely to succeed, e.g. zero successful syncs were observed when running Linux guests despite over a million attempts. Fixes: `9f1a122f97` ("KVM: MMU: allow more page become unsync at getting sp time") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-10-seanjc@google.com> [Remove WARN_ON after __kvm_sync_page. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	6c032f12dd	Revert "KVM: MMU: record maximum physical address width in kvm_mmu_extended_role" Drop MAXPHYADDR from mmu_role now that all MMUs have their role invalidated after a CPUID update. Invalidating the role forces all MMUs to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can only be changed only through a CPUID update. This reverts commit `de3ccd26fa`. Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	63f5a1909f	KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest instability. Initialize last_vmentry_cpu to -1 and use it to detect if the vCPU has been run at least once when its CPUID model is changed. KVM does not correctly handle changes to paging related settings in the guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM could theoretically zap all shadow pages, but actually making that happen is a mess due to lock inversion (vcpu->mutex is held). And even then, updating paging settings on the fly would only work if all vCPUs are stopped, updated in concert with identical settings, then restarted. To support running vCPUs with different vCPU models (that affect paging), KVM would need to track all relevant information in kvm_mmu_page_role. Note, that's the _page_ role, not the full mmu_role. Updating mmu_role isn't sufficient as a vCPU can reuse a shadow page translation that was created by a vCPU with different settings and thus completely skip the reserved bit checks (that are tied to CPUID). Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as it would require doubling gfn_track from a u16 to a u32, i.e. would increase KVM's memory footprint by 2 bytes for every 4kb of guest memory. E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT would all need to be tracked. In practice, there is no remotely sane use case for changing any paging related CPUID entries on the fly, so just sweep it under the rug (after yelling at userspace). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	49c6f8756c	KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified Invalidate all MMUs' roles after a CPUID update to force reinitizliation of the MMU context/helpers. Despite the efforts of commit `de3ccd26fa` ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"), there are still a handful of CPUID-based properties that affect MMU behavior but are not incorporated into mmu_role. E.g. 1gb hugepage support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all factor into the guest's reserved PTE bits. The obvious alternative would be to add all such properties to mmu_role, but doing so provides no benefit over simply forcing a reinitialization on every CPUID update, as setting guest CPUID is a rare operation. Note, reinitializing all MMUs after a CPUID update does not fix all of KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID properties, which means that a vCPU can reuse shadow pages that should not exist for the new vCPU model, e.g. that map GPAs that are now illegal (due to MAXPHYADDR changes) or that set bits that are now reserved (PAGE_SIZE for 1gb pages), etc... Tracking the relevant CPUID properties in kvm_mmu_page_role would address the majority of problems, but fully tracking that much state in the shadow page role comes with an unpalatable cost as it would require a non-trivial increase in KVM's memory footprint. The GBPAGES case is even worse, as neither Intel nor AMD provides a way to disable 1gb hugepage support in the hardware page walker, i.e. it's a virtualization hole that can't be closed when using TDP. In other words, resetting the MMU after a CPUID update is largely a superficial fix. But, it will allow reverting the tracking of MAXPHYADDR in the mmu_role, and that case in particular needs to mostly work because KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging is supported. For cases where KVM botches guest behavior, the damage is limited to that guest. But for the shadow_root_level, a misconfigured MMU can cause KVM to incorrectly access memory, e.g. due to walking off the end of its shadow page tables. Fixes: `7dcd575520` ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	f71a53d118	Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack" Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit `8053f924ca`. Fixes: `8053f924ca` ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	112022bdb5	KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs Mark NX as being used for all non-nested shadow MMUs, as KVM will set the NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled. Checking the mitigation itself is not sufficient as it can be toggled on at any time and KVM doesn't reset MMU contexts when that happens. KVM could reset the contexts, but that would require purging all SPTEs in all MMUs, for no real benefit. And, KVM already forces EFER.NX=1 when TDP is disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved for shadow MMUs. Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Colin Ian King	31c6565700	KVM: x86/mmu: Fix uninitialized boolean variable flush In the case where kvm_memslots_have_rmaps(kvm) is false the boolean variable flush is not set and is uninitialized. If is_tdp_mmu_enabled(kvm) is true then the call to kvm_tdp_mmu_zap_collapsible_sptes passes the uninitialized value of flush into the call. Fix this by initializing flush to false. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622150912.23429-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 04:31:16 -04:00
David Matlack	0485cf8dbe	KVM: x86/mmu: Remove redundant root_hpa checks The root_hpa checks below the top-level check in kvm_mmu_page_fault are theoretically redundant since there is no longer a way for the root_hpa to be reset during a page fault. The details of why are described in commit `ddce620821` ("KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler") __direct_map, kvm_tdp_mmu_map, and get_mmio_spte are all only reachable through kvm_mmu_page_fault, therefore their root_hpa checks are redundant. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210617231948.2591431-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:47 -04:00

1 2 3 4 5 ...

516 Commits