mirror of https://github.com/torvalds/linux.git
171 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
de8e8ebb1a |
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVkAACgkQOlYIJqCj
N/18Ow//cWPmXAdJcM0fRtnSGwzIZszGSD63htgdh5UDeJIFVyUGKH7uGhndQUwK
Uo8jCJ4ikwMxDdCijv+e4eqCCMZjb7HQhFKaauPVCJZOhmZn0br3EB5xX24Qgp8R
YN5gTheiTCHHVaxAMl9grgi1xTRi6pJRufRebOmtyGKNQkclctXcuSdtw7IEhqdM
wKM3eyb7qUhUrmt5tBkSyFAioGcPJIHE3vqLjImqDgduinbXJdQa1sek4Br0sX45
rfISZ2geXDj/Sh7EPrPU1ne5LQbtgzp1WTG6MRCidYfP86riMQUlEMY6odEYAgIX
kCd+z248OJShF5EYcEmjc894YLHJ0vVXIXKx/qh0+Jiobz3bujk+whaxTNa26rj0
3qLPGzFpYugtxkGqBYH4q90oUTovEk4922+jPsQ9GKQ26f0q3XzvriEUSOgrvo0Z
O26OyK7BezqSM5WMMSf/EGI1ESuli5lbBLYDOaNZS35di2YcDEgtaikRETpWwy82
TGxrjyeW9Pu6M3iTtQsOVHNxA4hU//Qd5HcDj5rcXOg1rgiPV9n2OaCEMwc6qi+V
VytbGm4IlMsz6AVHqyv3SUIt1Z4LNAZ/FwK8oeBRVd6LNfm6nfyrW6eQFQVLoIpA
1nyi9XjMg7xj6ubiSEQSTSl9gto8FzVWwLKwZ8dLH7SPvqlz+zY=
=qGpA
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
|
|
|
|
c09816f2af |
KVM: x86: Remove unused declaration kvm_mmu_may_ignore_guest_pat()
Commit
|
|
|
|
fe7413e398 |
Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"
Remove the helper and exports that were added to allow TDX code to reuse
kvm_tdp_map_page() for its gmem post-populate flow now that a dedicated
TDP MMU API is provided to install a mapping given a gfn+pfn pair.
This reverts commit
|
|
|
|
3ab3283dbb |
KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
Add and use a new API for mapping a private pfn from guest_memfd into the TDP MMU from TDX's post-populate hook instead of partially open-coding the functionality into the TDX code. Sharing code with the pre-fault path sounded good on paper, but it's fatally flawed as simulating a fault loses the pfn, and calling back into gmem to re-retrieve the pfn creates locking problems, e.g. kvm_gmem_populate() already holds the gmem invalidation lock. Providing a dedicated API will also removing several MMU exports that ideally would not be exposed outside of the MMU, let alone to vendor code. On that topic, opportunistically drop the kvm_mmu_load() export. Leave kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future. Gate the API on CONFIG_KVM_GUEST_MEMFD=y as private memory _must_ be backed by guest_memfd. Add a lockdep-only assert to that the incoming pfn is indeed backed by guest_memfd, and that the gmem instance's invalidate lock is held (which, combined with slots_lock being held, obviates the need to check for a stale "fault"). Cc: Michael Roth <michael.roth@amd.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
296599346c |
KVM: x86/mmu: WARN on attempt to check permissions for Shadow Stack #PF
Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to check permissions for a Shadow Stack access as KVM hasn't been taught to understand the magic Writable=0,Dirty=1 combination that is required for Shadow Stack accesses, and likely will never learn. There are no plans to support Shadow Stacks with the Shadow MMU, and the emulator rejects all instructions that affect Shadow Stacks, i.e. it should be impossible for KVM to observe a #PF due to a shadow stack access. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
85502b2214 |
LoongArch KVM changes for v6.16
1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmgsdO8WHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImejUWD/9zNaBSqTpqeAptRQ6qTKdrxYtN ZKJR9a8AQF5vMPD9dWoRr6iLaNt061GqBOKbhF5RGUVq6uIDfCkZbSbX1h6Ptgcz OwkJHbrZAu+Z31NSoYqbgPZnurSJ9oOUGsghqp3ecKEf0LptTVaKw4WDeKkKNOlq eJaUC1WJ1P4sSXhTlHKAl79Ds/1pza9iAgtKXothrh09PL48LYaWFpmiS0uQmKOD qwYVU/OJIUWn4ZOxyGdk6ZR0B+mJOwwoO1ILptRIeSk4oTKiE1HiIEqnhdnMrJFr DkRdwQ/jq7iH/CFkVybNzxgLqAqpGLJgPj5VakYPmp2scuWSESej9/o8wYrmn0y0 QuDQah0LiRIcbUYRHLDkfKRKCndfJ5KCrpOiD5mZ8bMd2LF9hPnH5toCXb/ZEpsK plu9qUrQgXF8rkX/zCIvQrOp0kYdU8DMbhZsXymSVpEs1fwrCNp2/Z2PrMYLKGt+ JT+65jVRJ67d1OdKw3DrWQA8Au0Ma1rgzX3oLDu8wnqAG7ULAJRVIkDMcxG7a5SQ P+DlIbEHC1U8Dw8hW+PFhfOl13M9p5s3EP5fy8q85UJ/5fJ41fCPBUOm/rfMQcci 6/e+xwBKKkJ7oQ/fFKvJ+n6GbV6suV5FUxicYQjC43iiCKklRhG77uek/DK3c5um ZiTu+YapX9qbG54Q9w== =YM/u -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.16 1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support. |
|
|
|
20a6cff3b2 |
KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in
kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete
root.
Since kvm_mmu_reload() can be called outside the
vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be
invoked after a root has been marked obsolete and before vcpu_enter_guest()
is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to
invalid. This causes kvm_mmu_reload() to fail to load a new root, which
can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while
loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to
is_page_fault_stale().
Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in
vcpu_enter_guest() since the cost of kvm_check_request() is negligible,
especially a check that's guarded by kvm_request_pending().
Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is
inline and may be called outside of kvm.ko.
Fixes:
|
|
|
|
c9c1e20b4c |
KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
5a46fd48d8 |
KVM: x86/mmu: Add setter for shadow_mmio_value
Future changes will want to set shadow_mmio_value from TDX code. Add a helper to setter with a name that makes more sense from that context. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [split into new patch] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073730.22200-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
2608f10576 |
KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
Export a function to walk down the TDP without modifying it and simply check if a GPA is mapped. Future changes will support pre-populating TDX private memory. In order to implement this KVM will need to check if a given GFN is already pre-populated in the mirrored EPT. [1] There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within the KVM MMU that almost does what is required. However, to make sense of the results, MMU internal PTE helpers are needed. Refactor the code to provide a helper that can be used outside of the KVM MMU code. Refactoring the KVM page fault handler to support this lookup usage was also considered, but it was an awkward fit. kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini. Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1] Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073457.22011-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
86eb1aef72 |
Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
|
|
2c3412e999 |
KVM: x86/mmu: Prevent aliased memslot GFNs
Add a few sanity checks to prevent memslot GFNs from ever having alias bits set. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. If these memslot GFNs ever had the bit that selects between the two aliases it could lead to unexpected behavior in the complicated code that directs faulting or zapping operations between the roots that map the two aliases. As a safety measure, prevent memslots from being set at a GFN range that contains the alias bit. Also, check in the kvm_faultin_pfn() for the fault path. This later check does less today, as the alias bits are specifically stripped from the GFN being checked, however future code could possibly call in to the fault handler in a way that skips this stripping. Since kvm_faultin_pfn() now has many references to vcpu->kvm, extract it to local variable. Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
fabaa76501 |
KVM: x86/tdp_mmu: Support mirror root for TDP MMU
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
3fc3f71851 |
KVM: x86/mmu: Support GFN direct bits
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid in implementing TDX shared memory. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. So KVM will either need to apply or strip the shared bit before mapping or zapping the shared EPT. Having GFNs sometimes have the shared bit and sometimes not would make the code confusing. So instead arrange the code such that GFNs never have shared bit set. Create a concept of "direct bits", that is stripped from the fault address when setting fault->gfn, and applied within the TDP MMU iterator. Calling code will behave as if it is operating on the PTE mapping the GFN (without shared bits) but within the iterator, the actual mappings will be shifted using bits specific for the root. SPs will have the GFN set without the shared bit. In the end the TDP MMU will behave like it is mapping things at the GFN without the shared bit but with a strange page table format where everything is offset by the shared bit. Since TDX only needs to shift the mapping like this for the shared bit, which is mapped as the normal TDP root, add a "gfn_direct_bits" field to the kvm_arch structure for each VM with a default value of 0. It will have the bit set at the position of the GPA shared bit in GFN through TD specific initialization code. Keep TDX specific concepts out of the MMU code by not naming it "shared". Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN ranges. In convention established above, these would need to target the shifted GFN range. It won't matter functionally, since the actual implementation will always result in a full flush for the only planned user (TDX). For correctness reasons, future changes can provide a TDX x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and force the full flush for TDs. This leaves one problem. Some operations use a concept of max GFN (i.e. kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the direct mask to the start of the range, the iterator would end up skipping iterating over the range not covered by the direct mask bit. For safety, make sure the __tdp_mmu_zap_root() operation iterates over the full GFN range supported by the underlying TDP format. Add a new iterator helper, for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range, regardless of root. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
3a4eb364a4 |
KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.
For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.
The relationship of the types of page tables in this scheme is depicted
below:
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> external PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT - Page table
Shared PT - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses
it to propagate PT change to the actual private PT.
Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
2c5e168e5c |
KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
As the first step toward replacing KVM's so-called "governed features" framework with a more comprehensive, less poorly named implementation, replace the "kvm_governed_feature" function prefix with "guest_cpu_cap" and rename guest_can_use() to guest_cpu_cap_has(). The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and provides a more clear distinction between guest capabilities, which are KVM controlled (heh, or one might say "governed"), and guest CPUID, which with few exceptions is fully userspace controlled. Opportunistically rewrite the comment about XSS passthrough for SEV-ES guests to avoid referencing so many functions, as such comments are prone to becoming stale (case in point...). No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e52ad1ddd0 |
KVM: x86: drop x86.h include from cpuid.h
Drop x86.h include from cpuid.h to allow the x86.h to include the cpuid.h instead. Also fix various places where x86.h was implicitly included via cpuid.h Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-2-mlevitsk@redhat.com [sean: fixup a missed include in mtrr.c] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4ca077f26d |
KVM: x86: Remove some unused declarations
Commit
|
|
|
|
896046474f |
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
5c5ddf7107 |
KVM x86 MTRR virtualization removal
Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM= =yb10 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop. |
|
|
|
5dcc1e7614 |
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
+YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
=Cf2R
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
|
|
|
|
0a7b73559b |
KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit |
|
|
|
82897db912 |
KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr"
Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's global "kvm_host" variable, so that it is automatically exported for use in vendor modules. Rename the variable/field to maxphyaddr to more clearly capture what value it holds, now that it's used outside of the MMU (and because the "shadow" part is more than a bit misleading as the variable is not at all unique to shadow paging). Recomputing the raw/true host.MAXPHYADDR on every use can be subtly expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as a nested hypervisor. Vendor code already has access to the information, e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so there's no tangible benefit to making it MMU-only. Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c043eaaa6b |
KVM: x86/mmu: Snapshot shadow_phys_bits when kvm.ko is loaded
Snapshot shadow_phys_bits when kvm.ko is loaded, not when a vendor module is loaded, to guard against usage of shadow_phys_bits before it is initialized. The computation isn't vendor specific in any way, i.e. there there is no reason to wait to snapshot the value until a vendor module is loaded, nor is there any reason to recompute the value every time a vendor module is loaded. Opportunistically convert it from "read mostly" to "read-only after init". Link: https://lore.kernel.org/r/20240423221521.2923759-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c63cf135cc |
KVM: SEV: Add support to handle RMP nested page faults
When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-12-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
7d41e24da2 |
KVM x86 misc changes for 6.10:
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
is unused by hardware, so that KVM can communicate its inability to map GPAs
that set bits 51:48 due to lack of 5-level paging. Guest firmware is
expected to use the information to safely remap BARs in the uppermost GPA
space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
- Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
- Don't completely ignore same-value writes to immutable feature MSRs, as
doing so results in KVM failing to reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
ABSENT inhibit, even if userspace enables in-kernel local APIC).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+rlEACgkQOlYIJqCj
N/3/xQ/7BvNl1aCJSIQy+yanCKK4wV0wWoY/hD+1wVge3zoaLZqLNHeR7fEa3vo+
OSS/pOz+PT6DbkokZYjjVaGs6+pFqaYg5YvRE7SPbj903phm81H7v5ZLtwgOBcXx
dG9cSLTaRhos0PxqoiLfmiGK5IDKmWuZyJzhw+nPh2YmxoRDO/4exsLA9xWWhQSh
BjPf32cq69fn39Mo/KeANdLR1FEjvKItEty7St5r/OZFxejP8VPe1xuFxHPJn4U+
FBbDe0DMXAPfoAQImBBhHUpm5Rp7Hwbh90tM8xY6rf3hvRZWmMCAX/Hx8C562M2b
k6jB13gsoVesatT6lgKs2I0KGL7TSC0jLYG8aeREdBz6AEo5bkBegB5965MZYfGv
T43i/zk+Ha5VIEURqE/CtocKF8AEjnUWLaIyL7VsDqaMslmaMdWzr8RouaO1snMT
N/mfilzx9/rzltTV67TI8FSykPNxehwNoc9P8l+ulbW1KKIzpZCWxtIpQnT2TGdn
89zAJ7LUbEAOnO+jMsJjld0fcNEmUqiqu9tezHuu0rVYErYqtfVhrWIf52r0AHDK
HRY5FNcZzCE+8FFAVDNl92Of+mPeF47RELXNMLAT+1lm91ug4k62GF4UDw7hsbFo
6+ductlj2DZlwxZVGKxKhBDxFg+AfsNCC1fZvYq+D/6ZE51eABo=
=9RXP
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.10:
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
is unused by hardware, so that KVM can communicate its inability to map GPAs
that set bits 51:48 due to lack of 5-level paging. Guest firmware is
expected to use the information to safely remap BARs in the uppermost GPA
space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
- Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
- Don't completely ignore same-value writes to immutable feature MSRs, as
doing so results in KVM failing to reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
ABSENT inhibit, even if userspace enables in-kernel local APIC).
|
|
|
|
63b6206e2f |
KVM: x86: Remove separate "bit" defines for page fault error code masks
Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240228024147.41573-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b628cb523c |
KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits
Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max mappable GPA to userspace, i.e. the max GPA that is addressable by the CPU itself. Typically this is identical to the max effective GPA, except in the case where the CPU supports MAXPHYADDR > 48 but does not support 5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the fifth level TDP page table entry). Enumerating the max mappable GPA via CPUID will allow guest firmware to map resources like PCI bars in the highest possible address space, while ensuring that the GPA is addressable by the CPU. Without precise knowledge about the max mappable GPA, the guest must assume that 5-level paging is unsupported and thus restrict its mappings to the lower 48 bits. Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace doesn't have easy access to whether or not 5-level paging is supported, and to play nice with userspace VMMs that reflect the supported CPUID directly into the guest. AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Tom Lendacky confirmed that the purpose of GuestPhysBits is software use and KVM can use it as described above. Real hardware always returns zero. Leave GuestPhysBits as '0' when TDP is disabled in order to comply with the APM's statement that GuestPhysBits "applies only to guest using nested paging". As above, guest firmware will likely create suboptimal mappings, but that is a very minor issue and not a functional concern. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240313125844.912415-3-kraxel@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
183bdd161c |
KVM: x86: Use KVM-governed feature framework to track "LAM enabled"
Use the governed feature framework to track if Linear Address Masking (LAM) is "enabled", i.e. if LAM can be used by the guest. Using the framework to avoid the relative expensive call guest_cpuid_has() during cr3 and vmexit handling paths for LAM. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
3098e6eca8 |
KVM: x86: Virtualize LAM for user pointer
Add support to allow guests to set the new CR3 control bits for Linear Address Masking (LAM) and add implementation to get untagged address for user pointers. LAM modifies the canonical check for 64-bit linear addresses, allowing software to use the masked/ignored address bits for metadata. Hardware masks off the metadata bits before using the linear addresses to access memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for virtualization. When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of the two LAM control bits. However, when EPT is off, the actual CR3 used by the guest is generated from the shadow MMU root which is different from the CR3 that is *set* by the guest, and KVM needs to manually apply any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest. KVM manually checks guest's CR3 to make sure it points to a valid guest physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check to allow the two LAM control bits to be set. After check, LAM bits of guest CR3 will be stripped off to extract guest physical address. In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address. Extend such check to allow the new LAM control bits too. Note, LAM doesn't have a global control bit to turn on/off LAM completely, but purely depends on hardware's CPUID to determine it can be enabled or not. That means, when EPT is on, even when KVM doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o causing problem. This is an unfortunate virtualization hole. KVM could choose to intercept CR3 in this case and inject fault but this would hurt performance when running a normal VM w/o LAM support. This is undesirable. Just choose to let the guest do such illegal thing as the worst case is guest being killed when KVM eventually find out such illegal behaviour and that the guest is misbehaving. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
1affe455d6 |
KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs
Add helpers to check if KVM honors guest MTRRs instead of open coding the logic in kvm_tdp_page_fault(). Future fixes and cleanups will also need to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and and non-coherent DMA transitions. Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can check if guest MTRRs were honored when stopping non-coherent DMA. Note, there is no need to explicitly check that TDP is enabled, KVM clears shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only if EPT is enabled. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com [sean: squash into a one patch, drop explicit TDP check massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
932844462a |
KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs
Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
cf9f4c0eb1 |
KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults
Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for
permission faults when emulating a guest memory access and CR0.WP may be
guest owned. If the guest toggles only CR0.WP and triggers emulation of
a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a
stale CR0.WP, i.e. use stale protection bits metadata.
Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP
is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't
support per-bit interception controls for CR0. Don't bother checking for
EPT vs. NPT as the "old == new" check will always be true under NPT, i.e.
the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0
from the VMCB on VM-Exit).
Reported-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net
Fixes:
|
|
|
|
607475cfa0 |
KVM: x86: Add helpers to query individual CR0/CR4 bits
Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(), and to make each caller's intent more obvious. Defer converting helpers that do truly ugly casts from "unsigned long" to "int", e.g. is_pse(), to a future commit so that their conversion is more isolated. Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3(); the local variable is used only once, immediately after its declaration. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com [sean: move "obvious" conversions to this commit, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
1f98f2bd8e |
KVM: x86/mmu: Change tdp_mmu to a read-only parameter
Change tdp_mmu to a read-only parameter and drop the per-vm tdp_mmu_enabled. For 32-bit KVM, make tdp_mmu_enabled a macro that is always false so that the compiler can continue omitting cals to the TDP MMU. The TDP MMU was introduced in 5.10 and has been enabled by default since 5.15. At this point there are no known functionality gaps between the TDP MMU and the shadow MMU, and the TDP MMU uses less memory and scales better with the number of vCPUs. In other words, there is no good reason to disable the TDP MMU on a live system. Purposely do not drop tdp_mmu=N support (i.e. do not force 64-bit KVM to always use the TDP MMU) since tdp_mmu=N is still used to get test coverage of KVM's shadow MMU TDP support, which is used in 32-bit KVM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
0c29397ac1 |
KVM: SVM: Disable SEV-ES support if MMIO caching is disable
Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs
generating #NPF(RSVD), which are reflected by the CPU into the guest as
a #VC. With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access
to the guest instruction stream or register state and so can't directly
emulate in response to a #NPF on an emulated MMIO GPA. Disabling MMIO
caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT),
and those flavors of #NPF cause automatic VM-Exits, not #VC.
Adjust KVM's MMIO masks to account for the C-bit location prior to doing
SEV(-ES) setup, and document that dependency between adjusting the MMIO
SPTE mask and SEV(-ES) setup.
Fixes:
|
|
|
|
2ca3129e80 |
KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs
Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
42c88ff893 |
KVM: x86/mmu: Dedup macros for computing various page table masks
Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b3fcdb04a9 |
KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h
Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
3c5c32457d |
KVM: VMX: Include MKTME KeyID bits in shadow_zero_check
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to include all MKTME KeyID bits to include them to shadow_zero_check. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
e54f1ff244 |
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
8a009d5bca |
KVM: x86/mmu: Make all page fault handlers internal to the MMU
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a972e29c1d |
KVM: x86/mmu: replace shadow_root_level with root_role.level
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
86931ff720 |
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes:
|
|
|
|
4f4aa80e3b |
KVM: X86: Handle implicit supervisor access with SMAP
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
8873c1434f |
KVM: X86: Rename variable smap to not_smap in permission_fault()
Comments above the variable says the bit is set when SMAP is overridden or the same meaning in update_permission_bitmask(): it is not subjected to SMAP restriction. Renaming it to reflect the negative implication and make the code better readability. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
5b22bbe717 |
KVM: X86: Change the type of access u32 to u64
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
527d5cd7ee |
KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped
Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b9e5603c2a |
KVM: x86: use struct kvm_mmu_root_info for mmu->root
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
d617429936 |
KVM: x86: Reinitialize context if host userspace toggles EFER.LME
While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and
therefore EFER.NX is the only bit that can affect the MMU role. However,
set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1.
In that case, the MMU has to be reset.
Fixes:
|