Commit Graph

235 Commits

Author SHA1 Message Date
Paolo Bonzini 4e02d4f973 KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
    fix a race between AP destroy and VMRUN.
 
  - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
 
  - Add support for ALLOWED_SEV_FEATURES.
 
  - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
 
  - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
    that utilize those bits.
 
  - Don't account temporary allocations in sev_send_update_data().
 
  - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
 N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
 8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
 jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
 RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
 TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
 PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
 7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
 e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
 2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
 hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
 HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
 =k6rK
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.16:

 - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
   fix a race between AP destroy and VMRUN.

 - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.

 - Add support for ALLOWED_SEV_FEATURES.

 - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.

 - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
   that utilize those bits.

 - Don't account temporary allocations in sev_send_update_data().

 - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-27 12:15:49 -04:00
Manali Shukla 89f9edf4c6 KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.

The biggest difference between the two features is that Threshold is
fault-like, whereas Detection is trap-like.  To allow the guest to make
forward progress, Threshold provides a per-VMCB counter which is
decremented every time a bus lock occurs, and a VM-Exit is triggered if
and only if the counter is '0'.

To provide Detection-like semantics, initialize the counter to '0', i.e.
exit on every bus lock, and when re-executing the guilty instruction, set
the counter to '1' to effectively step past the instruction.

Note, in the unlikely scenario that re-executing the instruction doesn't
trigger a bus lock, e.g. because the guest has changed memory types or
patched the guilty instruction, the bus lock counter will be left at '1',
i.e. the guest will be able to do a bus lock on a different instruction.
In a perfect world, KVM would ensure the counter is '0' if the guest has
made forward progress, e.g. if RIP has changed.  But trying to close that
hole would incur non-trivial complexity, for marginal benefit; the intent
of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
not to allow for precise detection of problematic guest code.  And, it's
simply not feasible to fully close the hole, e.g. if an interrupt arrives
before the original instruction can re-execute, the guest could step past
a different bus lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com
[sean: fix typo in comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-19 11:05:10 -07:00
Yosry Ahmed 656d9624bd KVM: x86: Generalize IBRS virtualization on emulated VM-exit
Commit 2e7eab8142 ("KVM: VMX: Execute IBPB on emulated VM-exit when
guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to
properly virtualize IBRS by providing separate predictor modes for L1
and L2.

AMD requires similar handling, except when IbrsSameMode is enumerated by
the host CPU (which is the case on most/all AMD CPUs). With
IbrsSameMode, hardware IBRS is sufficient and no extra handling is
needed from KVM.

Generalize the handling in nested_vmx_vmexit() by moving it into a
generic function, add the AMD handling, and use it in
nested_svm_vmexit() too. The main reason for using a generic function is
to have a single place to park the huge comment about virtualizing IBRS.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-4-yosry.ahmed@linux.dev
[sean: use kvm_nested_vmexit_handle_spec_ctrl() for the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24 11:18:32 -07:00
Paolo Bonzini 4d9a677596 KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
 
  - Add a helper to consolidate handling of mp_state transitions, and use it to
    clear pv_unhalted whenever a vCPU is made RUNNABLE.
 
  - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
    coalesce updates when multiple pieces of vCPU state are changing, e.g. as
    part of a nested transition.
 
  - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
    nested VM-Exit on interception (instead of injecting #UD into L2).
 
  - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
    as KVM can't get the current CPL.
 
  - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj
 N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7
 4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428
 TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3
 geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7
 5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y
 uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2
 yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm
 a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ
 dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k
 kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI
 WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA=
 =8SGF
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.15:

 - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.

 - Add a helper to consolidate handling of mp_state transitions, and use it to
   clear pv_unhalted whenever a vCPU is made RUNNABLE.

 - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.

 - Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).

 - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
   as KVM can't get the current CPL.

 - Misc cleanups
2025-03-19 09:04:48 -04:00
Jim Mattson c9e5f3fa90 KVM: x86: Introduce kvm_set_mp_state()
Replace all open-coded assignments to vcpu->arch.mp_state with calls
to a new helper, kvm_set_mp_state(), to centralize all changes to
mp_state.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:27 -08:00
Sean Christopherson 46d6c6f3ef KVM: nSVM: Enter guest mode before initializing nested NPT MMU
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role.  KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.

Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).

And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.

Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.

Fixes: 69cb877487 ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:57:55 -08:00
Sean Christopherson 2c5e168e5c KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().

The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.

Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:03 -08:00
Sean Christopherson 365e319208 KVM: Pass in write/dirty to kvm_vcpu_map(), not kvm_vcpu_unmap()
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them
pass "true" as a @writable param to kvm_vcpu_map(), and thus create a
read-only mapping when possible.

Note, creating read-only mappings can be theoretically slower, as they
don't play nice with fast GUP due to the need to break CoW before mapping
the underlying PFN.  But practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-34-seanjc@google.com>
2024-10-25 12:59:07 -04:00
Sean Christopherson f559b2e9c5 KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.

In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.

Per the APM:

  The CR3 register points to the base address of the page-directory-pointer
  table. The page-directory-pointer table is aligned on a 32-byte boundary,
  with the low 5 address bits 4:0 assumed to be 0.

And the SDM's much more explicit:

  4:0    Ignored

Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.

Fixes: e4e517b4be ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski@google.com>
Cc: Andy Nguyen <theflow@google.com>
Cc: 3pvd <3pvd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009140838.1036226-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:06 -04:00
Yongqiang Liu c501062bb2 KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state()
The fixed size temporary variables vmcb_control_area and vmcb_save_area
allocated in svm_set_nested_state() are released when the function exits.
Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid
massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT.

Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com>
Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:09 -07:00
Li RongQing f51af34686 KVM: SVM: remove useless input parameter in snp_safe_alloc_page
The input parameter 'vcpu' in snp_safe_alloc_page is not used.
Therefore, remove it.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-2-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 16:14:11 -07:00
Brijesh Singh 75253db41a KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the
RMP entry of a VMCB, VMSA or AVIC backing page.

When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
backing pages as "in-use" via a reserved bit in the corresponding RMP
entry after a successful VMRUN. This is done for _all_ VMs, not just
SNP-Active VMs.

If the hypervisor accesses an in-use page through a writable
translation, the CPU will throw an RMP violation #PF. On early SNP
hardware, if an in-use page is 2MB-aligned and software accesses any
part of the associated 2MB region with a hugepage, the CPU will
incorrectly treat the entire 2MB region as in-use and signal a an RMP
violation #PF.

To avoid this, the recommendation is to not use a 2MB-aligned page for
the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure
that the page returned is not 2MB-aligned and is safe to be used when
SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA
pages of nested guests.

  [ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ]

Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
2024-01-29 20:34:19 +01:00
Paolo Bonzini 8c9244af4d KVM SVM changes for 6.8:
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
 
  - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
    flushes on nested transitions, i.e. always satisfies flush requests.  This
    allows running bleeding edge versions of VMware Workstation on top of KVM.
 
  - Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
 
  - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept
    IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects
    a second, "simultaneous" NMI.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/9ESHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL50PcP/Rbdf/68/g1m4JQYl8rf2h7BD4PGE5yw
 ZpeXSkeZmzyRYPiJjJaZLcvvezyusIPoGRfmsKgj2nI7LCSVyHDmaHVp2h854Xz8
 kSWmK5znBYDx+vUqhIKEN2nwFNYSUaSqcRZWvoXi0BzalWlwCgK2yu8xeRDUhn4B
 +gDKlqZuJMYY1J3V8e64ZkvdxRHsw0WyvD0Ns4EgCe/2v5V9gc08a7vuSq80EtaE
 yf0cZmubDwuV96LfZnDkZnZpm4C1GNeLxAN1wlj7J6fAvrCAggetDtkJtWCd8yd0
 0ZtfjBOMVsCDWQsYXbwGGKdeynzATxc354k6yHBIO863z+M5MtEMKlFNCclrakMO
 RHfofZHhL+hn3ACESJPcse3ei0VbV28cL2NFdstUEukvZQoacIH9fz7+1GuWqBpv
 Vb9UJDde029HHsGf+n8LtfQsqV7/8aLV+/4bpiPOHQU+tzAJVxni/H9nJ+7V0lxd
 NfhWME1lEsQWxpBpcXcVB7D7+ri1Wd9eB4IR9xc/VqgLE1Nj4kIZqtOJF9lbY3wk
 +H/Ze/MNNg6E9yIErSIv7sWdrvoOPYWZdGCT8Fhm4OILAsDEO96z7WoVF0eWCdJ1
 xDIFGXNFuyIpVOqk/JZE/Lv5U1C4xhyFQCmk6gXDgepnTn4d8gx3S79iUfXD32gE
 GqAjV9Wwmz+o
 =mXEf
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.8:

 - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.

 - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
   flushes on nested transitions, i.e. always satisfies flush requests.  This
   allows running bleeding edge versions of VMware Workstation on top of KVM.

 - Sanity check that the CPU supports flush-by-ASID when enabling SEV support.

 - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept
   IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects
   a second, "simultaneous" NMI.
2024-01-08 08:10:16 -05:00
Paolo Bonzini 8ecb10bcbf KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
 checks for most virtual address usage in 64-bit mode, such that only the most
 significant bit of the untranslated address bits must match the polarity of the
 last translated address bit.  This allows software to use ignored, untranslated
 address bits for metadata, e.g. to efficiently tag pointers for address
 sanitization.
 
 LAM can be enabled separately for user pointers and supervisor pointers, and
 for userspace LAM can be select between 48-bit and 57-bit masking
 
  - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
  - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
 
 For user pointers, LAM enabling utilizes two previously-reserved high bits from
 CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
 61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
 
  - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
  - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
  - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
 
 For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
 the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
 
  - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
  - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
  - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
 
 The modified LAM canonicality checks:
  - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                               63               47
  - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                               63               47
  - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                               63               56
  - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                               63               56
  - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                               63               56..47
 
 The bulk of KVM support for LAM is to emulate LAM's modified canonicality
 checks.  The approach taken by KVM is to "fill" the metadata bits using the
 highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
 to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
 from the raw, untagged virtual address is kept for the canonicality check. This
 untagging allows
 
 Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
 touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
 enabling via CR3 and CR4 bits, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf
 4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D
 OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS
 NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq
 OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/
 CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2
 QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb
 gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM
 gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr
 dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/
 TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf
 wHQkXgM+XimA
 =qUZ2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 support for virtualizing Linear Address Masking (LAM)

Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit.  This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.

LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking

 - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
 - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.

For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:

 - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
 - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
 - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers

For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:

 - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers

The modified LAM canonicality checks:
 - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                              63               47
 - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                              63               47
 - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                              63               56
 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                              63               56
 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                              63               56..47

The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks.  The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows

Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
2024-01-08 08:10:12 -05:00
Vitaly Kuznetsov 017a99a966 KVM: nSVM: Hide more stuff under CONFIG_KVM_HYPERV/CONFIG_HYPERV
'struct hv_vmcb_enlightenments' in VMCB only make sense when either
CONFIG_KVM_HYPERV or CONFIG_HYPERV is enabled.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-17-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:35:26 -08:00
Vitaly Kuznetsov af9d544a45 KVM: x86: Introduce helper to handle Hyper-V paravirt TLB flush requests
As a preparation to making Hyper-V emulation optional, introduce a helper
to handle pending KVM_REQ_HV_TLB_FLUSH requests.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:23 -08:00
Sean Christopherson a484755ab2 Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"
Revert KVM's made-up consistency check on SVM's TLB control.  The APM says
that unsupported encodings are reserved, but the APM doesn't state that
VMRUN checks for a supported encoding.  Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
Zero), AMD behavior is typically to let software shoot itself in the foot.

This reverts commit 174a921b69.

Fixes: 174a921b69 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:47:43 -08:00
Binbin Wu 2c49db455e KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide
a clear distinction between CR3 and GPA checks.  This will allow exempting
bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks,
e.g. for upcoming features that will use high bits in CR3 for feature
enabling.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:04 -08:00
Maxim Levitsky 3fdc6087df x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
svm_leave_nested() similar to a nested VM exit, get the vCPU out of nested
mode and thus should end the local inhibition of AVIC on this vCPU.

Failure to do so, can lead to hangs on guest reboot.

Raise the KVM_REQ_APICV_UPDATE request to refresh the AVIC state of the
current vCPU in this case.

Fixes: f44509f849 ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:09:00 -04:00
Sean Christopherson b89456aee7 KVM: nSVM: Use KVM-governed feature framework to track "vGIF enabled"
Track "virtual GIF exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.

Note, checking KVM's capabilities instead of the "vgif" param means that
the code isn't strictly equivalent, as vgif_enabled could have been set
if nested=false where as that the governed feature cannot.  But that's a
glorified nop as the feature/flag is consumed only by paths that are

Link: https://lore.kernel.org/r/20230815203653.519297-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:31 -07:00
Sean Christopherson 59d67fc1f0 KVM: nSVM: Use KVM-governed feature framework to track "Pause Filter enabled"
Track "Pause Filtering is exposed to L1" via governed feature flags
instead of using dedicated bits/flags in vcpu_svm.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:30 -07:00
Sean Christopherson e183d17ac3 KVM: nSVM: Use KVM-governed feature framework to track "LBRv enabled"
Track "LBR virtualization exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.

Note, checking KVM's capabilities instead of the "lbrv" param means that
the code isn't strictly equivalent, as lbrv_enabled could have been set
if nested=false where as that the governed feature cannot.  But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.

Link: https://lore.kernel.org/r/20230815203653.519297-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:30 -07:00
Sean Christopherson 4d2a1560ff KVM: nSVM: Use KVM-governed feature framework to track "vVM{SAVE,LOAD} enabled"
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.

Opportunistically add a comment explaining why KVM disallows virtual
VMLOAD/VMSAVE when the vCPU model is Intel.

No functional change intended.

Link: https://lore.kernel.org/r/20230815203653.519297-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:29 -07:00
Sean Christopherson 4365a45571 KVM: nSVM: Use KVM-governed feature framework to track "TSC scaling enabled"
Track "TSC scaling exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.

Note, this fixes a benign bug where KVM would mark TSC scaling as exposed
to L1 even if overall nested SVM supported is disabled, i.e. KVM would let
L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR
support to userspace.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:28 -07:00
Sean Christopherson 7a6a6a3bf5 KVM: nSVM: Use KVM-governed feature framework to track "NRIPS enabled"
Track "NRIPS exposed to L1" via a governed feature flag instead of using
a dedicated bit/flag in vcpu_svm.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:28 -07:00
Sean Christopherson 2d63699099 KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure.  The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.

Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies.  Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.

This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.

Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:29 -07:00
Sean Christopherson c0dc39bd2c KVM: nSVM: Use the "outer" helper for writing multiplier to MSR_AMD64_TSC_RATIO
When emulating nested SVM transitions, use the outer helper for writing
the TSC multiplier for L2.  Using the inner helper only for one-off cases,
i.e. for paths where KVM is NOT emulating or modifying vCPU state, will
allow for multiple cleanups:

 - Explicitly disabling preemption only in the outer helper
 - Getting the multiplier from the vCPU field in the outer helper
 - Skipping the WRMSR in the outer helper if guest state isn't loaded

Opportunistically delete an extra newline.

No functional change intended.

Link: https://lore.kernel.org/r/20230729011608.1065019-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:28 -07:00
Sean Christopherson 0c94e24684 KVM: nSVM: Load L1's TSC multiplier based on L1 state, not L2 state
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired
ratio doesn't match the current ratio, not if the ratio L1 is using for
L2 diverges from the default.  Functionally, the end result is the same
as KVM will run L2 with L1's multiplier if L2's multiplier is the default,
i.e. checking that L1's multiplier is loaded is equivalent to checking if
L2 has a non-default multiplier.

However, the assertion that TSC scaling is exposed to L1 is flawed, as
userspace can trigger the WARN at will by writing the MSR and then
updating guest CPUID to hide the feature (modifying guest CPUID is
allowed anytime before KVM_RUN).  E.g. hacking KVM's state_test
selftest to do

                vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
                vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);

after restoring state in a new VM+vCPU yields an endless supply of:

  ------------[ cut here ]------------
  WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105
           nested_svm_vmexit+0x6af/0x720 [kvm_amd]
  Call Trace:
   nested_svm_exit_handled+0x102/0x1f0 [kvm_amd]
   svm_handle_exit+0xb9/0x180 [kvm_amd]
   kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
   kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
   ? trace_hardirqs_off+0x4d/0xa0
   __se_sys_ioctl+0x7a/0xc0
   __x64_sys_ioctl+0x21/0x30
   do_syscall_64+0x41/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check
into the if-statement is wrong as KVM needs to ensure L1's multiplier is
loaded in the above scenario.   Alternatively, the WARN_ON() could simply
be deleted, but that would make KVM's behavior even more subtle, e.g. it's
not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when
checking only tsc_ratio_msr.

Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:27 -07:00
Sean Christopherson 7cafe9b8e2 KVM: nSVM: Check instead of asserting on nested TSC scaling support
Check for nested TSC scaling support on nested SVM VMRUN instead of
asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO
has diverged from KVM's default.  Userspace can trigger the WARN at will
by writing the MSR and then updating guest CPUID to hide the feature
(modifying guest CPUID is allowed anytime before KVM_RUN).  E.g. hacking
KVM's state_test selftest to do

		vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
		vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);

after restoring state in a new VM+vCPU yields an endless supply of:

  ------------[ cut here ]------------
  WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699
           nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd]
  Call Trace:
   <TASK>
   enter_svm_guest_mode+0x114/0x560 [kvm_amd]
   nested_svm_vmrun+0x260/0x330 [kvm_amd]
   vmrun_interception+0x29/0x30 [kvm_amd]
   svm_invoke_exit_handler+0x35/0x100 [kvm_amd]
   svm_handle_exit+0xe7/0x180 [kvm_amd]
   kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
   kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
   __se_sys_ioctl+0x7a/0xc0
   __x64_sys_ioctl+0x21/0x30
   do_syscall_64+0x41/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
  RIP: 0033:0x45ca1b

Note, the nested #VMEXIT path has the same flaw, but needs a different
fix and will be handled separately.

Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:27 -07:00
Santosh Shukla 0977cfac6e KVM: nSVM: Implement support for nested VNMI
Allow L1 to use vNMI to accelerate its injection of NMI to L2 by
propagating vNMI int_ctl bits from/to vmcb12 to/from vmcb02.

To handle both the case where vNMI is enabled for L1 and L2, and where
vNMI is enabled for L1 but _not_ L2, move pending L1 vNMIs to nmi_pending
on nested VM-Entry and raise KVM_REQ_EVENT, i.e. rely on existing code to
route the NMI to the correct domain.

On nested VM-Exit, reverse the process and set/clear V_NMI_PENDING for L1
based one whether nmi_pending is zero or non-zero.  There is no need to
consider vmcb02 in this case, as V_NMI_PENDING can be set in vmcb02 if
vNMI is disabled for L2, and if vNMI is enabled for L2, then L1 and L2
have different NMI contexts.

Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-12-santosh.shukla@amd.com
[sean: massage changelog to match the code]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 17:43:45 -07:00
Maxim Levitsky 5d1ec45652 KVM: nSVM: Raise event on nested VM exit if L1 doesn't intercept IRQs
If L1 doesn't intercept interrupts, then KVM will use vmcb02's V_IRQ
to detect an interrupt window for L1 IRQs.  On a subsequent nested
VM-Exit, KVM might need to copy the current V_IRQ from vmcb02 to vmcb01
to continue waiting for an interrupt window, i.e. if there is still a
pending IRQ for L1.

Raise KVM_REQ_EVENT on nested exit if L1 isn't intercepting IRQs to ensure
that KVM will re-enable interrupt window detection if needed.

Note that this is a theoretical bug because KVM already raises
KVM_REQ_EVENT on each nested VM exit, because the nested VM exit resets
RFLAGS and kvm_set_rflags() raises the KVM_REQ_EVENT unconditionally.

Explicitly raise KVM_REQ_EVENT for the interrupt window case to avoid
having an unnecessary dependency on kvm_set_rflags(), and to document
the scenario.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[santosh: reworded description as per Sean's v2 comment]
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-4-santosh.shukla@amd.com
[sean: further massage changelog and comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:33:58 -07:00
Santosh Shukla 7334ede457 KVM: nSVM: Disable intercept of VINTR if saved L1 host RFLAGS.IF is 0
Disable intercept of virtual interrupts (used to detect interrupt windows)
if the saved host (L1) RFLAGS.IF is '0', as the effective RFLAGS.IF for L1
interrupts will never be set while L2 is running (L2's RFLAGS.IF doesn't
affect L1 IRQs when virtual interrupts are enabled).

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lkml.kernel.org/r/Y9hybI65So5X2LFg%40google.com
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-3-santosh.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:19:40 -07:00
Santosh Shukla 5faaffab5b KVM: nSVM: Don't sync vmcb02 V_IRQ back to vmcb12 if KVM (L0) is intercepting VINTR
Don't sync vmcb02 V_IRQ back to vmcb12 if KVM (L0) is intercepting
virtual interrupts in order to request an interrupt window, as KVM
has usurped vmcb02's int_ctl.  If an interrupt window opens before
the next VM-Exit, svm_clear_vintr() will restore vmcb12's int_ctl.
If no window opens, V_IRQ will be correctly preserved in vmcb12's
int_ctl (because it was never recognized while L2 was running).

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lkml.kernel.org/r/Y9hybI65So5X2LFg%40google.com
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-2-santosh.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:19:10 -07:00
Maxim Levitsky 8957cbcfed KVM: nSVM: Don't sync tlb_ctl back to vmcb12 on nested VM-Exit
Don't sync the TLB control field from vmcb02 to vmcs12 on nested VM-Exit.
Per AMD's APM, the field is not modified by hardware:

  The VMRUN instruction reads, but does not change, the value of the
  TLB_CONTROL field

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-2-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31 12:56:26 -08:00
Paolo Bonzini f15a87c006 Merge branch 'kvm-lapic-fix-and-cleanup' into HEAD
The first half or so patches fix semi-urgent, real-world relevant APICv
and AVIC bugs.

The second half fixes a variety of AVIC and optimized APIC map bugs
where KVM doesn't play nice with various edge cases that are
architecturally legal(ish), but are unlikely to occur in most real world
scenarios

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24 06:08:01 -05:00
Paolo Bonzini dc7c31e922 Merge branch 'kvm-v6.2-rc4-fixes' into HEAD
ARM:

* Fix the PMCR_EL0 reset value after the PMU rework

* Correctly handle S2 fault triggered by a S1 page table walk
  by not always classifying it as a write, as this breaks on
  R/O memslots

* Document why we cannot exit with KVM_EXIT_MMIO when taking
  a write fault from a S1 PTW on a R/O memslot

* Put the Apple M2 on the naughty list for not being able to
  correctly implement the vgic SEIS feature, just like the M1
  before it

* Reviewer updates: Alex is stepping down, replaced by Zenghui

x86:

* Fix various rare locking issues in Xen emulation and teach lockdep
  to detect them

* Documentation improvements

* Do not return host topology information from KVM_GET_SUPPORTED_CPUID
2023-01-24 06:05:23 -05:00
Sean Christopherson 2008fab345 KVM: x86: Inhibit APIC memslot if x2APIC and AVIC are enabled
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's
AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with
x2APIC enabled.  On AMD, if its "hybrid" mode is enabled (AVIC is enabled
when x2APIC is enabled even without x2AVIC support), keeping the APIC
access page memslot results in the guest being able to access the virtual
APIC page as x2APIC is fully emulated by KVM.  I.e. hardware isn't aware
that the guest is operating in x2APIC mode.

Exempt nested SVM's update of APICv state from the new logic as x2APIC
can't be toggled on VM-Exit.  In practice, invoking the x2APIC logic
should be harmless precisely because it should be a glorified nop, but
play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU
lock.

Intel doesn't suffer from the same issue as APICv has fully independent
VMCS controls for xAPIC vs. x2APIC virtualization.  Technically, KVM
should provide bus error semantics and not memory semantics for the APIC
page when x2APIC is enabled, but KVM already provides memory semantics in
other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware
disabled (via APIC_BASE MSR).

Note, checking apic_access_memslot_enabled without taking locks relies
it being set during vCPU creation (before kvm_vcpu_reset()).  vCPUs can
race to set the inhibit and delete the memslot, i.e. can get false
positives, but can't get false negatives as apic_access_memslot_enabled
can't be toggled "on" once any vCPU reaches KVM_RUN.

Opportunistically drop the "can" while updating avic_activate_vmcb()'s
comment, i.e. to state that KVM _does_ support the hybrid mode.  Move
the "Note:" down a line to conform to preferred kernel/KVM multi-line
comment style.

Opportunistically update the apicv_update_lock comment, as it isn't
actually used to protect apic_access_memslot_enabled (which is protected
by slots_lock).

Fixes: 0e311d33bf ("KVM: SVM: Introduce hybrid-AVIC mode")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:25 -05:00
Paolo Bonzini 74905e3de8 KVM: nSVM: clarify recalc_intercepts() wrt CR8
The mysterious comment "We only want the cr8 intercept bits of L1"
dates back to basically the introduction of nested SVM, back when
the handling of "less typical" hypervisors was very haphazard.
With the development of kvm-unit-tests for interrupt handling,
the same code grew another vmcb_clr_intercept for the interrupt
window (VINTR) vmexit, this time with a comment that is at least
decent.

It turns out however that the same comment applies to the CR8 write
intercept, which is also a "recheck if an interrupt should be
injected" intercept.  The CR8 read intercept instead has not
been used by KVM for 14 years (commit 649d68643e, "KVM: SVM:
sync TPR value to V_TPR field in the VMCB"), so do not bother
clearing it and let one comment describe both CR8 write and VINTR
handling.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-09 05:35:21 -05:00
Sean Christopherson 8d20bd6381 KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code.  In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.

Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.

Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.

Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl().  But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:47:35 -05:00
Vitaly Kuznetsov 3f4a812edf KVM: nSVM: hyper-v: Enable L2 TLB flush
Implement Hyper-V L2 TLB flush for nSVM. The feature needs to be enabled
both in extended 'nested controls' in VMCB and VP assist page.
According to Hyper-V TLFS, synthetic vmexit to L1 is performed with
- HV_SVM_EXITCODE_ENL exit_code.
- HV_SVM_ENL_EXITCODE_TRAP_AFTER_FLUSH exit_info_1.

Note: VP assist page is cached in 'struct kvm_vcpu_hv' so
recalc_intercepts() doesn't need to read from guest's memory. KVM
needs to update the case upon each VMRUN and after svm_set_nested_state
(svm_get_nested_state_pages()) to handle the case when the guest got
migrated while L2 was running.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-29-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:59:18 -05:00
Vitaly Kuznetsov b0c9c25e46 KVM: x86: Introduce .hv_inject_synthetic_vmexit_post_tlb_flush() nested hook
Hyper-V supports injecting synthetic L2->L1 exit after performing
L2 TLB flush operation but the procedure is vendor specific. Introduce
.hv_inject_synthetic_vmexit_post_tlb_flush nested hook for it.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-22-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:59:13 -05:00
Vitaly Kuznetsov e45aa2444d KVM: nSVM: Keep track of Hyper-V hv_vm_id/hv_vp_id
Similar to nSVM, KVM needs to know L2's VM_ID/VP_ID and Partition
assist page address to handle L2 TLB flush requests.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-21-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:59:12 -05:00
Sean Christopherson 26b516bb39 x86/hyperv: KVM: Rename "hv_enlightenments" to "hv_vmcb_enlightenments"
Now that KVM isn't littered with "struct hv_enlightenments" casts, rename
the struct to "hv_vmcb_enlightenments" to highlight the fact that the
struct is specifically for SVM's VMCB.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:58:59 -05:00
Sean Christopherson 68ae7c7bc5 KVM: SVM: Add a proper field for Hyper-V VMCB enlightenments
Add a union to provide hv_enlightenments side-by-side with the sw_reserved
bytes that Hyper-V's enlightenments overlay.  Casting sw_reserved
everywhere is messy, confusing, and unnecessarily unsafe.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:58:58 -05:00
Sean Christopherson 089fe572a2 x86/hyperv: Move VMCB enlightenment definitions to hyperv-tlfs.h
Move Hyper-V's VMCB enlightenment definitions to the TLFS header; the
definitions come directly from the TLFS[*], not from KVM.

No functional change intended.

[*] https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_svm_enlightened_vmcb_fields

[vitaly: rename VMCB_HV_ -> HV_VMCB_ to match the rest of
hyperv-tlfs.h, keep svm/hyperv.h]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18 12:58:57 -05:00
Paolo Bonzini 771a579c6e Merge branch 'kvm-svm-harden' into HEAD
This fixes three issues in nested SVM:

1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset().
However, if running nested and L1 doesn't intercept shutdown, the function
resets vcpu->arch.hflags without properly leaving the nested state.
This leaves the vCPU in inconsistent state and later triggers a kernel
panic in SVM code.  The same bug can likely be triggered by sending INIT
via local apic to a vCPU which runs a nested guest.

On VMX we are lucky that the issue can't happen because VMX always
intercepts triple faults, thus triple fault in L2 will always be
redirected to L1.  Plus, handle_triple_fault() doesn't reset the vCPU.
INIT IPI can't happen on VMX either because INIT events are masked while
in VMX mode.

Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM.
A normal hypervisor should always intercept SHUTDOWN, a unit test on
the other hand might want to not do so.

Finally, the guest can trigger a kernel non rate limited printk on SVM
from the guest, which is fixed as well.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 11:51:09 -05:00
Maxim Levitsky 92e7d5c83a KVM: x86: allow L1 to not intercept triple fault
This is SVM correctness fix - although a sane L1 would intercept
SHUTDOWN event, it doesn't have to, so we have to honour this.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 11:39:59 -05:00
Maxim Levitsky f9697df251 KVM: x86: add kvm_leave_nested
add kvm_leave_nested which wraps a call to nested_ops->leave_nested
into a function.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 11:39:56 -05:00
Maxim Levitsky 16ae56d7e0 KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use
Make sure that KVM uses vmcb01 before freeing nested state, and warn if
that is not the case.

This is a minimal fix for CVE-2022-3344 making the kernel print a warning
instead of a kernel panic.

Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17 11:39:54 -05:00
Paolo Bonzini 31e83e21cf KVM: x86: compile out vendor-specific code if SMM is disabled
Vendor-specific code that deals with SMI injection and saving/restoring
SMM state is not needed if CONFIG_KVM_SMM is disabled, so remove the
four callbacks smi_allowed, enter_smm, leave_smm and enable_smi_window.
The users in svm/nested.c and x86.c also have to be compiled out; the
amount of #ifdef'ed code is small and it's not worth moving it to
smm.c.

enter_smm is now used only within #ifdef CONFIG_KVM_SMM, and the stub
can therefore be removed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:19 -05:00