mirror of https://github.com/torvalds/linux.git
1405 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
73d7cf0710 |
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host on
stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively
being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM.
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification of
vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue.
- Allow out-of-range/invalid Xen event channel ports when configuring IRQ
routing, to avoid dictating a specific ioctl() ordering to userspace.
- Conditionally reschedule when setting memory attributes to avoid soft
lockups when userspace converts huge swaths of memory to/from private.
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest.
- Add a missing field in struct sev_data_snp_launch_start that resulted in
the guest-visible workarounds field being filled at the wrong offset.
- Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid
VM-Fail on INVVPID.
- Advertise supported TDX TDVMCALLs to userspace.
- Pass SetupEventNotifyInterrupt arguments to userspace.
- Fix TSC frequency underflow.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhurKgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNxHggApTP4vw+oOzfN7UoNmgR9XZMI1p2a
R8AzQ1zDyVbEVWq3xTKvXtld+dKeO0yKB/XeI/1JLck1OiHxY57I3X6k5AnsurEr
CBzeAhAjXivF8woMgmlP+30aqpomcPACdQm0gRnWkRDDJfXqSUas/iE/s9Ct1dT4
4w3PtFLsSsU8vX/RttR+CqF1AQ6SeV/NRvA8hzPGMGZoQ2um74j4ZsM/3xh77Kdw
Z2vOnZOIA4dk0074JjO/Yb9l00Ib4hn+MWG5jVJ+6i2HRRYd2knnB29apVS/ARdL
X20j+LvtYj/jrPPdYwqjvxbIXyLbJrLCZyjKhfueN+rnisPNvzR+7YE4ZQ==
=NduO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Many patches, pretty much all of them small, that accumulated while I
was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host
on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested
mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are
actively being created, so as not to create a non-SEV{-ES} vCPU in
an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification
of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring
IRQ routing, to avoid dictating a specific ioctl() ordering to
userspace
- Conditionally reschedule when setting memory attributes to avoid
soft lockups when userspace converts huge swaths of memory to/from
private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that
resulted in the guest-visible workarounds field being filled at the
wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes
to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: avoid underflow when scaling TSC frequency
KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
KVM: arm64: Don't free hyp pages with pKVM on GICv2
KVM: arm64: Fix error path in init_hyp_mode()
KVM: arm64: Adjust range correctly during host stage-2 faults
KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
Documentation: KVM: Fix unexpected unindent warnings
KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
KVM: Allow CPU to reschedule while setting per-page memory attributes
KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
|
|
|
|
6e9128ff9d |
Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage. Add the usual controls of the mitigation and integrate it into the existing speculation bugs infrastructure in the kernel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/ ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo= =2cLl -----END PGP SIGNATURE----- Merge tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU speculation fixes from Borislav Petkov: "Add the mitigation logic for Transient Scheduler Attacks (TSA) TSA are new aspeculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage. Add the usual controls of the mitigation and integrate it into the existing speculation bugs infrastructure in the kernel" * tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/process: Move the buffer clearing before MONITOR x86/microcode/AMD: Add TSA microcode SHAs KVM: SVM: Advertise TSA CPUID bits to guests x86/bugs: Add a Transient Scheduler Attacks mitigation x86/bugs: Rename MDS machinery to something more generic |
|
|
|
fa7d0f83c5 |
x86/traps: Initialize DR7 by writing its architectural reset value
Initialize DR7 by writing its architectural reset value to always set bit 10, which is reserved to '1', when "clearing" DR7 so as not to trigger unanticipated behavior if said bit is ever unreserved, e.g. as a feature enabling flag with inverted polarity. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com |
|
|
|
0b6f4a5f08 |
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs
being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS
is set to 4096 (which is allowed even for MAXSMP=n builds), putting the
vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN
limit.
arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024)
in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than]
2001 | static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
| ^
1 error generated.
Note, sparse_banks was given the same treatment by commit
|
|
|
|
31272abd59 |
KVM: SVM: Advertise TSA CPUID bits to guests
Synthesize the TSA CPUID feature bits for guests. Set TSA_{SQ,L1}_NO on
unaffected machines.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
|
|
7f9039c524 |
Generic:
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/.
kernel/ patches acked by Peter Zijlstra.
* Add MGLRU support to the access tracking perf test.
ARM fixes:
* Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
* Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
* Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
* Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
* Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
s390:
* Fix interaction between some filesystems and Secure Execution
* Some cleanups and refactorings, preparing for an upcoming big series
x86:
* Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
* Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
* Refine and harden handling of spurious faults.
* Add support for ALLOWED_SEV_FEATURES.
* Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
* Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
* Don't account temporary allocations in sev_send_update_data().
* Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
* Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
* Advertise support to userspace for WRMSRNS and PREFETCHI.
* Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
* Add a module param to control and enumerate support for device posted
interrupts.
* Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
* Flush shadow VMCSes on emergency reboot.
* Add support for SNP to the various SEV selftests.
* Add a selftest to verify fastops instructions via forced emulation.
* Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
the harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
=mYFL
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/. kernel/
patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o
private IRQs allocated
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which
can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big
series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for
CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be
intercepted due to the old/previous routing, but not the
new/current routing
- Add a module param to control and enumerate support for device
posted interrupts
- Fix a potential overflow with nested virt on Intel systems running
32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted
interrupt bitmap, and share the harvesting code between KVM and the
kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
rtmutex_api: provide correct extern functions
KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
KVM: s390: Simplify and move pv code
KVM: s390: Refactor and split some gmap helpers
KVM: s390: Remove unneeded srcu lock
s390: Remove unneeded includes
s390/uv: Improve splitting of large folios that cannot be split while dirty
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
rust: add helper for mutex_trylock
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
...
|
|
|
|
43db111107 |
ARM:
* Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance. * Enable nested virtualisation support on systems that support it, though it is disabled by default. * Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. * Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. * Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. * New selftest checking the pKVM ownership transition rules * Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. * Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. * Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. * Add a new selftest for the SVE host state being corrupted by a guest. * Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. * Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. * Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. * and the usual random cleanups. LoongArch: * Don't flush tlb if the host supports hardware page table walks. * Add KVM selftests support. RISC-V: * Add vector registers to get-reg-list selftest * VCPU reset related improvements * Remove scounteren initialization from VCPU reset * Support VCPU reset from userspace using set_mpstate() ioctl x86: * Initial support for TDX in KVM. This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module. This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit |
|
|
|
4e02d4f973 |
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
=k6rK
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
|
|
|
|
e9628b011b |
KVM: x86: Make kvm_pio_request.linear_rip a common field for user exits
Move and rename kvm_pio_request.linear_rip to kvm_vcpu_arch.cui_linear_rip so that the field can be used by other userspace exit completion flows that need to take action if and only if userspace has not modified RIP. No functional changes intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Manali Shukla <manali.shukla@amd.com> Link: https://lore.kernel.org/r/20250502050346.14274-2-manali.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
efef7f184f |
x86/msr: Add explicit includes of <asm/msr.h>
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.
To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.
[ mingo: Clarified the changelog. ]
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
|
|
|
|
0c7b20b852 |
Linux 6.15-rc4
-----BEGIN PGP SIGNATURE----- iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgOrWseHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGFyIH/AhXcuA8y8rk43mo t+0GO7JR4dnr4DIl74GgDjCXlXiKCT7EXMfD/ABdofTxV4Pbyv+pUODlg1E6eO9U C1WWM5PPNBGDDEVSQ3Yu756nr0UoiFhvW0R6pVdou5cezCWAtIF9LTN8DEUgis0u EUJD9+/cHAMzfkZwabjm/HNsa1SXv2X47MzYv/PdHKr0htEPcNHF4gqBrBRdACGy FJtaCKhuPf6TcDNXOFi5IEWMXrugReRQmOvrXqVYGa7rfUFkZgsAzRY6n/rUN5Z9 FAgle4Vlv9ohVYj9bXX8b6wWgqiKRpoN+t0PpRd6G6ict1AFBobNGo8LH3tYIKqZ b/dCGNg= =xDGd -----END PGP SIGNATURE----- Merge tag 'v6.15-rc4' into x86/msr, to pick up fixes and resolve conflicts Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
309d28576f |
KVM: SVM: Fix SNP AP destroy race with VMRUN
An AP destroy request for a target vCPU is typically followed by an
RMPADJUST to remove the VMSA attribute from the page currently being
used as the VMSA for the target vCPU. This can result in a vCPU that
is about to VMRUN to exit with #VMEXIT_INVALID.
This usually does not happen as APs are typically sitting in HLT when
being destroyed and therefore the vCPU thread is not running at the time.
However, if HLT is allowed inside the VM, then the vCPU could be about to
VMRUN when the VMSA attribute is removed from the VMSA page, resulting in
a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the
guest to crash. An RMPADJUST against an in-use (already running) VMSA
results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA
attribute cannot be changed until the VMRUN for target vCPU exits. The
Qemu command line option '-overcommit cpu-pm=on' is an example of allowing
HLT inside the guest.
Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the
KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for
requests to be honored, so create kvm_make_request_and_kick() that will
add a new event request and honor the KVM_REQUEST_WAIT flag. This will
ensure that the target vCPU sees the AP destroy request before returning
to the initiating vCPU should the target vCPU be in guest mode.
Fixes:
|
|
|
|
459074cff6 |
KVM: x86: Add module param to control and enumerate device posted IRQs
Add a module param to each KVM vendor module to allow disabling device
posted interrupts without having to sacrifice all of APICv/AVIC, and to
also effectively enumerate to userspace whether or not KVM may be
utilizing device posted IRQs. Disabling device posted interrupts is
very desirable for testing, and can even be desirable for production
environments, e.g. if the host kernel wants to interpose on device
interrupts.
Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match
the overall APICv/AVIC controls, and to avoid complications with said
controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the
original user-defined value to play nice with a vendor module being
reloaded with different enable_apicv settings.
Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
87e4951e25 |
KVM: x86: Rescan I/O APIC routes after EOI interception for old routing
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight when IRQ routing changes, e.g. because the guest changes routing from its IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs, so that the in-flight IRQ can be de-asserted when it's EOI'd. However, only the EOI for the in-flight IRQ needs to be intercepted, as IRQs on the same vector with the new routing are coincidental, i.e. occur only if the guest is reusing the vector for multiple interrupt sources. If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept EOIs for the vector and negative impact the vCPU's interrupt performance. Note, both commit |
|
|
|
45eb29140e |
Merge branch 'kvm-fixes-6.15-rc4' into HEAD
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state. |
|
|
|
5f9e169814 |
KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c435e608cf |
x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
fd02aa45bd |
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate
merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit
|
|
|
|
81d480fdf8 |
KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity check is omitted from production builds, and more importantly to remove the atomic accesses to account pages. A one-off memory leak in production is relatively uninteresting, and a WARN_ON won't help mitigate a systemic issue; it's as much about helping triage memory leaks as it is about detecting them in the first place, and doesn't magically stop the leaks. I.e. production environments will be quite sad if a severe KVM bug escapes, regardless of whether or not KVM WARNs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315023448.2358456-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
782f9feaa9 |
Merge branch 'kvm-pre-tdx' into HEAD
- Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing. |
|
|
|
3ecf162a31 |
KVM Xen changes for 6.15
- Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
- Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
- Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
- Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks, as there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of Xen TSC should be rare, i.e. are not a hot path.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZpO4ACgkQOlYIJqCj
N/3AMQ/+J4+yOslekq4DHYhZaTvJFf0MqhPgTuf2s6I5p449JWn9rebqK2w0M9Xj
fJy7/rboQA4QflBuhTiWcC3Dl1lYtxUqqtcCH9608eqKhbeay87OfV0/vgMwWBRs
FhcOcp1587esJj5gz5M5R9i3S5Yq7Q4fp1+DmS23X41Zz5nTb2q80MY5UklMgI9I
Ydaw1liB8rRHWbdt9yM4UsI8k4fMuj0PE8pEapoTSfsZm8J4cG9qHKrvuWjuFSCF
l18Hyl11nWq8eZ5Vg2E2UIz0EgtWIHKu1/fi4av20/JTuA8Mon15WC5q4BBmDDdD
keR9OJLYclVBh8KweiJSTUE6PcD9A8pWmoWyp6aGRiyyUVhbwysYTzT7uytwQz6w
RH/vVHe0o/m19SnD9rqsRVObc7dOGorFXScMcf4Qxoq9yQm2p0lJDvq6c9uECLMV
RIfZrXe9HS67RB9INybS+1fVlLcd0bLgGfG7q9lWLEABD45HpM5daQ4Mlf8+MIE0
V7egx9t69/WALbJka8pWNISeFRKkB1LRjite+XXasqJ0iFeneM8UKFVB4OMtXL9g
M0m8ovvySySMkoCq3yMlKxXh4rJ1/D556/bAaJBukMPWFWX9FQaP33U3FuzId7jH
ztZVugViQMNiIbQVgUSAcgpuJvgpttAciACODlaw2u2Bk1Txmn0=
=c3Wt
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM Xen changes for 6.15
- Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
- Restrict the Xen hypercall MSR indx to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
- Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
- Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks, as there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of Xen TSC should be rare, i.e. are not a hot path.
|
|
|
|
fcce7c1e7d |
KVM PV clock changes for 6.15:
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
- Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks.
- Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
accounts for kvmclock, and there's no evidence that the flag is actually
supported by Xen guests.
- Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZnJsACgkQOlYIJqCj
N/3nAg//TuURhfTm56TB0PZ01DX9Fqxl+9b+fDSllk1F7O5BcfkkEd11Jv4qa/Zb
eKhSNZzWuDCTMky8izM2Ej4rfTmCg2xF+3hdpVi6yQ7SItgDo2E7e71lm1lNXSMO
oCkQxEwQk0cW2sxeEqPuREq0Zm/kw7jrEt2co2OX2FKlt9UoZiCy6RDBde50z1ut
5Z32k6QX9Alhu67kXvBE/+Xv6abx1dbADOnaTgE7s74smHKxS2WXrfpKnPXjy2y0
pWjX9k2ClSISKdaFbSu4Y0VqeLqE+57ZAWAPT8vndJxjNWOvZK1oBSlaOPchR9CZ
0VFLDWKV2FjEs0O0AkWCw8XTEmdJ4R1ekHpqbBZJ9TJYwVA/LDWOGgR1jcORkzsS
WMJkfMOmQeL8bPR6TBuAFXawbhalsXnYUSthZ3sn4kA7c1DTkIC5mzrDZ3ADPyJi
UpYwVHaWAMOqncEvSQEuUTvSoDeb5P4HMyB4QOAsh1GoKw4vVXpSWUPDy0JKPOnu
WblztX9h/CRB/ZNt/566s2Jh7sCeBO2qs3ffujI4GDosYDcIRRuQy5U08/oMrPRf
l3nPStjxLqsCdJe8IXvL5zwt6YOxJvJdG8XcfcvfQsUCPMAZOIv7PKrt8AFfrN6c
GU5v8x/IBBB46qJw1Jm5eE5S3P/PuaIf235JHpIabGPzJ+H1QGo=
=yv8C
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM PV clock changes for 6.15:
- Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
- Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks.
- Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only
accounts for kvmclock, and there's no evidence that the flag is actually
supported by Xen guests.
- Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
|
|
|
|
4d9a677596 |
KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
- Add a helper to consolidate handling of mp_state transitions, and use it to
clear pv_unhalted whenever a vCPU is made RUNNABLE.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
- Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
- Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
as KVM can't get the current CPL.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj
N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7
4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428
TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3
geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7
5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y
uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2
yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm
a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ
dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k
kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI
WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA=
=8SGF
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.15:
- Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT.
- Add a helper to consolidate handling of mp_state transitions, and use it to
clear pv_unhalted whenever a vCPU is made RUNNABLE.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
- Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
- Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS,
as KVM can't get the current CPL.
- Misc cleanups
|
|
|
|
4286a3ec25 |
KVM x86/mmu changes for 6.15
Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZizUACgkQOlYIJqCj N/1XkBAAwLxK3uKwkIIgzu+V6NPMiqPBsNtGiiQgRYMfvCwMW2vU6ztsBgNgs6zI eMnOCCo6fQaxPFvpKue8VN7TD33BcjzKZaPiuHZrzQIa/oYQeOlZ4oaN8lr9F9Ec 5l1Lg/p2+z1GGDhWc2opNpg48sCtX7IQ0Tx46LkoB3VSDFP33GwW4+Ht2r71rNeL ofKB+T0hU5HOry5j0w0nTVwOEoNzlm1sVFqOHzgnK18Lmqw2CfOPm+46K+w8nOh+ v+rwGuGa//1kcCjNCcGP1OuJdNAMgXBxND/l6LAkWcHfffIRbXlO07O05dAGqPeF rRn5JUl02OkI6lq99+935OmtEROe6mt+Bx0dhAzk4Z0CD6JY34ShZSAADSnltQlK 2a1E95t63v8a7ZM5dwED7os2HBhxODoyeWQAlIHpkVdmeTJstkyvjPhubJc13+Js oDL6ehs3hhZ171ePn2aXo0NobA5fe7xbl4wugP3hNmBXjLvu04D+llcDmC095nBk ICuzFqFXCXzdjEwgWwPzTseWOCoWTkoRqeJ9lch4UD3mMMcmK0MbK6joocGvCFto cL/eZdElnf1MZwWYdo44X+NEc1jItZVvktkRrllpwCtpRSDINO6RYZGcRf/g0Lha XmaU7jICfi3AKc4N3S2l4KIkd/AeJQySM+kGArxIOYoaqFCe2Mc= =Iy57 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEAD KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. |
|
|
|
c9c1e20b4c |
KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a4dae7c7a4 |
KVM: x86: Allow vendor code to disable quirks
In some cases, the handling of quirks is split between platform-specific code and generic code, or it is done entirely in generic code, but the relevant bug does not trigger on some platforms; for example, this will be the case for "ignore guest PAT". Allow unaffected vendor modules to disable handling of a quirk for all VMs via a new entry in kvm_caps. Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API tells userspace that KVM *knows* that some of its past behavior was bogus or just undesirable. In other words, it's plausible for userspace to refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so preserve that and make it part of the API. As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on Intel systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
90cfe144c8 |
KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
484612f1a7 |
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
d3a6b6cfb8 |
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
Several MSRs are constant and only used in userspace(ring 3). But VMs may have different values. KVM uses kvm_set_user_return_msr() to switch to guest's values and leverages user return notifier to restore them when the kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also caches the value it wrote to an MSR last time. TDX module unconditionally resets some of these MSRs to architectural INIT state on TD exit. It makes the cached values in kvm_user_return_msrs are inconsistent with values in hardware. This inconsistency needs to be fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring some MSRs to the host's values. kvm_set_user_return_msr() can help correct this case, but it is not optimal as it always does a wrmsr. So, introduce a variation of kvm_set_user_return_msr() to update cached values and skip that wrmsr. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
fbb4adadea |
KVM: x86: Make cpu_dirty_log_size a per-VM value
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a50f673f25 |
KVM: TDX: Do TDX specific vcpu initialization
TD guest vcpu needs TDX specific initialization before running. Repurpose KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command KVM_TDX_INIT_VCPU, and implement the callback for it. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/ (Sean) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
8d032b683c |
KVM: TDX: create/destroy VM structure
Implement managing the TDX private KeyID to implement, create, destroy and free for a TDX guest. When creating at TDX guest, assign a TDX private KeyID for the TDX guest for memory encryption, and allocate pages for the guest. These are used for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS). On destruction, free the allocated pages, and the KeyID. Before tearing down the private page tables, TDX requires the guest TD to be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops hook. The TDR control structures can be freed in the vm_destroy() hook, which runs last. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix build issue in kvm-coco-queue - Init ret earlier to fix __tdx_td_init() error handling. (Chao) - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
adafea1106 |
KVM: x86: Add infrastructure for secure TSC
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
fb71c79593 |
KVM: x86: Snapshot the host's DEBUGCTL in common x86
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b50cb2b155 |
KVM: x86: Use a dedicated flow for queueing re-injected exceptions
Open code the filling of vcpu->arch.exception in kvm_requeue_exception() instead of bouncing through kvm_multiple_exception(), as re-injection doesn't actually share that much code with "normal" injection, e.g. the VM-Exit interception check, payload delivery, and nested exception code is all bypassed as those flows only apply during initial injection. When FRED comes along, the special casing will only get worse, as FRED explicitly tracks nested exceptions and essentially delivers the payload on the stack frame, i.e. re-injection will need more inputs, and normal injection will have yet more code that needs to be bypassed when KVM is re-injecting an exception. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4fa0efb43a |
KVM: x86: Rename and invert async #PF's send_user_only flag to send_always
Rename send_user_only to avoid "user", because KVM's ABI is to not inject page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert the polarity to keep the naming simple and unambiguous. E.g. while KVM often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and "send_kernel" could be misconstrued as "send only to kernel". Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
26e228ec16 |
KVM: x86/xen: Move kvm_xen_hvm_config field into kvm_xen
Now that all KVM usage of the Xen HVM config information is buried behind CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch and into kvm_xen. No functional change intended. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
69e5a7dde9 |
KVM: x86/xen: Bury xen_hvm_config behind CONFIG_KVM_XEN=y
Now that all references to kvm_vcpu_arch.xen_hvm_config are wrapped with CONFIG_KVM_XEN #ifdefs, bury the field itself behind CONFIG_KVM_XEN=y. No functional change intended. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4834eaded9 |
KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a *very* rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b146a9b34a |
KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lock
Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots, and do all SPTE aging with atomic updates; while clobbering Accessed information is ok, KVM must not corrupt other bits, e.g. must not drop a Dirty or Writable bit when making a SPTE young.. If updating a SPTE to mark it for access tracking fails, leave it as is and treat it as if it were young. If the spte is being actively modified, it is most likely young. Acquire and release mmu_lock for write when harvesting age information from the shadow MMU, as the shadow MMU doesn't yet support aging outside of mmu_lock. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
39d61b46ad |
KVM: x86: Remove per-vCPU "cache" of its reference pvclock
Remove the per-vCPU "cache" of the reference pvclock and instead cache only the TSC shift+multiplier. All other fields in pvclock are fully recomputed by kvm_guest_time_update(), i.e. aren't actually persisted. In addition to shaving a few bytes, explicitly tracking the TSC shift/mul fields makes it easier to see that those fields are tied to hw_tsc_khz (they exist to avoid having to do expensive math in the common case). And conversely, not tracking the other fields makes it easier to see that things like the version number are pulled from the guest's copy, not from KVM's reference. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
93da6af3ae |
KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries. Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).
Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:
SKX 6850 => 6450
ICX 9000 => 8800
EMR 7900 => 7700
Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.
Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case. However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex. Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state. KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit. The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.
To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.
Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.
Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug. The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled. E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT. KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.
Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
c2fee09fc1 |
KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec
D: L1 reads DR6, arch.dr6 = 0
CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0
CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0
L2 reads DR6, L1 disables DR interception
CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0
CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0
L2 detects failure
CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT
L1 reads DR6 (confirms failure)
CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0
L0's view:
==========
L2 reads DR6, arch.dr6 = 0
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
L2 => L1 nested VM-Exit
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23
L1 writes DR7, L0 disables DR interception
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23
L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0
A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>
B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23
C: L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0
L1 => L2 nested VM-Enter
CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME
L0 reads DR6, arch.dr6 = 0
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes:
|
|
|
|
dfcbcd864e |
KVM: x86: Remove unused iommu_domain and iommu_noncoherent from kvm_arch
Remove the "iommu_domain" and "iommu_noncoherent" fields from struct
kvm_arch, which are no longer used since commit
|
|
|
|
0f8e26b38d |
Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
* Add hypercall service support for usermode VMM.
x86:
* Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
* Ensure that all SEV code is compiled out when disabled in Kconfig, even
if building with less brilliant compilers.
* Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
* Use str_enabled_disabled() to replace open coded strings.
* Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
prior to every VM-Enter.
* Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
* Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
* Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
* Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
* Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
* As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots. When TDX will be enabled, operations
on private pages will need to go through the privileged TDX Module via SEAMCALLs;
as a result, they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
host memory, and define entries in kvm_x86_ops to operate on external page
tables such as the TDX private EPT.
* The recently introduced conversion of the NX-page reclamation kthread to
vhost_task moved the task under the main process. The task is created as
soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
didn't expect to see any child task of the VM process until it started
creating its own userspace threads. In particular crosvm refuses to fork()
if procfs shows any child task, so unbreak it by creating the task lazily.
This is arguably a userspace bug, as there can be other kinds of legitimate
worker tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they show
as "Kthread: 1" in proc/.../status?
x86 - Intel:
* Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
while L2 is active, while ultimately results in a hardware-accelerated L1
EOI effectively being lost.
* Honor event priority when emulating Posted Interrupt delivery during nested
VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
interrupt.
* Rework KVM's processing of the Page-Modification Logging buffer to reap
entries in the same order they were created, i.e. to mark gfns dirty in the
same order that hardware marked the page/PTE dirty.
* Misc cleanups.
Generic:
* Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
setting memory regions and add a dedicated API for setting KVM-internal
memory regions. The API can then explicitly disallow all flags for
KVM-internal memory regions.
* Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
where KVM would return a pointer to a vCPU prior to it being fully online,
and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
* Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
bug where userspace could coerce KVM into handling the ioctl on a vCPU that
isn't yet onlined.
* Gracefully handle xarray insertion failures; even though such failures are
impossible in practice after xa_reserve(), reserving an entry is always followed
by xa_store() which does not know (or differentiate) whether there was an
xa_reserve() before or not.
RISC-V:
* Zabha, Svvptc, and Ziccrse extension support for guests. None of them
require anything in KVM except for detecting them and marking them
as supported; Zabha adds byte and halfword atomic operations, while the
others are markers for specific operation of the TLB and of LL/SC
instructions respectively.
* Virtualize SBI system suspend extension for Guest/VM
* Support firmware counters which can be used by the guests to collect
statistics about traps that occur in the host.
Selftests:
* Rework vcpu_get_reg() to return a value instead of using an out-param, and
update all affected arch code accordingly.
* Convert the max_guest_memory_test into a more generic mmu_stress_test.
The basic gist of the "conversion" is to have the test do mprotect() on
guest memory while vCPUs are accessing said memory, e.g. to verify KVM
and mmu_notifiers are working as intended.
* Play nice with treewrite builds of unsupported architectures, e.g. arm
(32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
target architecture is actually one KVM selftests supports.
* Use the kernel's $(ARCH) definition instead of the target triple for arch
specific directories, e.g. arm64 instead of aarch64, mainly so as not to
be different from the rest of the kernel.
* Ensure that format strings for logging statements are checked by the
compiler even when the logging statement itself is disabled.
* Attempt to whack the last LLC references/misses mole in the Intel PMU
counters test by adding a data load and doing CLFLUSH{OPT} on the data
instead of the code being executed. It seems that modern Intel CPUs
have learned new code prefetching tricks that bypass the PMU counters.
* Fix a flaw in the Intel PMU counters test where it asserts that events
are counting correctly without actually knowing what the events count
given the underlying hardware; this can happen if Intel reuses a
formerly microarchitecture-specific event encoding as an architectural
event, as was the case for Top-Down Slots.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeTuzoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkBwf8CRNExYaM3j9y2E7mmo6AiL2ug6+J
Uy5Hai1poY48pPwKC6ke3EWT8WVsgj/Py5pCeHvLojQchWNjCCYNfSQluJdkRxwG
DgP3QUljSxEJWBeSwyTRcKM+IySi5hZd1IFo3gePFRB829Jpnj05vjbvCyv8gIwU
y3HXxSYDsViaaFoNg4OlZFsIGis7mtknsZzk++QjuCXmxNa6UCbv3qvE/UkVLhVg
WH65RTRdjk+EsdwaOMHKuUvQoGa+iM4o39b6bqmw8+ZMK39+y33WeTX/y5RXsp1N
tUUBRfS+MuuYgC/6LmTr66EkMzoChxk3Dp3kKUaCBcfqRC8PxQag5reZhw==
=NEaO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Loongarch:
- Clear LLBCTL if secondary mmu mapping changes
- Add hypercall service support for usermode VMM
x86:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM
performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
enabled
- Ensure that all SEV code is compiled out when disabled in Kconfig,
even if building with less brilliant compilers
- Remove a redundant TLB flush on AMD processors when guest CR4.PGE
changes
- Use str_enabled_disabled() to replace open coded strings
- Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
APICv cache prior to every VM-Enter
- Overhaul KVM's CPUID feature infrastructure to track all vCPU
capabilities instead of just those where KVM needs to manage state
and/or explicitly enable the feature in hardware. Along the way,
refactor the code to make it easier to add features, and to make it
more self-documenting how KVM is handling each feature
- Rework KVM's handling of VM-Exits during event vectoring; this
plugs holes where KVM unintentionally puts the vCPU into infinite
loops in some scenarios (e.g. if emulation is triggered by the
exit), and brings parity between VMX and SVM
- Add pending request and interrupt injection information to the
kvm_exit and kvm_entry tracepoints respectively
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU
when loading guest/host PKRU, due to a refactoring of the kernel
helpers that didn't account for KVM's pre-checking of the need to
do WRPKRU
- Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.
Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
went to userspace, and all the others did not; the new code need
not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not
- As part of enabling TDX virtual machines, support support
separation of private/shared EPT into separate roots.
When TDX will be enabled, operations on private pages will need to
go through the privileged TDX Module via SEAMCALLs; as a result,
they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the
private EPT in host memory, and define entries in kvm_x86_ops to
operate on external page tables such as the TDX private EPT
- The recently introduced conversion of the NX-page reclamation
kthread to vhost_task moved the task under the main process. The
task is created as soon as KVM_CREATE_VM was invoked and this, of
course, broke userspace that didn't expect to see any child task of
the VM process until it started creating its own userspace threads.
In particular crosvm refuses to fork() if procfs shows any child
task, so unbreak it by creating the task lazily. This is arguably a
userspace bug, as there can be other kinds of legitimate worker
tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they
show as "Kthread: 1" in proc/.../status?
x86 - Intel:
- Fix a bug where KVM updates hardware's APICv cache of the highest
ISR bit while L2 is active, while ultimately results in a
hardware-accelerated L1 EOI effectively being lost
- Honor event priority when emulating Posted Interrupt delivery
during nested VM-Enter by queueing KVM_REQ_EVENT instead of
immediately handling the interrupt
- Rework KVM's processing of the Page-Modification Logging buffer to
reap entries in the same order they were created, i.e. to mark gfns
dirty in the same order that hardware marked the page/PTE dirty
- Misc cleanups
Generic:
- Cleanup and harden kvm_set_memory_region(); add proper lockdep
assertions when setting memory regions and add a dedicated API for
setting KVM-internal memory regions. The API can then explicitly
disallow all flags for KVM-internal memory regions
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to
fix a bug where KVM would return a pointer to a vCPU prior to it
being fully online, and give kvm_for_each_vcpu() similar treatment
to fix a similar flaw
- Wait for a vCPU to come online prior to executing a vCPU ioctl, to
fix a bug where userspace could coerce KVM into handling the ioctl
on a vCPU that isn't yet onlined
- Gracefully handle xarray insertion failures; even though such
failures are impossible in practice after xa_reserve(), reserving
an entry is always followed by xa_store() which does not know (or
differentiate) whether there was an xa_reserve() before or not
RISC-V:
- Zabha, Svvptc, and Ziccrse extension support for guests. None of
them require anything in KVM except for detecting them and marking
them as supported; Zabha adds byte and halfword atomic operations,
while the others are markers for specific operation of the TLB and
of LL/SC instructions respectively
- Virtualize SBI system suspend extension for Guest/VM
- Support firmware counters which can be used by the guests to
collect statistics about traps that occur in the host
Selftests:
- Rework vcpu_get_reg() to return a value instead of using an
out-param, and update all affected arch code accordingly
- Convert the max_guest_memory_test into a more generic
mmu_stress_test. The basic gist of the "conversion" is to have the
test do mprotect() on guest memory while vCPUs are accessing said
memory, e.g. to verify KVM and mmu_notifiers are working as
intended
- Play nice with treewrite builds of unsupported architectures, e.g.
arm (32-bit), as KVM selftests' Makefile doesn't do anything to
ensure the target architecture is actually one KVM selftests
supports
- Use the kernel's $(ARCH) definition instead of the target triple
for arch specific directories, e.g. arm64 instead of aarch64,
mainly so as not to be different from the rest of the kernel
- Ensure that format strings for logging statements are checked by
the compiler even when the logging statement itself is disabled
- Attempt to whack the last LLC references/misses mole in the Intel
PMU counters test by adding a data load and doing CLFLUSH{OPT} on
the data instead of the code being executed. It seems that modern
Intel CPUs have learned new code prefetching tricks that bypass the
PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that
events are counting correctly without actually knowing what the
events count given the underlying hardware; this can happen if
Intel reuses a formerly microarchitecture-specific event encoding
as an architectural event, as was the case for Top-Down Slots"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
kvm: defer huge page recovery vhost task to later
KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
KVM: Disallow all flags for KVM-internal memslots
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
KVM: Add a dedicated API for setting KVM-internal memslots
KVM: Assert slots_lock is held when setting memory regions
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
LoongArch: KVM: Add hypercall service support for usermode VMM
LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
KVM: VMX: read the PML log in the same order as it was written
KVM: VMX: refactor PML terminology
KVM: VMX: Fix comment of handle_vmx_instruction()
KVM: VMX: Reinstate __exit attribute for vmx_exit()
KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
KVM: x86: Use LVT_TIMER instead of an open coded literal
RISC-V: KVM: Add new exit statstics for redirected traps
RISC-V: KVM: Update firmware counters for various events
RISC-V: KVM: Redirect instruction access fault trap to guest
...
|
|
|
|
931656b9e2 |
kvm: defer huge page recovery vhost task to later
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes:
|
|
|
|
86eb1aef72 |
Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
|
|
3eba032bb7 |
Merge branch 'kvm-userspace-hypercall' into HEAD
Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not. |
|
|
|
4f7ff70c05 |
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
with per-vCPU tracking of the vCPU's capabailities for all features. Along
the way, refactor the code to make it easier to add/modify features, and
add a variety of self-documenting macro types to again simplify adding new
features and to help readers understand KVM's handling of existing features.
- Rework KVM's handling of VM-Exits during event vectoring to plug holes where
KVM unintentionally puts the vCPU into infinite loops in some scenarios,
e.g. if emulation is triggered by the exit, and to bring parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
=32mM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
|
|
ef5a3c92a8 |
hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
Switch to using hvhdk.h everywhere in the kernel. This header includes all the new Hyper-V headers in include/hyperv, which form a superset of the definitions found in hyperv-tlfs.h. This makes it easier to add new Hyper-V interfaces without being restricted to those in the TLFS doc (reflected in hyperv-tlfs.h). To be more consistent with the original Hyper-V code, the names of some definitions are changed slightly. Update those where needed. Update comments in mshyperv.h files to point to include/hyperv for adding new definitions. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> |