mirror of https://github.com/torvalds/linux.git
574 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
db44dcbdf8 |
KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share PIR harvesting code between KVM and the kernel's Posted MSI handler -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2 klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/ cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw= =dU1G -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD KVM x86 posted interrupt changes for 6.16: Refine and optimize KVM's software processing of the PIR, and ultimately share PIR harvesting code between KVM and the kernel's Posted MSI handler |
|
|
|
edaf3eded3 |
x86/irq: KVM: Add helper for harvesting PIR to deduplicate KVM and posted MSIs
Now that posted MSI and KVM harvesting of PIR is identical, extract the code (and posted MSI's wonderful comment) to a common helper. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
baf68a0e3b |
KVM: VMX: Use arch_xchg() when processing PIR to avoid instrumentation
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid instrumentation so that KVM is compatible with the needs of posted MSI. This will allow extracting the core PIR logic to common code and sharing it between KVM and posted MSI handling. Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b41f8638b9 |
KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR
Rework KVM's processing of the PIR to use the same algorithm as posted
MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given
KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's
safe to say far more thought and investigation was put into handling the
PIR for posted MSIs, i.e. there's no reason to assume KVM's existing
logic is meaningful, let alone superior.
Matching the processing done by posted MSIs will also allow deduplicating
the code between KVM and posted MSIs.
See the comment for handle_pending_pir() added by commit
|
|
|
|
06b4d0ea22 |
KVM: VMX: Process PIR using 64-bit accesses on 64-bit kernels
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on 64-bit kernels, so that the worst case of having a posted IRQ in each chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8. Deliberately use a "continue" to skip empty entries so that the code is a carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM and posted MSI logic. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
f1459315f4 |
x86/irq: KVM: Track PIR bitmap as an "unsigned long" array
Track the PIR bitmap in posted interrupt descriptor structures as an array of unsigned longs instead of using unionized arrays for KVM (u32s) versus IRQ management (u64s). In practice, because the non-KVM usage is (sanely) restricted to 64-bit kernels, all existing usage of the u64 variant is already working with unsigned longs. Using "unsigned long" for the array will allow reworking KVM's processing of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will allow optimizing KVM by reducing the number of atomic accesses to PIR. Opportunstically replace the open coded literals in the posted MSIs code with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the for-loops, even though it would be cleaner from a certain perspective, in anticipation of decoupling the processing from the array declaration. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
6433fc01f9 |
KVM: VMX: Ensure vIRR isn't reloaded at odd times when sync'ing PIR
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC to ensure getting the highest priority IRQ from the chunk doesn't reload from the vIRR. In practice, a reload is functionally benign as vcpu->mutex is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing IRQs can't disappear. Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
87e4951e25 |
KVM: x86: Rescan I/O APIC routes after EOI interception for old routing
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight when IRQ routing changes, e.g. because the guest changes routing from its IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs, so that the in-flight IRQ can be de-asserted when it's EOI'd. However, only the EOI for the in-flight IRQ needs to be intercepted, as IRQs on the same vector with the new routing are coincidental, i.e. occur only if the guest is reusing the vector for multiple interrupt sources. If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept EOIs for the vector and negative impact the vCPU's interrupt performance. Note, both commit |
|
|
|
fd02aa45bd |
Merge branch 'kvm-tdx-initial' into HEAD
This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate
merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit
|
|
|
|
edb0e8f6e2 |
ARM:
* Nested virtualization support for VGICv3, giving the nested
hypervisor control of the VGIC hardware when running an L2 VM
* Removal of 'late' nested virtualization feature register masking,
making the supported feature set directly visible to userspace
* Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
* Paravirtual interface for discovering the set of CPU implementations
where a VM may run, addressing a longstanding issue of guest CPU
errata awareness in big-little systems and cross-implementation VM
migration
* Userspace control of the registers responsible for identifying a
particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
allowing VMs to be migrated cross-implementation
* pKVM updates, including support for tracking stage-2 page table
allocations in the protected hypervisor in the 'SecPageTable' stat
* Fixes to vPMU, ensuring that userspace updates to the vPMU after
KVM_RUN are reflected into the backing perf events
LoongArch:
* Remove unnecessary header include path
* Assume constant PGD during VM context switch
* Add perf events support for guest VM
RISC-V:
* Disable the kernel perf counter during configure
* KVM selftests improvements for PMU
* Fix warning at the time of KVM module removal
x86:
* Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock
allows multiple aging actions to run in parallel, and more importantly avoids
stalling vCPUs. This includes an implementation of per-rmap-entry locking;
aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
locking an rmap for write requires taking both the per-rmap spinlock and
the mmu_lock.
Note that this decreases slightly the accuracy of accessed-page information,
because changes to the SPTE outside aging might not use atomic operations
even if they could race against a clear of the Accessed bit. This is
deliberate because KVM and mm/ tolerate false positives/negatives for
accessed information, and testing has shown that reducing the latency of
aging is far more beneficial to overall system performance than providing
"perfect" young/old information.
* Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
* Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
* Drop "support" for async page faults for protected guests that do not set
SEND_ALWAYS (i.e. that only want async page faults at CPL3)
* Bring a bit of sanity to x86's VM teardown code, which has accumulated
a lot of cruft over the years. Particularly, destroy vCPUs before
the MMU, despite the latter being a VM-wide operation.
* Add common secure TSC infrastructure for use within SNP and in the
future TDX
* Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make
sense to use the capability if the relevant registers are not
available for reading or writing.
* Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
* Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
* Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
notifier only accounts for kvmclock, and there's no evidence that the
flag is actually supported by Xen guests.
* Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
* Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
* Restrict the Xen hypercall MSR index to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
* Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
* Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks; there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of the TSC leaves should be rare, i.e. are not a hot path.
x86 (Intel):
* Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
* Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
for upcoming FRED virtualization support.
* Decouple the EPT entry RWX protection bit macros from the EPT Violation
bits, both as a general cleanup and in anticipation of adding support for
emulating Mode-Based Execution Control (MBEC).
* Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
state while KVM is in the middle of emulating nested VM-Enter.
* Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
in anticipation of adding sanity checks for secondary exit controls (the
primary field is out of bits).
x86 (AMD):
* Ensure the PSP driver is initialized when both the PSP and KVM modules are
built-in (the initcall framework doesn't handle dependencies).
* Use long-term pins when registering encrypted memory regions, so that the
pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
excessive fragmentation.
* Add macros and helpers for setting GHCB return/error codes.
* Add support for Idle HLT interception, which elides interception if the vCPU
has a pending, unmasked virtual IRQ when HLT is executed.
* Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
address.
* Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
because the vCPU was "destroyed" via SNP's AP Creation hypercall.
* Reject SNP AP Creation if the requested SEV features for the vCPU don't
match the VM's configured set of features.
Selftests:
* Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
instead of executing code. The theory is that modern Intel CPUs have
learned new code prefetching tricks that bypass the PMU counters.
* Fix a flaw in the Intel PMU counters test where it asserts that an event is
counting correctly without actually knowing what the event counts on the
underlying hardware.
* Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
improve its coverage by collecting all dirty entries on each iteration.
* Fix a few minor bugs related to handling of stats FDs.
* Add infrastructure to make vCPU and VM stats FDs available to tests by
default (open the FDs during VM/vCPU creation).
* Relax an assertion on the number of HLT exits in the xAPIC IPI test when
running on a CPU that supports AMD's Idle HLT (which elides interception of
HLT if a virtual IRQ is pending and unmasked).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
=nnCN
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Nested virtualization support for VGICv3, giving the nested
hypervisor control of the VGIC hardware when running an L2 VM
- Removal of 'late' nested virtualization feature register masking,
making the supported feature set directly visible to userspace
- Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
- Paravirtual interface for discovering the set of CPU
implementations where a VM may run, addressing a longstanding issue
of guest CPU errata awareness in big-little systems and
cross-implementation VM migration
- Userspace control of the registers responsible for identifying a
particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
allowing VMs to be migrated cross-implementation
- pKVM updates, including support for tracking stage-2 page table
allocations in the protected hypervisor in the 'SecPageTable' stat
- Fixes to vPMU, ensuring that userspace updates to the vPMU after
KVM_RUN are reflected into the backing perf events
LoongArch:
- Remove unnecessary header include path
- Assume constant PGD during VM context switch
- Add perf events support for guest VM
RISC-V:
- Disable the kernel perf counter during configure
- KVM selftests improvements for PMU
- Fix warning at the time of KVM module removal
x86:
- Add support for aging of SPTEs without holding mmu_lock.
Not taking mmu_lock allows multiple aging actions to run in
parallel, and more importantly avoids stalling vCPUs. This includes
an implementation of per-rmap-entry locking; aging the gfn is done
with only a per-rmap single-bin spinlock taken, whereas locking an
rmap for write requires taking both the per-rmap spinlock and the
mmu_lock.
Note that this decreases slightly the accuracy of accessed-page
information, because changes to the SPTE outside aging might not
use atomic operations even if they could race against a clear of
the Accessed bit.
This is deliberate because KVM and mm/ tolerate false
positives/negatives for accessed information, and testing has shown
that reducing the latency of aging is far more beneficial to
overall system performance than providing "perfect" young/old
information.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction,
to coalesce updates when multiple pieces of vCPU state are
changing, e.g. as part of a nested transition
- Fix a variety of nested emulation bugs, and add VMX support for
synthesizing nested VM-Exit on interception (instead of injecting
#UD into L2)
- Drop "support" for async page faults for protected guests that do
not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)
- Bring a bit of sanity to x86's VM teardown code, which has
accumulated a lot of cruft over the years. Particularly, destroy
vCPUs before the MMU, despite the latter being a VM-wide operation
- Add common secure TSC infrastructure for use within SNP and in the
future TDX
- Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
make sense to use the capability if the relevant registers are not
available for reading or writing
- Don't take kvm->lock when iterating over vCPUs in the suspend
notifier to fix a largely theoretical deadlock
- Use the vCPU's actual Xen PV clock information when starting the
Xen timer, as the cached state in arch.hv_clock can be stale/bogus
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
KVM's suspend notifier only accounts for kvmclock, and there's no
evidence that the flag is actually supported by Xen guests
- Clean up the per-vCPU "cache" of its reference pvclock, and instead
only track the vCPU's TSC scaling (multipler+shift) metadata (which
is moderately expensive to compute, and rarely changes for modern
setups)
- Don't write to the Xen hypercall page on MSR writes that are
initiated by the host (userspace or KVM) to fix a class of bugs
where KVM can write to guest memory at unexpected times, e.g.
during vCPU creation if userspace has set the Xen hypercall MSR
index to collide with an MSR that KVM emulates
- Restrict the Xen hypercall MSR index to the unofficial synthetic
range to reduce the set of possible collisions with MSRs that are
emulated by KVM (collisions can still happen as KVM emulates
Hyper-V MSRs, which also reside in the synthetic range)
- Clean up and optimize KVM's handling of Xen MSR writes and
xen_hvm_config
- Update Xen TSC leaves during CPUID emulation instead of modifying
the CPUID entries when updating PV clocks; there is no guarantee PV
clocks will be updated between TSC frequency changes and CPUID
emulation, and guest reads of the TSC leaves should be rare, i.e.
are not a hot path
x86 (Intel):
- Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1
- Pass XFD_ERR as the payload when injecting #NM, as a preparatory
step for upcoming FRED virtualization support
- Decouple the EPT entry RWX protection bit macros from the EPT
Violation bits, both as a general cleanup and in anticipation of
adding support for emulating Mode-Based Execution Control (MBEC)
- Reject KVM_RUN if userspace manages to gain control and stuff
invalid guest state while KVM is in the middle of emulating nested
VM-Enter
- Add a macro to handle KVM's sanity checks on entry/exit VMCS
control pairs in anticipation of adding sanity checks for secondary
exit controls (the primary field is out of bits)
x86 (AMD):
- Ensure the PSP driver is initialized when both the PSP and KVM
modules are built-in (the initcall framework doesn't handle
dependencies)
- Use long-term pins when registering encrypted memory regions, so
that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
don't lead to excessive fragmentation
- Add macros and helpers for setting GHCB return/error codes
- Add support for Idle HLT interception, which elides interception if
the vCPU has a pending, unmasked virtual IRQ when HLT is executed
- Fix a bug in INVPCID emulation where KVM fails to check for a
non-canonical address
- Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
invalid, e.g. because the vCPU was "destroyed" via SNP's AP
Creation hypercall
- Reject SNP AP Creation if the requested SEV features for the vCPU
don't match the VM's configured set of features
Selftests:
- Fix again the Intel PMU counters test; add a data load and do
CLFLUSH{OPT} on the data instead of executing code. The theory is
that modern Intel CPUs have learned new code prefetching tricks
that bypass the PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that an
event is counting correctly without actually knowing what the event
counts on the underlying hardware
- Fix a variety of flaws, bugs, and false failures/passes
dirty_log_test, and improve its coverage by collecting all dirty
entries on each iteration
- Fix a few minor bugs related to handling of stats FDs
- Add infrastructure to make vCPU and VM stats FDs available to tests
by default (open the FDs during VM/vCPU creation)
- Relax an assertion on the number of HLT exits in the xAPIC IPI test
when running on a CPU that supports AMD's Idle HLT (which elides
interception of HLT if a virtual IRQ is pending and unmasked)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
RISC-V: KVM: Teardown riscv specific bits after kvm_exit
LoongArch: KVM: Register perf callbacks for guest
LoongArch: KVM: Implement arch-specific functions for guest perf
LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
LoongArch: KVM: Remove PGD saving during VM context switch
LoongArch: KVM: Remove unnecessary header include path
KVM: arm64: Tear down vGIC on failed vCPU creation
KVM: arm64: PMU: Reload when resetting
KVM: arm64: PMU: Reload when user modifies registers
KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
KVM: arm64: Initialize HCRX_EL2 traps in pKVM
KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
KVM: x86: Add infrastructure for secure TSC
KVM: x86: Push down setting vcpu.arch.user_set_tsc
...
|
|
|
|
14aecf2a5b |
KVM: x86: Assume timer IRQ was injected if APIC state is protected
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer IRQ was injected when deciding whether or not to busy wait in the "timer advanced" path. The "real" vIRR is not readable/writable, so trying to query for a pending timer IRQ will return garbage. Note, TDX can scour the PIR if it wants to be more precise and skip the "wait" call entirely. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
90cfe144c8 |
KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a50f673f25 |
KVM: TDX: Do TDX specific vcpu initialization
TD guest vcpu needs TDX specific initialization before running. Repurpose KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command KVM_TDX_INIT_VCPU, and implement the callback for it. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/ (Sean) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
7764b9dd17 |
KVM: x86: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/5051cfe7ed48ef9913bf2583eeca6795cb53d6ae.1738746821.git.namcao@linutronix.de |
|
|
|
93da6af3ae |
KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries. Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).
Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:
SKX 6850 => 6450
ICX 9000 => 8800
EMR 7900 => 7700
Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.
Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case. However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex. Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state. KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit. The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.
To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.
Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.
Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug. The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled. E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT. KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.
Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
c9e5f3fa90 |
KVM: x86: Introduce kvm_set_mp_state()
Replace all open-coded assignments to vcpu->arch.mp_state with calls to a new helper, kvm_set_mp_state(), to centralize all changes to mp_state. No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
82c470121c |
KVM: x86: Use kvfree_rcu() to free old optimized APIC map
Use kvfree_rcu() to free the old optimized APIC instead of open coding a rough equivalent via call_rcu() and a callback function. Note, there is a subtle function change as rcu_barrier() doesn't wait on kvfree_rcu(), but does wait on call_rcu(). Not forcing rcu_barrier() to wait is safe and desirable in this case, as KVM doesn't care when an old map is actually freed. In fact, using kvfree_rcu() fixes a largely theoretical use-after-free. Because KVM _doesn't_ do rcu_barrier() to wait for kvm_apic_map_free() to complete, if KVM-the-module is unloaded in the RCU grace period before kvm_apic_map_free() is invoked, KVM's callback could run after module unload. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250122073456.2950-1-lirongqing@baidu.com [sean: rework changelog, call out rcu_barrier() interaction] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4f7ff70c05 |
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
with per-vCPU tracking of the vCPU's capabailities for all features. Along
the way, refactor the code to make it easier to add/modify features, and
add a variety of self-documenting macro types to again simplify adding new
features and to help readers understand KVM's handling of existing features.
- Rework KVM's handling of VM-Exits during event vectoring to plug holes where
KVM unintentionally puts the vCPU into infinite loops in some scenarios,
e.g. if emulation is triggered by the exit, and to bring parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
=32mM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
|
|
d6470627f5 |
KVM: x86: Use LVT_TIMER instead of an open coded literal
Use LVT_TIMER instead of the literal '0' to clean up the apic_lvt_mask lookup when emulating handling writes to APIC_LVTT. No functional change intended. Signed-off-by: Liam Ni <zhiguangni01@gmail.com> [sean: manually regenerate patch (whitespace damaged), massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ca0245d131 |
KVM: x86: Remove hwapic_irr_update() from kvm_x86_ops
Remove the redundant .hwapic_irr_update() ops. If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1 in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned with the vIRR in the virtual APIC. So, no need to update RVI every time the vIRR changes. Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no impact to the nested case. Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
04bc93cf49 |
KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID
If KVM emulates an EOI for L1's virtual APIC while L2 is active, defer
updating GUEST_INTERUPT_STATUS.SVI, i.e. the VMCS's cache of the highest
in-service IRQ, until L1 is active, as vmcs01, not vmcs02, needs to track
vISR. The missed SVI update for vmcs01 can result in L1 interrupts being
incorrectly blocked, e.g. if there is a pending interrupt with lower
priority than the interrupt that was EOI'd.
This bug only affects use cases where L1's vAPIC is effectively passed
through to L2, e.g. in a pKVM scenario where L2 is L1's depriveleged host,
as KVM will only emulate an EOI for L1's vAPIC if Virtual Interrupt
Delivery (VID) is disabled in vmc12, and L1 isn't intercepting L2 accesses
to its (virtual) APIC page (or if x2APIC is enabled, the EOI MSR).
WARN() if KVM updates L1's ISR while L2 is active with VID enabled, as an
EOI from L2 is supposed to affect L2's vAPIC, but still defer the update,
to try to keep L1 alive. Specifically, KVM forwards all APICv-related
VM-Exits to L1 via nested_vmx_l1_wants_exit():
case EXIT_REASON_APIC_ACCESS:
case EXIT_REASON_APIC_WRITE:
case EXIT_REASON_EOI_INDUCED:
/*
* The controls for "virtualize APIC accesses," "APIC-
* register virtualization," and "virtual-interrupt
* delivery" only come from vmcs12.
*/
return true;
Fixes:
|
|
|
|
8f2a27752e |
KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps
Switch all queries (except XSAVES) of guest features from guest CPUID to guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls to guest_cpu_cap_has(). Keep guest_cpuid_has() around for XSAVES, but subsume its helper guest_cpuid_get_register() and add a compile-time assertion to prevent using guest_cpuid_has() for any other feature. Add yet another comment for XSAVE to explain why KVM is allowed to query its raw guest CPUID. Opportunistically drop the unused guest_cpuid_clear(), as there should be no circumstance in which KVM needs to _clear_ a guest CPUID feature now that everything is tracked via cpu_caps. E.g. KVM may need to _change_ a feature to emulate dynamic CPUID flags, but KVM should never need to clear a feature in guest CPUID to prevent it from being used by the guest. Delete the last remnants of the governed features framework, as the lone holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for governed vs. ungoverned features. Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when computing reserved CR4 bits is a nop when viewed as a whole, as KVM's capabilities are already incorporated into the calculation, i.e. if a feature is present in guest CPUID but unsupported by KVM, its CR4 bit was already being marked as reserved, checking guest_cpu_cap_has() simply double-stamps that it's a reserved bit. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
76bce9f101 |
KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.
Note, commit
|
|
|
|
2e9a2c624e |
Merge branch 'kvm-docs-6.13' into HEAD
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect references to non-existing ioctls - List registers supported by KVM_GET/SET_ONE_REG on s390 - Use rST internal links - Reorganize the introduction to the API document |
|
|
|
bb4409a9e7 |
KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
- Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which can lead to save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe, and harden the
cache accessors to try to prevent similar issues from occuring in the
future.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
- Minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczowUACgkQOlYIJqCj
N/3G8w//RYIslQkHXZXovQvhHKM9RBxdg6FjU0Do2KLN/xnR+JdjrBAI3HnG0TVu
TEsA6slaTdvAFudSBZ27rORPARJ3XCSnDT+NflDR2UqtmGYFXxixcs4LQRUC3I2L
tS/e847Qfp7/+kXFYQuH6YmMftCf7SQNbUPU3EwSXY8seUJB6ZhO89WXgrBtaMH+
94b7EdkP86L4dqiEMGr/q/46/ewFpPlB4WWFNIAY+SoZebQ+C8gcAsGDzfG6giNV
HxdehWNp5nJ34k9Tudt2FseL/IclA3nXZZIl2dU6PWLkjgvDqL2kXLHpAQTpKuXf
ZzBM4wjdVqZRQHscLgX6+0/uOJDf9/iJs/dwD9PbzLhAJnHF3SWRAy/grzpChLFR
Yil5zagtdhjqSKDf2FsUCJ7lwaST0fhHvmZZx4loIhcmN0/rvvrhhfLsJgCWv/Kf
RWMkiSQGhxAZUXjOJDQB+wTHv7ZoJy51ov7/rYjr49jCJrCVO6I6yQ6lNKDCYClR
vDS3yK6fiEXbC09iudm74FBl2KO+BwJKhnidekzzKGv34RHRAux5Oo54J5jJMhBA
tXFxALGwKr1KAI7vLK8ByZzwZjN5pDmKHUuOAlJNy9XQi+b4zbbREkXlcqZ5VgxA
xCibpnLzJFoHS8y7c76wfz4mCkWSWzuC9Rzy4sTkHMzgJESmZiE=
=dESL
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
- Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which can lead to save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe, and harden the
cache accessors to try to prevent similar issues from occuring in the
future.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
- Minor cleanups
|
|
|
|
d3ddef46f2 |
KVM: x86: Unconditionally set irr_pending when updating APICv state
Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit |
|
|
|
a75b7bb46a |
KVM: x86: Short-circuit all of kvm_apic_set_base() if MSR value is unchanged
Do nothing in all of kvm_apic_set_base(), not just __kvm_apic_set_base(), if the incoming MSR value is the same as the current value. Validating the mode transitions is obviously unnecessary, and rejecting the write is pointless if the vCPU already has an invalid value, e.g. if userspace is doing weird things and modified guest CPUID after setting MSR_IA32_APICBASE. Bailing early avoids kvm_recalculate_apic_map()'s slow path in the rare scenario where the map is DIRTY due to some other vCPU dirtying the map, in which case it's the other vCPU/task's responsibility to recalculate the map. Note, kvm_lapic_reset() calls __kvm_apic_set_base() only when emulating RESET, in which case the old value is guaranteed to be zero, and the new value is guaranteed to be non-zero. I.e. all callers of __kvm_apic_set_base() effectively pre-check for the MSR value actually changing. Don't bother keeping the check in __kvm_apic_set_base(), as no additional callers are expected, and implying that the MSR might already be non-zero at the time of kvm_lapic_reset() could confuse readers. Link: https://lore.kernel.org/r/20241101183555.1794700-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c9155eb012 |
KVM: x86: Unpack msr_data structure prior to calling kvm_apic_set_base()
Pass in the new value and "host initiated" as separate parameters to kvm_apic_set_base(), as forcing the KVM_SET_SREGS path to declare and fill an msr_data structure is awkward and kludgy, e.g. __set_sregs_common() doesn't even bother to set the proper MSR index. No functional change intended. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20241101183555.1794700-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ff6ce56e1d |
KVM: x86: Make kvm_recalculate_apic_map() local to lapic.c
Make kvm_recalculate_apic_map() local to lapic.c now that all external callers are gone. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-8-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7d1cb7cee9 |
KVM: x86: Rename APIC base setters to better capture their relationship
Rename kvm_set_apic_base() and kvm_lapic_set_base() to kvm_apic_set_base() and __kvm_apic_set_base() respectively to capture that the underscores version is a "special" variant (it exists purely to avoid recalculating the optimized map multiple times when stuffing the RESET value). Opportunistically add a comment explaining why kvm_lapic_reset() uses the inner helper. Note, KVM deliberately invokes kvm_arch_vcpu_create() while kvm->lock is NOT held so that vCPU setup isn't serialized if userspace is creating multiple/all vCPUs in parallel. I.e. triggering an extra recalculation is not limited to theoretical/rare edge cases, and so is worth avoiding. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-7-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c9c9acfcd5 |
KVM: x86: Move kvm_set_apic_base() implementation to lapic.c (from x86.c)
Move kvm_set_apic_base() to lapic.c so that the bulk of KVM's local APIC code resides in lapic.c, regardless of whether or not KVM is emulating the local APIC in-kernel. This will also allow making various helpers visible only to lapic.c. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-6-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
8166d25579 |
KVM: x86: Drop superfluous kvm_lapic_set_base() call when setting APIC state
Now that kvm_lapic_set_base() does nothing if the "new" APIC base MSR is the same as the current value, drop the kvm_lapic_set_base() call in the KVM_SET_LAPIC flow that passes in the current value, as it too does nothing. Note, the purpose of invoking kvm_lapic_set_base() was purely to set apic->base_address (see commit |
|
|
|
d7d770bed9 |
KVM: x86: Short-circuit all kvm_lapic_set_base() if MSR value isn't changing
Do nothing in kvm_lapic_set_base() if the APIC base MSR value is the same as the current value. All flows except the handling of the base address explicitly take effect if and only if relevant bits are changing. For the base address, invoking kvm_lapic_set_base() before KVM initializes the base to APIC_DEFAULT_PHYS_BASE during vCPU RESET would be a KVM bug, i.e. KVM _must_ initialize apic->base_address before exposing the vCPU (to userspace or KVM at-large). Note, the inhibit is intended to be set if the base address is _changed_ from the default, i.e. is also covered by the RESET behavior. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-2-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
fcd366b95e |
KVM: x86: Don't fault-in APIC access page during initial allocation
Drop the gfn_to_page() lookup when installing KVM's internal memslot for the APIC access page, as KVM doesn't need to immediately fault-in the page now that the page isn't pinned. In the extremely unlikely event the kernel can't allocate a 4KiB page, KVM can just as easily return -EFAULT on the future page fault. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-37-seanjc@google.com> |
|
|
|
037bc38b29 |
KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an error
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just
a leftover bit of weirdness from days of old when KVM stuffed a "bad" page
into the guest instead of actually handling missing pages. See commit
|
|
|
|
3f8df62852 |
Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup. |
|
|
|
aa9477966a |
KVM: x86: Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt()
Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() now that nVMX essentially open codes kvm_get_apic_interrupt() in order to correctly emulate nested posted interrupts. Opportunistically stop exporting kvm_cpu_get_interrupt(), as the aforementioned nVMX flow was the only user in vendor code. Link: https://lore.kernel.org/r/20240906043413.1049633-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
a194a3a13c |
KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate API
Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR, out of kvm_get_apic_interrupt() and into a separate API so that nested VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from L2 to L1. To correctly emulate nested posted interrupts while APICv is active, KVM must: 1. find the highest pending interrupt. 2. check if that IRQ is L2's notification vector 3. emulate VM-Exit if the IRQ is NOT the notification vector 4. ACK the IRQ in L1 _after_ VM-Exit When APICv is active, the process of moving the IRQ from the IRR to the ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI, and so acknowledging the interrupt before switching to vmcs01 would result in marking the IRQ as in-service in the wrong VMCS. KVM currently fudges around this issue by doing kvm_get_apic_interrupt() smack dab in the middle of emulating VM-Exit, but that hack doesn't play nice with nested posted interrupts, as notification vector IRQs don't trigger a VM-Exit in the first place. Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
73b42dc69b |
KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD. While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.
Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.
Fixes:
|
|
|
|
d33234342f |
KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()
Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
71bf395a27 |
KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bits
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that
are must-be-zero on both Intel and AMD, i.e. any reserved bits other than
the BUSY bit, which Intel ignores and basically says is undefined.
KVM's xapic_state_test selftest has been fudging the bug since commit
|
|
|
|
1448d4a935 |
KVM: x86: Optimize local variable in start_sw_tscdeadline()
Change the data type of the local variable this_tsc_khz to u32 because virtual_tsc_khz is also declared as u32. Since do_div() casts the divisor to u32 anyway, changing the data type of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240814203345.2234-2-thorsten.blum@toblux.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4b7c3f6d04 |
KVM: x86: Make x2APIC ID 100% readonly
Ignore the userspace provided x2APIC ID when fixing up APIC state for
KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit
|
|
|
|
0005ca2076 |
KVM: x86: Eliminate log spam from limited APIC timer periods
SAP's vSMP MemoryONE continuously requests a local APIC timer period less than 500 us, resulting in the following kernel log spam: kvm: vcpu 15: requested 70240 ns lapic timer period limited to 500000 ns kvm: vcpu 19: requested 52848 ns lapic timer period limited to 500000 ns kvm: vcpu 15: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70208 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 387520 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70160 ns lapic timer period limited to 500000 ns kvm: vcpu 66: requested 205744 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70224 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns limit_periodic_timer_frequency: 7569 callbacks suppressed ... To eliminate this spam, change the pr_info_ratelimited() in limit_periodic_timer_frequency() to pr_info_once(). Reported-by: James Houghton <jthoughton@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Message-ID: <20240724190640.2449291-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
896046474f |
KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
f4854bf741 |
KVM: x86: Replace static_call_cond() with static_call()
The use of static_call_cond() is essentially the same as static_call() on x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace it with static_call() to simplify the code. Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
5dcc1e7614 |
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj
N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc
l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t
XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn
c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9
Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7
OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm
T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH
+YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc
lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk
9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA
xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY=
=Cf2R
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
|
|
|
|
b460256b16 |
KVM: x86: Make nanoseconds per APIC bus cycle a VM variable
Introduce the VM variable "nanoseconds per APIC bus cycle" in
preparation to make the APIC bus frequency configurable.
The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.
In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."
The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.
Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare
for allowing userspace to tell KVM to use the frequency that TDX mandates
instead of the default 1Ghz. Doing so ensures that the guest doesn't have
a conflicting view of the APIC bus frequency.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[reinette: rework changelog]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
89a58812c4 |
KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
49ff3b4aec |
KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.
For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.
This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.
Before:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]
After:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]
Fixes:
|