mirror of https://github.com/torvalds/linux.git
3098 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
32f55e475c |
KVM: nVMX: Request immediate exit iff pending nested event needs injection
When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
02b0d3b9d4 | Merge branch 'kvm-6.10-fixes' into HEAD | |
|
|
f3ced000a2 |
KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC
routes, irrespective of whether the I/O APIC is emulated by userspace or
by KVM. If a level-triggered interrupt routed through the I/O APIC is
pending or in-service for a vCPU, KVM needs to intercept EOIs on said
vCPU even if the vCPU isn't the destination for the new routing, e.g. if
servicing an interrupt using the old routing races with I/O APIC
reconfiguration.
Commit
|
|
|
|
a6816314af |
KVM: Introduce vcpu->wants_to_run
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d29bf2ca14 |
KVM: x86: Prevent excluding the BSP on setting max_vcpu_ids
If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7c305d5118 |
KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_ID
Do not accept IDs which are definitely invalid by limit checking the
passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was
already set.
This ensures invalid values, especially on 64-bit systems, don't go
unnoticed and lead to a valid id by chance when truncated by the final
assignment.
Fixes:
|
|
|
|
3dee3b1874 |
KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ef2e18ef37 |
KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load
Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2a27c43140 |
KVM: Delete the now unused kvm_arch_sched_in()
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
8fbb696a8f |
KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e3c89f5dd1 |
KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIP
Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <foxywang@tencent.com> Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
85542adb65 |
KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d99e4cb2ae |
KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bit
Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in 32-bit Protected Mode (including Compatibility Mode) should #UD or succeed. The existing code already does the exact equivalent of guest_cpuid_is_intel_compatible(), just in a rather roundabout way. No functional change intended. Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c092fc879f |
KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUs
Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that
are Intel compatible, not just strictly vCPUs with vendor==Intel. The
behavior is explicitly called out in the SDM, and thus architectural, i.e.
applies to all CPUs that implement Intel's architecture, and isn't a quirk
that is unique to CPUs manufactured by Intel:
However, if an instruction breakpoint is placed on an instruction located
immediately after a POP SS/MOV SS instruction, the breakpoint will be
suppressed as if EFLAGS.RF were 1.
Applying the behavior strictly to Intel wasn't intentional, KVM simply
didn't have a concept of "Intel compatible" as of commit
|
|
|
|
6463e5e418 |
KVM: x86: Apply Intel's TSC_AUX reserved-bit behavior to Intel compat vCPUs
Extend Intel's check on MSR_TSC_AUX[63:32] to all vCPU models that are
Intel compatible, i.e. aren't AMD or Hygon in KVM's world, as the behavior
is architectural, i.e. applies to any CPU that is compatible with Intel's
architecture. Applying the behavior strictly to Intel wasn't intentional,
KVM simply didn't have a concept of "Intel compatible" as of commit
|
|
|
|
65a4de0ffd |
KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
Ensure a full memory barrier is emitted in the VM-Exit path, as a full barrier is required on Intel CPUs to evict WC buffers. This will allow unconditionally honoring guest PAT on Intel CPUs that support self-snoop. As srcu_read_lock() is always called in the VM-Exit path and it internally has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a second fence and make sure smp_mb() is called without dependency on implementation details of srcu_read_lock(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: massage changelog] Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e1548088ff |
KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit |
|
|
|
0a7b73559b |
KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit |
|
|
|
f992572120 |
KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasons
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to APICV_INHIBIT_REASON_DISABLED. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
69148ccec6 |
KVM: x86: Print names of apicv inhibit reasons in traces
Use the tracing infrastructure helper __print_flags() for printing flag bitfields, to enhance the trace output by displaying a string describing each of the inhibit reasons set. The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap value, requiring the user to consult the source file where the inhibit reasons are defined to decode the trace output. Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
6fef518594 |
KVM: x86: Add a capability to configure bus frequency for APIC timer
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.
Vishal reported that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.
The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.
In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."
The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.
The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest
kernel expects a 25MHz timer frequency but gets timer interrupt more
frequently due to the 1GHz frequency used by KVM.
To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.
Reported-by: Vishal Annapurve <vannapurve@google.com>
Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
b460256b16 |
KVM: x86: Make nanoseconds per APIC bus cycle a VM variable
Introduce the VM variable "nanoseconds per APIC bus cycle" in
preparation to make the APIC bus frequency configurable.
The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.
In addition, per Intel SDM:
"The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is
enumerated in CPUID leaf 0x15) divided by the value specified in
the divide configuration register."
The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.
Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare
for allowing userspace to tell KVM to use the frequency that TDX mandates
instead of the default 1Ghz. Doing so ensures that the guest doesn't have
a conflicting view of the APIC bus frequency.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[reinette: rework changelog]
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
ab978c62e7 |
Merge branch 'kvm-6.11-sev-snp' into HEAD
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests. |
|
|
|
89a58812c4 |
KVM: x86: Drop support for hand tuning APIC timer advancement from userspace
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
ea19f7d0bf |
KVM: x86: Remove IA32_PERF_GLOBAL_OVF_CTRL from KVM_GET_MSR_INDEX_LIST
This MSR reads as 0, and any host-initiated writes are ignored, so there's no reason to enumerate it in KVM_GET_MSR_INDEX_LIST. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231113184854.2344416-1-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7974c0643e |
KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...
Add "struct kvm_host_values kvm_host" to hold the various host values that KVM snapshots during initialization. Bundling the host values into a single struct simplifies adding new MSRs and other features with host state/values that KVM cares about, and provides a one-stop shop. E.g. adding a new value requires one line, whereas tracking each value individual often requires three: declaration, definition, and export. No functional change intended. Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4f2e7aa1cf |
KVM: SEV: Implement gmem hook for initializing private pages
This will handle the RMP table updates needed to put a page into a private state before mapping it into an SEV-SNP guest. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501085210.2213060-14-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
e366f92ea9 |
KVM: SEV: Support SEV-SNP AP Creation NAE event
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP guests to alter the register state of the APs on their own. This allows the guest a way of simulating INIT-SIPI. A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used so as to avoid updating the VMSA pointer while the vCPU is running. For CREATE The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID. The GPA is saved in the svm struct of the target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the vCPU and then the vCPU is kicked. For CREATE_ON_INIT: The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID the next time an INIT is performed. The GPA is saved in the svm struct of the target vCPU. For DESTROY: The guest indicates it wishes to stop the vCPU. The GPA is cleared from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to vCPU and then the vCPU is kicked. The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as a result of the event or as a result of an INIT. If a new VMSA is to be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_HALTED to prevent it from being run. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-13-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c63cf135cc |
KVM: SEV: Add support to handle RMP nested page faults
When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-12-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
7323260373 |
Merge branch 'kvm-coco-hooks' into HEAD
Common patches for the target-independent functionality and hooks that are needed by SEV-SNP and TDX. |
|
|
|
7d41e24da2 |
KVM x86 misc changes for 6.10:
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
is unused by hardware, so that KVM can communicate its inability to map GPAs
that set bits 51:48 due to lack of 5-level paging. Guest firmware is
expected to use the information to safely remap BARs in the uppermost GPA
space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
- Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
- Don't completely ignore same-value writes to immutable feature MSRs, as
doing so results in KVM failing to reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
ABSENT inhibit, even if userspace enables in-kernel local APIC).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmY+rlEACgkQOlYIJqCj
N/3/xQ/7BvNl1aCJSIQy+yanCKK4wV0wWoY/hD+1wVge3zoaLZqLNHeR7fEa3vo+
OSS/pOz+PT6DbkokZYjjVaGs6+pFqaYg5YvRE7SPbj903phm81H7v5ZLtwgOBcXx
dG9cSLTaRhos0PxqoiLfmiGK5IDKmWuZyJzhw+nPh2YmxoRDO/4exsLA9xWWhQSh
BjPf32cq69fn39Mo/KeANdLR1FEjvKItEty7St5r/OZFxejP8VPe1xuFxHPJn4U+
FBbDe0DMXAPfoAQImBBhHUpm5Rp7Hwbh90tM8xY6rf3hvRZWmMCAX/Hx8C562M2b
k6jB13gsoVesatT6lgKs2I0KGL7TSC0jLYG8aeREdBz6AEo5bkBegB5965MZYfGv
T43i/zk+Ha5VIEURqE/CtocKF8AEjnUWLaIyL7VsDqaMslmaMdWzr8RouaO1snMT
N/mfilzx9/rzltTV67TI8FSykPNxehwNoc9P8l+ulbW1KKIzpZCWxtIpQnT2TGdn
89zAJ7LUbEAOnO+jMsJjld0fcNEmUqiqu9tezHuu0rVYErYqtfVhrWIf52r0AHDK
HRY5FNcZzCE+8FFAVDNl92Of+mPeF47RELXNMLAT+1lm91ug4k62GF4UDw7hsbFo
6+ductlj2DZlwxZVGKxKhBDxFg+AfsNCC1fZvYq+D/6ZE51eABo=
=9RXP
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.10:
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which
is unused by hardware, so that KVM can communicate its inability to map GPAs
that set bits 51:48 due to lack of 5-level paging. Guest firmware is
expected to use the information to safely remap BARs in the uppermost GPA
space, i.e to avoid placing a BAR at a legal, but unmappable, GPA.
- Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
- Don't completely ignore same-value writes to immutable feature MSRs, as
doing so results in KVM failing to reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- Don't mark APICv as being inhibited due to ABSENT if APICv is disabled
KVM-wide to avoid confusing debuggers (KVM will never bother clearing the
ABSENT inhibit, even if userspace enables in-kernel local APIC).
|
|
|
|
4232da23d7 |
Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support. |
|
|
|
a90764f0e4 |
KVM: guest_memfd: Add hook for invalidating memory
In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-6-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
3bb2531e20 |
KVM: guest_memfd: Add hook for initializing memory
guest_memfd pages are generally expected to be in some arch-defined initial state prior to using them for guest memory. For SEV-SNP this initial state is 'private', or 'guest-owned', and requires additional operations to move these pages into a 'private' state by updating the corresponding entries the RMP table. Allow for an arch-defined hook to handle updates of this sort, and go ahead and implement one for x86 so KVM implementations like AMD SVM can register a kvm_x86_ops callback to handle these updates for SEV-SNP guests. The preparation callback is always called when allocating/grabbing folios via gmem, and it is up to the architecture to keep track of whether or not the pages are already in the expected state (e.g. the RMP table in the case of SEV-SNP). In some cases, it is necessary to defer the preparation of the pages to handle things like in-place encryption of initial guest memory payloads before marking these pages as 'private'/'guest-owned'. Add an argument (always true for now) to kvm_gmem_get_folio() that allows for the preparation callback to be bypassed. To detect possible issues in the way userspace initializes memory, it is only possible to add an unprepared page if it is not already included in the filemap. Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/ Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
40269c03fd |
KVM: x86: Explicitly zero kvm_caps during vendor module load
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
555485bd86 |
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
Effectively reset supported_mce_cap on vendor module load to ensure that
capabilities aren't unintentionally preserved across module reload, e.g.
if kvm-intel.ko added a module param to control LMCE support, or if
someone somehow managed to load a vendor module that doesn't support LMCE
after loading and unloading kvm-intel.ko.
Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have
a module param for LMCE, and there is no system in the world that supports
both kvm-intel.ko and kvm-amd.ko.
Fixes:
|
|
|
|
c43ad19045 |
KVM: x86: Fully re-initialize supported_vm_types on vendor module load
Recompute the entire set of supported VM types when a vendor module is
loaded, as preserving supported_vm_types across vendor module unload and
reload can result in VM types being incorrectly treated as supported.
E.g. if a vendor module is loaded with TDP enabled, unloaded, and then
reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly
retained. Ditto for SEV_VM and SEV_ES_VM and their respective module
params in kvm-amd.ko.
Fixes:
|
|
|
|
6982b34c21 |
KVM: x86: Only set APICV_INHIBIT_REASON_ABSENT if APICv is enabled
Use the APICv enablement status to determine if APICV_INHIBIT_REASON_ABSENT
needs to be set, instead of unconditionally setting the reason during
initialization.
Specifically, in cases where AVIC is disabled via module parameter or lack
of hardware support, unconditionally setting an inhibit reason due to the
absence of an in-kernel local APIC can lead to a scenario where the reason
incorrectly remains set after a local APIC has been created by either
KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. This is
because the helpers in charge of removing the inhibit return early if
enable_apicv is not true, and therefore the bit remains set.
This leads to confusion as to the cause why APICv is not active, since an
incorrect reason will be reported by tracepoints and/or a debugging tool
that examines the currently set inhibit reasons.
Fixes:
|
|
|
|
1d294dfaba |
KVM: x86: Allow, don't ignore, same-value writes to immutable MSRs
When handling userspace writes to immutable feature MSRs for a vCPU that
has already run, fall through into the normal code to set the MSR instead
of immediately returning '0'. I.e. allow such writes, instead of ignoring
such writes. This fixes a bug where KVM incorrectly allows writes to the
VMX MSRs that enumerate which CR{0,4} can be set, but only if the vCPU has
already run.
The intent of returning '0' and thus ignoring the write, was to avoid any
side effects, e.g. refreshing the PMU and thus doing weird things with
perf events while the vCPU is running. That approach sounds nice in
theory, but in practice it makes it all but impossible to maintain a sane
ABI, e.g. all VMX MSRs return -EBUSY if the CPU is post-VMXON, and the VMX
MSRs for fixed-1 CR bits are never writable, etc.
As for refreshing the PMU, kvm_set_msr_common() explicitly skips the PMU
refresh if MSR_IA32_PERF_CAPABILITIES is being written with the current
value, specifically to avoid unwanted side effects. And if necessary,
adding similar logic for other MSRs is not difficult.
Fixes:
|
|
|
|
817772266d |
* Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. This fixes a warning
"Unpatched return thunk in use. This should not happen!" when running
KVM selftests.
* Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
would allow userspace to refresh the cache with a bogus GPA. The bug has
existed for quite some time, but was exposed by a new sanity check added in
6.9 (to ensure a cache is either GPA-based or HVA-based).
* Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
behind during a 6.9 cleanup.
* Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
results in an array overflow (detected by KASAN).
* Fix a bug where KVM incorrectly clears root_role.direct when userspace sets
guest CPUID.
* Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used
by a nested guest, if KVM is using Page-Modification Logging and the nested
hypervisor is NOT using EPT.
x86 PMU:
* Drop support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken without an obvious/easy path forward, and because
exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak
host kernel addresses to the guest.
* Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at
RESET time, as done by both Intel and AMD processors.
* Disable LBR virtualization on CPUs that don't support LBR callstacks, as
KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
perf event, and would fail on such CPUs.
Tests:
* Fix a flaw in the max_guest_memory selftest that results in it exhausting
the supply of ucall structures when run with more than 256 vCPUs.
* Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYjdqcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPNRAgAh1AdKBAWnq9bFN2Np1kSAcRAk3bs
REDq/0iD1T9TvIwEmE1lHaRuqvCSO15WW+DKvbs7TS8zA0DyY7X/x8sIIy5YzZ5C
bQ+JXiqk55OAj0sPskBpCvE5qEreuU8qAit57+8OseKWs57EICvJjrfsRnHlmIub
pgGas3I42LjIgsuZRr2kjv+GrvaiikW+wWK6sq3CvPzTtHV196d26AK5l4NOoLkY
0FTbBIYUSJ7wxs92xuTed5mZ7JFZdsa5DVMXF5MRZ9W6g2vZCLbqCNRddRhSAsl0
gKmqZkuPTB7AnGQbJ2h/aKFT0ydsguzqbbKq62sK7ft5f1CUlbp9luDC9w==
=99rq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"This is a bit on the large side, mostly due to two changes:
- Changes to disable some broken PMU virtualization (see below for
details under "x86 PMU")
- Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched
return thunk in use. This should not happen!" when running KVM
selftests.
Everything else is small bugfixes and selftest changes:
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure
where KVM would allow userspace to refresh the cache with a bogus
GPA. The bug has existed for quite some time, but was exposed by a
new sanity check added in 6.9 (to ensure a cache is either
GPA-based or HVA-based).
- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that
got left behind during a 6.9 cleanup.
- Fix a math goof in x86's hugepage logic for
KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow
(detected by KASAN).
- Fix a bug where KVM incorrectly clears root_role.direct when
userspace sets guest CPUID.
- Fix a dirty logging bug in the where KVM fails to write-protect
SPTEs used by a nested guest, if KVM is using Page-Modification
Logging and the nested hypervisor is NOT using EPT.
x86 PMU:
- Drop support for virtualizing adaptive PEBS, as KVM's
implementation is architecturally broken without an obvious/easy
path forward, and because exposing adaptive PEBS can leak host LBRs
to the guest, i.e. can leak host kernel addresses to the guest.
- Set the enable bits for general purpose counters in
PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD
processors.
- Disable LBR virtualization on CPUs that don't support LBR
callstacks, as KVM unconditionally uses
PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and
would fail on such CPUs.
Tests:
- Fix a flaw in the max_guest_memory selftest that results in it
exhausting the supply of ucall structures when run with more than
256 vCPUs.
- Mark KVM_MEM_READONLY as supported for RISC-V in
set_memory_region_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test
KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
perf/x86/intel: Expose existence of callback support to KVM
KVM: VMX: Snapshot LBR capabilities during module initialization
KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
KVM: SVM: Remove a useless zeroing of allocated memory
...
|
|
|
|
e913ef159f |
KVM: x86: Split core of hypercall emulation to helper function
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
2a955c4db1 |
KVM: x86: Add supported_vm_types to kvm_caps
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES), and also allows the vendor module to specify which VM types are supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
517987e3fb |
KVM: x86: add fields to struct kvm_arch for CoCo features
Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
546d714b08 |
KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR
Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
8d2aec3b2d |
KVM: x86: use u64_to_user_ptr()
There is no danger to the kernel if 32-bit userspace provides a 64-bit value that has the high bits set, but for whatever reason happens to resolve to an address that has something mapped there. KVM uses the checked version of get_user() and put_user(), so any faults are caught properly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
fd706c9b16 |
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel *or* AMD, but that would add useless complexity to KVM. KVM needs to do *something* in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240405235603.1173076-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a952d608f0 |
KVM: Use vfree for memory allocated by vcalloc()/__vcalloc()
commit 37b2a6510a48("KVM: use __vcalloc for very large allocations")
replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree()
with vfree().
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20240131012357.53563-1-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
ed2e8d49b5 |
KVM: x86: Add BHI_NO
Intel processors that aren't vulnerable to BHI will set MSR_IA32_ARCH_CAPABILITIES[BHI_NO] = 1;. Guests may use this BHI_NO bit to determine if they need to implement BHI mitigations or not. Allow this bit to be passed to the guests. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> |
|
|
|
4f712ee0cb |
S390:
* Changes to FPU handling came in via the main s390 pull request
* Only deliver to the guest the SCLP events that userspace has
requested.
* More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same).
* Fix selftests undefined behavior.
x86:
* Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can be
programmed *using the architectural encoding*. The enumeration does
NOT say anything about the encoding when the CPU doesn't report support
the event *in general*. It might support it, and it might support it
using the same encoding that made it into the architectural PMU spec.
* Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly emulates
RDMPC, counter availability, and a variety of other PMC-related
behaviors that depend on guest CPUID and therefore are easier to
validate with selftests than with custom guests (aka kvm-unit-tests).
* Zero out PMU state on AMD if the virtual PMU is disabled, it does not
cause any bug but it wastes time in various cases where KVM would check
if a PMC event needs to be synthesized.
* Optimize triggering of emulated events, with a nice ~10% performance
improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
guest.
* Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
* Fix a bug where KVM would report stale/bogus exit qualification information
when exiting to userspace with an internal error exit code.
* Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
* Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
* Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM
doesn't support yielding in the middle of processing a zap, and 1GiB
granularity resulted in multi-millisecond lags that are quite impolite
for CONFIG_PREEMPT kernels.
* Allocate write-tracking metadata on-demand to avoid the memory overhead when
a kernel is built with i915 virtualization support but the workloads use
neither shadow paging nor i915 virtualization.
* Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives.
* Fix the debugregs ABI for 32-bit KVM.
* Rework the "force immediate exit" code so that vendor code ultimately decides
how and when to force the exit, which allowed some optimization for both
Intel and AMD.
* Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, causing extra unnecessary work.
* Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
* Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere in the
kernel) are detected earlier and are less likely to hang the kernel.
x86 Xen emulation:
* Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the
gpa but the underlying host virtual address remains the same.
* When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
* Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
* Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
RISC-V:
* Support exception and interrupt handling in selftests
* New self test for RISC-V architectural timer (Sstc extension)
* New extension support (Ztso, Zacas)
* Support userspace emulation of random number seed CSRs.
ARM:
* Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
* Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
* Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
* Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
* Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
* Misc cleanups and fixes as usual.
Generic:
* cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
true on all architectures except MIPS (where Kconfig determines the
available depending on CPU capabilities). It is replaced either by
an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
everywhere else.
* Factor common "select" statements in common code instead of requiring
each architecture to specify it
* Remove thoroughly obsolete APIs from the uapi headers.
* Move architecture-dependent stuff to uapi/asm/kvm.h
* Always flush the async page fault workqueue when a work item is being
removed, especially during vCPU destruction, to ensure that there are no
workers running in KVM code when all references to KVM-the-module are gone,
i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
* Grab a reference to the VM's mm_struct in the async #PF worker itself instead
of gifting the worker a reference, so that there's no need to remember
to *conditionally* clean up after the worker.
Selftests:
* Reduce boilerplate especially when utilize selftest TAP infrastructure.
* Add basic smoke tests for SEV and SEV-ES, along with a pile of library
support for handling private/encrypted/protected memory.
* Fix benign bugs where tests neglect to close() guest_memfd files.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
=mqOV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- Changes to FPU handling came in via the main s390 pull request
- Only deliver to the guest the SCLP events that userspace has
requested
- More virtual vs physical address fixes (only a cleanup since
virtual and physical address spaces are currently the same)
- Fix selftests undefined behavior
x86:
- Fix a restriction that the guest can't program a PMU event whose
encoding matches an architectural event that isn't included in the
guest CPUID. The enumeration of an architectural event only says
that if a CPU supports an architectural event, then the event can
be programmed *using the architectural encoding*. The enumeration
does NOT say anything about the encoding when the CPU doesn't
report support the event *in general*. It might support it, and it
might support it using the same encoding that made it into the
architectural PMU spec
- Fix a variety of bugs in KVM's emulation of RDPMC (more details on
individual commits) and add a selftest to verify KVM correctly
emulates RDMPC, counter availability, and a variety of other
PMC-related behaviors that depend on guest CPUID and therefore are
easier to validate with selftests than with custom guests (aka
kvm-unit-tests)
- Zero out PMU state on AMD if the virtual PMU is disabled, it does
not cause any bug but it wastes time in various cases where KVM
would check if a PMC event needs to be synthesized
- Optimize triggering of emulated events, with a nice ~10%
performance improvement in VM-Exit microbenchmarks when a vPMU is
exposed to the guest
- Tighten the check for "PMI in guest" to reduce false positives if
an NMI arrives in the host while KVM is handling an IRQ VM-Exit
- Fix a bug where KVM would report stale/bogus exit qualification
information when exiting to userspace with an internal error exit
code
- Add a VMX flag in /proc/cpuinfo to report 5-level EPT support
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock
held for read, e.g. to avoid serializing vCPUs when userspace
deletes a memslot
- Tear down TDP MMU page tables at 4KiB granularity (used to be
1GiB). KVM doesn't support yielding in the middle of processing a
zap, and 1GiB granularity resulted in multi-millisecond lags that
are quite impolite for CONFIG_PREEMPT kernels
- Allocate write-tracking metadata on-demand to avoid the memory
overhead when a kernel is built with i915 virtualization support
but the workloads use neither shadow paging nor i915 virtualization
- Explicitly initialize a variety of on-stack variables in the
emulator that triggered KMSAN false positives
- Fix the debugregs ABI for 32-bit KVM
- Rework the "force immediate exit" code so that vendor code
ultimately decides how and when to force the exit, which allowed
some optimization for both Intel and AMD
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left
elevated if vCPU creation ultimately failed, causing extra
unnecessary work
- Cleanup the logic for checking if the currently loaded vCPU is
in-kernel
- Harden against underflowing the active mmu_notifier invalidation
count, so that "bad" invalidations (usually due to bugs elsehwere
in the kernel) are detected earlier and are less likely to hang the
kernel
x86 Xen emulation:
- Overlay pages can now be cached based on host virtual address,
instead of guest physical addresses. This removes the need to
reconfigure and invalidate the cache if the guest changes the gpa
but the underlying host virtual address remains the same
- When possible, use a single host TSC value when computing the
deadline for Xen timers in order to improve the accuracy of the
timer emulation
- Inject pending upcall events when the vCPU software-enables its
APIC to fix a bug where an upcall can be lost (and to follow Xen's
behavior)
- Fall back to the slow path instead of warning if "fast" IRQ
delivery of Xen events fails, e.g. if the guest has aliased xAPIC
IDs
RISC-V:
- Support exception and interrupt handling in selftests
- New self test for RISC-V architectural timer (Sstc extension)
- New extension support (Ztso, Zacas)
- Support userspace emulation of random number seed CSRs
ARM:
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized
to address serialization some of the serialization on the LPI
injection path
- Support for _architectural_ VHE-only systems, advertised through
the absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
LoongArch:
- Set reserved bits as zero in CPUCFG
- Start SW timer only when vcpu is blocking
- Do not restart SW timer when it is expired
- Remove unnecessary CSR register saving during enter guest
- Misc cleanups and fixes as usual
Generic:
- Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
always true on all architectures except MIPS (where Kconfig
determines the available depending on CPU capabilities). It is
replaced either by an architecture-dependent symbol for MIPS, and
IS_ENABLED(CONFIG_KVM) everywhere else
- Factor common "select" statements in common code instead of
requiring each architecture to specify it
- Remove thoroughly obsolete APIs from the uapi headers
- Move architecture-dependent stuff to uapi/asm/kvm.h
- Always flush the async page fault workqueue when a work item is
being removed, especially during vCPU destruction, to ensure that
there are no workers running in KVM code when all references to
KVM-the-module are gone, i.e. to prevent a very unlikely
use-after-free if kvm.ko is unloaded
- Grab a reference to the VM's mm_struct in the async #PF worker
itself instead of gifting the worker a reference, so that there's
no need to remember to *conditionally* clean up after the worker
Selftests:
- Reduce boilerplate especially when utilize selftest TAP
infrastructure
- Add basic smoke tests for SEV and SEV-ES, along with a pile of
library support for handling private/encrypted/protected memory
- Fix benign bugs where tests neglect to close() guest_memfd files"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
selftests: kvm: remove meaningless assignments in Makefiles
KVM: riscv: selftests: Add Zacas extension to get-reg-list test
RISC-V: KVM: Allow Zacas extension for Guest/VM
KVM: riscv: selftests: Add Ztso extension to get-reg-list test
RISC-V: KVM: Allow Ztso extension for Guest/VM
RISC-V: KVM: Forward SEED CSR access to user space
KVM: riscv: selftests: Add sstc timer test
KVM: riscv: selftests: Change vcpu_has_ext to a common function
KVM: riscv: selftests: Add guest helper to get vcpu id
KVM: riscv: selftests: Add exception handling support
LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
LoongArch: KVM: Do not restart SW timer when it is expired
LoongArch: KVM: Start SW timer only when vcpu is blocking
LoongArch: KVM: Set reserved bits as zero in CPUCFG
KVM: selftests: Explicitly close guest_memfd files in some gmem tests
KVM: x86/xen: fix recursive deadlock in timer injection
KVM: pfncache: simplify locking and make more self-contained
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
KVM: x86/xen: improve accuracy of Xen timers
...
|
|
|
|
0e33cf955f |
* Mitigate RFDS vulnerability
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmXvZgoACgkQaDWVMHDJ krC2Eg//aZKBp97/DSzRqXKDwJzVUr0sGJ9cii0gVT1sI+1U6ZZCh/roVH4xOT5/ HqtOOnQ+X0mwUx2VG3Yv2VPI7VW68sJ3/y9D8R4tnMEsyQ4CmDw96Pre3NyKr/Av jmW7SK94fOkpNFJOMk3zpk7GtRUlCsVkS1P61dOmMYduguhel/V20rWlx83BgnAY Rf/c3rBjqe8Ri3rzBP5icY/d6OgwoafuhME31DD/j6oKOh+EoQBvA4urj46yMTMX /mrK7hCm/wqwuOOvgGbo7sfZNBLCYy3SZ3EyF4beDERhPF1DaSvCwOULpGVJroqu SelFsKXAtEbYrDgsan+MYlx3bQv43q7PbHska1gjkH91plO4nAsssPr5VsusUKmT sq8jyBaauZb40oLOSgooL4RqAHrfs8q5695Ouwh/DB/XovMezUI1N/BkpGFmqpJI o2xH9P5q520pkB8pFhN9TbRuFSGe/dbWC24QTq1DUajo3M3RwcwX6ua9hoAKLtDF pCV5DNcVcXHD3Cxp0M5dQ5JEAiCnW+ZpUWgxPQamGDNW5PEvjDmFwql2uWw/qOuW lkheOIffq8ejUBQFbN8VXfIzzeeKQNFiIcViaqGITjIwhqdHAzVi28OuIGwtdh3g ywLzSC8yvyzgKrNBgtFMr3ucKN0FoPxpBro253xt2H7w8srXW64= =5V9t -----END PGP SIGNATURE----- Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RFDS mitigation from Dave Hansen: "RFDS is a CPU vulnerability that may allow a malicious userspace to infer stale register values from kernel space. Kernel registers can have all kinds of secrets in them so the mitigation is basically to wait until the kernel is about to return to userspace and has user values in the registers. At that point there is little chance of kernel secrets ending up in the registers and the microarchitectural state can be cleared. This leverages some recent robustness fixes for the existing MDS vulnerability. Both MDS and RFDS use the VERW instruction for mitigation" * tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests x86/rfds: Mitigate Register File Data Sampling (RFDS) Documentation/hw-vuln: Add documentation for RFDS x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set |
|
|
|
2a0180129d |
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it to guests so that they can deploy the mitigation. RFDS_NO indicates that the system is not vulnerable to RFDS, export it to guests so that they don't deploy the mitigation unnecessarily. When the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize RFDS_NO to the guest. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> |
|
|
|
e9a2bba476 |
KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages
that are "mapped" into guests via physical addresses.
- Add support for using gfn_to_pfn caches with only a host virtual address,
i.e. to bypass the "gfn" stage of the cache. The primary use case is
overlay pages, where the guest may change the gfn used to reference the
overlay page, but the backing hva+pfn remains the same.
- Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
of a gpa, so that userspace doesn't need to reconfigure and invalidate the
cache/mapping if the guest changes the gpa (but userspace keeps the resolved
hva the same).
- When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
- Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
- Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
- Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
refresh), and drop a now-redundant acquisition of xen_lock (that was
protecting the shared_info cache) to fix a deadlock due to recursively
acquiring xen_lock.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrblYACgkQOlYIJqCj
N/3K4Q/+KZ8lrnNXvdHNCQdosA5DDXpqUcRzhlTUp82fncpdJ0LqrSMzMots2Eh9
KC0jSPo8EkivF+Epug0+bpQBEaLXzTWhRcS1grePCDz2lBnxoHFSWjvaK2p14KlC
LvxCJZjxyfLKHwKHpSndvO9hVFElCY3mvvE9KRcKeQAmrz1cz+DDMKelo1MuV8D+
GfymhYc+UXpY41+6hQdznx+WoGoXKRameo3iGYuBoJjvKOyl4Wxkx9WSXIxxxuqG
kHxjiWTR/jF1ITJl6PeMrFcGl3cuGKM/UfTOM6W2h6Wi3mhLpXveoVLnqR1kipIj
btSzSVHL7C4WTPwOcyhwPzap+dJmm31c6N0uPScT7r9yhs+q5BDj26vcVcyPZUHo
efIwmsnO2eQvuw+f8C6QqWCPaxvw46N0zxzwgc5uA3jvAC93y0l4v+xlAQsC0wzV
0+BwU00cutH/3t3c/WPD5QcmRLH726VoFuTlaDufpoMU7gBVJ8rzjcusxR+5BKT+
GJcAgZxZhEgvnzmTKd4Ec/mt+xZ2Erd+kV3MKCHvDPyj8jqy8FQ4DAWKGBR+h3WR
rqAs2k8NPHyh3i1a3FL1opmxEGsRS+Cnc6Bi77cj9DxTr22JkgDJEuFR+Ues1z6/
SpE889kt3w5zTo34+lNxNPlIKmO0ICwwhDL6pxJTWU7iWQnKypU=
=GliW
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages
that are "mapped" into guests via physical addresses.
- Add support for using gfn_to_pfn caches with only a host virtual address,
i.e. to bypass the "gfn" stage of the cache. The primary use case is
overlay pages, where the guest may change the gfn used to reference the
overlay page, but the backing hva+pfn remains the same.
- Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
of a gpa, so that userspace doesn't need to reconfigure and invalidate the
cache/mapping if the guest changes the gpa (but userspace keeps the resolved
hva the same).
- When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
- Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
- Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
- Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
refresh), and drop a now-redundant acquisition of xen_lock (that was
protecting the shared_info cache) to fix a deadlock due to recursively
acquiring xen_lock.
|
|
|
|
e9025cdd8c |
KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj
N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+
gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e
zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY
nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N
C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd
KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7
UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S
qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1
T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E
qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj
pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y=
=vayI
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing
fixed counters and architectural event encodings based on whether or not
guest CPUID reports support for the _architectural_ encoding.
- Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.
- Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
and a variety of other PMC-related behaviors that depend on guest CPUID,
i.e. are difficult to validate via KVM-Unit-Tests.
- Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
cycles, e.g. when checking if a PMC event needs to be synthesized when
skipping an instruction.
- Optimize triggering of emulated events, e.g. for "count instructions" events
when skipping an instruction, which yields a ~10% performance improvement in
VM-Exit microbenchmarks when a vPMU is exposed to the guest.
- Tighten the check for "PMI in guest" to reduce false positives if an NMI
arrives in the host while KVM is handling an IRQ VM-Exit.
|
|
|
|
41ebae2ecd |
KVM x86 MMU changes for 6.9:
- Clean up code related to unprotecting shadow pages when retrying a guest
instruction after failed #PF-induced emulation.
- Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
a reschedule is needed, e.g. if a high priority task needs to run. Because
KVM doesn't support yielding in the middle of processing a zapped non-leaf
SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
attempting to schedule in a high priority.
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
- Allocate write-tracking metadata on-demand to avoid the memory overhead when
running kernels built with KVMGT support (external write-tracking enabled),
but for workloads that don't use nested virtualization (shadow paging) or
KVMGT.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrTH4ACgkQOlYIJqCj
N/1q3xAAh3wpUDzRfkNkgGUbulhuJmQ72PiaW3NRoMo/3Rowegsdgt1N3/ec+fcJ
Awx0KUM8Cju8O2Zqp6NzKwUkddCni8dHmOa55NJQuK2M1OpnE0RjBB94n+AFJZki
mm8wKSKNgjlVeJDG87+RLPnbaeEvqYPp22oNKJyAPsimTbxvmhIqtg8qdyujGPXA
Jke7LXgtVGav+nEzXiLh86VU/agoBJc/zt+hiuLvamU5Y8so+zReqFbrDtvsgtpV
ryvMbDZxcPXKrsBP+B7syqUAbODcmh/wkzOCZ4Tby5yurEaw1rwpZIH0BRKRgGx2
F2JqWayYsCOsrJ4DwQre8RfLMtbEKB2BBWkZlYyblAy0++1LcTP9pSk5YC5lSL71
5Oszql9DKi10Vq5IfR/ehsr6mHXFr3AB7C7QefiXpytGbObQs8/f/OxinxaEajcs
ERBgh+rcQ5p3kfdiHzuQjn7y45J7z21CKVhka4iKJtTxypBK4ZvkDOVqHuHppb5O
aw6rC5HR1EKhSW4jz7QWrDExtDZ2X5HeYl8TgfHncSSJRc7urKYcSCHhXJsB6BPs
iQf0xbHaIOyH9jmoqLZjz0QZmXB9fydQ/zAlFVXZsrNHvomayVjqrpl8UFTMdhuI
zll9ynfRRHMUkIi1YubUlmFMgBeqOXGkfBFh8QUH3+YiI7Cwzh4=
=SgFo
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.9:
- Clean up code related to unprotecting shadow pages when retrying a guest
instruction after failed #PF-induced emulation.
- Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
a reschedule is needed, e.g. if a high priority task needs to run. Because
KVM doesn't support yielding in the middle of processing a zapped non-leaf
SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
attempting to schedule in a high priority.
- Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
- Allocate write-tracking metadata on-demand to avoid the memory overhead when
running kernels built with KVMGT support (external write-tracking enabled),
but for workloads that don't use nested virtualization (shadow paging) or
KVMGT.
|
|
|
|
c9cd0beae9 |
KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives (though in fairness in KMSAN, it's comically
difficult to see that the uninitialized memory is never truly consumed).
- Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
DR6 and DR7.
- Rework the "force immediate exit" code so that vendor code ultimately
decides how and when to force the exit. This allows VMX to further optimize
handling preemption timer exits, and allows SVM to avoid sending a duplicate
IPI (SVM also has a need to force an exit).
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, and add WARN to guard against similar bugs.
- Provide a dedicated arch hook for checking if a different vCPU was in-kernel
(for directed yield), and simplify the logic for checking if the currently
loaded vCPU is in-kernel.
- Misc cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrRjQACgkQOlYIJqCj
N/2Dzw//b+ptSBAl1kGBRmk/DqsX7J9ZkQYCQOTeh1vXiUM+XRTSQoArN0Oo1roy
3wcEnQ0beVw7jMuzZ8UUuTfU8WUMja/kwltnqXYNHwLnb6yH0I/BIengXWdUdAMc
FmgPZ4qJR2IzKYzvDsc3eEQ515O8UHWakyVDnmLBtiakAeBcUTYceHpEEPpzE5y5
ODASTQKM9o/h8R8JwKFTJ8/mrOLNcsu5SycwFdnmubLJCrNWtJWTijA6y1lh6shn
hbEJex+ESoC2v8p7IP53u1SGJubVlPajt+RkYJtlEI3WVsevp024eYcF4nb1OjXi
qS2Y3W7DQGWvyCBoSzoMY+9nRMgyOOpHYetdiz+9oZOmnjiYWY0ku59U7Gv+Aotj
AUbCn4Ry/OpqsuZ7Oo7i3IT8R7uzsTeNNdxhYBn1OQquBEZ0KBYXlZkGfTk9K0t0
Fhka/5Zu6fBlg5J+zCyaXUGmsGWBo/9HxsC5z1JuKo8fatro5qyqYE5KiM01dkqc
6FET6gL+fFprC5c67JGRPdEtk6F9Emb+6oiTTA8/8q8JQQAKiJKk95Nlq7KzPfVS
A5RQPTuTJ7acE/5CY4zB1DdxCjqgnonBEA2ULnA/J10Rk8orHJRnGJcEwKEyDrZh
HpsxIIqt++i8KffORpCym6zSAVYuQjn1mu7MGth+zuCqhcEpBfc=
=GX0O
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives (though in fairness in KMSAN, it's comically
difficult to see that the uninitialized memory is never truly consumed).
- Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
DR6 and DR7.
- Rework the "force immediate exit" code so that vendor code ultimately
decides how and when to force the exit. This allows VMX to further optimize
handling preemption timer exits, and allows SVM to avoid sending a duplicate
IPI (SVM also has a need to force an exit).
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, and add WARN to guard against similar bugs.
- Provide a dedicated arch hook for checking if a different vCPU was in-kernel
(for directed yield), and simplify the logic for checking if the currently
loaded vCPU is in-kernel.
- Misc cleanups and fixes.
|
|
|
|
39fee313fd |
Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. |
|
|
|
451a707813 |
KVM: x86/xen: improve accuracy of Xen timers
A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:
root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)
The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.
But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.
It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...
The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.
Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).
Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.
This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.
Fixes:
|
|
|
|
a1176ef5c9 |
KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
Advertise and support software-protected VMs if and only if the TDP MMU is
enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's
legacy/shadow MMU. TDP support for the shadow MMU is maintenance-only,
e.g. support for TDX and SNP will also be restricted to the TDP MMU.
Fixes:
|
|
|
|
322d79f1db |
KVM: x86: Clean up directed yield API for "has pending interrupt"
Directly return the boolean result of whether or not a vCPU has a pending interrupt instead of effectively doing: if (true) return true; return false; Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
9b8615c5d3 |
KVM: x86: Rely solely on preempted_in_kernel flag for directed yield
Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the flag is "accurate" (or rather, consistent and deterministic within KVM) for guests with protected state, and explicitly use preempted_in_kernel when checking if a vCPU was preempted in kernel mode instead of bouncing through kvm_arch_vcpu_in_kernel(). Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to preempted_in_kernel if the target vCPU is not the "running", i.e. loaded, vCPU, as the only reason that code existed was for the directed yield case where KVM wants to check the CPL of a vCPU that may or may not be loaded on the current pCPU. Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
77bcd9e623 |
KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
fc3c94142b |
KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit()
WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard against incorrect management of the key, e.g. to detect if KVM fails to decrement the key in error paths. Because kvm_has_noapic_vcpu is purely an optimization, in all likelihood KVM could completely botch handling of kvm_has_noapic_vcpu and no one would notice (which is a good argument for deleting the key entirely, but that's a problem for another day). Note, ideally the sanity check would be performance when kvm_usage_count goes to zero, but adding an arch callback just for this sanity check isn't at all worth doing. Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
a78d904669 |
KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code
Move incrementing and decrementing of kvm_has_noapic_vcpu into kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug where KVM fails to decrement the count if vCPU creation ultimately fails, e.g. due to a memory allocation failing. Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize lapic_in_kernel() checks, and that optimization is quite dubious. That, and practically speaking no setup that cares at all about performance runs with a userspace local APIC. Reported-by: Li RongQing <lirongqing@baidu.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
0ec3d6d1f1 |
KVM: x86: Fully defer to vendor code to decide how to force immediate exit
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
9c9025ea00 |
KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
dfeef3d3f3 |
KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
Remove reexecute_instruction()'s final check on the MMU being direct, as
EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a
shadow MMU. Prior to commit
|
|
|
|
515c18a64e |
KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop the dedicated path entirely and always query indirect_shadow_pages when deciding whether or not to try unprotecting the gfn. For indirect, a.k.a. shadow MMUs, checking indirect_shadow_pages is harmless; unless *every* shadow page was somehow zapped while KVM was attempting to emulate the instruction, indirect_shadow_pages is guaranteed to be non-zero. Well, unless the instruction used a direct hugepage with 2-level paging for its code page, but in that case, there's obviously nothing to unprotect. And in the extremely unlikely case all shadow pages were zapped, there's again obviously nothing to unprotect. Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
474b99ed70 |
KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
Drop KVM's completely pointless acquisition of mmu_lock when deciding whether or not to unprotect any shadow pages residing at the gfn before resuming the guest to let it retry an instruction that KVM failed to emulated. In this case, indirect_shadow_pages is used as a coarse-grained heuristic to check if there is any chance of there being a relevant shadow page to unprotected. But acquiring mmu_lock largely defeats any benefit to the heuristic, as taking mmu_lock for write is likely far more costly to the VM as a whole than unnecessarily walking mmu_page_hash. Furthermore, the current code is already prone to false negatives and false positives, as it drops mmu_lock before checking the flag and unprotecting shadow pages. And as evidenced by the lack of bug reports, neither false positives nor false negatives are problematic. A false positive simply means that KVM will try to unprotect shadow pages that have already been zapped. And a false negative means that KVM will resume the guest without unprotecting the gfn, i.e. if a shadow page was _just_ created, the vCPU will hit the same page fault and do the whole dance all over again, and detect and unprotect the shadow page the second time around (or not, if something else zaps it first). Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: drop READ_ONCE() and comment change, rewrite changelog] Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2a5f091ce1 |
KVM: x86: Open code all direct reads to guest DR6 and DR7
Bite the bullet, and open code all direct reads of DR6 and DR7. KVM currently has a mix of open coded accesses and calls to kvm_get_dr(), which is confusing and ugly because there's no rhyme or reason as to why any particular chunk of code uses kvm_get_dr(). The obvious alternative is to force all accesses through kvm_get_dr(), but it's not at all clear that doing so would be a net positive, e.g. even if KVM ends up wanting/needing to force all reads through a common helper, e.g. to play caching games, the cost of reverting this change is likely lower than the ongoing cost of maintaining weird, arbitrary code. No functional change intended. Cc: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
fc5375dd8c |
KVM: x86: Make kvm_get_dr() return a value, not use an out parameter
Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
615451d8cb |
KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
Now that all relevant kernel changes and selftests are in place, enable the new capability. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
a4bff3df51 |
KVM: pfncache: remove KVM_GUEST_USES_PFN usage
As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
78b74638eb |
KVM: pfncache: add a mark-dirty helper
At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
910c57dfa4 |
KVM: x86: Mark target gfn of emulated atomic instruction as dirty
When emulating an atomic access on behalf of the guest, mark the target
gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault. This
fixes a bug where KVM effectively corrupts guest memory during live
migration by writing to guest memory without informing userspace that the
page is dirty.
Marking the page dirty got unintentionally dropped when KVM's emulated
CMPXCHG was converted to do a user access. Before that, KVM explicitly
mapped the guest page into kernel memory, and marked the page dirty during
the unmap phase.
Mark the page dirty even if the CMPXCHG fails, as the old data is written
back on failure, i.e. the page is still written. The value written is
guaranteed to be the same because the operation is atomic, but KVM's ABI
is that all writes are dirty logged regardless of the value written. And
more importantly, that's what KVM did before the buggy commit.
Huge kudos to the folks on the Cc list (and many others), who did all the
actual work of triaging and debugging.
Fixes:
|
|
|
|
2f8ebe43a0 |
KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:
- Remove redundant newlines from error messages.
- Delete an unused variable in the AMX test (which causes build failures when
compiling with -Werror).
- Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an
error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE,
and the test eventually got skipped).
- Fix TSC related bugs in several Hyper-V selftests.
- Fix a bug in the dirty ring logging test where a sem_post() could be left
pending across multiple runs, resulting in incorrect synchronization between
the main thread and the vCPU worker thread.
- Relax the dirty log split test's assertions on 4KiB mappings to fix false
positives due to the number of mappings for memslot 0 (used for code and
data that is NOT being dirty logged) changing, e.g. due to NUMA balancing.
- Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the
function generates boolean values, and all callers treat the return value as
a bool).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKupQSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5DiQP/RNSgLrE9+/3oyqo9zpbhio2dKqz4dIk
8Ga1ZE4R89dyMB9jGKtWn3rEkyma3TsB+neVpG9ohHV6j25JJ0vNAkxQu3Gt+gkl
uM1lh/IfXPnAKyuy6dW9tpgZYE1v2/KfdWjeEzzxfPjzY/LX3yFiiCKEnUmfjjzZ
sSz91nV4KYS4b4xLWTIcBgNJuyLJuL05htTLmCu7t8DKOBHwHxXjSn8qqG8OvAjs
FOhf0zgGJKBFdKOw2Y8XeDdKO0RTEyEPHaFILcLEsuhoVIbY5OUmLe32pAFzzMbG
hPawUZ5CzC++e339gUgGkRNY80iSnGcYVcZa+ohxOsNBdOWko9z/eGWZUV7qkYDK
dkPHMoDnSzUCE2eSYbEB1eR/KOfziJCWMS9SAIJbJxIGb1HYajikwAEZ6FNp3R+u
MyCuNlV9TfsGgt4Dx8RctMeH2ROpORRu7h3WPFUBgG2/jOzPk/OR6U8hSzvmhTvL
MykZ8IaLmUIYoK/nCY2iwy50lQRxtZ/htqWn3sidCBGY0DXdNlMhvd3Vk9jtUvY5
Fgof0b564eYfk/qO3cMIDd2WFaDejP28JVSn0CNm6z9i54ubCKkSBEb4kTYXXnVK
YBHvbZ21Vjg52trudvK5UPt599sxxNBNiSV32ckLFKHS4ZVGSFSBSbsAWiQF157i
CbYntmtJhM+D
=infW
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:
- Remove redundant newlines from error messages.
- Delete an unused variable in the AMX test (which causes build failures when
compiling with -Werror).
- Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an
error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE,
and the test eventually got skipped).
- Fix TSC related bugs in several Hyper-V selftests.
- Fix a bug in the dirty ring logging test where a sem_post() could be left
pending across multiple runs, resulting in incorrect synchronization between
the main thread and the vCPU worker thread.
- Relax the dirty log split test's assertions on 4KiB mappings to fix false
positives due to the number of mappings for memslot 0 (used for code and
data that is NOT being dirty logged) changing, e.g. due to NUMA balancing.
- Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the
function generates boolean values, and all callers treat the return value as
a bool).
|
|
|
|
22d0bc0721 |
KVM x86 fixes for 6.8:
- Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
if the incoming events->nmi.pending is non-zero. If the target vCPU is in
the UNITIALIZED state, the spurious request will result in KVM exiting to
userspace, which in turn causes QEMU to constantly acquire and release
QEMU's global mutex, to the point where the BSP is unable to make forward
progress.
- Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
incorrectly truncated, and ultimately causes KVM to think a fixed counter
has already been disabled (KVM thinks the old value is '0').
- Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
that is ultimately ignored due to ignore_msrs=true doesn't zero the output
as intended.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKt90SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5e5wP/jU3Zuul2e7fb4E6RN/GPhAFSTzG7Cwe
4lVSSSPmOQsEXTKwCOMj7fgwF9qVSLzLRi62MKziTJY/1FDsTcI3xlM7nM2wwQC2
26evIzI3qB54rHQdviuh1jwh6scZH7xLw7kANE+8x4skkm6AZB1IUnj3utR3fEPj
mIUA5kGQxEAEDrn0TFzrRgIw4JngKjrCwmpT+vbmR37flC+Rwv8jr4JY1E3cBAT3
KEilv3Fg07gbvagWGZNSSUNqQos5MsnLifdryKbA/vuIJf+j/01CMo5KtLKshiaX
t4gXPldVZDXdxjH6im0wRAX4s/FpZg3vVje2OxPbzwMVb5+XvLewzjzagQ1lFA3I
gsNXF8uGdYn0fb8T/wQG4ulWBw6A844PSmGONCwLDA+GZuL9xjMIK5d1litvb/im
bEP1Ahv6UcnDNKHqRzuFXQENiS2uQdJNLs7p291oDNkTm/CGjDUgFXPuaCehWrUf
ZZf1dxmIPM/Xt2j19mS/HnTHD114A8t1GTx799kBXbG4x0ScVQclkhRk6yFG3ObA
14uXxxAdEBoZGBJ2yr5FbddvRLswbWugFoxKbtCZ/CHMopOUQcRRmRb7Lm1NHLtg
Ae/sHO6gQ1xcrbwpMCq+6RjFK57yW+n1TB8ZTmAE2RQynGqzReSTlUNtfn3yMg4v
hz+2zGzezoeN
=92ae
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.8:
- Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
if the incoming events->nmi.pending is non-zero. If the target vCPU is in
the UNITIALIZED state, the spurious request will result in KVM exiting to
userspace, which in turn causes QEMU to constantly acquire and release
QEMU's global mutex, to the point where the BSP is unable to make forward
progress.
- Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
incorrectly truncated, and ultimately causes KVM to think a fixed counter
has already been disabled (KVM thinks the old value is '0').
- Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
that is ultimately ignored due to ignore_msrs=true doesn't zero the output
as intended.
|
|
|
|
e1dda3afe2 |
KVM: x86: Fix broken debugregs ABI for 32 bit kernels
The ioctl()s to get and set KVM's debug registers are broken for 32 bit
kernels as they'd only copy half of the user register state because of a
UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4
bytes).
This makes it impossible for userland to set anything but DR0 without
resorting to bit folding tricks.
Switch to a loop for copying debug registers that'll implicitly do the
type conversion for us, if needed.
There are likely no users (left) for 32bit KVM, fix the bug nonetheless.
Fixes:
|
|
|
|
3376ca3f1a |
KVM: x86: Fix KVM_GET_MSRS stack info leak
Commit |
|
|
|
f19063b1ca |
KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
9e62797fd7 |
KVM: x86: Make gtod_is_based_on_tsc() return 'bool'
gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good clocksources and '0' otherwise. Moreover, its result is used raw by kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are 'bool'. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d52734d00b |
KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
Since commit |
|
|
|
9e05d9b067 |
KVM: x86: Check irqchip mode before create PIT
As the kvm api(https://docs.kernel.org/virt/kvm/api.html) reads, KVM_CREATE_PIT2 call is only valid after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. Without this check, I can create PIT first and enable irqchip-split then, which may cause the PIT invalid because of lacking of in-kernel PIC to inject the interrupt. Signed-off-by: Tengfei Yu <moehanabichan@gmail.com> Message-Id: <20240125050823.4893-1-moehanabichan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
6231c9e1a9 |
KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
qemu_thread_start
kvm_vcpu_thread_fn
qemu_wait_io_event
qemu_wait_io_event_common
process_queued_cpu_work
do_kvm_cpu_synchronize_post_init/_reset
kvm_arch_put_registers
kvm_put_vcpu_events (cpu, level=[2|3])
This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.
Fixes:
|
|
|
|
7bb7fce136 |
KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check. On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.
Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.
Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.
Fixes:
|
|
|
|
955997e880 |
KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()
Use the recently introduced guard(mutex) infrastructure acquire and automatically release vendor_module_lock when the guard goes out of scope. Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify releasing vendor_module_lock in error paths. No functional change intended. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com [sean: rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
09d1c6a80f |
Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
switch a memory area between guest_memfd and regular anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new guest_memfd
and page attributes infrastructure. This is mostly useful for testing,
since there is no pKVM-like infrastructure to provide a meaningfully
reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually care
about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a stable TSC",
because some of them don't expect the "TSC stable" bit (added to the pvclock
ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
flushes on nested transitions, i.e. always satisfies flush requests. This
allows running bleeding edge versions of VMware Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
- On AMD machines with vNMI, always rely on hardware instead of intercepting
IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
base granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with
a prefix branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV
support to that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing flag
in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix the
various bugs that were lurking due to lack of said annotation.
There are two non-KVM patches buried in the middle of guest_memfd support:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
=66Ws
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
|
|
|
|
b51cc5d028 |
x86/cleanups changes for v6.8:
- A micro-optimization got misplaced as a cleanup:
- Micro-optimize the asm code in secondary_startup_64_no_verify()
- Change global variables to local
- Add missing kernel-doc function parameter descriptions
- Remove unused parameter from a macro
- Remove obsolete Kconfig entry
- Fix comments
- Fix typos, mostly scripted, manually reviewed
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb2i8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iFIQ//RjqKWmEBfv0UVCNgtRgkUKOvYVkfhC1R
FykHWbSE+/oDODS7B+gbWqzl9Fq2Oxx9re4KZuMfnojE96KZ6H1flQn7z3UVRUrf
pfMx13E+uyf7qbVZktqH38lUS4s/AHdX2PKCiXlU/0hIkiBdjbAl3ylyqMv7ytIL
Fi2N9iYJN+eLlMkc3A5IK83xNiU8rb0gO6Uywn3nUbqadY/YX2gDpND5kfzRIneR
lTKy4rX3+E65qYB2Ly1wDr7e0Q0rgaTzPctx6twFrxQXK+MsHiartJhM5juND/tU
DEjSW9ISOHlitKEJI/zbdrvJlr5AKDNy2zHYmQQuqY6+YHRamCKqwIjLIPkKj52g
lAbosNwvp/o8W3zUHgUfVZR5hVxN863zV2qa/ehoQ3b/9kNjQC8actILjYEgIVu9
av1sd+nETbjCUABIF9H9uAoRbgc+wQs2nupJZrjvginFz8+WVhgaBdJDMYCNAmjc
fNMjGtRS7YXiIMj09ZAXFThVW302FdbTgggDh/qlQlDOXFu5HRbyuWR+USr4/jkP
qs2G6m/BHDs9HxDRo/no+ccSrUBV5phfhZbO7qwjTf2NJJvPHW+cxGpT00zU2v8A
lgfVI7SDkxwbyi1gacJ054GqEhsWuEdi40ikqxjhL8Oq4xwwsey/PiaIxjkDQx92
Gj3XUSDnGEs=
=kUav
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
- Change global variables to local
- Add missing kernel-doc function parameter descriptions
- Remove unused parameter from a macro
- Remove obsolete Kconfig entry
- Fix comments
- Fix typos, mostly scripted, manually reviewed
and a micro-optimization got misplaced as a cleanup:
- Micro-optimize the asm code in secondary_startup_64_no_verify()
* tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch/x86: Fix typos
x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify()
x86/docs: Remove reference to syscall trampoline in PTI
x86/Kconfig: Remove obsolete config X86_32_SMP
x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro
x86/mtrr: Document missing function parameters in kernel-doc
x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
|
|
|
|
3115d2de39 |
KVM Xen change for 6.8:
To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWXASsSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5R54P/iQPQBs4dJmNkPiA6uSq1O5/8hN4P59z aapJNgDiny/D9/zPbOxGWR31W7lvCgiES/lp3KcHZmwbeAwJpdT6a0cJWGRlGuov gccK8AoYcnwSU98sPisnFv7dJ66ogJfXVkPKKaWo+zVW53XUq2XpIie4eWaOweBt QsXpTGYpGajv1Bf/MgRtNtlkVAo1w8XL1L0NWRugzCk2CAYezz8IT1874GNZoJbd GJfVP+76FdNw+4/CxiaBwxP0gHfBIiAsJzGqbmMPhGG2xJn+KGs5FTEf37Pta8cl aMHAq6/JAoabJfP39MexVkopMaFlPbDwIWfkLWf6wSP86KHei+t9kLC0E4/R2NJ+ GKlrBB6Gj+gzFR4fZ75hIwS/4REMt6zVCbS7uSRrCduqrlEFcY5ED2NesoL9wZrB WMDIxIGIVDdRxc9WLypKmBj7KTgL0qXBxnsAcPiDRf1sk6SGajkesWxA1C1Nzo/H yNfqq0gjdPZVB2RIGN6DpWQFu3d+ZQnG2ToKIBW7OkvJ5USYiDSo4VozhESgYHRZ UJDhJ73QYESynClP6ST+9cxNof3FXCEPDeKr5NcmjVZxlJcdeUDNRqv0LUxQ56BI FvHMHtSs4WLYHZZVzsdh+Yhnc9rEGfoL0NwDPBCcOXjuNMvNQmuzSldc/VDGm/qt sCtxYMms5n7u =3v8F -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEAD KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. |
|
|
|
8ecb10bcbf |
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit. This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.
LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking
- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
The modified LAM canonicality checks:
- LAM_S48 : [ 1 ][ metadata ][ 1 ]
63 47
- LAM_U48 : [ 0 ][ metadata ][ 0 ]
63 47
- LAM_S57 : [ 1 ][ metadata ][ 1 ]
63 56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
63 56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
63 56..47
The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks. The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows
Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf
4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D
OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS
NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq
OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/
CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2
QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb
gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM
gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr
dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/
TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf
wHQkXgM+XimA
=qUZ2
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit. This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.
LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking
- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
The modified LAM canonicality checks:
- LAM_S48 : [ 1 ][ metadata ][ 1 ]
63 47
- LAM_U48 : [ 0 ][ metadata ][ 0 ]
63 47
- LAM_S57 : [ 1 ][ metadata ][ 1 ]
63 56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
63 56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
63 56..47
The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks. The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows
Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
|
|
|
|
01edb1cfbd |
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/rsSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Q8gQAJc4y9NOd09kYXpI+DhkTVe6v07dmYds
NzBI2uViqxXFwA5pTs5VTVVYAl1FEmK6NvIVnJdc3epSYRSqyaeN/Z2NoulNxekj
/jLA/aA4+dTeJf2lfMFeH65IIuSJhuhyGeZV31RfW3NzEmlglcsb74QkHnJB8rLQ
RFJXZcOxSSap72AWxKmxk0alRaI6ONZ9NyqOWFWjZdQuAE7id9Ae5OixKUrlJkmR
6CbY8ra51MFIXQEsomVlcl5b1DNiv0drPPf5YaC9T4CERtt5yZxpvZeTPhq70evm
OutoZpzfi69cF1fFCxqN5cWZSt1C/Bu3xp8+ILI1+bZkMCV/ty85DU6hfMZQZzcV
JeJkRg/AAgOrG4dtHskwg9LDMs867kgbaqZ8l8K7Dt8rGmcLc5/rZ1ZdjTStFj6V
ukmVKMAVgkmh88u62wQ5HjrN1IE1oE6nmDp3zivfPuohEr49A8mAT02A2x9AVxAr
HvmwfDMA92xOGSRAN9Gt0mbOA+G0WZe4A36XgPEXloYeskYZgHzgW2hT6VWTd86O
ydU9s4L8g+Fy4jcObAiKsT8YwFgAMfVXZKTXvuTME4m/WUNBCrYCwqEOp/NM5qrk
qYWVXxOMMjZo71tQfvSPu1TWCtW/4ckvmqMrdQosgwLFy5pSqgXEwTruDvbJ1KWU
KhIWVbUfmgFA
=+Emh
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
|
|
|
|
33d0403fda |
KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+7sSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/pwQAL8jIapIWP54VWxWlcTZFtCptGSobGlv
cBS4L091/bYuMB/jO0pPtD+apzsYt3WmJ+tRsNA7Yctzh9BDE3XxbV7pKVIUpz9P
TLCtYU2hPzp3vC6WCryjtU0OHxEnYMGHE1RCB7/bRblz+q6td7+MLZHcEUdwv83l
3pVM5+tNyQBog40frEVf+z7wrXzz2FgnauJn70X1UUs40VuiTzi6FqfLn6QK95xQ
8QPpjGFep7wQ6RgC4cPKiWSaP5PypCCpr4lMSKrKAf4iaKJdO1CYxEPeu0LcyFhR
DUM3zb+AZ/FVrisRWUnjke4Epb87ikoMQBlflrI9+o4cNJQaxEHAzTMGO+u4oucy
KwnXtNYM3lKGvDEvoUSBDphNayzcchn+0qk8YKB+XvClYSOtGi+NsWUB4x+M6crM
960cidF/CzYZL/IDj9GW2Tb+IiPJarmazdbqDmMpQiAKz0KE3tezGiysB6d6VJs1
V+KWOaSzAT9GsBKvGnPDHQaZ20vK+YsGB/TMWvpg3rFLTyV5QFM17UNdXyJlX0g8
G0v+gf7j3MKm156H2yYW0XhIAfhstc1Xb8fTDQjJ3pZn6us2NAtFgnrIpbL31Z7E
yaSgZuxetswbNwVSECUGlH4/zAtQudBfAt837Nu4eSCjMrJE4SPrrwpbTqp0SPXd
1VZbGc70QFf7
=O4hV
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
|
|
|
|
0afdfd85e3 |
KVM x86 Hyper-V changes for 6.8:
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW8gYSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5sGUP/iadHMz7Up1X29IDGtq58LRORNVXp2Ln
2dqoj8IKZeSr+mPMw2GvZyuiLqVPMs4Et21WJfCO7HgKd/NPMDORwRndhJYweFRY
yk+5NJLvXYuo8UR3b2QYy8XUghEqP+j5eYyon6UdCiPACcBGTpgoj4pU7SLM7l4T
EOge42ya5YxD/1oWr5vyifNrOJCPNTBYcC0as5//+RdnmQYqYZ26Z73b0B8Pdct4
XMWwgoKlmLTmei0YntXtGaDGimCvTYP8EPM4tOWgiBSWMhQXWbAh/0biDfd3eZVO
Hoe4HvstdjUNbpO3h3Zo78Ob7ehk4kx/6r0nlQnz5JxzGnuDjYCDIVUlYn0mw5Yi
nu4ztr8M3VRksDbpmAjSO9XFEKIYxlYQfzZ1UuTy8ehdBYTDl/3lPAbh2ApUYE72
Tt2PXmFGz2j1sjG38Gh94s48Za5OxHoVlfq8iGhU4v7UjuxnMNHfExOWd66SwZgx
5tZkr4rj/pWt21wr7jaVqFGzuftIC5G4ZEBhh7JcW89oamFrykgQUu5z4dhBMO75
G7DAVh9eSH2SKkmJH1ClXriveazTK7fqMx8sZzzRnusMz09qH7SIdjSzmp7H5utw
pWBfatft0n0FTI1r+hxGueiJt7dFlrIz0Q4hHyBN4saoVH121bZioc0pq1ob6MIk
Y2Ou4xJBt14F
=bjfs
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Hyper-V changes for 6.8:
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
|
|
|
|
54aa699e80 |
arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86". Only touches comments, no code changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org |
|
|
|
136292522e |
LoongArch KVM changes for v6.8
1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmWGu+0WHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImesO7D/wOdYP96R+mRzpLBeuTtFxU8e4A 3n2luxOeP8v1WYtQ9H8M01Wgly+9u6cJ2pgAlv79BQHfmCfC0aWQLmpnCZmk/mYW wtQ75ASA3Qg6zOBWEksCkA0LUdPDHfQuaaUXT7RYZ7QtHKSNkkhsw2nMCq6fgrXU RnZjGctjuxgYSqQtwzfYO2AjSBAfAq1MjSzCTULJ0KkE8o5Bg0KOoGj8ijC1U+ua QWBnqTNzeKmYmqAFfhXoiiFYcuBUq7DEk5RtwDU7SeqqJEV3a8AbbsrWfz+wMemG gri95uRxvnhpPZ+6/PrVjIezqexPJmQ9+tjY6mxh/bPRnS5ICFygjV3lt050JUK8 xIaJEFvl7g88RIz5mnTeM9tU4ibIsCLgA9zj33ps2H7QP5NazUm1dzk1YGAgqPdw m5hjwtTFQEujQM6cz1DLfhoi15VDNcYUonJIvGFZMhl7InitDpB3u9sI+AVGIVUG yKzBkqGB1L1vbJGnuWmspEqSUo7Z9iYzuVGbOnjc9LKQ/8OpLxj0brymYheA+CKG CIdULximQFVEHc2lbE+H+bW4hnrFP4sN9hlTng7KN7ommCIg+FltisM8Nt5NLWID 9ywLj4Qa0Qrc5vB3FJ8+ksuDe2nD83uVLj247R7B0wxQcYw4ocyW/YU+gayF4EjY 6azutwllW5ZB+I3hyw== =phol -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support. |
|
|
|
6254eebad4 |
KVM fixes for 6.7-rcN:
- When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0,
get the CPL directly instead of relying on preempted_in_kernel, which
is valid if and only if the vCPU was preempted, i.e. NOT running.
- Set .owner for various KVM file_operations so that files refcount the
KVM module until KVM is done executing _all_ code, including the last
few instructions of kvm_put_kvm(). And then revert the misguided
attempt to rely on "struct kvm" refcounts to pin KVM-the-module.
- Fix a benign "return void" that was recently introduced.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmVyeV0SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5aJIP/izKZivi/kZjuuKp2c1W2XM+mBZlM+Yj
qYcdV0rZygQJOZXTpMaEVg7iUtvbwAT495nm8sXr/IxXw+omMcP+qyLRCZ6JafYy
B19buCnAt2DymlJOurFzlIEeWtunkxk/gFLMB/BnSrok88cKz5PMxAVFPPBXsTms
ZqSFlDhzG0G4Mxhr8t0elyjd4HrCbNjCn1MhJg+uzFHKakfOvbST5jO02LkTeIM2
VFrqWZo1C6uPDrA8TzWzik54qOrDFrodNv/XvIJ0szgVOc+7Iwxy80A/v7o7jBET
igH+6F3cbST5uoKFrFn7pPJdTOfX2u18DXcpxiYIu+24ToKyqdE1Np9M5W4ZMX/9
Im5ilykfylHpRYAL4tECD6Jzd/Q/xIvpe8Uk6HTfFAtb/UdMY35/1keBnnkI2oj8
/4USM7AHNiqoAs4+OE4kZslrFG8ttv3vIOr7Mtk2UjGyGp8TH8sRFYPJKToXsQIJ
Gs96rsbiU+oo/IDp3UiRhWtwpwfKGbkDLp4r/3X6UOx6Re5u1ITVIoM14qFQaw3W
CKdHKN/MoreYLS5gasjaGRSyQNJPonaS10l8SqzWflrUBZYfyjCNKliihjKood2g
JykH4p69IFTWADT2VbrGCQVKfY1GJCxGwpGePFmChTsPUiQ2P+AbHCmaefnIRRbK
8UR/OmsDtFRZ
=gWp0
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.7-rcN' of https://github.com/kvm-x86/linux into kvm-master
KVM fixes for 6.7-rcN:
- When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0,
get the CPL directly instead of relying on preempted_in_kernel, which
is valid if and only if the vCPU was preempted, i.e. NOT running.
- Set .owner for various KVM file_operations so that files refcount the
KVM module until KVM is done executing _all_ code, including the last
few instructions of kvm_put_kvm(). And then revert the misguided
attempt to rely on "struct kvm" refcounts to pin KVM-the-module.
- Fix a benign "return void" that was recently introduced.
|
|
|
|
6d72283526 |
KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potential CPU hotplug) Xen will never use TSC as its clocksource. Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set in either the primary or secondary pvclock memory areas. This has led to bugs in some guest kernels which only become evident if PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support such guests, give the VMM a new Xen HVM config flag to tell KVM to forcibly clear the bit in the Xen pvclocks. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b4f69df0f6 |
KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be desirable to not compile it in to reduce module sizes as well as the attack surface. Introduce CONFIG_KVM_HYPERV option to make it possible. Note, there's room for further nVMX/nSVM code optimizations when !CONFIG_KVM_HYPERV, this will be done in follow-up patches. Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files are grouped together. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
cfef5af3cb |
KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg' placement in 'struct kvm_hv' is unfortunate. As a preparation to making Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it under CONFIG_HYPERV. While on it, introduce hv_get_partition_assist_page() helper to allocate partition assist page. Move the comment explaining why we use a single page for all vCPUs from VMX and expand it a bit. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ef8d89033c |
KVM: x86: Remove 'return void' expression for 'void function'
The requested info will be stored in 'guest_xsave->region' referenced by
the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
to explicitly use return void expression for a void function "static void
kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].
Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
f2f63f7ec6 |
KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed only for RESET, which is really just the same thing as vCPU creation, and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't possibly be any work to do. Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if guest CPUID is empty, but everything that is at all dynamic is guaranteed to be '0'/NULL, e.g. it should be impossible for KVM to have already created a perf event. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c52ffadc65 |
KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the VM. Unnecessarily updating the masterclock is undesirable as it can cause kvmclock's time to jump, which is particularly painful on systems with a stable TSC as kvmclock _should_ be fully reliable on such systems. The unexpected time jumps are due to differences in the TSC=>nanoseconds conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW (the pvclock algorithm is inherently lossy). When updating the masterclock, KVM refreshes the "base", i.e. moves the elapsed time since the last update from the kvmclock/pvclock algorithm to the CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but adds no real value when the TSC is stable. Prior to commit |
|
|
|
547c91929f |
KVM: x86: Get CPL directly when checking if loaded vCPU is in kernel mode
When querying whether or not a vCPU "is" running in kernel mode, directly
get the CPL if the vCPU is the currently loaded vCPU. In scenarios where
a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel
from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't
scheduled out due to being preempted and so preempted_in_kernel is stale.
This affects perf/core's ability to accurately tag guest RIP with
PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample. This
causes perf/tool to fail to connect the vCPU RIPs to the guest kernel
space symbols when parsing these samples due to incorrect PERF_RECORD_MISC
flags:
Before (perf-report of a cpu-cycles sample):
1.23% :58945 [unknown] [u] 0xffffffff818012e0
After:
1.35% :60703 [kernel.vmlinux] [g] asm_exc_page_fault
Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful
as nothing in the API's suggests that it's safe to use if and only if the
vCPU was preempted. That can be cleaned up in the future, for now just
fix the glaring correctness bug.
Note #2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX
requires VMREAD, i.e. is correct if and only if the vCPU is loaded. If
the target vCPU *was* preempted, then it can be scheduled back in after
the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up
trying to do VMREAD on a VMCS that isn't loaded on the current pCPU.
Signed-off-by: Like Xu <likexu@tencent.com>
Fixes:
|
|
|
|
b39bd520a6 |
KVM: x86: Untag addresses for LAM emulation where applicable
Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via
get_untagged_addr()) and "direct" calls from various VM-Exit handlers in
VMX where LAM untagging is supposed to be applied. Defer implementing
the guts of vmx_get_untagged_addr() to future patches purely to make the
changes easier to consume.
LAM is active only for 64-bit linear addresses and several types of
accesses are exempted.
- Cases need to untag address (handled in get_vmx_mem_address())
Operand(s) of VMX instructions and INVPCID.
Operand(s) of SGX ENCLS.
- Cases LAM doesn't apply to (no change needed)
Operand of INVLPG.
Linear address in INVPCID descriptor.
Linear address in INVVPID descriptor.
BASEADDR specified in SECS of ECREATE.
Note:
- LAM doesn't apply to write to control registers or MSRs
- LAM masking is applied before walking page tables, i.e. the faulting
linear address in CR2 doesn't contain the metadata.
- The guest linear address saved in VMCS doesn't contain metadata.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
37a41847b7 |
KVM: x86: Introduce get_untagged_addr() in kvm_x86_ops and call it in emulator
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag the metadata from linear address. Call the interface in linearization of instruction emulator for 64-bit mode. When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper Address Ignore (UAI), linear addresses may be tagged with metadata that needs to be dropped prior to canonicality checks, i.e. the metadata is ignored. Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific code, as sadly LAM and UAI have different semantics. Pass the emulator flags to allow vendor specific implementation to precisely identify the access type (LAM doesn't untag certain accesses). Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2c49db455e |
KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide a clear distinction between CR3 and GPA checks. This will allow exempting bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks, e.g. for upcoming features that will use high bits in CR3 for feature enabling. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
89ea60c2c7 |
KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-24-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
eed52e434b |
KVM: Allow arch code to track number of memslot address spaces per VM
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
90b4fe1798 |
KVM: x86: Disallow hugepages when memory attributes are mixed
Disallow creating hugepages with mixed memory attributes, e.g. shared versus private, as mapping a hugepage in this case would allow the guest to access memory with the wrong attributes, e.g. overlaying private memory with a shared hugepage. Tracking whether or not attributes are mixed via the existing disallow_lpage field, but use the most significant bit in 'disallow_lpage' to indicate a hugepage has mixed attributes instead using the normal refcounting. Whether or not attributes are mixed is binary; either they are or they aren't. Attempting to squeeze that info into the refcount is unnecessarily complex as it would require knowing the previous state of the mixed count when updating attributes. Using a flag means KVM just needs to ensure the current status is reflected in the memslots. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
ee605e3156 |
KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce the probability of exiting to userspace with a stale run->exit_reason that *appears* to be valid. To support fd-based guest memory (guest memory without a corresponding userspace virtual address), KVM will exit to userspace for various memory related errors, which userspace *may* be able to resolve, instead of using e.g. BUS_MCEERR_AR. And in the more distant future, KVM will also likely utilize the same functionality to let userspace "intercept" and handle memory faults when the userspace mapping is missing, i.e. when fast gup() fails. Because many of KVM's internal APIs related to guest memory use '0' to indicate "success, continue on" and not "exit to userspace", reporting memory faults/errors to userspace will set run->exit_reason and corresponding fields in the run structure fields in conjunction with a a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON. And because KVM already returns -EFAULT in many paths, there's a relatively high probability that KVM could return -EFAULT without setting run->exit_reason, in which case reporting KVM_EXIT_UNKNOWN is much better than reporting whatever exit reason happened to be in the run structure. Note, KVM must wait until after run->immediate_exit is serviced to sanitize run->exit_reason as KVM's ABI is that run->exit_reason is preserved across KVM_RUN when run->immediate_exit is true. Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-19-seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
16f95f3b95 |
KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
bb58b90b1a |
KVM: Introduce KVM_SET_USER_MEMORY_REGION2
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional information can be supplied without setting userspace up to fail. The padding in the new kvm_userspace_memory_region2 structure will be used to pass a file descriptor in addition to the userspace_addr, i.e. allow userspace to point at a file descriptor and map memory into a guest that is NOT mapped into host userspace. Alternatively, KVM could simply add "struct kvm_userspace_memory_region2" without a new ioctl(), but as Paolo pointed out, adding a new ioctl() makes detection of bad flags a bit more robust, e.g. if the new fd field is guarded only by a flag and not a new ioctl(), then a userspace bug (setting a "bad" flag) would generate out-of-bounds access instead of an -EINVAL error. Cc: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-9-seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
be47941980 |
KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM
8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2
pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr
YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO
F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac
lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA
bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV
6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd
Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE
yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G
avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1
WhLzonLz52IS
=U0fq
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
|
|
|
|
d5cde2e0b3 |
KVM PMU change for 6.7:
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't
require redoing the entire run loop due to the NMI not being detected until
the final kvm_vcpu_exit_request() check before entering the guest.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8G/sSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/FQP/1B0tk5TMe/Xfe/q4ng+J2eMr10TpbH5
uWRpxN6seRmH7cqfZwsNH86FubNRf3h9U/jOK3C9Q9dIhrq9MB1dZePDjF/xmZcz
4lhM76fHTeRNxJ1o+j2ApiK9U2dDAbBTLA8iGi+OTs/sAuvbNUELY7d3Ht2TqJjb
e9tGT+SavbTsg0UHEmteFHepMCe577AchL2T6jPbUaaVB05N7uD/qvIGDOLQvyaC
KHWqY5f+eFN+3JdGEefCiS4XCAWXBPSs7Ybq5SduxS7rnB7m96Vkidwk1DLjnyUt
+KNtb8JXBsMMuyaYZHrl4mPZyvOfmZxXOz9CzCYXzcQlsnkJqIyy3CiZFVEAqdq2
kXtOhNEqByAKVCWvcoJvfO/VGd/w3KP5XYP3GHXJ8gsS3sDORnL5PYWIvPNfjdlu
x7nsnk7PbaGdspSPqfKblwUvET1fePs1yjKECUMl4iJ6Wfr+QfKEpPUXQ6f79r+h
DrhPE9DIWyMMbre0p8E7uTFsteVerUx/GVDh7jtn6LCUKwWmKAZ43sKR2d35GAvG
x7ZKCcKl5U9vmC8c6q/eAZUE7CeNy1QBGXhYX6oP28NGxl5AzZ/Q8aYMsv0uqhyF
cwYbVKA5Wl5fovMrnjs8wwkKqa9cHdzy7JmhyhBV5k5ggfSUeD7mG0UM5eRxr6ZM
TOa/97QeXa7v
=mjsy
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM PMU change for 6.7:
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't
require redoing the entire run loop due to the NMI not being detected until
the final kvm_vcpu_exit_request() check before entering the guest.
|
|
|
|
e122d7a100 |
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8He8SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KyQP+wUH3n6hhJGScsSCpWXK6r8q+Y2ZBftY
ecXuoTfeBJmsoTbnExF7K600DtbxHY5jjxt3ROmoUCertCFRCoq6pi5v4rbRDDQ1
fmGkht43A6zAuHQ0Ntvkq4rNEmISAbzLP4EXOxZJ/Hxld91T8IutMFo7NN/YfOSx
nb+qgb7B25T7ODGvzahRjxnoevCHBN/TdKeDrvsoWeMpVw+CDYqquQOcLfHMaBAN
DqGwZzpdVqRQqg3TOuBGCiv5IcvskjkFUh0y6cEYkCR/MruLoT6CygoLImEV2naW
RU0ZU9Y4cjf+BV/faQEdP6mDQwwCUHWLxDpXUVn03KQYQHlA7q6UgRKxy35ixZ5w
Euxvg4m2ZGgJjsVLqTTMUlbLSNxD6wWZAVxGH7w8XghKrNmoj1IoajPZS+1rwyO2
5rUynMKf3HMT6oeqqZH95aChlUMiAvaPYPc+ogku8Bt1zJQVv/xnk/6T95Vw6C/t
KfYsV80rmJd/EL/fUXYX3mCMcZGHyv80QlOEc0uR4f25HGszCG8qHiSaUtnvQUjQ
xaguSuO1Cf7sdhHPWj4p/US+Jerrgd8nzoQGvKUOkdLsQzU71xwjvTZNlmmBYKKO
zgGIXZfaXa4JibAqnRrC+V8UdDPOwKvOEzmH0joLEzkTISnIG2LycvZ6tG7sTcMU
0sIg2dvhJx/G
=Z2eM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
|
|
|
|
f0f59d069e |
KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
This will also allow for future optimizations of handling guest MTRR updates
for VMs with non-coherent DMA and the quirk enabled.
- Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8FG8SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5tYMP+gJd3raPnpmai4NyaFaZNP6/5YsXuUMj
XBvHH7hBGHmjd1sV+O62fhUvNk4+M/1f1rERutP4s7yXEXxQfC9G/MQFgLBfyiW8
xR+RQkNrz8HsG8mHFBZ0Ei6OofhP+BRTYDRU7kbctKDh/4Hp5AOZAxYHs/ZhOho1
Lw6upZbQLCkdt72eEKbfocg6Tf400hWEyarBRXFe4KJzWq7KMjAPgqA/3Vx0lF6u
zX73Zr6tV0mcf3QXd58Q4CUwOuwMo1aTangmOhEeC09JplF2okLV36h6WrCF8qqO
gvmDrMA450Yc215peOJGBJzoZJrNjMIHZ2m+4Ifag6Z/jJoam4vjzUZmmrzx+Gbj
Ot5lmXCVRXCdHmUNdYQ6yR27WaVP3C3ItkxwNZGMPoh2G08NGyLLY1kwzRyITEH4
M9jYTRBZaeue57ad5Ms9FaneBLWwPxajTX90rWZbl2kzfd8PG5cF1VroESBLoa0f
I2kDcd7988xLTOMl1sfO8ci21Ve7rQc0hA6WlOXrDxb26OvYrftYXeXOCowN6kqP
czXIu5ZPmLI1btimZQXGMdxKkw5wwe3wDC3y5gKrm+rTfORUXoOUDoITIpmPCnAp
Dzfr5la3RI1GjHhzR80x4vXQC9BgJ9WrEwJub/RqVfE3T3ohw+NZl+AeM1xB9eT1
2mJWm6GFEm9Y
=Zfbr
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
This will also allow for future optimizations of handling guest MTRR updates
for VMs with non-coherent DMA and the quirk enabled.
- Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
|
|
|
|
f292dc8aad |
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1
/v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E
/d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E
fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD
ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi
0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S
H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV
2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5
HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw
ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X
uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT
tqRZu3kVHd61
=sfp8
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
|
|
|
|
fad505b2cb |
KVM: x86: Service NMI requests after PMI requests in VM-Enter path
Service NMI and SMI requests after PMI requests in vcpu_enter_guest() so
that KVM does not need to cancel and redo the VM-Enter if the guest
configures its PMIs to be delivered as NMIs (likely) or SMIs (unlikely).
Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
requests after NMI requests (the likely case) means KVM won't detect the
pending NMI request until the final check for outstanding requests.
Detecting requests at the final stage is costly as KVM has already loaded
guest state, potentially queued events for injection, disabled IRQs,
dropped SRCU, etc., most of which needs to be unwound.
Note that changing the order of request processing doesn't change the end
result, as KVM's final check for outstanding requests prevents entering
the guest until all requests are serviced. I.e. KVM will ultimately
coalesce events (or not) regardless of the ordering.
Using SPEC2017 benchmark programs running along with Intel vtune in a VM
demonstrates that the following code change reduces 800~1500 canceled
VM-Enters per second.
Some glory details:
Probe the invocation to vmx_cancel_injection():
$ perf probe -a vmx_cancel_injection
$ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
Partial results when SPEC2017 with Intel vtune are running in the VM:
On kernel without the change:
10.010018010 14254 probe:vmx_cancel_injection
20.037646388 15207 probe:vmx_cancel_injection
30.078739816 15261 probe:vmx_cancel_injection
40.114033258 15085 probe:vmx_cancel_injection
50.149297460 15112 probe:vmx_cancel_injection
60.185103088 15104 probe:vmx_cancel_injection
On kernel with the change:
10.003595390 40 probe:vmx_cancel_injection
20.017855682 31 probe:vmx_cancel_injection
30.028355883 34 probe:vmx_cancel_injection
40.038686298 31 probe:vmx_cancel_injection
50.048795162 20 probe:vmx_cancel_injection
60.069057747 19 probe:vmx_cancel_injection
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231002040839.2630027-1-mizhang@google.com
[sean: hoist PMU/PMI above SMI too, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
2770d47220 |
KVM: x86: Ignore MSR_AMD64_TW_CFG access
Hyper-V enabled Windows Server 2022 KVM VM cannot be started on Zen1 Ryzen
since it crashes at boot with SYSTEM_THREAD_EXCEPTION_NOT_HANDLED +
STATUS_PRIVILEGED_INSTRUCTION (in other words, because of an unexpected #GP
in the guest kernel).
This is because Windows tries to set bit 8 in MSR_AMD64_TW_CFG and can't
handle receiving a #GP when doing so.
Give this MSR the same treatment that commit
|
|
|
|
122ae01c51 |
KVM: x86: remove the unused assigned_dev_head from kvm_arch
Legacy device assignment was dropped years ago. This field is not used anymore. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20231019043336.8998-1-liangchen.linux@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2081a8450e |
KVM: x86: remove always-false condition in kvmclock_sync_fn
The 'kvmclock_periodic_sync' is a readonly param that cannot change after bootup. The kvm_arch_vcpu_postcreate() is not going to schedule the kvmclock_sync_work if kvmclock_periodic_sync == false. As a result, the "if (!kvmclock_periodic_sync)" can never be true if the kvmclock_sync_work = kvmclock_sync_fn() is scheduled. Link: https://lore.kernel.org/kvm/a461bf3f-c17e-9c3f-56aa-726225e8391d@oracle.com Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20231001213637.76686-1-dongli.zhang@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
3d30bfcbdc |
KVM: x86/mmu: Stop kicking vCPUs to sync the dirty log when PML is disabled
Stop kicking vCPUs in kvm_arch_sync_dirty_log() when PML is disabled. Kicking vCPUs when PML is disabled serves no purpose and could negatively impact guest performance. This restores KVM's behavior to prior to 5.12 commit |
|
|
|
26951ec862 |
KVM: x86: Use octal for file permission
Convert all module params to octal permissions to improve code readability
and to make checkpatch happy:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
octal permissions '0444'.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
88e4cd893f |
KVM x86/pmu fixes for 6.6:
- Truncate writes to PMU counters to the counter's width to avoid spurious
overflows when emulating counter events in software.
- Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
architectural behavior).
- Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
kick the guest out of emulated halt.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmUp1FESHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5IRsQAIsk+UwTP+q+ZzkpkSOJ+ocmKU97/GbW
snB+F5FwNXnWEPzHIV+Ldv+WUpmHilTrylk2t5jLyew783TPxTnLmNAa+D3iSSBP
jSGzCIqR2uRHOxhuJgkKvdOkfuS7vob1KcKrfOwKCSss78VhKGkMGIi66/81RTxo
zxpzva+F2YtbCwKWXewOvR4CsWhjVqOGRTCmjF6t8PpFDGqwZdu0ornBHC2gvkUI
iDHWVBg5Rz/akqxjEVL94SP5qdFSaVG+F3Z8xpnn+tfPncEK/xPFdGHGKwOy5Jvt
4dQLc6TGmS2+NGPU3eAJOr+GZKryQth1CI+5RDlnoKQXjQ3laJwjmgyCRbUYLoZh
/R7f5YJrhGheUvCCmagY1g2x41qp/CTG1RnX1SVTIGH9h+5LSVcCukCL9Tx2/B4v
eU8nrzhUuijSqG6TiyAV5hvFqMQf3LWWcjSSW58kIWmXLpqdb/Xp6wiFHjOM7wZM
c1br+6AwKZwKNdqn3/cnlBnLc+1jq/PWFnuF9svjKn5JTOyg8kddmyWUkDqiLOeZ
/jqqwRJQUZppy4DxFHdkuQxnTsrztNzs/vhQtF6MIgFRULrs4FaiTUxuAs72skqm
Fv/IIuyHWjST9HY8dgTx8PLqUevEc7zekmhN1Cj5KwhlHxKYWSZfew80CO7h2qhJ
IvAC70QC+BsW
=g8g3
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.6-fixes' of https://github.com/kvm-x86/linux into HEAD
KVM x86/pmu fixes for 6.6:
- Truncate writes to PMU counters to the counter's width to avoid spurious
overflows when emulating counter events in software.
- Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
architectural behavior).
- Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
kick the guest out of emulated halt.
|
|
|
|
8647c52e95 |
KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
Mask off xfeatures that aren't exposed to the guest only when saving guest
state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly.
Preserving the maximal set of xfeatures in user_xfeatures restores KVM's
ABI for KVM_SET_XSAVE, which prior to commit
|
|
|
|
18164f66e6 |
x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can
constrain which xfeatures are saved into the userspace buffer without
having to modify the user_xfeatures field in KVM's guest_fpu state.
KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to
guest must not show up in the effective xstate_bv field of the buffer.
Saving only the guest-supported xfeatures allows userspace to load the
saved state on a different host with a fewer xfeatures, so long as the
target host supports the xfeatures that are exposed to the guest.
KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to
the set of guest-supported xfeatures, but doing so broke KVM's historical
ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that
are supported by the *host*.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928001956.924301-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
362ff6dca5 |
KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops
Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count transitions from 0 to 1) or stops (noncoherent dma count transitions from 1 to 0). Before the zap, test if guest MTRR is to be honored after the assignment starts or was honored before the assignment stops. When there's no noncoherent DMA device, EPT memory type is ((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) When there're noncoherent DMA devices, EPT memory type needs to honor guest CR0.CD and MTRR settings. So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries need to be zapped to clear stale memory type. This issue might be hidden when the device is statically assigned with VFIO adding/removing MMIO regions of the noncoherent DMA devices for several times during guest boot, and current KVM MMU will call kvm_mmu_zap_all_fast() on the memslot removal. But if the device is hot-plugged, or if the guest has mmio_always_on for the device, the MMIO regions of it may only be added for once, then there's no path to do the EPT entries zapping to clear stale memory type. Therefore do the EPT zapping when noncoherent assignment starts/stops to ensure stale entries cleaned away. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065223.20432-1-yan.y.zhao@intel.com [sean: fix misspelled words in comment and changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
bf328e22e4 |
KVM: x86: Don't sync user-written TSC against startup values
The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost between the calculation of the value, and the kernel eventually handling the ioctl. To work around this, KVM has a hack which, if a TSC is set with a value which is within a second's worth of the last TSC "written" to any vCPU in the VM, assumes that userspace actually intended the two TSC values to be in sync and adjusts the newly-written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision KVM allows for in in the legacy API. On *creation*, the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to restore a vCPU's intended TSC, because the VM has been alive for less than 1 second and KVM's default TSC value for new vCPU's is '0', the intended TSC is within a second of the last "written" TSC and KVM incorrectly adjusts the intended TSC in an attempt to synchronize. But further hacks can be piled onto KVM's existing hackish ABI, and declare that the *first* value written by *userspace* (on any vCPU) should not be subject to this "correction", i.e. KVM can assume that the first write from userspace is not an attempt to sync up with TSC values that only come from the kernel's default vCPU creation. To that end: Add a flag, kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in the VM *has* been set by userspace, and make the 1-second slop hack only trigger if user_set_tsc is already set. Note that userspace can explicitly request a *synchronization* of the TSC by writing zero. For the purpose of user_set_tsc, an explicit synchronization counts as "setting" the TSC, i.e. if userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, the new value will be "corrected". This behavior is deliberate, as treating explicit synchronization as "setting" the TSC preserves KVM's existing behaviour inasmuch as possible (KVM always applied the 1-second "correction" regardless of whether the write came from userspace vs. the kernel). Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Tested-by: Yong He <alexyonghe@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7a18c7c2b6 |
KVM: x86/mmu: Zap SPTEs when CR0.CD is toggled iff guest MTRRs are honored
Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring guest MTRRs, which is the only time that KVM incorporates the guest's CR0.CD into the final memtype. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065122.20315-1-yan.y.zhao@intel.com [sean: rephrase shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
8b0e00fba9 |
KVM: x86: Virtualize HWCR.TscFreqSel[bit 24]
On certain CPUs, Linux guests expect HWCR.TscFreqSel[bit 24] to be set. If it isn't set, they complain: [Firmware Bug]: TSC doesn't count with P0 frequency! Allow userspace (and the guest) to set this bit in the virtual HWCR to eliminate the above complaint. Allow the guest to write the bit even though its is R/O on *some* CPUs. Like many bits in HWRC, TscFreqSel is not architectural at all. On Family 10h[1], it was R/W and powered on as 0. In Family 15h, one of the "changes relative to Family 10H Revision D processors[2] was: • MSRC001_0015 [Hardware Configuration (HWCR)]: • Dropped TscFreqSel; TSC can no longer be selected to run at NB P0-state. Despite the "Dropped" above, that same document later describes HWCR[bit 24] as follows: TscFreqSel: TSC frequency select. Read-only. Reset: 1. 1=The TSC increments at the P0 frequency If the guest clears the bit, the worst case scenario is the guest will be no worse off than it is today, e.g. the whining may return after a guest clears the bit and kexec()'s into a new kernel. [1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf [2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/42301_15h_Mod_00h-0Fh_BKDG.pdf, Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230929230246.1954854-3-jmattson@google.com [sean: elaborate on why the bit is writable by the guest] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
598a790fc2 |
KVM: x86: Allow HWCR.McStatusWrEn to be cleared once set
When HWCR is set to 0, store 0 in vcpu->arch.msr_hwcr.
Fixes:
|
|
|
|
5d6d6a7d7e |
KVM: x86: Refine calculation of guest wall clock to use a single TSC read
When populating the guest's PV wall clock information, KVM currently does a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern which should be avoided; when working with the relationship between two clocks, it's never correct to obtain one of them "now" and then the other at a slightly different "now" after an unspecified period of preemption (which might not even be under the control of the kernel, if this is an L1 hosting an L2 guest under nested virtualization). Add a kvm_get_wall_clock_epoch() function to return the guest wall clock epoch in nanoseconds using the same method as __get_kvmclock() — by using kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM clock time from a *single* TSC reading. The condition using get_cpu_tsc_khz() is equivalent to the version in __get_kvmclock() which separately checks for the CONSTANT_TSC feature or the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e47d86083c |
KVM: x86: Add SBPB support
Add support for the AMD Selective Branch Predictor Barrier (SBPB) by advertising the CPUID bit and handling PRED_CMD writes accordingly. Note, like SRSO_NO and IBPB_BRTYPE before it, advertise support for SBPB even if it's not enumerated by in the raw CPUID. Some CPUs that gained support via a uCode patch don't report SBPB via CPUID (the kernel forces the flag). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/a4ab1e7fe50096d50fde33e739ed2da40b41ea6a.1692919072.git.jpoimboe@kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
0068299540 |
KVM: SVM: Treat all "skip" emulation for SEV guests as outright failures
Treat EMULTYPE_SKIP failures on SEV guests as unhandleable emulation instead of simply resuming the guest, and drop the hack-a-fix which effects that behavior for the INT3/INTO injection path. If KVM can't skip an instruction for which KVM has already done partial emulation, resuming the guest is undesirable as doing so may corrupt guest state. Link: https://lore.kernel.org/r/20230825013621.2845700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
aeb904f6b9 |
KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to return more than true/false, e.g. to explicitly differentiate between "retry", "fault", and "unhandleable". For now, just do the plumbing, a future patch will expand SVM's implementation to signal outright failure if KVM attempts EMULTYPE_SKIP on an SEV guest. No functional change intended (or rather, none that are visible to the guest or userspace). Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ee11ab6bb0 |
KVM: X86: Reduce size of kvm_vcpu_arch structure when CONFIG_KVM_XEN=n
When CONFIG_KVM_XEN=n, the size of kvm_vcpu_arch can be reduced from 5100+ to 4400+ by adding macro control. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/all/CAPm50aKwbZGeXPK5uig18Br8CF1hOS71CE2j_dLX+ub7oJdpGg@mail.gmail.com [sean: fix whitespace damage] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4346db6e6e |
KVM: x86: Force TLB flush on userspace changes to special registers
Userspace can directly modify the content of vCPU's CR0, CR3, and CR4 via
KVM_SYNC_X86_SREGS and KVM_SET_SREGS{,2}. Make sure that KVM flushes guest
TLB entries and paging-structure caches if a (partial) guest TLB flush is
architecturally required based on the CRn changes. To keep things simple,
flush whenever KVM resets the MMU context, i.e. if any bits in CR0, CR3,
CR4, or EFER are modified. This is extreme overkill, but stuffing state
from userspace is not such a hot path that preserving guest TLB state is a
priority.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230814222358.707877-3-mhal@rbox.co
[sean: call out that the flushing on MMU context resets is for simplicity]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
9dbb029b9c |
KVM: x86: Remove redundant vcpu->arch.cr0 assignments
Drop the vcpu->arch.cr0 assignment after static_call(kvm_x86_set_cr0).
CR0 was already set by {vmx,svm}_set_cr0().
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230814222358.707877-2-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
73554b29bd |
KVM: x86/pmu: Synthesize at most one PMI per VM-exit
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.
Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI. But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.
E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.
KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack. The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.
Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
Fixes:
|
|
|
|
0df9dab891 |
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
58ea7cf700 |
KVM: x86/mmu: Move KVM-only page-track declarations to internal header
Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b83ab124de |
KVM: x86: Add a new page-track hook to handle memslot deletion
Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c70934e0ab |
KVM: x86: Reject memslot MOVE operations if KVMGT is attached
Disallow moving memslots if the VM has external page-track users, i.e. if
KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't
correctly handle moving memory regions.
Note, this is potential ABI breakage! E.g. userspace could move regions
that aren't shadowed by KVMGT without harming the guest. However, the
only known user of KVMGT is QEMU, and QEMU doesn't move generic memory
regions. KVM's own support for moving memory regions was also broken for
multiple years (albeit for an edge case, but arguably moving RAM is
itself an edge case), e.g. see commit
|
|
|
|
db0d70e610 |
KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into
mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only
for kvm_arch_flush_shadow_all(). This will allow refactoring
kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without
having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping
everything in mmu.c will also likely simplify supporting TDX, which
intends to do zap only relevant SPTEs on memslot updates.
No functional change intended.
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
6d5e3c318a |
KVM x86 changes for 6.6:
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS
Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis
KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas
yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7
wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058
5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t
vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT
DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8
2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl
wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp
pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J
gLdtzs8M9K9t
=yGM1
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.6:
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
|
|
|
|
1814db83c0 |
KVM: x86: Selftests changes for 6.6:
- Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
- Add support for printf() in guest code and covert all guest asserts to use
printf-based reporting
- Clean up the PMU event filter test and add new testcases
- Include x86 selftests in the KVM x86 MAINTAINERS entry
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueu4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5wvIQAK8jWhb1Y4CzrJmcZyYYIR6apgtXl4vB
KbhFIFHi5ZeZXlpXA2o/FW8Q9LNmcRLtxoapb09t/eyb0+ODllDPt/aSG7p6Y4p9
rNb1g6Hj77LTaG5gMy7/lbk9ERzf61+MKUuucU7WzjlY8oyd+lm+y2cx2O3+S/89
C5cp2CGnqK2NMbUnzYN8izMrdvtwDvgQvm3H7Ah8yrGXJkcemVggXibuh+2coTfo
p2RKrY+A4Syw/edNe0GVZYoSVJdwPEif8o0gAz5PwC2LTjpf9Iobt89KEx08BkVw
ms0MFbwLS66MoSYIVoZkBdy/Tri5aCKxHGqu7taEWhogjbzrPvktA6PNYihO4zGa
OSjA/oyAPvFJ4cLuBlrVh/xPWVoGX/6Sx3dBP5TI3zyR0FAqZkoAPDivWhflOpTt
q3aoHr6THGRzqHOCYuX7nwzhqBFSSHUF1zy/P7rThSzieSzUiJiANUwBjTeB9Wsr
5Cn+KQ8XOZw1LVcoeI9y97xcHh9HeP3seO+MFie8OH9QK4nUqgqEbF8sp7WF0rB6
6rZ1lht9a2Qx4xdtqSMBkQdgnnaiCZ7jBtEFMK6kSQ67zvorlCwkOue3TrtorJ4H
1XI/DGAzltEfCLMAq+4FkHkkEr84S3gRjaLlI9aHWlVrSk1wxM87R16jgVfJp74R
gTNAzCys2KwM
=dHTQ
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-selftests-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM: x86: Selftests changes for 6.6:
- Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
- Add support for printf() in guest code and covert all guest asserts to use
printf-based reporting
- Clean up the PMU event filter test and add new testcases
- Include x86 selftests in the KVM x86 MAINTAINERS entry
|
|
|
|
e0fb12c673 |
KVM/arm64 updates for Linux 6.6
- Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes... -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmTsXrUPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDZpIQAJUM1rNEOJ8ExYRfoG1LaTfcOm5TD6D1IWlO uCUx4xLMBudw/55HusmUSdiomQ3Xg5UdRaU7vX5OYwPbdoWebjEUfgdP3jCA/TiW mZTMv3x9hOvp+EOS/UnS469cERvg1/KfwcdOQsWL0HsCFZnu2XmQHWPD++vovLNp F1892ij875mC6C6mOR60H2nyjIiCuqWh/8eKBkp65CARCbFDYxWhqBnmcmTvoquh E87pQDPdtgXc0KlOWCABh5bYOu1WGVEXE5f3ixtdY9cQakkSI3NkFKw27/mIWS4q TCsagByNnPFDXTglb1dJopNdluLMFi1iXhRJX78R/PYaHxf4uFafWcQk1U7eDdLg 1kPANggwYe4KNAQZUvRhH7lIPWHCH0r4c1qHV+FsiOZVoDOSKHo4RW1ZFtirJSNW LNJMdk+8xyae0S7z164EpZB/tpFttX4gl3YvUT/T+4gH8+CRFAaoAlK39CoGDPpk f+P2GE1Z5YupF16YjpZtBnan55KkU1b6eORl5zpnAtoaz5WGXqj1t4qo0Q6e9WB9 X4rdDVhH7vRUmhjmSP6PuEygb84hnITLdGpkH2BmWj/4uYuCN+p+U2B2o/QdMJoo cPxdflLOU/+1gfAFYPtHVjVKCqzhwbw3iLXQpO12gzRYqE13rUnAr7RuGDf5fBVC LW7Pv81o =DKhx -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes... |
|
|
|
fe60e8f65f |
KVM: x86: Use KVM-governed feature framework to track "XSAVES enabled"
Use the governed feature framework to track if XSAVES is "enabled", i.e. if XSAVES can be used by the guest. Add a comment in the SVM code to explain the very unintuitive logic of deliberately NOT checking if XSAVES is enumerated in the guest CPUID model. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
392a532462 |
x86: kvm: x86: Remove unnecessary initial values of variables
bitmap and khz is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20230817002631.2885-1-zeming@nfschina.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7b0151caf7 |
KVM: x86: Remove WARN sanity check on hypervisor timer vs. UNINITIALIZED vCPU
Drop the WARN in KVM_RUN that asserts that KVM isn't using the hypervisor timer, a.k.a. the VMX preemption timer, for a vCPU that is in the UNINITIALIZIED activity state. The intent of the WARN is to sanity check that KVM won't drop a timer interrupt due to an unexpected transition to UNINITIALIZED, but unfortunately userspace can use various ioctl()s to force the unexpected state. Drop the sanity check instead of switching from the hypervisor timer to a software based timer, as the only reason to switch to a software timer when a vCPU is blocking is to ensure the timer interrupt wakes the vCPU, but said interrupt isn't a valid wake event for vCPUs in UNINITIALIZED state *and* the interrupt will be dropped in the end. Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com> Closes: https://lore.kernel.org/all/CALcu4rbFrU4go8sBHk3FreP+qjgtZCGcYNpSiEXOLm==qFv7iQ@mail.gmail.com Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230808232057.2498287-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
765da7fe0e |
KVM: x86: Remove break statements that will never be executed
Fix compiler warnings when compiling KVM with [-Wunreachable-code-break]. No functional change intended. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230807094243.32516-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
619b507244 |
KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common code
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop "arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a range-based TLB invalidation where the range is defined by the memslot. Now that kvm_flush_remote_tlbs_range() can be called from common code we can just use that and drop a bunch of duplicate code from the arch directories. Note this adds a lockdep assertion for slots_lock being held when calling kvm_flush_remote_tlbs_memslot(), which was previously only asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(), but they all hold the slots_lock, so the lockdep assertion continues to hold true. Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating kvm_flush_remote_tlbs_memslot(), since it is no longer necessary. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com |
|
|
|
eb3515dc99 |
x86: Move gds_ucode_mitigated() declaration to header
The declaration got placed in the .c file of the caller, but that
causes a warning for the definition:
arch/x86/kernel/cpu/bugs.c:682:6: error: no previous prototype for 'gds_ucode_mitigated' [-Werror=missing-prototypes]
Move it to a header where both sides can observe it instead.
Fixes:
|
|
|
|
64094e7e31 |
Mitigate Gather Data Sampling issue
* Add Base GDS mitigation * Support GDS_NO under KVM * Fix a documentation typo -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTJh5YACgkQaDWVMHDJ krAzAw/8DzjhAYEa7a1AodCBMNg8uNOPnLNoRPPNhaN5Iw6W3zXYDBDKT9PyjAIx RoIM0aHx/oY9nCpK441o25oCWAAyzk6E5/+q9hMa7B4aHUGKqiDUC6L9dC8UiiSN yvoBv4g7F81QnmyazwYI64S6vnbr4Cqe7K/mvVqQ/vbJiugD25zY8mflRV9YAuMk Oe7Ff/mCA+I/kqyKhJE3cf3qNhZ61FsFI886fOSvIE7g4THKqo5eGPpIQxR4mXiU Ri2JWffTaeHr2m0sAfFeLH4VTZxfAgBkNQUEWeG6f2kDGTEKibXFRsU4+zxjn3gl xug+9jfnKN1ceKyNlVeJJZKAfr2TiyUtrlSE5d+subIRKKBaAGgnCQDasaFAluzd aZkOYz30PCebhN+KTrR84FySHCaxnev04jqdtVGAQEDbTvyNagFUdZFGhWijJShV l2l4A0gFSYJmPfPVuuAwOJnnZtA1sRH9oz/Sny3+z9BKloZh+Nc/+Cu9zC8SLjaU BF3Qv2gU9HKTJ+MSy2JrGS52cONfpO5ngFHoOMilZ1KBHrfSb1eiy32PDT+vK60Y PFEmI8SWl7bmrO1snVUCfGaHBsHJSu5KMqwBGmM4xSRzJpyvRe493xC7+nFvqNLY vFOFc4jGeusOXgiLPpfGduppkTGcM7sy75UMLwTSLcQbDK99mus= =ZAPY -----END PGP SIGNATURE----- Merge tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/gds fixes from Dave Hansen: "Mitigate Gather Data Sampling issue: - Add Base GDS mitigation - Support GDS_NO under KVM - Fix a documentation typo" * tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/x86: Fix backwards on/off logic about YMM support KVM: Add GDS_NO support to KVM x86/speculation: Add Kconfig option for GDS x86/speculation: Add force option to GDS mitigation x86/speculation: Add Gather Data Sampling mitigation |
|
|
|
2d63699099 |
KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for propagating TSC offsets/multipliers into hardware, and instead have the vendor implementations pull the information directly from the vCPU structure. The respective vCPU fields _must_ be written at the same time in order to maintain consistent state, i.e. it's not random luck that the value passed in by all callers is grabbed from the vCPU. Explicitly grabbing the value from the vCPU field in SVM's implementation in particular will allow for additional cleanup without introducing even more subtle dependencies. Specifically, SVM can skip the WRMSR if guest state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the correct value for the vCPU prior to entering the guest. This also reconciles KVM's handling of related values that are stored in the vCPU, as svm_write_tsc_offset() already assumes/requires the caller to have updated l1_tsc_offset. Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
a2fd5d02ba |
KVM: x86: Snapshot host's MSR_IA32_ARCH_CAPABILITIES
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead of reading the MSR every time KVM wants to query the host state, e.g. when initializing the default value during vCPU creation. The paths that query ARCH_CAPABILITIES aren't particularly performance sensitive, but creating vCPUs is a frequent enough operation that burning 8 bytes is a good trade-off. Alternatively, KVM could add a field in kvm_caps and thus skip the on-demand calculations entirely, but a pure snapshot isn't possible due to the way KVM handles the l1tf_vmx_mitigation module param. And unlike the other "supported" fields in kvm_caps, KVM doesn't enforce the "supported" value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets userspace advertise whatever it wants. Those problems are solvable, but it's not clear there is real benefit versus snapshotting the host value, and grabbing the host value will allow additional cleanup of KVM's FB_CLEAR_CTRL code. Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com Cc: Chao Gao <chao.gao@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7f717f5484 |
KVM: x86: Remove x86_emulate_ops::guest_has_long_mode
Remove x86_emulate_ops::guest_has_long_mode along with its implementation,
emulator_guest_has_long_mode(). It has been unused since commit
|
|
|
|
0d033770d4 |
KVM: x86: Fix KVM_CAP_SYNC_REGS's sync_regs() TOCTOU issues
In a spirit of using a sledgehammer to crack a nut, make sync_regs() feed
__set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() with kernel's own
copy of data.
Both __set_sregs() and kvm_vcpu_ioctl_x86_set_vcpu_events() assume they
have exclusive rights to structs they operate on. While this is true when
coming from an ioctl handler (caller makes a local copy of user's data),
sync_regs() breaks this contract; a pointer to a user-modifiable memory
(vcpu->run->s.regs) is provided. This can lead to a situation when incoming
data is checked and/or sanitized only to be re-set by a user thread running
in parallel.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Fixes:
|
|
|
|
26a0652cb4 |
KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
3f2739bd1e |
KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in the VM-Exit fastpath handler, as several of the common helpers used during emulation expect the caller to provide SRCU protection. E.g. if the guest is counting instructions retired, KVM will query the PMU event filter when stepping over the WRMSR. dump_stack+0x85/0xdf lockdep_rcu_suspicious+0x109/0x120 pmc_event_is_allowed+0x165/0x170 kvm_pmu_trigger_event+0xa5/0x190 handle_fastpath_set_msr_irqoff+0xca/0x1e0 svm_vcpu_run+0x5c3/0x7b0 [kvm_amd] vcpu_enter_guest+0x2108/0x2580 Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this isn't the first bug of this nature, e.g. see commit |
|
|
|
5e1fe4a21c |
KVM: x86/irq: Conditionally register IRQ bypass consumer again
As was attempted commit
|
|
|
|
bf672720e8 |
KVM: x86: check the kvm_cpu_get_interrupt result before using it
The code was blindly assuming that kvm_cpu_get_interrupt never returns -1 when there is a pending interrupt. While this should be true, a bug in KVM can still cause this. If -1 is returned, the code before this patch was converting it to 0xFF, and 0xFF interrupt was injected to the guest, which results in an issue which was hard to debug. Add WARN_ON_ONCE to catch this case and skip the injection if this happens again. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
81ac7e5d74 |
KVM: Add GDS_NO support to KVM
Gather Data Sampling (GDS) is a transient execution attack using gather instructions from the AVX2 and AVX512 extensions. This attack allows malicious code to infer data that was previously stored in vector registers. Systems that are not vulnerable to GDS will set the GDS_NO bit of the IA32_ARCH_CAPABILITIES MSR. This is useful for VM guests that may think they are on vulnerable systems that are, in fact, not affected. Guests that are running on affected hosts where the mitigation is enabled are protected as if they were running on an unaffected system. On all hosts that are not affected or that are mitigated, set the GDS_NO bit. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> |
|
|
|
e8069f5a8e |
ARM64:
* Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the stage-2
fault path.
* Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with
services that live in the Secure world. pKVM intervenes on FF-A calls
to guarantee the host doesn't misuse memory donated to the hyp or a
pKVM guest.
* Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
* Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set configuration
from userspace, but the intent is to relax this limitation and allow
userspace to select a feature set consistent with the CPU.
* Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
* Use a separate set of pointer authentication keys for the hypervisor
when running in protected mode, as the host is untrusted at runtime.
* Ensure timer IRQs are consistently released in the init failure
paths.
* Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps
(FEAT_EVT), as it is a register commonly read from userspace.
* Erratum workaround for the upcoming AmpereOne part, which has broken
hardware A/D state management.
RISC-V:
* Redirect AMO load/store misaligned traps to KVM guest
* Trap-n-emulate AIA in-kernel irqchip for KVM guest
* Svnapot support for KVM Guest
s390:
* New uvdevice secret API
* CMM selftest and fixes
* fix racy access to target CPU for diag 9c
x86:
* Fix missing/incorrect #GP checks on ENCLS
* Use standard mmu_notifier hooks for handling APIC access page
* Drop now unnecessary TR/TSS load after VM-Exit on AMD
* Print more descriptive information about the status of SEV and SEV-ES during
module load
* Add a test for splitting and reconstituting hugepages during and after
dirty logging
* Add support for CPU pinning in demand paging test
* Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
* Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage
recovery threads (because nx_huge_pages=off can be toggled at runtime)
* Move handling of PAT out of MTRR code and dedup SVM+VMX code
* Fix output of PIC poll command emulation when there's an interrupt
* Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
* Misc cleanups, fixes and comments
Generic:
* Miscellaneous bugfixes and cleanups
Selftests:
* Generate dependency files so that partial rebuilds work as expected
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmSgHrIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroORcAf+KkBlXwQMf+Q0Hy6Mfe0OtkKmh0Ae
6HJ6dsuMfOHhWv5kgukh+qvuGUGzHq+gpVKmZg2yP3h3cLHOLUAYMCDm+rjXyjsk
F4DbnJLfxq43Pe9PHRKFxxSecRcRYCNox0GD5UYL4PLKcH0FyfQrV+HVBK+GI8L3
FDzUcyJkR12Lcj1qf++7fsbzfOshL0AJPmidQCoc6wkLJpUEr/nYUqlI1Kx3YNuQ
LKmxFHS4l4/O/px3GKNDrLWDbrVlwciGIa3GZLS52PZdW3mAqT+cqcPcYK6SW71P
m1vE80VbNELX5q3YSRoOXtedoZ3Pk97LEmz/xQAsJ/jri0Z5Syk0Ok0m/Q==
=AMXp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the
stage-2 fault path.
- Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact
with services that live in the Secure world. pKVM intervenes on
FF-A calls to guarantee the host doesn't misuse memory donated to
the hyp or a pKVM guest.
- Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
- Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set
configuration from userspace, but the intent is to relax this
limitation and allow userspace to select a feature set consistent
with the CPU.
- Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
- Use a separate set of pointer authentication keys for the
hypervisor when running in protected mode, as the host is untrusted
at runtime.
- Ensure timer IRQs are consistently released in the init failure
paths.
- Avoid trapping CTR_EL0 on systems with Enhanced Virtualization
Traps (FEAT_EVT), as it is a register commonly read from userspace.
- Erratum workaround for the upcoming AmpereOne part, which has
broken hardware A/D state management.
RISC-V:
- Redirect AMO load/store misaligned traps to KVM guest
- Trap-n-emulate AIA in-kernel irqchip for KVM guest
- Svnapot support for KVM Guest
s390:
- New uvdevice secret API
- CMM selftest and fixes
- fix racy access to target CPU for diag 9c
x86:
- Fix missing/incorrect #GP checks on ENCLS
- Use standard mmu_notifier hooks for handling APIC access page
- Drop now unnecessary TR/TSS load after VM-Exit on AMD
- Print more descriptive information about the status of SEV and
SEV-ES during module load
- Add a test for splitting and reconstituting hugepages during and
after dirty logging
- Add support for CPU pinning in demand paging test
- Add support for AMD PerfMonV2, with a variety of cleanups and minor
fixes included along the way
- Add a "nx_huge_pages=never" option to effectively avoid creating NX
hugepage recovery threads (because nx_huge_pages=off can be toggled
at runtime)
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Add a maintainer's handbook to document KVM x86 processes,
preferred coding style, testing expectations, etc.
- Misc cleanups, fixes and comments
Generic:
- Miscellaneous bugfixes and cleanups
Selftests:
- Generate dependency files so that partial rebuilds work as
expected"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits)
Documentation/process: Add a maintainer handbook for KVM x86
Documentation/process: Add a label for the tip tree handbook's coding style
KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index
RISC-V: KVM: Remove unneeded semicolon
RISC-V: KVM: Allow Svnapot extension for Guest/VM
riscv: kvm: define vcpu_sbi_ext_pmu in header
RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC
RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel emulation of AIA APLIC
RISC-V: KVM: Implement device interface for AIA irqchip
RISC-V: KVM: Skeletal in-kernel AIA irqchip support
RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero
RISC-V: KVM: Add APLIC related defines
RISC-V: KVM: Add IMSIC related defines
RISC-V: KVM: Implement guest external interrupt line management
KVM: x86: Remove PRIx* definitions as they are solely for user space
s390/uv: Update query for secret-UVCs
s390/uv: replace scnprintf with sysfs_emit
s390/uvdevice: Add 'Lock Secret Store' UVC
...
|
|
|
|
255006adb3 |
KVM VMX changes for 6.5:
- Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaLDYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5ovYP/ib86UG9QXwoEKx0mIyLQ5q1jD+StvxH 18SIH62+MXAtmz2E+EmXIySW76diOKCngApJ11WTERPwpZYEpcITh2D2Jp/vwgk5 xUPK+WKYQs1SGpJu3wXhLE1u6mB7X9p7EaXRSKG67P7YK09gTaOik1/3h6oNrGO+ KI06reCQN1PstKTfrZXxYpRlfDc761YaAmSZ79Bg+bK9PisFqme7TJ2mAqNZPFPd E7ho/UOEyWRSyd5VMsuOUB760pMQ9edKrs+38xNDp5N+0Fh0ItTjuAcd2KVWMZyW Fk+CJq4kCqTlEik5OwcEHsTGJGBFscGPSO+T0YtVfSZDdtN/rHN7l8RGquOebVTG Ldm5bg4agu4lXsqqzMxn8J9SkbNg3xno79mMSc2185jS2HLt5Hu6PzQnQ2tEtHJQ IuovmssHOVKDoYODOg0tq8UMydgT3hAvC7YJCouubCjxUUw+22nhN3EDuAhbJhtT DgQNGT7GmsrKIWLEjbm6EpLLOdJdB7/U1MrEshLS015a/DUz4b3ZGYApneifJL8h nGE2Wu+36xGUVNLgDMdvd+R17WdyQa+f+9KjUGy71KelFV4vI4A3JwvH0aIsTyHZ LGlQBZqelc66GYwMiqVC0GYGRtrdgygQopfstvZJ3rYiHZV/mdhB5A0T4J2Xvh2Q bnDNzsSFdsH5 =PjYj -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups |
|
|
|
751d77fefa |
KVM x86/pmu changes for 6.5:
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaHFgSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5twMP/15ZJFqZVigVQoATJeeR9tWUuyJe95xM
lyfnTel91Sg8XOamdwBGi7jLpaDgj34Jm0cfM7/4LbJk2/taeaCLYmJd5w9FXvaw
EkytQGO85hVNe2XuY+h+XxSIxpflKxgFuUnOwcDk2QbKgASzNSG/mJ9ZBx8PNVXD
FnyOqpbbYDFspWWvUOAI/RkHnr/dALjXJsSUMvuh3nz5e1NTyubjCAZg+/bse2nR
s8FrcSh4B0Lg0h4r2fdJ4sAiM/qWhcCIhq5svyTAcUG0T4rMS40LrosJOw3wkBRM
dyZYXy6GEENeCFJPhenF1mTE1embFyZp89PV/FCNRZXODbnM4kheJFT9gucAjlKi
ZafRcutrkYIVf4lZCMofDfQGLX/GCEJnwUPKyGygIsPoDRrdR7OLrFycON5bxocr
9NBNG+2teQFbnt5irB/bBGojtIZtu3OEylkuRjQUQ3lJYQ5r6LddarI9acIu1SHt
4rRfh8QN5qmMvVblaQzggOr6BPtmPr8QqMEMFncaUMCsV/82hRAEfvj2rifGFJNo
Axz1ajMfirxyM45WzredUkzzsbphiiegPBELCLRZfHmaEhJ8P7t7wvri0bXt9YdI
vjSfX+6ulOgDC+xAazE0gEJO4Uh5+g3Y+1e0fr43ltWzUOWdCQskzD3LE9DkqIXj
KAaCuHYbYpIZ
=MwqV
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM x86/pmu changes for 6.5:
- Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
|
|
|
|
36b68d360a |
KVM x86 changes for 6.5:
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Fix a longstanding bug in the reporting of the number of entries returned by
KVM_GET_CPUID2
- Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGMMSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5iDIP/0PwY3J5odTEUTnAyuDFPimd5PBt9k/O
B414wdpSKVgzq+0An4qM9mKRnklVIh2p8QqQTvDhcBUg3xb6CX9xZ4ery7hp/T5O
tr5bAXs2AYX6jpxvsopt+w+E9j6fvkJhcJCRU9im3QbrqwUE+ecyU5OHvmv2n/GO
syVZJbPOYuoLPKDjlSMrScE6fWEl9UOvHc5BK/vafTeyisMG3vv1BSmJj6GuiNNk
TS1RRIg//cOZghQyDfdXt0azTmakNZyNn35xnoX9x8SRmdRykyUjQeHmeqWxPDso
kiGO+CGancfS57S6ZtCkJjqEWZ1o/zKdOxr8MMf/3nJhv4kY7/5XtlVoACv5soW9
bZEmNiXIaSbvKNMwAlLJxHFbLa1sMdSCb345CIuMdt5QiWJ53ZiTyIAJX6+eL+Zf
8nkeekgPf5VUs6Zt0RdRPyvo+W7Vp9BtI87yDXm1nQKpbys2pt6CD3YB/oF4QViG
a5cyGoFuqRQbS3nmbshIlR7EanTuxbhLZKrNrFnolZ5e624h3Cnk2hVsfTznVGiX
vNHWM80phk1CWB9McErrZVkGfjlyVyBL13CBB2XF7Dl6PfF6/N22a9bOuTJD3tvk
PlNx4hvZm3esvvyGpjfbSajTKYE8O7rxiE1KrF0BpZ5IUl5WSiTr6XCy/yI/mIeM
hay2IWhPOF2z
=D0BH
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.5:
* Move handling of PAT out of MTRR code and dedup SVM+VMX code
* Fix output of PIC poll command emulation when there's an interrupt
* Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
* Misc cleanups
|
|
|
|
bc6cb4d5bc |
Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
The cmpxchg128() family of functions is basically & functionally
the same as cmpxchg_double(), but with a saner interface: instead
of a 6-parameter horror that forced u128 - u64/u64-halves layout
details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128 types.
- Restructure the generated atomic headers, and add
kerneldoc comments for all of the generic atomic{,64,_long}_t
operations. Generated definitions are much cleaner now,
and come with documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering
when taking multiple locks of the same type. This gets rid of
one use of lockdep_set_novalidate_class() in the bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
variable shadowing generating garbage code on Clang on certain
ARM builds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
x9Nt+Tp0Ze4=
=DsYj
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
|
|
|
|
ed3b7923a8 |
Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of higher-frequency
SMT cores and lower-frequency non-SMT cores), under the old code
lower-priority CPUs pulled tasks from the higher-priority cores if
more than one SMT sibling was busy - resulting in many unnecessary
task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores with more
than one busy sibling and allows lower-priority CPUs to pull tasks, which
avoids superfluous migrations and lets lower-priority cores inspect all SMT
siblings for the busiest queue.
- Implement the 'runnable boosting' feature in the EAS balancer: consider CPU
contention in frequency, EAS max util & load-balance busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves other key
workloads unchanged.
- Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it
into the build_sched_topology() helper function and building
it dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
- Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations.
- Fix task_struct::saved_state handling.
- Fix various rq clock update bugs, unearthed by turning on the rq clock
debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by
creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
- Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation
to (maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS
qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w
egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk
o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo
9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR
u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf
vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl
AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T
wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr
4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF
VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N
PiDbtX760ic=
=EWQA
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of
higher-frequency SMT cores and lower-frequency non-SMT cores),
under the old code lower-priority CPUs pulled tasks from the
higher-priority cores if more than one SMT sibling was busy -
resulting in many unnecessary task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores
with more than one busy sibling and allows lower-priority CPUs
to pull tasks, which avoids superfluous migrations and lets
lower-priority cores inspect all SMT siblings for the busiest
queue.
- Implement the 'runnable boosting' feature in the EAS balancer:
consider CPU contention in frequency, EAS max util & load-balance
busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves
other key workloads unchanged.
Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it into
the build_sched_topology() helper function and building it
dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations
- Fix task_struct::saved_state handling
- Fix various rq clock update bugs, unearthed by turning on the rq
clock debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
by creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or
CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation to
(maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings"
* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
sched/core: Fixed missing rq clock update before calling set_rq_offline()
sched/deadline: Update GRUB description in the documentation
sched/deadline: Fix bandwidth reclaim equation in GRUB
sched/wait: Fix a kthread_park race with wait_woken()
sched/topology: Mark set_sched_topology() __init
sched/fair: Rename variable cpu_util eff_util
arm64/arch_timer: Fix MMIO byteswap
sched/fair, cpufreq: Introduce 'runnable boosting'
sched/fair: Refactor CPU utilization functions
cpuidle: Use local_clock_noinstr()
sched/clock: Provide local_clock_noinstr()
x86/tsc: Provide sched_clock_noinstr()
clocksource: hyper-v: Provide noinstr sched_clock()
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
x86/vdso: Fix gettimeofday masking
math64: Always inline u128 version of mul_u64_u64_shr()
s390/time: Provide sched_clock_noinstr()
loongarch: Provide noinstr sched_clock_read()
...
|
|
|
|
a306425708 |
KVM: x86: Update comments about MSR lists exposed to userspace
Refresh comments about msrs_to_save, emulated_msrs, and msr_based_features
to remove stale references left behind by commit
|
|
|
|
4a2771895c |
KVM: x86/svm/pmu: Add AMD PerfMonV2 support
If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e12fa4b92a |
KVM: x86: Clean up: remove redundant bool conversions
As test_bit() returns bool, explicitly converting result to bool is unnecessary. Get rid of '!!'. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
056b9919a1 |
KVM: x86: Use cpu_feature_enabled() for PKU instead of #ifdef
Replace an #ifdef on CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS with a cpu_feature_enabled() check on X86_FEATURE_PKU. The macro magic of DISABLED_MASK_BIT_SET() means that cpu_feature_enabled() provides the same end result (no code generated) when PKU is disabled by Kconfig. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://lore.kernel.org/r/20230602010550.785722-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
0a8a5f2c8c |
KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
02f1b0b736 |
KVM: x86: Correct the name for skipping VMENTER l1d flush
There is no VMENTER_L1D_FLUSH_NESTED_VM. It should be ARCH_CAP_SKIP_VMENTRY_L1DFLUSH. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230524061634.54141-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
9397fa2ea3 |
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of U64_MAX as an error return. This breaks the clean wrap-around of the clock. Modify the function signature to return a boolean state and provide another u64 pointer to store the actual time on success. This obviates the need to steal one time value and restores the full counter width. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org |
|
|
|
0f613bfa82 |
locking/atomic: treewide: use raw_atomic*_<op>()
Now that we have raw_atomic*_<op>() definitions, there's no need to use arch_atomic*_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic*_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com |
|
|
|
8b703a49c9 |
KVM: x86: Account fastpath-only VM-Exits in vCPU stats
Increment vcpu->stat.exits when handling a fastpath VM-Exit without
going through any part of the "slow" path. Not bumping the exits stat
can result in wildly misleading exit counts, e.g. if the primary reason
the guest is exiting is to program the TSC deadline timer.
Fixes:
|
|
|
|
dee321977a |
KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common()
Move the common check-and-set handling of PAT MSR writes out of vendor code and into kvm_set_msr_common(). This aligns writes with reads, which are already handled in common code, i.e. makes the handling of reads and writes symmetrical in common code. Alternatively, the common handling in kvm_get_msr_common() could be moved to vendor code, but duplicating code is generally undesirable (even though the duplicatated code is trivial in this case), and guest writes to PAT should be rare, i.e. the overhead of the extra function call is a non-issue in practice. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
bc7fe2f0b7 |
KVM: x86: Move PAT MSR handling out of mtrr.c
Drop handling of MSR_IA32_CR_PAT from mtrr.c now that SVM and VMX handle writes without bouncing through kvm_set_msr_common(). PAT isn't truly an MTRR even though it affects memory types, and more importantly KVM enables hardware virtualization of guest PAT (by NOT setting "ignore guest PAT") when a guest has non-coherent DMA, i.e. KVM doesn't need to zap SPTEs when the guest PAT changes. The read path is and always has been trivial, i.e. burying it in the MTRR code does more harm than good. WARN and continue for the PAT case in kvm_set_msr_common(), as that code is _currently_ reached if and only if KVM is buggy. Defer cleaning up the lack of symmetry between the read and write paths to a future patch. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
34a83deac3 |
KVM: x86: Use MTRR macros to define possible MTRR MSR ranges
Use the MTRR macros to identify the ranges of possible MTRR MSRs instead of bounding the ranges with a mismash of open coded values and unrelated MSR indices. Carving out the gap for the machine check MSRs in particular is confusing, as it's easy to incorrectly think the case statement handles MCE MSRs instead of skipping them. Drop the range-based funneling of MSRs between the end of the MCE MSRs and MTRR_DEF_TYPE, i.e. 0x2A0-0x2FF, and instead handle MTTR_DEF_TYPE as the one-off case that it is. Extract PAT (0x277) as well in anticipation of dropping PAT "handling" from the MTRR code. Keep the range-based handling for the variable+fixed MTRRs even though capturing unknown MSRs 0x214-0x24F is arguably "wrong". There is a gap in the fixed MTRRs, 0x260-0x267, i.e. the MTRR code needs to filter out unknown MSRs anyways, and using a single range generates marginally better code for the big switch statement. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b9846a698c |
KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to
save/restore the register value during migration. Missing this may cause
userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the
value to the target VM.
In addition, there is no need to add MSR_IA32_TSX_CTRL when
ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So
add the checking in kvm_probe_msr_to_save().
Fixes:
|
|
|
|
c8c655c34e |
s390:
* More phys_to_virt conversions
* Improvement of AP management for VSIE (nested virtualization)
ARM64:
* Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
* New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
* Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
* A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
* The usual selftest fixes and improvements.
KVM x86 changes for 6.4:
* Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
* Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
* Move AMD_PSFD to cpufeatures.h and purge KVM's definition
* Avoid unnecessary writes+flushes when the guest is only adding new PTEs
* Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
when emulating invalidations
* Clean up the range-based flushing APIs
* Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
* Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
* Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
* Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, similar to CPUID features
* Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
* Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
x86 AMD:
* Add support for virtual NMIs
* Fixes for edge cases related to virtual interrupts
x86 Intel:
* Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
* Fix a bug in emulation of ENCLS in compatibility mode
* Allow emulation of NOP and PAUSE for L2
* AMX selftests improvements
* Misc cleanups
MIPS:
* Constify MIPS's internal callbacks (a leftover from the hardware enabling
rework that landed in 6.3)
Generic:
* Drop unnecessary casts from "void *" throughout kvm_main.c
* Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
size by 8 bytes on 64-bit kernels by utilizing a padding hole
Documentation:
* Fix goof introduced by the conversion to rST
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
=S2Hq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
|
|
|
|
f20730efbd |
SMP cross-CPU function-call updates for v6.4:
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some
major architectures it's not even consistently available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK438RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jJ5Q/5AZ0HGpyqwdFK8GmGznyu5qjP5HwV9pPq
gZQScqSy4tZEeza4TFMi83CoXSg9uJ7GlYJqqQMKm78LGEPomnZtXXC7oWvTA9M5
M/jAvzytmvZloSCXV6kK7jzSejMHhag97J/BjTYhZYQpJ9T+hNC87XO6J6COsKr9
lPIYqkFrIkQNr6B0U11AQfFejRYP1ics2fnbnZL86G/zZAc6x8EveM3KgSer2iHl
KbrO+xcYyGY8Ef9P2F72HhEGFfM3WslpT1yzqR3sm4Y+fuMG0oW3qOQuMJx0ZhxT
AloterY0uo6gJwI0P9k/K4klWgz81Tf/zLb0eBAtY2uJV9Fo3YhPHuZC7jGPGAy3
JusW2yNYqc8erHVEMAKDUsl/1KN4TE2uKlkZy98wno+KOoMufK5MA2e2kPPqXvUi
Jk9RvFolnWUsexaPmCftti0OCv3YFiviVAJ/t0pchfmvvJA2da0VC9hzmEXpLJVF
25nBTV/1uAOrWvOpCyo3ElrC2CkQVkFmK5rXMDdvf6ib0Nid4vFcCkCSLVfu+ePB
11mi7QYro+CcnOug1K+yKogUDmsZgV/u1kUwgQzTIpZ05Kkb49gUiXw9L2RGcBJh
yoDoiI66KPR7PWQ2qBdQoXug4zfEEtWG0O9HNLB0FFRC3hu7I+HHyiUkBWs9jasK
PA5+V7HcQRk=
=Wp7f
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP cross-CPU function-call updates from Ingo Molnar:
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some major
architectures it's not even consistently available.
* tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
trace,smp: Trace all smp_function_call*() invocations
trace: Add trace_ipi_send_cpu()
sched, smp: Trace smp callback causing an IPI
smp: reword smp call IPI comment
treewide: Trace IPIs sent via smp_send_reschedule()
irq_work: Trace self-IPIs sent via arch_irq_work_raise()
smp: Trace IPIs sent via arch_send_call_function_ipi_mask()
sched, smp: Trace IPIs sent via send_call_function_single_ipi()
trace: Add trace_ipi_send_cpumask()
kernel/smp: Make csdlock_debug= resettable
locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging
locking/csd_lock: Remove added data from CSD lock debugging
locking/csd_lock: Add Kconfig option for csd_debug default
|
|
|
|
4a5fd41995 |
KVM SVM changes for 6.4:
- Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGuLISHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5NOMQAKy1Od54yzQsIKyAZZJVfOEm7N5VLQgz +jLilXgHd8dm/g0g/KVCDPFoZ/ut2Tf5Dn4WwyoPWOpgGsOyTwdDIJabf9rustkA goZFcfUXz+P1nangTidrj6CFYgGmVS13Uu//H19X4bSzT+YifVevJ4QkRVElj9Mh VBUeXppC/gMGBZ9tKEzl+AU3FwJ58cB88q4boovBFYiDdciv/fF86t02Lc+dCIX1 6hTcOAnjAcp3eJY0wPQJUAEScufDKcMf6tSrsB/yWXv9KB9ANXFNXry8/+lW/Ux/ oOUmUVdRXrrsRUqtYk9+KuMoIN7CL1SBV0RCm5ApqwqwnTVdHS+odHU3c2s7E/uU QXIW4vwSne3W9Y4YApDgFjwDwmzY85dvblWlWBnR2LW2I3Or48xK+S8LpWG+lj6l EDf7RzeqAipJ1qUq6qDYJlyg/YsyYlcoErtra423skg38HBWxQXdqkVIz3SYdKjA 0OcBQIRI28KzJDn1gU6P3Q0Wr/cKsx9EGy6+jWBhf4Yf3eHP7+3WUTrg/Up0q8ny 0j/+cbe5kBb6k2T9y2X6jm6TVbPV5FyMBOF/UxmqEbRLmxXjBe8tMnFwV+qN871I gk5HTSIkX39GU9kNA3h5HoWjdNeRfhazKR9ZVrELVc1zjHnGLthXBPZbIAUsPPMx vgM6jf8NwLXZ =9xNX -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.4' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.4: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts |
|
|
|
c21775ae02 |
KVM selftests, and an AMX/XCR0 bugfix, for 6.4:
- Don't advertisze XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
- Overhaul the AMX selftests to improve coverage and cleanup the test
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGt50SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5MskP/2PhSrdgHxCwfpqpdVe/q5OWwFuhn3wG
f5QKMpEBg4wJFeIE3eGJEaDlg776nWtWDNgUmqdjoZ8vyyadkPX9CV2Y2Hq0M7Tw
d0gKPjQrz2BavyDYoPNfs4pfshs4EvDTswBkhdAt8KTZhGZosJOywQIp61V3ePqr
1rDP6C4+CmwTRAK0f7egslyJ2pZXiUcvhITvzx8XhIAQh6nEK4gUZ/l3hLmg38kD
Af23kiLnP8lHUUx4BQtRAnTw0SZXJ8DcKtoFkzEH8mdj4g6EqXpxy48zuyZcqWVi
4XIFr+WECPsV5gdqWN9rMDqIG2ib+2heKDmcdUptcVuvr1ktv0reQybmgVck4CKX
fTAdu86/LBaQmIHwNHaNFPwdUby4QQZ8ajafPC62oc+B6N1lQg8bbCwnvO6KGlGl
FaQTnzaZq7ft4tfQRXOMu1AbLZLK7dIqJHHhxR3MkBkd4MAcZ1bVKkvlJLqsOKNw
TEsreXErY7AsegZK73Rn4IN/CJGBof5bZ2NIchmiN+0UfMsd9zGn66Als6oRNh4E
tRUhFONPIEmydy9UB50qe6b98ElB6R++opZbvkVW2hy8lMy3iJrCvUbOs1nx3wbn
cxvIuTfw/dAFf70S03/zudf7lYHs2wKV1rrIAebyTd4NnvWdVB8OaSHgZswMgVjb
UzzQfnQ+u9so
=BY10
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-selftests-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM selftests, and an AMX/XCR0 bugfix, for 6.4:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
- Overhaul the AMX selftests to improve coverage and cleanup the test
- Misc cleanups
|
|
|
|
48b1893ae3 |
KVM x86 PMU changes for 6.4:
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGtd4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Z9kP/i3WZ40hevvQvB/5cEpxxmxYDwCYnnjM
hiQgK5jT4SrMTmVjLgkNdI2PogQoS4CX+GC7lcA9bvse84hjuPvgOflb2B+p2UQi
Ytbr9g/tfKNIpnKIk9mcPcSObN9vm2Kgt7n28rtPrHWj89eQzgc66eijqdpKBLxA
c3crVR8krwYAQK0tmzHq1+H6hB369YbHAHyTTRRI/bNWnqKblnvUbt0NL2aBusa9
rNMaOdRtinLpy2dmuX/b3japRB8QTnlf7zpPIF4cBEhbYXy5woClZpf1D2fCA6Er
XFbEoYawMVd9UeJYbW4z5yErLT83eYoGp4U0eFXWp6fvh8nZlgCGvBKE9g4mmqwj
aSLaTR5eVN2qlw6jXVeg3unCo8Eyl36AwYwve2L6sFmBvZvNV5iz2eQ7rrOe4oE3
dnTUaLQ8I2SVg04MbYmCq5W+frTL/I7kqNpbccL1Z3R5WO4y5gz63mug6NfLIvhR
t45TAIaifxBfcXQsBZM3v2KUK/xQrD3AbJmFKh54L2CKqiGaNWsMLX+6NZ7LZWgf
8rEqsVkkQDgF7z8eXai4TR26nYfSX6g9gDqtOH73L87aJ7PJk5cRoDWQ1sWs1e/l
4HA/L0Bo/3pnKAa0ZWxJOixmzqY49gNQf3dj8gt3jk3y2ijbAivshiSpPBmIxn0u
QLeOf/LGvipl
=m18F
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.4:
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
|
|
|
|
807b758496 |
KVM x86 MMU changes for 6.4:
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGsvASHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5XnoP/0D8rQmrA0xPHK81zYS1E71tsR/itO/T
CQMSB4PhEqvcRUaWOuhLBRUW+noWzaOkjkMYK2uoPTdtme7v9+Ar7EtfrWYHrBWD
IxHCAymo3a5dQPUc3Nb77u6HjRAOokPSqSz5jE4qAjlniW09feruro2Phi+BTme4
JjxTc/7Oh0Fu26+mK7mJHiw3fV1x3YznnnRPrKGrVQes5L6ozNICkUZ6nvuJUVMk
lTNHNQbG8PqJZnfWG7VIKRn1vdfXwEfnvyucGVEqFfPLkOXqJHyqMVmIOtvsH7C5
l8j36+lBZwtFh2jk2EsXOTb6sS7l1MSvyHLlbaJaqqffP+77Hf1n0fROur0k9Yse
jJJejJWxZ/SvjMt/bOA+4ybGafZH0lt20DsDWnat5GSQ1EVT1CInN2p8OY8pdecR
QOJBqnNUOykC7/Pyad+IxTxwrOSNCYh+5aYG8AdGquZvNUEwjffVJqrmxDvklY8Z
DTYwGKgNY7NsP/dV0WYYElsAuHiKwiDZL15KftiQebO1fPcZDpTzDo83/8UMfGxh
yegngcNX9Qi7lWtLkUMy8A99UvejM0QrS/Zt8v1zjlQ8PjreZLLBWsNpe0ufIMRk
31ZAC2OS4Koi3wZ54tA7Z1Kh11meGhAk5Ti7sNke0rDqB9UMmj6UKw121cSRvW7q
W6O4U3YeGpKx
=zb4u
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.4:
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
|
|
|
|
a1c288f87d |
KVM x86 changes for 6.4:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGr2sSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5b80P/2ayACpc7iV2DysXkrxOdn1JmMu9BeHd
3oMb7bydf79LMNAO+NKPqVjo74yZ/Lh8UyufJGgF3HnSCdumx5Iklyx6/2PUHu/I
8xT1H7VlIGQMcNy0G4hMus34ZcafJl4y+BXgMEqEErLcy3n598UvFGJ+C0/4lnux
2Gk7dLASHq/mVVKReBM/kD4RhCVy5Venz6zkk9KbwDLHAmfejVK5bSqDYAnO1WtV
IBWetxlVyMZCnfPV2drhzgNVwiHvYvCaMBW+cUk5cH8Z2r0VZVDERmc1D4/rd04t
xs9lMk6CdNU7REQfblA0xMgeO/dNAXq5Fs4FfcM8OTBZU32KKafPhgW1uj2Sv+9l
nbb1XxZ7C0EcBhKVbUD6zRl05vjHwxlRgoi0yWUqERthFKNXHV42JJgaNn4fxDYS
tOBKBNkM9z6tCGN2aZv6GwhsEyY2y7oLdbZUGK9/FM3mF1VBASms1BTwokJXTxCD
pkOpAGeN5hxOlC4/wl6iHJTrz9oaJUj5E5kMD1oK6oQJgnnfqH0kVTG/ui/OUtJg
8N3amYO/d7InFvuE0f9R6TqZVhTN2QefHmNJaEldsmYp1NMI8Ep8JIhQKRA2LZVE
CGRxyrPj5CESerAItAI6tshEre5W8aScEzhpmd6HgHmahhQJsCEj+3q/J8FPWLG/
iQ3GnggrklfU
=qj7D
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.4:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Misc cleanups
|
|
|
|
4f382a79a6 |
KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2 rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1 HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE 61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je +5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e 0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1 1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT 8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK DHT7RBE0 =Bhta -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.4 - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. |
|
|
|
6be3ae45f5 |
KVM: x86: Add a helper to handle filtering of unpermitted XCR0 features
Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account for XCR0 features that require explicit opt-in on a per-process basis. In addition to documenting when KVM should/shouldn't consult xstate_get_guest_group_perm(), the helper will also allow sanitizing the filtered XCR0 to avoid enumerating architecturally illegal XCR0 values, e.g. XTILE_CFG without XTILE_DATA. No functional changes intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> [sean: rename helper, move to x86.h, massage changelog] Reviewed-by: Aaron Lewis <aaronlewis@google.com> Tested-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
3a6de51a43 |
KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run
Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES,
after running a vCPU, WARN and bug the VM if the PMU is refreshed after
the vCPU has run.
Note, KVM has disallowed CPUID updates after running a vCPU since commit
|
|
|
|
0094f62c7e |
KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN
Disallow writes to feature MSRs after KVM_RUN to prevent userspace from changing the vCPU model after running the vCPU. Similar to guest CPUID, KVM uses feature MSRs to configure intercepts, determine what operations are/aren't allowed, etc. Changing the capabilities while the vCPU is active will at best yield unpredictable guest behavior, and at worst could be dangerous to KVM. Allow writing the current value, e.g. so that userspace can blindly set all MSRs when emulating RESET, and unconditionally allow writes to MSR_IA32_UCODE_REV so that userspace can emulate patch loads. Special case the VMX MSRs to keep the generic list small, i.e. so that KVM can do a linear walk of the generic list without incurring meaningful overhead. Cc: Like Xu <like.xu.linux@gmail.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
9eb6ba31db |
KVM: x86: Generate set of VMX feature MSRs using first/last definitions
Add VMX MSRs to the runtime list of feature MSRs by iterating over the range of emulated MSRs instead of manually defining each MSR in the "all" list. Using the range definition reduces the cost of emulating a new VMX MSR, e.g. prevents forgetting to add an MSR to the list. Extracting the VMX MSRs from the "all" list, which is a compile-time constant, also shrinks the list to the point where the compiler can heavily optimize code that iterates over the list. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
b1932c5c19 |
KVM: x86: Rename kvm_init_msr_list() to clarify it inits multiple lists
Rename kvm_init_msr_list() to kvm_init_msr_lists() to clarify that it initializes multiple lists: MSRs to save, emulated MSRs, and feature MSRs. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
da3db168fb |
KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD
Virtualize FLUSH_L1D so that the guest can use the performant L1D flush if one of the many mitigations might require a flush in the guest, e.g. Linux provides an option to flush the L1D when switching mms. Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware and exposed to the guest, i.e. always let the guest write it directly if FLUSH_L1D is fully supported. Forward writes to hardware in host context on the off chance that KVM ends up emulating a WRMSR, or in the really unlikely scenario where userspace wants to force a flush. Restrict these forwarded WRMSRs to the known command out of an abundance of caution. Passing through the MSR means the guest can throw any and all values at hardware, but doing so in host context is arguably a bit more dangerous. Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322011440.2195485-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
903358c7ed |
KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code
Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the logic to kvm_set_msr_common(). Now that the MSR interception toggling is handled as part of setting guest CPUID, the VMX and SVM paths are identical. Opportunistically massage the code to make it a wee bit denser. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20230322011440.2195485-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
944a8dad8b |
KVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static
smatch reports arch/x86/kvm/x86.c:199:20: warning: symbol 'mitigate_smt_rsb' was not declared. Should it be static? This variable is only used in one file so it should be static. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e65733b5c5 |
KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
The 'longmode' field is a bit annoying as it blows an entire __u32 to represent a boolean value. Since other architectures are looking to add support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it up. Redefine the field (and the remaining padding) as a set of flags. Preserve the existing ABI by using bit 0 to indicate if the guest was in long mode and requiring that the remaining 31 bits must be zero. Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev |
|
|
|
52882b9c7a |
KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.
This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
6c41468c7c |
KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection
When injecting an exception into a vCPU in Real Mode, suppress the error
code by clearing the flag that tracks whether the error code is valid, not
by clearing the error code itself. The "typo" was introduced by recent
fix for SVM's funky Paged Real Mode.
Opportunistically hoist the logic above the tracepoint so that the trace
is coherent with respect to what is actually injected (this was also the
behavior prior to the buggy commit).
Fixes:
|
|
|
|
0dc902267c |
KVM: x86: Suppress pending MMIO write exits if emulator detects exception
Clear vcpu->mmio_needed when injecting an exception from the emulator to squash a (legitimate) warning about vcpu->mmio_needed being true at the start of KVM_RUN without a callback being registered to complete the userspace MMIO exit. Suppressing the MMIO write exit is inarguably wrong from an architectural perspective, but it is the least awful hack-a-fix due to shortcomings in KVM's uAPI, not to mention that KVM already suppresses MMIO writes in this scenario. Outside of REP string instructions, KVM doesn't provide a way to resume an instruction at the exact point where it was "interrupted" if said instruction partially completed before encountering an MMIO access. For MMIO reads, KVM immediately exits to userspace upon detecting MMIO as userspace provides the to-be-read value in a buffer, and so KVM can safely (more or less) restart the instruction from the beginning. When the emulator re-encounters the MMIO read, KVM will service the MMIO by getting the value from the buffer instead of exiting to userspace, i.e. KVM won't put the vCPU into an infinite loop. On an emulated MMIO write, KVM finishes the instruction before exiting to userspace, as exiting immediately would ultimately hang the vCPU due to the aforementioned shortcoming of KVM not being able to resume emulation in the middle of an instruction. For the vast majority of _emulated_ instructions, deferring the userspace exit doesn't cause problems as very few x86 instructions (again ignoring string operations) generate multiple writes. But for instructions that generate multiple writes, e.g. PUSHA (multiple pushes onto the stack), deferring the exit effectively results in only the final write triggering an exit to userspace. KVM does support multiple MMIO "fragments", but only for page splits; if an instruction performs multiple distinct MMIO writes, the number of fragments gets reset when the next MMIO write comes along and any previous MMIO writes are dropped. Circling back to the warning, if a deferred MMIO write coincides with an exception, e.g. in this case a #SS due to PUSHA underflowing the stack after queueing a write to an MMIO page on a previous push, KVM injects the exceptions and leaves the deferred MMIO pending without registering a callback, thus triggering the splat. Sweep the problem under the proverbial rug as dropping MMIO writes is not unique to the exception scenario (see above), i.e. instructions like PUSHA are fundamentally broken with respect to MMIO, and have been since KVM's inception. Reported-by: zhangjianguo <zhangjianguo18@huawei.com> Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322141220.2206241-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
4c8c3c7f70 |
treewide: Trace IPIs sent via smp_send_reschedule()
To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.
Changes to include the declaration of the tracepoint were driven by the
following coccinelle script:
@func_use@
@@
smp_send_reschedule(...);
@include@
@@
#include <trace/events/ipi.h>
@no_include depends on func_use && !include@
@@
#include <...>
+
+ #include <trace/events/ipi.h>
[csky bits]
[riscv bits]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
|
|
|
|
99b3086980 |
KVM: x86: Remove a redundant guest cpuid check in kvm_set_cr4()
If !guest_cpuid_has(vcpu, X86_FEATURE_PCID), CR4.PCIDE would have been in
vcpu->arch.cr4_guest_rsvd_bits and failed earlier kvm_is_valid_cr4() check.
Remove this meaningless check.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Fixes:
|
|
|
|
fa4c027a79 |
KVM: x86: Add support for SVM's Virtual NMI
Add support for SVM's Virtual NMIs implementation, which adds proper tracking of virtual NMI blocking, and an intr_ctrl flag that software can set to mark a virtual NMI as pending. Pending virtual NMIs are serviced by hardware if/when virtual NMIs become unblocked, i.e. act more or less like real NMIs. Introduce two new kvm_x86_ops callbacks so to support SVM's vNMI, as KVM needs to treat a pending vNMI as partially injected. Specifically, if two NMIs (for L1) arrive concurrently in KVM's software model, KVM's ABI is to inject one and pend the other. Without vNMI, KVM manually tracks the pending NMI and uses NMI windows to detect when the NMI should be injected. With vNMI, the pending NMI is simply stuffed into the VMCB and handed off to hardware. This means that KVM needs to be able to set a vNMI pending on-demand, and also query if a vNMI is pending, e.g. to honor the "at most one NMI pending" rule and to preserve all NMIs across save and restore. Warn if KVM attempts to open an NMI window when vNMI is fully enabled, as the above logic should prevent KVM from ever getting to kvm_check_and_inject_events() with two NMIs pending _in software_, and the "at most one NMI pending" logic should prevent having an NMI pending in hardware and an NMI pending in software if NMIs are also blocked, i.e. if KVM can't immediately inject the second NMI. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230227084016.3368-11-santosh.shukla@amd.com [sean: rewrite shortlog and changelog, massage code comments] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
bdedff2631 |
KVM: x86: Route pending NMIs from userspace through process_nmi()
Use the asynchronous NMI queue to handle pending NMIs coming in from userspace during KVM_SET_VCPU_EVENTS so that all of KVM's logic for handling multiple NMIs goes through process_nmi(). This will simplify supporting SVM's upcoming "virtual NMI" functionality, which will need changes KVM manages pending NMIs. Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ab2ee212a5 |
KVM: x86: Save/restore all NMIs when multiple NMIs are pending
Save all pending NMIs in KVM_GET_VCPU_EVENTS, and queue KVM_REQ_NMI if one
or more NMIs are pending after KVM_SET_VCPU_EVENTS in order to re-evaluate
pending NMIs with respect to NMI blocking.
KVM allows multiple NMIs to be pending in order to faithfully emulate bare
metal handling of simultaneous NMIs (on bare metal, truly simultaneous
NMIs are impossible, i.e. one will always arrive first and be consumed).
Support for simultaneous NMIs botched the save/restore though. KVM only
saves one pending NMI, but allows userspace to restore 255 pending NMIs
as kvm_vcpu_events.nmi.pending is a u8, and KVM's internal state is stored
in an unsigned int.
Fixes:
|
|
|
|
400fee8c9b |
KVM: x86: Tweak the code and comment related to handling concurrent NMIs
Tweak the code and comment that deals with concurrent NMIs to explicitly call out that x86 allows exactly one pending NMI, but that KVM needs to temporarily allow two pending NMIs in order to workaround the fact that the target vCPU cannot immediately recognize an incoming NMI, unlike bare metal. No functional change intended. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-7-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2cb9317377 |
KVM: x86: Raise an event request when processing NMIs if an NMI is pending
Don't raise KVM_REQ_EVENT if no NMIs are pending at the end of process_nmi(). Finishing process_nmi() without a pending NMI will become much more likely when KVM gains support for AMD's vNMI, which allows pending vNMIs in hardware, i.e. doesn't require explicit injection. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-6-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
607475cfa0 |
KVM: x86: Add helpers to query individual CR0/CR4 bits
Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(), and to make each caller's intent more obvious. Defer converting helpers that do truly ugly casts from "unsigned long" to "int", e.g. is_pse(), to a future commit so that their conversion is more isolated. Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3(); the local variable is used only once, immediately after its declaration. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com [sean: move "obvious" conversions to this commit, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e40bcf9f3a |
KVM: x86: Ignore CR0.WP toggles in non-paging mode
If paging is disabled, there are no permission bits to emulate. Micro-optimize this case to avoid unnecessary work. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-4-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
01b31714bd |
KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
There is no need to unload the MMU roots with TDP enabled when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.
One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.
The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):
legacy TDP shadow
kvm-x86/next@d8708b 8.43s 9.45s 70.3s
+patch 5.39s 5.63s 70.2s
For legacy MMU this is ~36% faster, for TDP MMU even ~40% faster. Also
TDP and legacy MMU now both have a similar runtime which vanishes the
need to disable TDP MMU for grsecurity.
Shadow MMU sees no measurable difference and is still slow, as expected.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-3-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
cd42853e95 |
kvm: x86/mmu: Use KVM_MMU_ROOT_XXX for kvm_mmu_invalidate_addr()
The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL). Change the argument type of kvm_mmu_invalidate_addr() and use KVM_MMU_ROOT_XXX instead so that we can reuse the function for kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating gva for different set of roots. No fuctionalities changed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com [sean: massage comment slightly] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
753b43c9d1 |
KVM: x86/mmu: Use 64-bit address to invalidate to fix a subtle bug
FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned long, as the type of the address to invalidate. On 32-bit kernels, the upper 32 bits of the GPA will get dropped when an L2 GPA address is invalidated in the shadowed nested TDP MMU. Convert it to u64 to fix the problem. Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com [sean: tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d8708b80fa |
KVM: Change return type of kvm_arch_vm_ioctl() to "int"
All kvm_arch_vm_ioctl() implementations now only deal with "int" types as return values, so we can change the return type of these functions to use "int" instead of "long". Signed-off-by: Thomas Huth <thuth@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20230208140105.655814-7-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c5edd753a0 |
KVM: x86: Remove the KVM_GET_NR_MMU_PAGES ioctl
The KVM_GET_NR_MMU_PAGES ioctl is quite questionable on 64-bit hosts since it fails to return the full 64 bits of the value that can be set with the corresponding KVM_SET_NR_MMU_PAGES call. Its "long" return value is truncated into an "int" in the kvm_arch_vm_ioctl() function. Since this ioctl also never has been used by userspace applications (QEMU, Google's internal VMM, kvmtool and CrosVM have been checked), it's likely the best if we remove this badly designed ioctl before anybody really tries to use it. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230208140105.655814-4-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
258d985f6e |
KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages
Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults on self-changing writes to shadowed page tables instead of propagating that information to the emulator via a semi-persistent vCPU flag. Using a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented, as it's not at all obvious that clearing the flag only when emulation actually occurs is correct. E.g. if KVM sets the flag and then retries the fault without ever getting to the emulator, the flag will be left set for future calls into the emulator. But because the flag is consumed if and only if both EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated MMIO, or while L2 is active, KVM avoids false positives on a stale flag since FNAME(page_fault) is guaranteed to be run and refresh the flag before it's ultimately consumed by the tail end of reexecute_instruction(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
49d5759268 |
ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Two patches sorting out confusion between virtual and physical
addresses, which currently are the same on s390.
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world,
some of them affecting architecurally legal but unlikely to
happen in practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM
similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at this
point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and
MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just
let the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how
to do initialization.
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit
the correct hypercall instruction instead of relying on KVM to patch
in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmP2YA0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPg/Qf+J6nT+TkIa+8Ei+fN1oMTDp4YuIOx
mXvJ9mRK9sQ+tAUVwvDz3qN/fK5mjsYbRHIDlVc5p2Q3bCrVGDDqXPFfCcLx1u+O
9U9xjkO4JxD2LS9pc70FYOyzVNeJ8VMGOBbC2b0lkdYZ4KnUc6e/WWFKJs96bK+H
duo+RIVyaMthnvbTwSv1K3qQb61n6lSJXplywS8KWFK6NZAmBiEFDAWGRYQE9lLs
VcVcG0iDJNL/BQJ5InKCcvXVGskcCm9erDszPo7w4Bypa4S9AMS42DHUaRZrBJwV
/WqdH7ckIz7+OSV0W1j+bKTHAFVTCjXYOM7wQykgjawjICzMSnnG9Gpskw==
=goe1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
|
|
|
|
877934769e |
- Cache the AMD debug registers in per-CPU variables to avoid MSR writes
where possible, when supporting a debug registers swap feature for SEV-ES guests - Add support for AMD's version of eIBRS called Automatic IBRS which is a set-and-forget control of indirect branch restriction speculation resources on privilege change - Add support for a new x86 instruction - LKGS - Load kernel GS which is part of the FRED infrastructure - Reset SPEC_CTRL upon init to accomodate use cases like kexec which rediscover - Other smaller fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmP1RDIACgkQEsHwGGHe VUohBw//ZB9ZRqsrKdm6D9YaP2x4Zb+kqKqo6rjYeWaYqyPyCwDujPwh+pb3Oq1t aj62muDv1t/wEJc8mKNkfXkjEEtBVAOcpb5YIpKreoEvNKyevol83Ih0u5iJcTRE E5qf8HDS8b/JZrcazJJLl6WQmQNH5RiKSu5bbCpRhoeOcyo5pRYR5MztK9vNmAQk GMdwHsUSU+jN8uiE4HnpaOb/luhgFindRwZVTpdjJegQWLABS8cl3CKeTv4+PW45 isvv37XnQP248wsptIEVRHeG6g3g/HtvwRx7DikUw06QwUyUK7H9hJssOoSP8TL9 u4psRwfWnJ1OxU6klL+s0Ii+pjQ97wXmK/oqK7QkdUwhWqR/mQAW2e9kWHAngyDn A6mKbzSM6HFAeSXQpB9cMb6uvYRD44SngDFe3WXtEK8jiiQ70ikUm4E28I5KJOPg s+RyioHk0NFRHYSOOBqNG1NKz6ED7L3GbgbbzxkgMh21AAyI3X351t+PtGoLV5ew eqOsM7lbg9Scg1LvPk1JcoALS8USWqgar397rz9qGUs+OkPWBtEBCmTdMz/Eb+2t g/WHdLS5/ajSs5gNhT99W3DeqZMPDEkgBRSeyBBmY3CUD3gBL2wXEktRXv504zBR RC4oyUPX3c9E2ib6GATLE3kBLbcz9hTWbMxF+X3lLJvTVd/Qc2o= =v/ZC -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Cache the AMD debug registers in per-CPU variables to avoid MSR writes where possible, when supporting a debug registers swap feature for SEV-ES guests - Add support for AMD's version of eIBRS called Automatic IBRS which is a set-and-forget control of indirect branch restriction speculation resources on privilege change - Add support for a new x86 instruction - LKGS - Load kernel GS which is part of the FRED infrastructure - Reset SPEC_CTRL upon init to accomodate use cases like kexec which rediscover - Other smaller fixes and cleanups * tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd: Cache debug register values in percpu variables KVM: x86: Propagate the AMD Automatic IBRS feature to the guest x86/cpu: Support AMD Automatic IBRS x86/cpu, kvm: Add the SMM_CTL MSR not present feature x86/cpu, kvm: Add the Null Selector Clears Base feature x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code x86/cpu, kvm: Add support for CPUID_80000021_EAX x86/gsseg: Add the new <asm/gsseg.h> header to <asm/asm-prototypes.h> x86/gsseg: Use the LKGS instruction if available for load_gs_index() x86/gsseg: Move load_gs_index() to its own new header file x86/gsseg: Make asm_load_gs_index() take an u16 x86/opcode: Add the LKGS instruction to x86-opcode-map x86/cpufeature: Add the CPU feature bit for LKGS x86/bugs: Reset speculation control settings on init x86/cpu: Remove redundant extern x86_read_arch_cap_msr() |
|
|
|
2c10b61421 |
kvm: initialize all of the kvm_debugregs structure before sending it to userspace
When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there might be some unitialized portions of the kvm_debugregs structure that could be copied to userspace. Prevent this as is done in the other kvm ioctls, by setting the whole structure to 0 before copying anything into it. Bonus is that this reduces the lines of code as the explicit flag setting and reserved space zeroing out can be removed. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable <stable@kernel.org> Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
4bc6dcaa15 |
KVM SVM changes for 6.3:
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsH7kSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/dQQAJSVCYA7F7LcfJf+c1ULG0XCd8rHnXdR
EGTygTnWrzwTCRaunOBPE4AJRxKdkwTKy+yfnnQVRdYYfRe1SZKpUQ2XNEDGn0+v
zVfOmSFFcCWXmJeY8y5n1GlDH4ENO3G7nD1ncDQ0I9PazmOsmxChoVZ9afFJ5bpo
73hjcYVfUDYxGkeRLWSSSFtWIGguE8BkpRH3wZ8MGZi+ueoFUJPPBKHeDtxCV2/T
KcJLne8tQVTiWCdMO3EFwxgIvsjQDoT0gZYLYNHJ6KqD9Szc3jA9v2ryTm5IYlpb
akYUqePaD0SGrrfDBrwz3bLu3fehDu7eduXESRlzzb8S4xP7/qXeo9KeVN+b4MBb
nmBBFncvMWbC8Po5wB5OVAfAa7ACmGiXeBV8pfgGI6FTq1fpc4VNm2PevKkDvlqN
O2eZ1KuNkwBnbIPj3JVPPnsJcUjYXFjZyzfpMV1T+ExmL/IYceatX4S7zfgL5nUg
3qFi5mX2Cufk2EBvBu+Dkpt/H4lze+ysZRciMC+v7Q4LWAYZ8HW1a44pnBVUJMPM
bWiJ1/O8RIWM1tWIrlO38+ZZalbu3spIVMBXKzqEGXvpUwJ4UgZM1tFiWvISTVFe
2X6N3d7aT/DQ1PzZU6BsyVZWAFaodHBauMcr9FUkWqqGu3HOhqC4rSJZ9eRR7V5O
WSp1gTVY1JXy
=AVpx
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.3:
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
|
|
|
|
157ed9cb04 |
KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsFZ4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5eKEP/0qeZsOQot53wkf+wNiGh1X6qDacBPFP
A8GAPC70fEisxAt776DeKEBwikHpARPglCt1Il9dFvkG+0jgYpvPu8UGF1LpouKX
cD/7itr2k8GZlXZBg2Rgu3TRyFBJEGHT6tAu7PBhZyL6yWQDUxao8FPFrRGfmJ7O
Z6eFMo1cERNHICQm+W/2TBd1xguiF+m4CXKlA70R4wzM37aPF9o5HvmIwAvPzyhU
w4WzcIQbjVPs1VpBTzwPqRmyZ8omSlDYo7VqmsDiRtJbucqgbhFI2wR+nyImFCa9
D2pI5TV3CFTt0fvd8SZpH19nR3S6cMLCXONOsijmvR2BmS3PhJMP4dMm5m4R06nF
RBtnTj9fkbeL1ghFEkMxHBZVTG3bBlO4ySOxIqNHCvPjqQ37mJ+xP4C8kcIC9p5F
+xL3AvZ7zenPv3A29SY9YH+QvZLBwyDJzAsveLeYkLFoJxoDT4glOY/Wpi1rkZ17
/zHDZWoF49l1Eu3Bql0hFetkCreUNFGpa4moUmEC0evYOvV2WCb+39TDXZ8CPCGD
+cDiRnD8MFQpBw47F03EnFheFHxiJoL0Clv5vvM3C+xOq2J9WVG9mqQWCk+4ta2B
Um4D++0a9lwvJhOImaR7uyiV3K7oVm+rU8+46x+nTNGaIP2bnE+vronY+b6KGeUx
7+xzTKlYygGe
=ev5v
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
|
|
|
|
6f0f2d5ef8 |
KVM: x86: Mitigate the cross-thread return address predictions bug
By default, KVM/SVM will intercept attempts by the guest to transition out of C0. However, the KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this behavior. To mitigate the cross-thread return address predictions bug (X86_BUG_SMT_RSB), a VMM must not be allowed to override the default behavior to intercept C0 transitions. Use a module parameter to control the mitigation on processors that are vulnerable to X86_BUG_SMT_RSB. If the processor is vulnerable to the X86_BUG_SMT_RSB bug and the module parameter is set to mitigate the bug, KVM will not allow the disabling of the HLT, MWAIT and CSTATE exits. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <4019348b5e07148eb4d593380a5f6713b93c9a16.1675956146.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
e73ba25fdc |
KVM: x86: Simplify msr_io()
As of commit
|
|
|
|
4559e6cf45 |
KVM: x86: Remove unnecessary initialization in kvm_vm_ioctl_set_msr_filter()
Do not initialize the value of `r`, as it will be overwritten. Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-6-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
1fdefb8bd8 |
KVM: x86: Explicitly state lockdep condition of msr_filter update
Replace `1` with the actual mutex_is_locked() check. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-5-mhal@rbox.co [sean: delete the comment that explained the hardocded '1'] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4d85cfcaa8 |
KVM: x86: Simplify msr_filter update
Replace srcu_dereference()+rcu_assign_pointer() sequence with a single rcu_replace_pointer(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-4-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
708f799d22 |
KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_X86_SET_MSR_FILTER)
Reduce time spent holding kvm->lock: unlock mutex before calling synchronize_srcu(). There is no need to hold kvm->lock until all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs will switch to the new filter before exiting to userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-3-mhal@rbox.co [sean: expand changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
32e69f232d |
KVM: x86: Use emulator callbacks instead of duplicating "host flags"
Instead of re-defining the "host flags" bits, just expose dedicated helpers for each of the two remaining flags that are consumed by the emulator. The emulator never consumes both "is guest" and "is SMM" in close proximity, so there is no motivation to avoid additional indirect branches. Also while at it, garbage collect the recently removed host flags. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com [sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2de154f541 |
KVM: x86/pmu: Provide "error" semantics for unsupported-but-known PMU MSRs
Provide "error" semantics (read zeros, drop writes) for userspace accesses to MSRs that are ultimately unsupported for whatever reason, but for which KVM told userspace to save and restore the MSR, i.e. for MSRs that KVM included in KVM_GET_MSR_INDEX_LIST. Previously, KVM special cased a few PMU MSRs that were problematic at one point or another. Extend the treatment to all PMU MSRs, e.g. to avoid spurious unsupported accesses. Note, the logic can also be used for non-PMU MSRs, but as of today only PMU MSRs can end up being unsupported after KVM told userspace to save and restore them. Link: https://lore.kernel.org/r/20230124234905.3774678-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e33b6d79ac |
KVM: x86/pmu: Don't tell userspace to save MSRs for non-existent fixed PMCs
Limit the set of MSRs for fixed PMU counters based on the number of fixed counters actually supported by the host so that userspace doesn't waste time saving and restoring dummy values. Signed-off-by: Like Xu <likexu@tencent.com> [sean: split for !enable_pmu logic, drop min(), write changelog] Link: https://lore.kernel.org/r/20230124234905.3774678-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
c3531edc79 |
KVM: x86/pmu: Don't tell userspace to save PMU MSRs if PMU is disabled
Omit all PMU MSRs from the "MSRs to save" list if the PMU is disabled so that userspace doesn't waste time saving and restoring dummy values. KVM provides "error" semantics (read zeros, drop writes) for such known-but- unsupported MSRs, i.e. has fudged around this issue for quite some time. Keep the "error" semantics as-is for now, the logic will be cleaned up in a separate patch. Cc: Aaron Lewis <aaronlewis@google.com> Cc: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20230124234905.3774678-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2374b7310b |
KVM: x86/pmu: Use separate array for defining "PMU MSRs to save"
Move all potential to-be-saved PMU MSRs into a separate array so that a future patch can easily omit all PMU MSRs from the list when the PMU is disabled. No functional change intended. Link: https://lore.kernel.org/r/20230124234905.3774678-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e76ae52747 |
KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs
Add helpers to print unimplemented MSR accesses and condition all such
prints on report_ignored_msrs, i.e. honor userspace's request to not
print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited,
printing can still be problematic, e.g. if a print gets stalled when host
userspace is writing MSRs during live migration, an effective stall can
result in very noticeable disruption in the guest.
E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU
counters while the PMU was disabled in KVM.
- 99.75% 0.00% [.] __ioctl
- __ioctl
- 99.74% entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_ioctl
- do_vfs_ioctl
- 92.48% kvm_vcpu_ioctl
- kvm_arch_vcpu_ioctl
- 85.12% kvm_set_msr_ignored_check
svm_set_msr
kvm_set_msr_common
printk
vprintk_func
vprintk_default
vprintk_emit
console_unlock
call_console_drivers
univ8250_console_write
serial8250_console_write
uart_console_write
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
|
|
8911ce6669 |
KVM: x86/pmu: Cap kvm_pmu_cap.num_counters_gp at KVM's internal max
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based on the vendor PMU capabilities so that consuming num_counters_gp naturally does the right thing. This fixes a mostly theoretical bug where KVM could over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g. if the number of counters reported by perf is greater than KVM's hardcoded internal limit. Incorporating input from the AMD PMU also avoids over-reporting MSRs to save when running on AMD. Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
8c19b6f257 |
KVM: x86: Propagate the AMD Automatic IBRS feature to the guest
Add the AMD Automatic IBRS feature bit to those being propagated to the guest, and enable the guest EFER bit. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230124163319.2277355-9-kim.phillips@amd.com |
|
|
|
2eb398df77 |
KVM: x86: Replace IS_ERR() with IS_ERR_VALUE()
Avoid type casts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Link: https://lore.kernel.org/r/202211161718436948912@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
14329b825f |
KVM: x86/pmu: Introduce masked events to the pmu event filter
When building a list of filter events, it can sometimes be a challenge to fit all the events needed to adequately restrict the guest into the limited space available in the pmu event filter. This stems from the fact that the pmu event filter requires each event (i.e. event select + unit mask) be listed, when the intention might be to restrict the event select all together, regardless of it's unit mask. Instead of increasing the number of filter events in the pmu event filter, add a new encoding that is able to do a more generalized match on the unit mask. Introduce masked events as another encoding the pmu event filter understands. Masked events has the fields: mask, match, and exclude. When filtering based on these events, the mask is applied to the guest's unit mask to see if it matches the match value (i.e. umask & mask == match). The exclude bit can then be used to exclude events from that match. E.g. for a given event select, if it's easier to say which unit mask values shouldn't be filtered, a masked event can be set up to match all possible unit mask values, then another masked event can be set up to match the unit mask values that shouldn't be filtered. Userspace can query to see if this feature exists by looking for the capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS. This feature is enabled by setting the flags field in the pmu event filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS. Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY(). It is an error to have a bit set outside the valid bits for a masked event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in such cases, including the high bits of the event select (35:32) if called on Intel. With these updates the filter matching code has been updated to match on a common event. Masked events were flexible enough to handle both event types, so they were used as the common event. This changes how guest events get filtered because regardless of the type of event used in the uAPI, they will be converted to masked events. Because of this there could be a slight performance hit because instead of matching the filter event with a lookup on event select + unit mask, it does a lookup on event select then walks the unit masks to find the match. This shouldn't be a big problem because I would expect the set of common event selects to be small, and if they aren't the set can likely be reduced by using masked events to generalize the unit mask. Using one type of event when filtering guest events allows for a common code path to be used. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
f422f853af |
KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if present
The scaling information in subleaf 1 should match the values set by KVM in the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info) which is shared with the guest, but is not directly available to the VMM. The offset values are not set since a TSC offset is already applied. The TSC frequency should also be set in sub-leaf 2. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ee661d8ea9 |
KVM: x86: Replace cpu_dirty_logging_count with nr_memslots_dirty_logging
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging. Both fields count the number of memslots that have dirty-logging enabled, with the only difference being that cpu_dirty_logging_count is only incremented when using PML. So while nr_memslots_dirty_logging is not a direct replacement for cpu_dirty_logging_count, it can be combined with enable_pml to get the same information. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
f15a87c006 |
Merge branch 'kvm-lapic-fix-and-cleanup' into HEAD
The first half or so patches fix semi-urgent, real-world relevant APICv and AVIC bugs. The second half fixes a variety of AVIC and optimized APIC map bugs where KVM doesn't play nice with various edge cases that are architecturally legal(ish), but are unlikely to occur in most real world scenarios Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b3f257a846 |
KVM: x86: Track required APICv inhibits with variable, not callback
Track the per-vendor required APICv inhibits with a variable instead of calling into vendor code every time KVM wants to query the set of required inhibits. The required inhibits are a property of the vendor's virtualization architecture, i.e. are 100% static. Using a variable allows the compiler to inline the check, e.g. generate a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking inhibits for performance reasons. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
2008fab345 |
KVM: x86: Inhibit APIC memslot if x2APIC and AVIC are enabled
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's
AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with
x2APIC enabled. On AMD, if its "hybrid" mode is enabled (AVIC is enabled
when x2APIC is enabled even without x2AVIC support), keeping the APIC
access page memslot results in the guest being able to access the virtual
APIC page as x2APIC is fully emulated by KVM. I.e. hardware isn't aware
that the guest is operating in x2APIC mode.
Exempt nested SVM's update of APICv state from the new logic as x2APIC
can't be toggled on VM-Exit. In practice, invoking the x2APIC logic
should be harmless precisely because it should be a glorified nop, but
play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU
lock.
Intel doesn't suffer from the same issue as APICv has fully independent
VMCS controls for xAPIC vs. x2APIC virtualization. Technically, KVM
should provide bus error semantics and not memory semantics for the APIC
page when x2APIC is enabled, but KVM already provides memory semantics in
other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware
disabled (via APIC_BASE MSR).
Note, checking apic_access_memslot_enabled without taking locks relies
it being set during vCPU creation (before kvm_vcpu_reset()). vCPUs can
race to set the inhibit and delete the memslot, i.e. can get false
positives, but can't get false negatives as apic_access_memslot_enabled
can't be toggled "on" once any vCPU reaches KVM_RUN.
Opportunistically drop the "can" while updating avic_activate_vmcb()'s
comment, i.e. to state that KVM _does_ support the hybrid mode. Move
the "Note:" down a line to conform to preferred kernel/KVM multi-line
comment style.
Opportunistically update the apicv_update_lock comment, as it isn't
actually used to protect apic_access_memslot_enabled (which is protected
by slots_lock).
Fixes:
|
|
|
|
e4aa7f88af |
KVM: Disable CPU hotplug during hardware enabling/disabling
Disable CPU hotplug when enabling/disabling hardware to prevent the
corner case where if the following sequence occurs:
1. A hotplugged CPU marks itself online in cpu_online_mask
2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
callback
3 hardware_{en,dis}able_all() is invoked on another CPU
the hotplugged CPU will be included in on_each_cpu() and thus get sent
through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called.
start_secondary { ...
set_cpu_online(smp_processor_id(), true); <- 1
...
local_irq_enable(); <- 2
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
}
KVM currently fudges around this race by keeping track of which CPUs have
done hardware enabling (see commit
|
|
|
|
aaf12a7b43 |
KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section
The CPU STARTING section doesn't allow callbacks to fail. Move KVM's hotplug callback to ONLINE section so that it can abort onlining a CPU in certain cases to avoid potentially breaking VMs running on existing CPUs. For example, when KVM fails to enable hardware virtualization on the hotplugged CPU. Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures when offlining a CPU, all user tasks and non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's CPU offline callback to disable hardware virtualization at that point. Likewise, KVM's online callback can enable hardware virtualization before any vCPU task gets a chance to run on hotplugged CPUs. Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are disabled, as the ONLINE section runs with IRQs disabled. The WARN wasn't intended to be a requirement, e.g. disabling preemption is sufficient, the IRQ thing was purely an aggressive sanity check since the helper was only ever invoked via SMP function call. Rename KVM's CPU hotplug callbacks accordingly. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: drop WARN that IRQs are disabled] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c82a5c5c53 |
KVM: x86: Do compatibility checks when onlining CPU
Do compatibility checks when enabling hardware to effectively add compatibility checks when onlining a CPU. Abort enabling, i.e. the online process, if the (hotplugged) CPU is incompatible with the known good setup. At init time, KVM does compatibility checks to ensure that all online CPUs support hardware virtualization and a common set of features. But KVM uses hotplugged CPUs without such compatibility checks. On Intel CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or VM-Entry failure if the hotplugged CPU doesn't support all features enabled by KVM. Note, this is little more than a NOP on SVM, as SVM already checks for full SVM support during hardware enabling. Opportunistically add a pr_err() if setup_vmcs_config() fails, and tweak all error messages to output which CPU failed. Signed-off-by: Chao Gao <chao.gao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
d83420c2d7 |
KVM: x86: Move CPU compat checks hook to kvm_x86_ops (from kvm_x86_init_ops)
Move the .check_processor_compatibility() callback from kvm_x86_init_ops to kvm_x86_ops to allow a future patch to do compatibility checks during CPU hotplug. Do kvm_ops_update() before compat checks so that static_call() can be used during compat checks. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
d419313249 |
KVM: x86: Do VMX/SVM support checks directly in vendor code
Do basic VMX/SVM support checks directly in vendor code instead of implementing them via kvm_x86_ops hooks. Beyond the superficial benefit of providing common messages, which isn't even clearly a net positive since vendor code can provide more precise/detailed messages, there's zero advantage to bouncing through common x86 code. Consolidating the checks will also simplify performing the checks across all CPUs (in a future patch). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
8d20bd6381 |
KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks use consistent formatting across common x86, Intel, and AMD code. In addition to providing consistent print formatting, using KBUILD_MODNAME, e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and SGX and ...) as technologies without generating weird messages, and without causing naming conflicts with other kernel code, e.g. "SEV: ", "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems. Opportunistically move away from printk() for prints that need to be modified anyways, e.g. to drop a manual "kvm: " prefix. Opportunistically convert a few SGX WARNs that are similarly modified to WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good that they would fire repeatedly and spam the kernel log without providing unique information in each print. Note, defining pr_fmt yields undesirable results for code that uses KVM's printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's wrappers is relatively limited in KVM x86 code. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20221130230934.1014142-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
81a1cf9f89 |
KVM: Drop kvm_arch_check_processor_compat() hook
Drop kvm_arch_check_processor_compat() and its support code now that all architecture implementations are nops. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by: Anup Patel <anup@brainfault.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
3045c483ee |
KVM: x86: Do CPU compatibility checks in x86 code
Move the CPU compatibility checks to pure x86 code, i.e. drop x86's use of the common kvm_x86_check_cpu_compat() arch hook. x86 is the only architecture that "needs" to do per-CPU compatibility checks, moving the logic to x86 will allow dropping the common code, and will also give x86 more control over when/how the compatibility checks are performed, e.g. TDX will need to enable hardware (do VMXON) in order to perform compatibility checks. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |