mirror of https://github.com/torvalds/linux.git
2667 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
0e8863244e |
ARM:
* Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA.
* Use acquire/release semantics in the KVM FF-A proxy to avoid reading
a stale value for the FF-A version.
* Fix KVM guest driver to match PV CPUID hypercall ABI.
* Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work.
s390:
* Don't use %pK for debug printing and tracepoints.
x86:
* Use a separate subclass when acquiring KVM's per-CPU posted interrupts
wakeup lock in the scheduled out path, i.e. when adding a vCPU on
the list of vCPUs to wake, to workaround a false positive deadlock.
The schedule out code runs with a scheduler lock that the wakeup
handler takes in the opposite order; but it does so with IRQs disabled
and cannot run concurrently with a wakeup.
* Explicitly zero-initialize on-stack CPUID unions
* Allow building irqbypass.ko as as module when kvm.ko is a module
* Wrap relatively expensive sanity check with KVM_PROVE_MMU
* Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
selftests:
* Add more scenarios to the MONITOR/MWAIT test.
* Add option to rseq test to override /dev/cpu_dma_latency
* Bring list of exit reasons up to date
* Cleanup Makefile to list once tests that are valid on all architectures
Other:
* Documentation fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmf083IUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1dgf/QwfpZcHoMNQSnrc1jMy2LHrArln2
XfmsOGZTU7kyoLQsLWGAPNocOveGdiemTDsj5ZXoNMnqV8hCBr+tZuv2gWI1rr/o
kiGerdIgSZ9piTjBlJkVAaOzbWhg2DUnr7qVVzEzFY9+rPNyQ81vgAfU7h56KhYB
optecozmBrHHAxvQZwmPeL9UyPWFjOF1BY/8LTMx7X+aVuCX6qx1JqO3a3ylAw4J
tGXv6qFJfuCnu1d1b4X0ILce0iMUTOjQzvTcIm+BKjYycecl+3j1aczC/BOorIgc
mf0+XeauhcTduK73pirnvx2b05eOxntgkOpwJytO2RP6pE0uK+2Th/C3Qg==
=ba/Y
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA
- Use acquire/release semantics in the KVM FF-A proxy to avoid
reading a stale value for the FF-A version
- Fix KVM guest driver to match PV CPUID hypercall ABI
- Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work
s390:
- Don't use %pK for debug printing and tracepoints
x86:
- Use a separate subclass when acquiring KVM's per-CPU posted
interrupts wakeup lock in the scheduled out path, i.e. when adding
a vCPU on the list of vCPUs to wake, to workaround a false positive
deadlock. The schedule out code runs with a scheduler lock that the
wakeup handler takes in the opposite order; but it does so with
IRQs disabled and cannot run concurrently with a wakeup
- Explicitly zero-initialize on-stack CPUID unions
- Allow building irqbypass.ko as as module when kvm.ko is a module
- Wrap relatively expensive sanity check with KVM_PROVE_MMU
- Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
selftests:
- Add more scenarios to the MONITOR/MWAIT test
- Add option to rseq test to override /dev/cpu_dma_latency
- Bring list of exit reasons up to date
- Cleanup Makefile to list once tests that are valid on all
architectures
Other:
- Documentation fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
KVM: arm64: Use acquire/release to communicate FF-A version negotiation
KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
KVM: arm64: selftests: Introduce and use hardware-definition macros
KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
KVM: x86: Explicitly zero-initialize on-stack CPUID unions
KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
Documentation: kvm: remove KVM_CAP_MIPS_TE
Documentation: kvm: organize capabilities in the right section
Documentation: kvm: fix some definition lists
Documentation: kvm: drop "Capability" heading from capabilities
Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
selftests: kvm: list once tests that are valid on all architectures
selftests: kvm: bring list of exit reasons up to date
selftests: kvm: revamp MONITOR/MWAIT tests
KVM: arm64: Don't translate FAR if invalid/unsafe
...
|
|
|
|
459a35111b |
KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
Convert HAVE_KVM_IRQ_BYPASS into a tristate so that selecting
IRQ_BYPASS_MANAGER follows KVM={m,y}, i.e. doesn't force irqbypass.ko to
be built-in.
Note, PPC allows building KVM as a module, but selects HAVE_KVM_IRQ_BYPASS
from a boolean Kconfig, i.e. KVM PPC unnecessarily forces irqbpass.ko to
be built-in. But that flaw is a longstanding PPC specific issue.
Fixes:
|
|
|
|
edb0e8f6e2 |
ARM:
* Nested virtualization support for VGICv3, giving the nested
hypervisor control of the VGIC hardware when running an L2 VM
* Removal of 'late' nested virtualization feature register masking,
making the supported feature set directly visible to userspace
* Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
* Paravirtual interface for discovering the set of CPU implementations
where a VM may run, addressing a longstanding issue of guest CPU
errata awareness in big-little systems and cross-implementation VM
migration
* Userspace control of the registers responsible for identifying a
particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
allowing VMs to be migrated cross-implementation
* pKVM updates, including support for tracking stage-2 page table
allocations in the protected hypervisor in the 'SecPageTable' stat
* Fixes to vPMU, ensuring that userspace updates to the vPMU after
KVM_RUN are reflected into the backing perf events
LoongArch:
* Remove unnecessary header include path
* Assume constant PGD during VM context switch
* Add perf events support for guest VM
RISC-V:
* Disable the kernel perf counter during configure
* KVM selftests improvements for PMU
* Fix warning at the time of KVM module removal
x86:
* Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock
allows multiple aging actions to run in parallel, and more importantly avoids
stalling vCPUs. This includes an implementation of per-rmap-entry locking;
aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
locking an rmap for write requires taking both the per-rmap spinlock and
the mmu_lock.
Note that this decreases slightly the accuracy of accessed-page information,
because changes to the SPTE outside aging might not use atomic operations
even if they could race against a clear of the Accessed bit. This is
deliberate because KVM and mm/ tolerate false positives/negatives for
accessed information, and testing has shown that reducing the latency of
aging is far more beneficial to overall system performance than providing
"perfect" young/old information.
* Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
coalesce updates when multiple pieces of vCPU state are changing, e.g. as
part of a nested transition.
* Fix a variety of nested emulation bugs, and add VMX support for synthesizing
nested VM-Exit on interception (instead of injecting #UD into L2).
* Drop "support" for async page faults for protected guests that do not set
SEND_ALWAYS (i.e. that only want async page faults at CPL3)
* Bring a bit of sanity to x86's VM teardown code, which has accumulated
a lot of cruft over the years. Particularly, destroy vCPUs before
the MMU, despite the latter being a VM-wide operation.
* Add common secure TSC infrastructure for use within SNP and in the
future TDX
* Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make
sense to use the capability if the relevant registers are not
available for reading or writing.
* Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
fix a largely theoretical deadlock.
* Use the vCPU's actual Xen PV clock information when starting the Xen timer,
as the cached state in arch.hv_clock can be stale/bogus.
* Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
notifier only accounts for kvmclock, and there's no evidence that the
flag is actually supported by Xen guests.
* Clean up the per-vCPU "cache" of its reference pvclock, and instead only
track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
expensive to compute, and rarely changes for modern setups).
* Don't write to the Xen hypercall page on MSR writes that are initiated by
the host (userspace or KVM) to fix a class of bugs where KVM can write to
guest memory at unexpected times, e.g. during vCPU creation if userspace has
set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
* Restrict the Xen hypercall MSR index to the unofficial synthetic range to
reduce the set of possible collisions with MSRs that are emulated by KVM
(collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
in the synthetic range).
* Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
* Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
entries when updating PV clocks; there is no guarantee PV clocks will be
updated between TSC frequency changes and CPUID emulation, and guest reads
of the TSC leaves should be rare, i.e. are not a hot path.
x86 (Intel):
* Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
* Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
for upcoming FRED virtualization support.
* Decouple the EPT entry RWX protection bit macros from the EPT Violation
bits, both as a general cleanup and in anticipation of adding support for
emulating Mode-Based Execution Control (MBEC).
* Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
state while KVM is in the middle of emulating nested VM-Enter.
* Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
in anticipation of adding sanity checks for secondary exit controls (the
primary field is out of bits).
x86 (AMD):
* Ensure the PSP driver is initialized when both the PSP and KVM modules are
built-in (the initcall framework doesn't handle dependencies).
* Use long-term pins when registering encrypted memory regions, so that the
pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
excessive fragmentation.
* Add macros and helpers for setting GHCB return/error codes.
* Add support for Idle HLT interception, which elides interception if the vCPU
has a pending, unmasked virtual IRQ when HLT is executed.
* Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
address.
* Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
because the vCPU was "destroyed" via SNP's AP Creation hypercall.
* Reject SNP AP Creation if the requested SEV features for the vCPU don't
match the VM's configured set of features.
Selftests:
* Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
instead of executing code. The theory is that modern Intel CPUs have
learned new code prefetching tricks that bypass the PMU counters.
* Fix a flaw in the Intel PMU counters test where it asserts that an event is
counting correctly without actually knowing what the event counts on the
underlying hardware.
* Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
improve its coverage by collecting all dirty entries on each iteration.
* Fix a few minor bugs related to handling of stats FDs.
* Add infrastructure to make vCPU and VM stats FDs available to tests by
default (open the FDs during VM/vCPU creation).
* Relax an assertion on the number of HLT exits in the xAPIC IPI test when
running on a CPU that supports AMD's Idle HLT (which elides interception of
HLT if a virtual IRQ is pending and unmasked).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
=nnCN
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Nested virtualization support for VGICv3, giving the nested
hypervisor control of the VGIC hardware when running an L2 VM
- Removal of 'late' nested virtualization feature register masking,
making the supported feature set directly visible to userspace
- Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
- Paravirtual interface for discovering the set of CPU
implementations where a VM may run, addressing a longstanding issue
of guest CPU errata awareness in big-little systems and
cross-implementation VM migration
- Userspace control of the registers responsible for identifying a
particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
allowing VMs to be migrated cross-implementation
- pKVM updates, including support for tracking stage-2 page table
allocations in the protected hypervisor in the 'SecPageTable' stat
- Fixes to vPMU, ensuring that userspace updates to the vPMU after
KVM_RUN are reflected into the backing perf events
LoongArch:
- Remove unnecessary header include path
- Assume constant PGD during VM context switch
- Add perf events support for guest VM
RISC-V:
- Disable the kernel perf counter during configure
- KVM selftests improvements for PMU
- Fix warning at the time of KVM module removal
x86:
- Add support for aging of SPTEs without holding mmu_lock.
Not taking mmu_lock allows multiple aging actions to run in
parallel, and more importantly avoids stalling vCPUs. This includes
an implementation of per-rmap-entry locking; aging the gfn is done
with only a per-rmap single-bin spinlock taken, whereas locking an
rmap for write requires taking both the per-rmap spinlock and the
mmu_lock.
Note that this decreases slightly the accuracy of accessed-page
information, because changes to the SPTE outside aging might not
use atomic operations even if they could race against a clear of
the Accessed bit.
This is deliberate because KVM and mm/ tolerate false
positives/negatives for accessed information, and testing has shown
that reducing the latency of aging is far more beneficial to
overall system performance than providing "perfect" young/old
information.
- Defer runtime CPUID updates until KVM emulates a CPUID instruction,
to coalesce updates when multiple pieces of vCPU state are
changing, e.g. as part of a nested transition
- Fix a variety of nested emulation bugs, and add VMX support for
synthesizing nested VM-Exit on interception (instead of injecting
#UD into L2)
- Drop "support" for async page faults for protected guests that do
not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)
- Bring a bit of sanity to x86's VM teardown code, which has
accumulated a lot of cruft over the years. Particularly, destroy
vCPUs before the MMU, despite the latter being a VM-wide operation
- Add common secure TSC infrastructure for use within SNP and in the
future TDX
- Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
make sense to use the capability if the relevant registers are not
available for reading or writing
- Don't take kvm->lock when iterating over vCPUs in the suspend
notifier to fix a largely theoretical deadlock
- Use the vCPU's actual Xen PV clock information when starting the
Xen timer, as the cached state in arch.hv_clock can be stale/bogus
- Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
KVM's suspend notifier only accounts for kvmclock, and there's no
evidence that the flag is actually supported by Xen guests
- Clean up the per-vCPU "cache" of its reference pvclock, and instead
only track the vCPU's TSC scaling (multipler+shift) metadata (which
is moderately expensive to compute, and rarely changes for modern
setups)
- Don't write to the Xen hypercall page on MSR writes that are
initiated by the host (userspace or KVM) to fix a class of bugs
where KVM can write to guest memory at unexpected times, e.g.
during vCPU creation if userspace has set the Xen hypercall MSR
index to collide with an MSR that KVM emulates
- Restrict the Xen hypercall MSR index to the unofficial synthetic
range to reduce the set of possible collisions with MSRs that are
emulated by KVM (collisions can still happen as KVM emulates
Hyper-V MSRs, which also reside in the synthetic range)
- Clean up and optimize KVM's handling of Xen MSR writes and
xen_hvm_config
- Update Xen TSC leaves during CPUID emulation instead of modifying
the CPUID entries when updating PV clocks; there is no guarantee PV
clocks will be updated between TSC frequency changes and CPUID
emulation, and guest reads of the TSC leaves should be rare, i.e.
are not a hot path
x86 (Intel):
- Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1
- Pass XFD_ERR as the payload when injecting #NM, as a preparatory
step for upcoming FRED virtualization support
- Decouple the EPT entry RWX protection bit macros from the EPT
Violation bits, both as a general cleanup and in anticipation of
adding support for emulating Mode-Based Execution Control (MBEC)
- Reject KVM_RUN if userspace manages to gain control and stuff
invalid guest state while KVM is in the middle of emulating nested
VM-Enter
- Add a macro to handle KVM's sanity checks on entry/exit VMCS
control pairs in anticipation of adding sanity checks for secondary
exit controls (the primary field is out of bits)
x86 (AMD):
- Ensure the PSP driver is initialized when both the PSP and KVM
modules are built-in (the initcall framework doesn't handle
dependencies)
- Use long-term pins when registering encrypted memory regions, so
that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
don't lead to excessive fragmentation
- Add macros and helpers for setting GHCB return/error codes
- Add support for Idle HLT interception, which elides interception if
the vCPU has a pending, unmasked virtual IRQ when HLT is executed
- Fix a bug in INVPCID emulation where KVM fails to check for a
non-canonical address
- Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
invalid, e.g. because the vCPU was "destroyed" via SNP's AP
Creation hypercall
- Reject SNP AP Creation if the requested SEV features for the vCPU
don't match the VM's configured set of features
Selftests:
- Fix again the Intel PMU counters test; add a data load and do
CLFLUSH{OPT} on the data instead of executing code. The theory is
that modern Intel CPUs have learned new code prefetching tricks
that bypass the PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that an
event is counting correctly without actually knowing what the event
counts on the underlying hardware
- Fix a variety of flaws, bugs, and false failures/passes
dirty_log_test, and improve its coverage by collecting all dirty
entries on each iteration
- Fix a few minor bugs related to handling of stats FDs
- Add infrastructure to make vCPU and VM stats FDs available to tests
by default (open the FDs during VM/vCPU creation)
- Relax an assertion on the number of HLT exits in the xAPIC IPI test
when running on a CPU that supports AMD's Idle HLT (which elides
interception of HLT if a virtual IRQ is pending and unmasked)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
RISC-V: KVM: Teardown riscv specific bits after kvm_exit
LoongArch: KVM: Register perf callbacks for guest
LoongArch: KVM: Implement arch-specific functions for guest perf
LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
LoongArch: KVM: Remove PGD saving during VM context switch
LoongArch: KVM: Remove unnecessary header include path
KVM: arm64: Tear down vGIC on failed vCPU creation
KVM: arm64: PMU: Reload when resetting
KVM: arm64: PMU: Reload when user modifies registers
KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
KVM: arm64: Initialize HCRX_EL2 traps in pKVM
KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
KVM: x86: Add infrastructure for secure TSC
KVM: x86: Push down setting vcpu.arch.user_set_tsc
...
|
|
|
|
99c21beaab |
vfs-6.15-rc1.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90p4AAKCRCRxhvAZXjc
ojMIAP9atkG3u7+490+NGWLdulQlaHnD51Owa9MiW87UfKpsTQEArwi/NrJqXJNT
PFQ2xIa5TxG+9haChR89w3kjZ6b/hgs=
=iDkx
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Add CONFIG_DEBUG_VFS infrastucture:
- Catch invalid modes in open
- Use the new debug macros in inode_set_cached_link()
- Use debug-only asserts around fd allocation and install
- Place f_ref to 3rd cache line in struct file to resolve false
sharing
Cleanups:
- Start using anon_inode_getfile_fmode() helper in various places
- Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
f_pos_lock
- Add unlikely() to kcmp()
- Remove legacy ->remount_fs method from ecryptfs after port to the
new mount api
- Remove invalidate_inodes() in favour of evict_inodes()
- Simplify ep_busy_loopER by removing unused argument
- Avoid mmap sem relocks when coredumping with many missing pages
- Inline getname()
- Inline new_inode_pseudo() and de-staticize alloc_inode()
- Dodge an atomic in putname if ref == 1
- Consistently deref the files table with rcu_dereference_raw()
- Dedup handling of struct filename init and refcounts bumps
- Use wq_has_sleeper() in end_dir_add()
- Drop the lock trip around I_NEW wake up in evict()
- Load the ->i_sb pointer once in inode_sb_list_{add,del}
- Predict not reaching the limit in alloc_empty_file()
- Tidy up do_sys_openat2() with likely/unlikely
- Call inode_sb_list_add() outside of inode hash lock
- Sort out fd allocation vs dup2 race commentary
- Turn page_offset() into a wrapper around folio_pos()
- Remove locking in exportfs around ->get_parent() call
- try_lookup_one_len() does not need any locks in autofs
- Fix return type of several functions from long to int in open
- Fix return type of several functions from long to int in ioctls
Fixes:
- Fix watch queue accounting mismatch"
* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: sort out fd allocation vs dup2 race commentary, take 2
fs: call inode_sb_list_add() outside of inode hash lock
fs: tidy up do_sys_openat2() with likely/unlikely
fs: predict not reaching the limit in alloc_empty_file()
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
fs: drop the lock trip around I_NEW wake up in evict()
fs: use wq_has_sleeper() in end_dir_add()
VFS/autofs: try_lookup_one_len() does not need any locks
fs: dedup handling of struct filename init and refcounts bumps
fs: consistently deref the files table with rcu_dereference_raw()
exportfs: remove locking around ->get_parent() call.
fs: use debug-only asserts around fd allocation and install
fs: dodge an atomic in putname if ref == 1
vfs: Remove invalidate_inodes()
ecryptfs: remove NULL remount_fs from super_operations
watch_queue: fix pipe accounting mismatch
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
epoll: simplify ep_busy_loop by removing always 0 argument
fs: Turn page_offset() into a wrapper around folio_pos()
kcmp: improve performance adding an unlikely hint to task comparisons
...
|
|
|
|
361da275e5 |
Merge branch 'kvm-nvmx-and-vm-teardown' into HEAD
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
b2aba529bf |
KVM: Drop kvm_arch_sync_events() now that all implementations are nops
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other arch has ever used it). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20250224235542.2562848-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
ed8f966331 |
KVM: Assert that a destroyed/freed vCPU is no longer visible
After freeing a vCPU, assert that it is no longer reachable, and that kvm_get_vcpu() doesn't return garbage or a pointer to some other vCPU. While KVM obviously shouldn't be attempting to access a freed vCPU, it's all too easy for KVM to make a VM-wide request, e.g. via KVM_BUG_ON() or kvm_flush_remote_tlbs(). Alternatively, KVM could short-circuit problematic paths if the VM's refcount has gone to zero, e.g. in kvm_make_all_cpus_request(), or KVM could try disallow making global requests during teardown. But given that deleting the vCPU from the array Just Works, adding logic to the requests path is unnecessary, and trying to make requests illegal during teardown would be a fool's errand. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
f9835fa147
|
make use of anon_inode_getfile_fmode()
["fallen through the cracks" misc stuff] A bunch of anon_inode_getfile() callers follow it with adjusting ->f_mode; we have a helper doing that now, so let's make use of it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250118014434.GT1977892@ZenIV Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
aa34b81165 |
KVM: Allow lockless walk of SPTEs when handing aging mmu_notifier event
It is possible to correctly do aging without taking the KVM MMU lock, or while taking it for read; add a Kconfig to let architectures do so. Architectures that select KVM_MMU_LOCKLESS_AGING are responsible for correctness. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-3-jthoughton@google.com [sean: massage shortlog+changelog, fix Kconfig goof and shorten name] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
374ccd6360 |
KVM: Rename kvm_handle_hva_range()
Rename kvm_handle_hva_range() to kvm_age_hva_range(), kvm_handle_hva_range_no_flush() to kvm_age_hva_range_no_flush(), and __kvm_handle_hva_range() to kvm_handle_hva_range(), as kvm_age_hva_range() will get more aging-specific functionality. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-2-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
6f61269495 |
KVM: remove kvm_arch_post_init_vm
The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
66119f8ce1 |
KVM: Do not restrict the size of KVM-internal memory regions
Exempt KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES restriction, as the limit on the number of pages exists purely to play nice with dirty bitmap operations, which use 32-bit values to index the bitmaps, and dirty logging isn't supported for KVM-internal memslots. Link: https://lore.kernel.org/all/20240802205003.353672-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250123144627.312456-2-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250123144627.312456-2-imbrenda@linux.ibm.com> |
|
|
|
86eb1aef72 |
Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
|
|
7a9164dc69 |
KVM vcpu_array fixes and cleanups for 6.14:
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
where KVM would return a pointer to a vCPU prior to it being fully online,
and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
- Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a
bug where userspace could coerce KVM into handling the ioctl on a vCPU that
isn't yet onlined.
- Gracefully handle xa_insert() failures even though such failuires should be
impossible in practice.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJpxQACgkQOlYIJqCj
N/0mrxAAia8fym/sz70XJQyOPta9v/PvvqPpmZ1SuAeoHBUxbyqF+HqSpi5n6TIA
GijHfM5/LC4krRcS4765Pa6UjsxOeoOqbR+PAO9XZ6A5b93pJt5kKQw5JJE7ZckE
GNJbxCtfnKUP04GTb60pwKr7oVizlUf9AP8x1vGkp2mnnUa/VJTkwziSjum0FIB0
5tSzNEQL7gegomkzoqsIZxk+F2fxo9uuGu+ACTts51oEf1AC1kSljroSp65m4LhB
0d0c+4n5bTP5BhvP3DJC1OAi+nouyKCyV6NoM48J5x3RqQGd53ilBNqvY+O1caLv
LNnlmebPcVmGaWGWq6ta41xyG7klqnvITQe7mVeukSP38/82zkcTryiW4vTbykGz
7doz7ZcYefUf3U0aFouTHMr53uiUy+ysWCQkFKJvddJCT9vyjHKroWRJvq4dKRGK
tLfMDKU2/qCnSalWwmSErdLSRDVVZI8tbAFzsk0s/P2kspyiYzgB3LEGy8UGuB6G
Cov6cCmFI6trpYNwVU4tEVGfPEtqhf4odTzbvwOo1ekZzBwa/6WSGZa5eMQjV4tA
ikAgaVQYbduUyikEY+A0LUBVqvoVqCUG9eUKaSUHQTIImZ3Hnc0DcUT/RfNdk27A
nYI+Jz2OFoz1rY8QTXUHEn0IDTZL9ipREeAejbYXg4dee5QTwNY=
=nx41
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vcpu_array-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM vcpu_array fixes and cleanups for 6.14:
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
where KVM would return a pointer to a vCPU prior to it being fully online,
and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
- Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a
bug where userspace could coerce KVM into handling the ioctl on a vCPU that
isn't yet onlined.
- Gracefully handle xa_insert() failures even though such failuires should be
impossible in practice.
|
|
|
|
0cc3cb2151 |
KVM: Disallow all flags for KVM-internal memslots
Disallow all flags for KVM-internal memslots as all existing flags require some amount of userspace interaction to have any meaning. In addition to guarding against KVM goofs, explicitly disallowing dirty logging of KVM- internal memslots will (hopefully) allow exempting KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES limit, which appears to exist purely because the dirty bitmap operations use a 32-bit index. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
344315e93d |
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
156bffdb2b |
KVM: Add a dedicated API for setting KVM-internal memslots
Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d131f0042f |
KVM: Assert slots_lock is held when setting memory regions
Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
f81a6d12bf |
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
Open code kvm_set_memory_region() into its sole caller in preparation for adding a dedicated API for setting internal memslots. Oppurtunistically use the fancy new guard(mutex) to avoid a local 'r' variable. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
dca6c88532 |
KVM: Add member to struct kvm_gfn_range to indicate private/shared
Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
67b43038ce |
KVM: guest_memfd: Remove RCU-protected attribute from slot->gmem.file
Remove the RCU-protected attribute from slot->gmem.file. No need to use RCU
primitives rcu_assign_pointer()/synchronize_rcu() to update this pointer.
- slot->gmem.file is updated in 3 places:
kvm_gmem_bind(), kvm_gmem_unbind(), kvm_gmem_release().
All of them are protected by kvm->slots_lock.
- slot->gmem.file is read in 2 paths:
(1) kvm_gmem_populate
kvm_gmem_get_file
__kvm_gmem_get_pfn
(2) kvm_gmem_get_pfn
kvm_gmem_get_file
__kvm_gmem_get_pfn
Path (1) kvm_gmem_populate() requires holding kvm->slots_lock, so
slot->gmem.file is protected by the kvm->slots_lock in this path.
Path (2) kvm_gmem_get_pfn() does not require holding kvm->slots_lock.
However, it's also not guarded by rcu_read_lock() and rcu_read_unlock().
So synchronize_rcu() in kvm_gmem_unbind()/kvm_gmem_release() actually
will not wait for the readers in kvm_gmem_get_pfn() due to lack of RCU
read-side critical section.
The path (2) kvm_gmem_get_pfn() is safe without RCU protection because:
a) kvm_gmem_bind() is called on a new memslot, before the memslot is
visible to kvm_gmem_get_pfn().
b) kvm->srcu ensures that kvm_gmem_unbind() and freeing of a memslot
occur after the memslot is no longer visible to kvm_gmem_get_pfn().
c) get_file_active() ensures that kvm_gmem_get_pfn() will not access the
stale file if kvm_gmem_release() sets it to NULL. This is because if
kvm_gmem_release() occurs before kvm_gmem_get_pfn(), get_file_active()
will return NULL; if get_file_active() does not return NULL,
kvm_gmem_release() should not occur until after kvm_gmem_get_pfn()
releases the file reference.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241104084303.29909-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
01528db67f |
KVM: Drop hack that "manually" informs lockdep of kvm->lock vs. vcpu->mutex
Now that KVM takes vcpu->mutex inside kvm->lock when creating a vCPU, drop
the hack to manually inform lockdep of the kvm->lock => vcpu->mutex
ordering.
This effectively reverts commit
|
|
|
|
e53dc37f5a |
KVM: Don't BUG() the kernel if xa_insert() fails with -EBUSY
WARN once instead of triggering a BUG if xa_insert() fails because it encountered an existing entry. While KVM guarantees there should be no existing entry, there's no reason to BUG the kernel, as KVM needs to gracefully handle failure anyways. Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
d0831edcd8 |
Revert "KVM: Fix vcpu_array[0] races"
Now that KVM loads from vcpu_array if and only if the target index is
valid with respect to online_vcpus, i.e. now that it is safe to erase a
not-fully-onlined vCPU entry, revert to storing into vcpu_array before
success is guaranteed.
If xa_store() fails, which _should_ be impossible, then putting the vCPU's
reference to 'struct kvm' results in a refcounting bug as the vCPU fd has
been installed and owns the vCPU's reference.
This was found by inspection, but forcing the xa_store() to fail
confirms the problem:
| Unable to handle kernel paging request at virtual address ffff800080ecd960
| Call trace:
| _raw_spin_lock_irq+0x2c/0x70
| kvm_irqfd_release+0x24/0xa0
| kvm_vm_release+0x1c/0x38
| __fput+0x88/0x2ec
| ____fput+0x10/0x1c
| task_work_run+0xb0/0xd4
| do_exit+0x210/0x854
| do_group_exit+0x70/0x98
| get_signal+0x6b0/0x73c
| do_signal+0xa4/0x11e8
| do_notify_resume+0x60/0x12c
| el0_svc+0x64/0x68
| el0t_64_sync_handler+0x84/0xfc
| el0t_64_sync+0x190/0x194
| Code: b9000909 d503201f 2a1f03e1 52800028 (88e17c08)
Practically speaking, this is a non-issue as xa_store() can't fail, absent
a nasty kernel bug. But the code is visually jarring and technically
broken.
This reverts commit
|
|
|
|
6e2b2358b3 |
KVM: Grab vcpu->mutex across installing the vCPU's fd and bumping online_vcpus
During vCPU creation, acquire vcpu->mutex prior to exposing the vCPU to userspace, and hold the mutex until online_vcpus is bumped, i.e. until the vCPU is fully online from KVM's perspective. To ensure asynchronous vCPU ioctls also wait for the vCPU to come online, explicitly check online_vcpus at the start of kvm_vcpu_ioctl(), and take the vCPU's mutex to wait if necessary (having to wait for any ioctl should be exceedingly rare, i.e. not worth optimizing). Reported-by: Will Deacon <will@kernel.org> Reported-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/all/20240730155646.1687-1-will@kernel.org Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4aca98a8a1 |
VFIO updates for v6.13
- Constify an unmodified structure used in linking vfio and kvm.
(Christophe JAILLET)
- Add ID for an additional hardware SKU supported by the nvgrace-gpu
vfio-pci variant driver. (Ankit Agrawal)
- Fix incorrect signed cast in QAT vfio-pci variant driver, negating
test in check_add_overflow(), though still caught by later tests.
(Giovanni Cabiddu)
- Additional debugfs attributes exposed in hisi_acc vfio-pci variant
driver for migration debugging. (Longfang Liu)
- Migration support is added to the virtio vfio-pci variant driver,
becoming the primary feature of the driver while retaining emulation
of virtio legacy support as a secondary option. (Yishai Hadas)
- Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered
through reviews of the virtio variant driver. (Yishai Hadas)
- Fix an unlikely issue where a PCI device exposed to userspace with
an unknown capability at the base of the extended capability chain
can overflow an array index. (Avihai Horon)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmdE2SEbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiXa8P/ikuJ33L7sHnLJErYzHB
j2IPNY224LQrpXY+Rnfe4HVCcaSGO7Azeh95DYBFl7ZJ9QJxZbFhUt7Fl8jiKEOj
k5ag0e+SP4+5tMp2lZBehTa+xlZQLJ4QXMRxWF2kpfXyX7v6JaNKZhXWJ6lPvbrL
zco911Qr1Y5Kqc/kdgX6HGfNusoScj9d0leHNIrka2FFJnq3qZqGtmRKWe9V9zP3
Ke5idU1vYNNBDbOz51D6hZbxZLGxIkblG15sw7LNE3O1lhWznfG+gkJm7u7curlj
CrwR4XvXkgAtglsi8KOJHW84s4BO87UgAde3RUUXgXFcfkTQDSOGQuYVDVSKgFOs
eJCagrpz0p5jlS6LfrUyHU9FhK1sbDQdb8iJQRUUPVlR9U0kfxFbyv3HX7JmGoWw
csOr8Eh2dXmC4EWan9rscw2lxYdoeSmJW0qLhhcGylO7kUGxXRm8vP+MVenkfINX
9OPtsOsFhU7HDl54UsujBA5x8h03HIWmHz3rx8NllxL1E8cfhXivKUViuV8jCXB3
6rVT5mn2VHnXICiWZFXVmjZgrAK3mBfA+6ugi/nbWVdnn8VMomLuB/Df+62wSPSV
ICApuWFBhSuSVmQcJ6fsCX6a8x+E2bZDPw9xqZP7krPUdP1j5rJofgZ7wkdYToRv
HN0p5NcNwnoW2aM5chN9Ons1
=nTtY
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Constify an unmodified structure used in linking vfio and kvm
(Christophe JAILLET)
- Add ID for an additional hardware SKU supported by the nvgrace-gpu
vfio-pci variant driver (Ankit Agrawal)
- Fix incorrect signed cast in QAT vfio-pci variant driver, negating
test in check_add_overflow(), though still caught by later tests
(Giovanni Cabiddu)
- Additional debugfs attributes exposed in hisi_acc vfio-pci variant
driver for migration debugging (Longfang Liu)
- Migration support is added to the virtio vfio-pci variant driver,
becoming the primary feature of the driver while retaining emulation
of virtio legacy support as a secondary option (Yishai Hadas)
- Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered
through reviews of the virtio variant driver (Yishai Hadas)
- Fix an unlikely issue where a PCI device exposed to userspace with an
unknown capability at the base of the extended capability chain can
overflow an array index (Avihai Horon)
* tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: Properly hide first-in-list PCIe extended capability
vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()
vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()
vfio/virtio: Enable live migration once VIRTIO_PCI was configured
vfio/virtio: Add PRE_COPY support for live migration
vfio/virtio: Add support for the basic live migration functionality
virtio-pci: Introduce APIs to execute device parts admin commands
virtio: Manage device and driver capabilities via the admin commands
virtio: Extend the admin command to include the result size
virtio_pci: Introduce device parts access commands
Documentation: add debugfs description for hisi migration
hisi_acc_vfio_pci: register debugfs for hisilicon migration driver
hisi_acc_vfio_pci: create subfunction for data reading
hisi_acc_vfio_pci: extract public functions for container_of
vfio/qat: fix overflow check in qat_vf_resume_write()
vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table
kvm/vfio: Constify struct kvm_device_ops
|
|
|
|
9f16d5e6f2 |
The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages. The reason to
do so was that KVM needs to map both non-refcounted pages (for example
BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
refcounted pages. However, the result was security issues in the past,
and more recently the inability to map VM_IO and VM_PFNMAP memory
that _is_ backed by struct page but is not refcounted. In particular
this broke virtio-gpu blob resources (which directly map host graphics
buffers into the guest as "vram" for the virtio-gpu device) with the
amdgpu driver, because amdgpu allocates non-compound higher order pages
and the tail pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the per-architecture
code, to always work on the pfn whenever possible. The large series that
did this, from David Stevens and Sean Christopherson, also cleaned up
substantially the set of functions that provided arch code with the
pfn for a host virtual addresses. The previous maze of twisty little
passages, all different, is replaced by five functions (__gfn_to_page,
__kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
saving almost 200 lines of code.
ARM:
* Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
* Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
was introduced in PSCIv1.3 as a mechanism to request hibernation,
similar to the S4 state in ACPI
* Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
* PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
* Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
* Avoid emulated MMIO completion if userspace has requested synchronous
external abort injection
* Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
* Add iocsr and mmio bus simulation in kernel.
* Add in-kernel interrupt controller emulation.
* Add support for virtualization extensions to the eiointc irqchip.
PPC:
* Drop lingering and utterly obsolete references to PPC970 KVM, which was
removed 10 years ago.
* Fix incorrect documentation references to non-existing ioctls
RISC-V:
* Accelerate KVM RISC-V when running as a guest
* Perf support to collect KVM guest statistics from host side
s390:
* New selftests: more ucontrol selftests and CPU model sanity checks
* Support for the gen17 CPU model
* List registers supported by KVM_GET/SET_ONE_REG in the documentation
x86:
* Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
documentation, harden against unexpected changes. Even if the hardware
A/D tracking is disabled, it is possible to use the hardware-defined A/D
bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
of special cases.
* Elide TLB flushes when aging secondary PTEs, as has been done in x86's
primary MMU for over 10 years.
* Recover huge pages in-place in the TDP MMU when dirty page logging is
toggled off, instead of zapping them and waiting until the page is
re-accessed to create a huge mapping. This reduces vCPU jitter.
* Batch TLB flushes when dirty page logging is toggled off. This reduces
the time it takes to disable dirty logging by ~3x.
* Remove the shrinker that was (poorly) attempting to reclaim shadow page
tables in low-memory situations.
* Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
* Advertise CPUIDs for new instructions in Clearwater Forest
* Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which in turn can lead to save/restore failures.
* Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
* Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe; harden the
cache accessors to try to prevent similar issues from occuring in the
future. The issue that triggered this change was already fixed in 6.12,
but was still kinda latent.
* Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
* Minor cleanups
* Switch hugepage recovery thread to use vhost_task. These kthreads can
consume significant amounts of CPU time on behalf of a VM or in response
to how the VM behaves (for example how it accesses its memory); therefore
KVM tried to place the thread in the VM's cgroups and charge the CPU
time consumed by that work to the VM's container. However the kthreads
did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
instances inside could not complete freezing. Fix this by replacing the
kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
Another 100+ lines removed, with generally better behavior too like
having these threads properly parented in the process tree.
* Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
really work; there was really nothing to work around anyway: the broken
patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
MSR is virtualized and therefore unaffected by the erratum.
* Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
x86 selftests:
* x86 selftests can now use AVX.
Documentation:
* Use rST internal links
* Reorganize the introduction to the API document
Generic:
* Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
of RCU, so that running a vCPU on a different task doesn't encounter long
due to having to wait for all CPUs become quiescent. In general both reads
and writes are rare, but userspace that supports confidential computing is
introducing the use of "helper" vCPUs that may jump from one host processor
to another. Those will be very happy to trigger a synchronize_rcu(), and
the effect on performance is quite the disaster.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
=lWpE
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"The biggest change here is eliminating the awful idea that KVM had of
essentially guessing which pfns are refcounted pages.
The reason to do so was that KVM needs to map both non-refcounted
pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
VMAs that contain refcounted pages.
However, the result was security issues in the past, and more recently
the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
struct page but is not refcounted. In particular this broke virtio-gpu
blob resources (which directly map host graphics buffers into the
guest as "vram" for the virtio-gpu device) with the amdgpu driver,
because amdgpu allocates non-compound higher order pages and the tail
pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the
per-architecture code, to always work on the pfn whenever possible.
The large series that did this, from David Stevens and Sean
Christopherson, also cleaned up substantially the set of functions
that provided arch code with the pfn for a host virtual addresses.
The previous maze of twisty little passages, all different, is
replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
non-__ versions of these two, and kvm_prefetch_pages) saving almost
200 lines of code.
ARM:
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
call was introduced in PSCIv1.3 as a mechanism to request
hibernation, similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested
synchronous external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
- Add iocsr and mmio bus simulation in kernel.
- Add in-kernel interrupt controller emulation.
- Add support for virtualization extensions to the eiointc irqchip.
PPC:
- Drop lingering and utterly obsolete references to PPC970 KVM, which
was removed 10 years ago.
- Fix incorrect documentation references to non-existing ioctls
RISC-V:
- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
s390:
- New selftests: more ucontrol selftests and CPU model sanity checks
- Support for the gen17 CPU model
- List registers supported by KVM_GET/SET_ONE_REG in the
documentation
x86:
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
improve documentation, harden against unexpected changes.
Even if the hardware A/D tracking is disabled, it is possible to
use the hardware-defined A/D bits to track if a PFN is Accessed
and/or Dirty, and that removes a lot of special cases.
- Elide TLB flushes when aging secondary PTEs, as has been done in
x86's primary MMU for over 10 years.
- Recover huge pages in-place in the TDP MMU when dirty page logging
is toggled off, instead of zapping them and waiting until the page
is re-accessed to create a huge mapping. This reduces vCPU jitter.
- Batch TLB flushes when dirty page logging is toggled off. This
reduces the time it takes to disable dirty logging by ~3x.
- Remove the shrinker that was (poorly) attempting to reclaim shadow
page tables in low-memory situations.
- Clean up and optimize KVM's handling of writes to
MSR_IA32_APICBASE.
- Advertise CPUIDs for new instructions in Clearwater Forest
- Quirk KVM's misguided behavior of initialized certain feature MSRs
to their maximum supported feature set, which can result in KVM
creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
a non-zero value results in the vCPU having invalid state if
userspace hides PDCM from the guest, which in turn can lead to
save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support
LA57 to better follow the "architecture", in quotes because the
actual behavior is poorly documented. E.g. most MSR writes and
descriptor table loads ignore CR4.LA57 and operate purely on
whether the CPU supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(),
as filling the cache from IRQ context is generally unsafe; harden
the cache accessors to try to prevent similar issues from occuring
in the future. The issue that triggered this change was already
fixed in 6.12, but was still kinda latent.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where
KVM over-advertises SPEC_CTRL when trying to support cross-vendor
VMs.
- Minor cleanups
- Switch hugepage recovery thread to use vhost_task.
These kthreads can consume significant amounts of CPU time on
behalf of a VM or in response to how the VM behaves (for example
how it accesses its memory); therefore KVM tried to place the
thread in the VM's cgroups and charge the CPU time consumed by that
work to the VM's container.
However the kthreads did not process SIGSTOP/SIGCONT, and therefore
cgroups which had KVM instances inside could not complete freezing.
Fix this by replacing the kthread with a PF_USER_WORKER thread, via
the vhost_task abstraction. Another 100+ lines removed, with
generally better behavior too like having these threads properly
parented in the process tree.
- Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
didn't really work; there was really nothing to work around anyway:
the broken patch was meant to fix nested virtualization, but the
PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
erratum.
- Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
'y'.
x86 selftests:
- x86 selftests can now use AVX.
Documentation:
- Use rST internal links
- Reorganize the introduction to the API document
Generic:
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
instead of RCU, so that running a vCPU on a different task doesn't
encounter long due to having to wait for all CPUs become quiescent.
In general both reads and writes are rare, but userspace that
supports confidential computing is introducing the use of "helper"
vCPUs that may jump from one host processor to another. Those will
be very happy to trigger a synchronize_rcu(), and the effect on
performance is quite the disaster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
KVM: x86: add back X86_LOCAL_APIC dependency
Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
KVM: x86: switch hugepage recovery thread to vhost_task
KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Documentation: KVM: fix malformed table
irqchip/loongson-eiointc: Add virt extension support
LoongArch: KVM: Add irqfd support
LoongArch: KVM: Add PCHPIC user mode read and write functions
LoongArch: KVM: Add PCHPIC read and write functions
LoongArch: KVM: Add PCHPIC device support
LoongArch: KVM: Add EIOINTC user mode read and write functions
LoongArch: KVM: Add EIOINTC read and write functions
LoongArch: KVM: Add EIOINTC device support
LoongArch: KVM: Add IPI user mode read and write function
LoongArch: KVM: Add IPI read and write function
LoongArch: KVM: Add IPI device support
LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
KVM: arm64: Pass on SVE mapping failures
...
|
|
|
|
0f25f0e4ef |
the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
|
|
|
|
d96c77bd4e |
KVM: x86: switch hugepage recovery thread to vhost_task
kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
c59de14133 |
KVM x86 MMU changes for 6.13
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
documentation, harden against unexpected changes, and to simplify
A/D-disabled MMUs by using the hardware-defined A/D bits to track if a
PFN is Accessed and/or Dirty.
- Elide TLB flushes when aging SPTEs, as has been done in x86's primary
MMU for over 10 years.
- Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when
dirty logging is toggled off, which reduces the time it takes to disable
dirty logging by ~3x.
- Recover huge pages in-place in the TDP MMU instead of zapping the SP
and waiting until the page is re-accessed to create a huge mapping.
Proactively installing huge pages can reduce vCPU jitter in extreme
scenarios.
- Remove support for (poorly) reclaiming page tables in shadow MMUs via
the primary MMU's shrinker interface.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczpgQACgkQOlYIJqCj
N/0qTg//WMEFjC05VrtfwnI8+aCN3au1ALh1Lh1AQzgpVtJfINAmOBAVcIA8JEm9
kX+kApVW0y5y88VgHEb3J43Rbk0LuLQTePQhsfZ8vVJNhNjnxeraOR8wyLsjOjVS
Dpb7AtMxKM/9GqrP5YBFaPPf5YKu/FZFpZfNdKu4esf+Re+usWYpJcTOJOAkQeCL
v9grKLdVZCWEydO9QXrO6EnOW+EWc5Rmrx4521OBggqbeslAjbD6mAhE32T5Eg2C
fdwVAN/XcVv5g1lIof/fHyiXviiztSV+FJCxipxJYsn3HWPgSxBcRvrydsOLY5lT
ghvHn0YptZaxND9oyksP63lrF7LuAb8FI44kr9OOZIZeT0QFVNGBWOn2jv6k1xkr
hLJ1zqKgIwXMgSALlfk2VoywUDLRwv4OAoL31j7m+/nGMlFDtFfY3oARfpZ4kI7K
Taop67iQkHFIh8eX0t3HCPbIop/0oMKINKHC0VmVDgml5l+vIUApRvD6gsXYpnTD
Q+AdQgwYU0lM7J6U9GjebisSanim3zKpUggrG5FdAwGGNjns5HLtzLLyBu+KNHic
aTPs8HCu0jK51Re3fS7Q5IZBzsec15uv1tyCIDVhNuqtocgzXER6Y4YojC8qNPla
XR8H2d/SskJBNkdfbvWsAp+sd/nvtEgv8deEvTmO2SEg5pxhnN8=
=xTZ8
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.13
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
documentation, harden against unexpected changes, and to simplify
A/D-disabled MMUs by using the hardware-defined A/D bits to track if a
PFN is Accessed and/or Dirty.
- Elide TLB flushes when aging SPTEs, as has been done in x86's primary
MMU for over 10 years.
- Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when
dirty logging is toggled off, which reduces the time it takes to disable
dirty logging by ~3x.
- Recover huge pages in-place in the TDP MMU instead of zapping the SP
and waiting until the page is re-accessed to create a huge mapping.
Proactively installing huge pages can reduce vCPU jitter in extreme
scenarios.
- Remove support for (poorly) reclaiming page tables in shadow MMUs via
the primary MMU's shrinker interface.
|
|
|
|
b39d1578d5 |
KVM generic changes for 6.13
- Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
partial poasses over "all" vCPUs. Opportunistically expand the comment
to better explain the motivation and logic.
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
of RCU, so that running a vCPU on a different task doesn't encounter
long stalls due to having to wait for all CPUs become quiescent.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczn8AACgkQOlYIJqCj
N/39KA/8CKNrGLbbRSn0jaOLW2Xq8mA2ce1CgzTGScIcL6R+bLHSTRPeeUXMPuC7
CbaRCrBtkP+qn0GJPp+Jo0luH5fxSQL16/iXC+rB8//Lw0ttEygWO3oY8MsV1YjF
opRhHjqHNIzT1gLEXRSzwm/lBv3jjZG8+5hjaz/LKKIABuzS4bsmacNvHFpIXqhx
tJDoZydiTFNGFA9+GD5telD/CxK43U5VIbwwnisFtSvNE+PMm1vT3cHIT8lUvuxB
eoeAyS78SuWnJn4KhYSjByleBYAeqgxaGNHZVow0B6CMKRLocZdppRQklXE3ap/2
3+V3LNdFl5LBjAP/FHWuMUoovGANUyVQumCHaflz9d8a4acMQInWJtirg/iqGLDN
JXAXJZBh+JybEBVsPbGvxotTaXPRuMYzKCF512dRcmSzNk73yUSDg8XfOI45rJ9k
sxFTAGk4blFhE7Htos5rQ/NmvY1K62va1TAdFXqThdx+b6Y5TjWmRE8dCIMouLoJ
UpEiL0CmbS1yAFXReeyIV/EcBI4Q8C3KXP13n9zyNUCnEQK7CQgc+OmakFL6d5MX
p8Dq98NBZ0GN+Ay1x9WjfpQCvaH8/7Hf9mltbkTzQ8Zrz/4nmuZ5YuRsIunLL/uO
dyAs72I9lw7tHAoHJtBd9W7ShKJwYHLtQl9i9umKPOnsc03iwN8=
=/hvr
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.13
- Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
partial poasses over "all" vCPUs. Opportunistically expand the comment
to better explain the motivation and logic.
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
of RCU, so that running a vCPU on a different task doesn't encounter
long stalls due to having to wait for all CPUs become quiescent.
|
|
|
|
e3e0f9b7ae |
KVM/riscv changes for 6.13
- Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmct20AACgkQrUjsVaLH LAdyWw/+Os6fIFhThsur7z26E/+7Edxk/QEfUMEEvQPwKfwiRdHYLU1dWGTCw2nA xr8yR9MQFkUMTHZBPi+zki47uE6gW1qMb1tl8AuZ4cjZCMSg0pBObdB/axT5+517 boC74A1r04dlvW3sEvcMaHen0B75Hwm703jjOIiThiZFlB519HcattTPAJo8nTRD D4hhpTY/v8QyFlMg0m7KnOOa3MtmR0Q704LaAdEQ5S9kdMlUtMzljHvWIzYdTfjJ C7v7EAo+iRi7ecBdg9ZxhOkcKUB2xF+lmLYBRoiOeaECJ4RIaseY63SqLyzlGeNZ Dx7DuF0mljC3nmHPH/60pOxDKYWGxjx5IQF1ORBvcsux4uhwH8KkyjcxYody/m82 HXalAwmN4mA2asrt0xTpIoOUkwOGy7LEKx6083hbrSgzk9E/bnoi+YDe+Gm4z+YR ofqsj4G6x0fS85RCKB4iQGRcHgW2IPHYeKWXYYP0m1COe9kj9E3POBpAjj3qMxZr R+eveYhLZ7A3B/DAVYQQVoCcQWECXC0M4FOOVh4vEV2JeEB55dlI77/p6X5y7RGj YA7njv9eC5hrxOgUMOLFIvMqFKe7dcuYmE3V11+TD5lHBfqWfKE7vA1V1rOA8e8R Qf5DXLVBbsNj2yirbj1LI3x7jg5aXw5IhylJNn/c1vezqw8RUfg= =sI+4 -----END PGP SIGNATURE----- Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv changes for 6.13 - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side |
|
|
|
66635b0776 |
assorted variants of irqfd setup: convert to CLASS(fd)
in all of those failure exits prior to fdget() are plain returns and the only thing done after fdput() is (on failure exits) a kfree(), which can be done before fdput() just fine. NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for eventfd_ctx_fileget() failure (we only want fdput() there) and once we stop doing that, it doesn't need to check if eventfd is NULL or ERR_PTR(...) there. NOTE: in privcmd we move fdget() up before the allocation - more to the point, before the copy_from_user() attempt. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
8152f82010 |
fdget(), more trivial conversions
all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
6348be02ee |
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
2ebbe0308c |
KVM: Allow arch code to elide TLB flushes when aging a young page
Add a Kconfig to allow architectures to opt-out of a TLB flush when a young page is aged, as invalidating TLB entries is not functionally required on most KVM-supported architectures. Stale TLB entries can result in false negatives and theoretically lead to suboptimal reclaim, but in practice all observations have been that the performance gained by skipping TLB flushes outweighs any performance lost by reclaiming hot pages. E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB that's required for ordering (presumably because there are optimizations related to eliding other TLB flushes when doing make-before-break). Link: https://lore.kernel.org/r/20241011021051.1557902-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
3e7f43188e |
KVM: Protect vCPU's "last run PID" with rwlock, not RCU
To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a vCPU. When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to run SEV migration helper vCPUs during post-copy, the synchronize_rcu() needed to change the PID associated with the vCPU can stall for hundreds of milliseconds, which is problematic for latency sensitive post-copy operations. In the directed yield path, do not acquire the lock if it's contended, i.e. if the associated PID is changing, as that means the vCPU's task is already running. Reported-by: Steve Rutherford <srutherford@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
6cf9ef23d9 |
KVM: Return '0' directly when there's no task to yield to
Do "return 0" instead of initializing and returning a local variable in kvm_vcpu_yield_to(), e.g. so that it's more obvious what the function returns if there is no task. No functional change intended. Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7e513617da |
KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop
Rework kvm_vcpu_on_spin() to use a single for-loop instead of making "two" passes over all vCPUs. Given N=kvm->last_boosted_vcpu, the logic is to iterate from vCPU[N+1]..vcpu[N-1], i.e. using two loops is just a kludgy way of handling the wrap from the last vCPU to vCPU0 when a boostable vCPU isn't found in vcpu[N+1]..vcpu[MAX]. Open code the xa_load() instead of using kvm_get_vcpu() to avoid reading online_vcpus in every loop, as well as the accompanying smp_rmb(), i.e. make it a custom kvm_for_each_vcpu(), for all intents and purposes. Oppurtunistically clean up the comment explaining the logic. Link: https://lore.kernel.org/r/20240802202121.341348-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
bbee049d8e |
kvm/vfio: Constify struct kvm_device_ops
'struct kvm_device_ops' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 2605 169 16 2790 ae6 virt/kvm/vfio.o After: ===== text data bss dec hex filename 2685 89 16 2790 ae6 virt/kvm/vfio.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/e7361a1bb7defbb0f7056b884e83f8d75ac9fe21.1727517084.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alex Williamson <alex.williamson@redhat.com> |
|
|
|
8b15c3764c |
KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"
Now that KVM no longer relies on an ugly heuristic to find its struct page references, i.e. now that KVM can't get false positives on VM_MIXEDMAP pfns, remove KVM's hack to elevate the refcount for pfns that happen to have a valid struct page. In addition to removing a long-standing wart in KVM, this allows KVM to map non-refcounted struct page memory into the guest, e.g. for exposing GPU TTM buffers to KVM guests. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-86-seanjc@google.com> |
|
|
|
93b7da404f |
KVM: Drop APIs that manipulate "struct page" via pfns
Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone.
No functional change intended.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-85-seanjc@google.com>
|
|
|
|
31fccdd212 |
KVM: Make kvm_follow_pfn.refcounted_page a required field
Now that the legacy gfn_to_pfn() APIs are gone, and all callers of hva_to_pfn() pass in a refcounted_page pointer, make it a required field to ensure all future usage in KVM plays nice. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-82-seanjc@google.com> |
|
|
|
06cdaff80e |
KVM: Drop gfn_to_pfn() APIs now that all users are gone
Drop gfn_to_pfn() and all its variants now that all users are gone. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-80-seanjc@google.com> |
|
|
|
f42e289a20 |
KVM: Add support for read-only usage of gfn_to_page()
Rework gfn_to_page() to support read-only accesses so that it can be used by arm64 to get MTE tags out of guest memory. Opportunistically rewrite the comment to be even more stern about using gfn_to_page(), as there are very few scenarios where requiring a struct page is actually the right thing to do (though there are such scenarios). Add a FIXME to call out that KVM probably should be pinning pages, not just getting pages. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-77-seanjc@google.com> |
|
|
|
ce6bf70346 |
KVM: Convert gfn_to_page() to use kvm_follow_pfn()
Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-76-seanjc@google.com> |
|
|
|
1fbee5b01a |
KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn()
Provide the "struct page" associated with a guest_memfd pfn as an output from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can directly put the page instead of having to rely on kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-47-seanjc@google.com> |
|
|
|
4af18dc6a9 |
KVM: guest_memfd: Pass index, not gfn, to __kvm_gmem_get_pfn()
Refactor guest_memfd usage of __kvm_gmem_get_pfn() to pass the index into the guest_memfd file instead of the gfn, i.e. resolve the index based on the slot+gfn in the caller instead of in __kvm_gmem_get_pfn(). This will allow kvm_gmem_get_pfn() to retrieve and return the specific "struct page", which requires the index into the folio, without a redoing the index calculation multiple times (which isn't costly, just hard to follow). Opportunistically add a kvm_gmem_get_index() helper to make the copy+pasted code easier to understand. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-46-seanjc@google.com> |
|
|
|
1c7b627e93 |
KVM: Add kvm_faultin_pfn() to specifically service guest page faults
Add a new dedicated API, kvm_faultin_pfn(), for servicing guest page faults, i.e. for getting pages/pfns that will be mapped into the guest via an mmu_notifier-protected KVM MMU. Keep struct kvm_follow_pfn buried in internal code, as having __kvm_faultin_pfn() take "out" params is actually cleaner for several architectures, e.g. it allows the caller to have its own "page fault" structure without having to marshal data to/from kvm_follow_pfn. Long term, common KVM would ideally provide a kvm_page_fault structure, a la x86's struct of the same name. But all architectures need to be converted to a common API before that can happen. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-44-seanjc@google.com> |
|
|
|
68e51d0a43 |
KVM: Disallow direct access (w/o mmu_notifier) to unpinned pfn by default
Add an off-by-default module param to control whether or not KVM is allowed to map memory that isn't pinned, i.e. that KVM can't guarantee won't be freed while it is mapped into KVM and/or the guest. Don't remove the functionality entirely, as there are use cases where mapping unpinned memory is safe (as defined by the platform owner), e.g. when memory is hidden from the kernel and managed by userspace, in which case userspace is already fully trusted to not muck with guest memory mappings. But for more typical setups, mapping unpinned memory is wildly unsafe, and unnecessary. The APIs are used exclusively by x86's nested virtualization support, and there is no known (or sane) use case for mapping PFN-mapped memory a KVM guest _and_ letting the guest use it for virtualization structures. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-36-seanjc@google.com> |