Commit Graph

1261 Commits

Author SHA1 Message Date
Linus Torvalds edb0e8f6e2 ARM:
* Nested virtualization support for VGICv3, giving the nested
 hypervisor control of the VGIC hardware when running an L2 VM
 
 * Removal of 'late' nested virtualization feature register masking,
   making the supported feature set directly visible to userspace
 
 * Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
   of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
 
 * Paravirtual interface for discovering the set of CPU implementations
   where a VM may run, addressing a longstanding issue of guest CPU
   errata awareness in big-little systems and cross-implementation VM
   migration
 
 * Userspace control of the registers responsible for identifying a
   particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
   allowing VMs to be migrated cross-implementation
 
 * pKVM updates, including support for tracking stage-2 page table
   allocations in the protected hypervisor in the 'SecPageTable' stat
 
 * Fixes to vPMU, ensuring that userspace updates to the vPMU after
   KVM_RUN are reflected into the backing perf events
 
 LoongArch:
 
 * Remove unnecessary header include path
 
 * Assume constant PGD during VM context switch
 
 * Add perf events support for guest VM
 
 RISC-V:
 
 * Disable the kernel perf counter during configure
 
 * KVM selftests improvements for PMU
 
 * Fix warning at the time of KVM module removal
 
 x86:
 
 * Add support for aging of SPTEs without holding mmu_lock.  Not taking mmu_lock
   allows multiple aging actions to run in parallel, and more importantly avoids
   stalling vCPUs.  This includes an implementation of per-rmap-entry locking;
   aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
   locking an rmap for write requires taking both the per-rmap spinlock and
   the mmu_lock.
 
   Note that this decreases slightly the accuracy of accessed-page information,
   because changes to the SPTE outside aging might not use atomic operations
   even if they could race against a clear of the Accessed bit.  This is
   deliberate because KVM and mm/ tolerate false positives/negatives for
   accessed information, and testing has shown that reducing the latency of
   aging is far more beneficial to overall system performance than providing
   "perfect" young/old information.
 
 * Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.
 
 * Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).
 
 * Drop "support" for async page faults for protected guests that do not set
   SEND_ALWAYS (i.e. that only want async page faults at CPL3)
 
 * Bring a bit of sanity to x86's VM teardown code, which has accumulated
   a lot of cruft over the years.  Particularly, destroy vCPUs before
   the MMU, despite the latter being a VM-wide operation.
 
 * Add common secure TSC infrastructure for use within SNP and in the
   future TDX
 
 * Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
   sense to use the capability if the relevant registers are not
   available for reading or writing.
 
 * Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.
 
 * Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.
 
 * Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
   notifier only accounts for kvmclock, and there's no evidence that the
   flag is actually supported by Xen guests.
 
 * Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
 
 * Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
 * Restrict the Xen hypercall MSR index to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).
 
 * Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
 * Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks; there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of the TSC leaves should be rare, i.e. are not a hot path.
 
 x86 (Intel):
 
 * Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
 * Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
   for upcoming FRED virtualization support.
 
 * Decouple the EPT entry RWX protection bit macros from the EPT Violation
   bits, both as a general cleanup and in anticipation of adding support for
   emulating Mode-Based Execution Control (MBEC).
 
 * Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.
 
 * Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
 
 x86 (AMD):
 
 * Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).
 
 * Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.
 
 * Add macros and helpers for setting GHCB return/error codes.
 
 * Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.
 
 * Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.
 
 * Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
 * Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.
 
 Selftests:
 
 * Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
   instead of executing code.  The theory is that modern Intel CPUs have
   learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that an event is
   counting correctly without actually knowing what the event counts on the
   underlying hardware.
 
 * Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
   improve its coverage by collecting all dirty entries on each iteration.
 
 * Fix a few minor bugs related to handling of stats FDs.
 
 * Add infrastructure to make vCPU and VM stats FDs available to tests by
   default (open the FDs during VM/vCPU creation).
 
 * Relax an assertion on the number of HLT exits in the xAPIC IPI test when
   running on a CPU that supports AMD's Idle HLT (which elides interception of
   HLT if a virtual IRQ is pending and unmasked).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
 yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
 uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
 diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
 DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
 wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
 =nnCN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Nested virtualization support for VGICv3, giving the nested
     hypervisor control of the VGIC hardware when running an L2 VM

   - Removal of 'late' nested virtualization feature register masking,
     making the supported feature set directly visible to userspace

   - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
     of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers

   - Paravirtual interface for discovering the set of CPU
     implementations where a VM may run, addressing a longstanding issue
     of guest CPU errata awareness in big-little systems and
     cross-implementation VM migration

   - Userspace control of the registers responsible for identifying a
     particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
     allowing VMs to be migrated cross-implementation

   - pKVM updates, including support for tracking stage-2 page table
     allocations in the protected hypervisor in the 'SecPageTable' stat

   - Fixes to vPMU, ensuring that userspace updates to the vPMU after
     KVM_RUN are reflected into the backing perf events

  LoongArch:

   - Remove unnecessary header include path

   - Assume constant PGD during VM context switch

   - Add perf events support for guest VM

  RISC-V:

   - Disable the kernel perf counter during configure

   - KVM selftests improvements for PMU

   - Fix warning at the time of KVM module removal

  x86:

   - Add support for aging of SPTEs without holding mmu_lock.

     Not taking mmu_lock allows multiple aging actions to run in
     parallel, and more importantly avoids stalling vCPUs. This includes
     an implementation of per-rmap-entry locking; aging the gfn is done
     with only a per-rmap single-bin spinlock taken, whereas locking an
     rmap for write requires taking both the per-rmap spinlock and the
     mmu_lock.

     Note that this decreases slightly the accuracy of accessed-page
     information, because changes to the SPTE outside aging might not
     use atomic operations even if they could race against a clear of
     the Accessed bit.

     This is deliberate because KVM and mm/ tolerate false
     positives/negatives for accessed information, and testing has shown
     that reducing the latency of aging is far more beneficial to
     overall system performance than providing "perfect" young/old
     information.

   - Defer runtime CPUID updates until KVM emulates a CPUID instruction,
     to coalesce updates when multiple pieces of vCPU state are
     changing, e.g. as part of a nested transition

   - Fix a variety of nested emulation bugs, and add VMX support for
     synthesizing nested VM-Exit on interception (instead of injecting
     #UD into L2)

   - Drop "support" for async page faults for protected guests that do
     not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)

   - Bring a bit of sanity to x86's VM teardown code, which has
     accumulated a lot of cruft over the years. Particularly, destroy
     vCPUs before the MMU, despite the latter being a VM-wide operation

   - Add common secure TSC infrastructure for use within SNP and in the
     future TDX

   - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
     make sense to use the capability if the relevant registers are not
     available for reading or writing

   - Don't take kvm->lock when iterating over vCPUs in the suspend
     notifier to fix a largely theoretical deadlock

   - Use the vCPU's actual Xen PV clock information when starting the
     Xen timer, as the cached state in arch.hv_clock can be stale/bogus

   - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
     different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
     KVM's suspend notifier only accounts for kvmclock, and there's no
     evidence that the flag is actually supported by Xen guests

   - Clean up the per-vCPU "cache" of its reference pvclock, and instead
     only track the vCPU's TSC scaling (multipler+shift) metadata (which
     is moderately expensive to compute, and rarely changes for modern
     setups)

   - Don't write to the Xen hypercall page on MSR writes that are
     initiated by the host (userspace or KVM) to fix a class of bugs
     where KVM can write to guest memory at unexpected times, e.g.
     during vCPU creation if userspace has set the Xen hypercall MSR
     index to collide with an MSR that KVM emulates

   - Restrict the Xen hypercall MSR index to the unofficial synthetic
     range to reduce the set of possible collisions with MSRs that are
     emulated by KVM (collisions can still happen as KVM emulates
     Hyper-V MSRs, which also reside in the synthetic range)

   - Clean up and optimize KVM's handling of Xen MSR writes and
     xen_hvm_config

   - Update Xen TSC leaves during CPUID emulation instead of modifying
     the CPUID entries when updating PV clocks; there is no guarantee PV
     clocks will be updated between TSC frequency changes and CPUID
     emulation, and guest reads of the TSC leaves should be rare, i.e.
     are not a hot path

  x86 (Intel):

   - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
     thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1

   - Pass XFD_ERR as the payload when injecting #NM, as a preparatory
     step for upcoming FRED virtualization support

   - Decouple the EPT entry RWX protection bit macros from the EPT
     Violation bits, both as a general cleanup and in anticipation of
     adding support for emulating Mode-Based Execution Control (MBEC)

   - Reject KVM_RUN if userspace manages to gain control and stuff
     invalid guest state while KVM is in the middle of emulating nested
     VM-Enter

   - Add a macro to handle KVM's sanity checks on entry/exit VMCS
     control pairs in anticipation of adding sanity checks for secondary
     exit controls (the primary field is out of bits)

  x86 (AMD):

   - Ensure the PSP driver is initialized when both the PSP and KVM
     modules are built-in (the initcall framework doesn't handle
     dependencies)

   - Use long-term pins when registering encrypted memory regions, so
     that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
     don't lead to excessive fragmentation

   - Add macros and helpers for setting GHCB return/error codes

   - Add support for Idle HLT interception, which elides interception if
     the vCPU has a pending, unmasked virtual IRQ when HLT is executed

   - Fix a bug in INVPCID emulation where KVM fails to check for a
     non-canonical address

   - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
     invalid, e.g. because the vCPU was "destroyed" via SNP's AP
     Creation hypercall

   - Reject SNP AP Creation if the requested SEV features for the vCPU
     don't match the VM's configured set of features

  Selftests:

   - Fix again the Intel PMU counters test; add a data load and do
     CLFLUSH{OPT} on the data instead of executing code. The theory is
     that modern Intel CPUs have learned new code prefetching tricks
     that bypass the PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that an
     event is counting correctly without actually knowing what the event
     counts on the underlying hardware

   - Fix a variety of flaws, bugs, and false failures/passes
     dirty_log_test, and improve its coverage by collecting all dirty
     entries on each iteration

   - Fix a few minor bugs related to handling of stats FDs

   - Add infrastructure to make vCPU and VM stats FDs available to tests
     by default (open the FDs during VM/vCPU creation)

   - Relax an assertion on the number of HLT exits in the xAPIC IPI test
     when running on a CPU that supports AMD's Idle HLT (which elides
     interception of HLT if a virtual IRQ is pending and unmasked)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
  RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
  RISC-V: KVM: Teardown riscv specific bits after kvm_exit
  LoongArch: KVM: Register perf callbacks for guest
  LoongArch: KVM: Implement arch-specific functions for guest perf
  LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
  LoongArch: KVM: Remove PGD saving during VM context switch
  LoongArch: KVM: Remove unnecessary header include path
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
  KVM: x86: Add infrastructure for secure TSC
  KVM: x86: Push down setting vcpu.arch.user_set_tsc
  ...
2025-03-25 14:22:07 -07:00
Linus Torvalds 99c21beaab vfs-6.15-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90p4AAKCRCRxhvAZXjc
 ojMIAP9atkG3u7+490+NGWLdulQlaHnD51Owa9MiW87UfKpsTQEArwi/NrJqXJNT
 PFQ2xIa5TxG+9haChR89w3kjZ6b/hgs=
 =iDkx
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Add CONFIG_DEBUG_VFS infrastucture:
      - Catch invalid modes in open
      - Use the new debug macros in inode_set_cached_link()
      - Use debug-only asserts around fd allocation and install

   - Place f_ref to 3rd cache line in struct file to resolve false
     sharing

Cleanups:

   - Start using anon_inode_getfile_fmode() helper in various places

   - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
     f_pos_lock

   - Add unlikely() to kcmp()

   - Remove legacy ->remount_fs method from ecryptfs after port to the
     new mount api

   - Remove invalidate_inodes() in favour of evict_inodes()

   - Simplify ep_busy_loopER by removing unused argument

   - Avoid mmap sem relocks when coredumping with many missing pages

   - Inline getname()

   - Inline new_inode_pseudo() and de-staticize alloc_inode()

   - Dodge an atomic in putname if ref == 1

   - Consistently deref the files table with rcu_dereference_raw()

   - Dedup handling of struct filename init and refcounts bumps

   - Use wq_has_sleeper() in end_dir_add()

   - Drop the lock trip around I_NEW wake up in evict()

   - Load the ->i_sb pointer once in inode_sb_list_{add,del}

   - Predict not reaching the limit in alloc_empty_file()

   - Tidy up do_sys_openat2() with likely/unlikely

   - Call inode_sb_list_add() outside of inode hash lock

   - Sort out fd allocation vs dup2 race commentary

   - Turn page_offset() into a wrapper around folio_pos()

   - Remove locking in exportfs around ->get_parent() call

   - try_lookup_one_len() does not need any locks in autofs

   - Fix return type of several functions from long to int in open

   - Fix return type of several functions from long to int in ioctls

  Fixes:

   - Fix watch queue accounting mismatch"

* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: sort out fd allocation vs dup2 race commentary, take 2
  fs: call inode_sb_list_add() outside of inode hash lock
  fs: tidy up do_sys_openat2() with likely/unlikely
  fs: predict not reaching the limit in alloc_empty_file()
  fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
  fs: drop the lock trip around I_NEW wake up in evict()
  fs: use wq_has_sleeper() in end_dir_add()
  VFS/autofs: try_lookup_one_len() does not need any locks
  fs: dedup handling of struct filename init and refcounts bumps
  fs: consistently deref the files table with rcu_dereference_raw()
  exportfs: remove locking around ->get_parent() call.
  fs: use debug-only asserts around fd allocation and install
  fs: dodge an atomic in putname if ref == 1
  vfs: Remove invalidate_inodes()
  ecryptfs: remove NULL remount_fs from super_operations
  watch_queue: fix pipe accounting mismatch
  fs: place f_ref to 3rd cache line in struct file to resolve false sharing
  epoll: simplify ep_busy_loop by removing always 0 argument
  fs: Turn page_offset() into a wrapper around folio_pos()
  kcmp: improve performance adding an unlikely hint to task comparisons
  ...
2025-03-24 09:13:50 -07:00
Paolo Bonzini 361da275e5 Merge branch 'kvm-nvmx-and-vm-teardown' into HEAD
The immediate issue being fixed here is a nVMX bug where KVM fails to
detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI).
However, checking for a pending interrupt accesses the legacy PIC, and
x86's kvm_arch_destroy_vm() currently frees the PIC before destroying
vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results
in a NULL pointer deref; that's a prerequisite for the nVMX fix.

The remaining patches attempt to bring a bit of sanity to x86's VM
teardown code, which has accumulated a lot of cruft over the years.  E.g.
KVM currently unloads each vCPU's MMUs in a separate operation from
destroying vCPUs, all because when guest SMP support was added, KVM had a
kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU.
And that oddity lived on, for 18 years...

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20 13:13:00 -04:00
Sean Christopherson b2aba529bf KVM: Drop kvm_arch_sync_events() now that all implementations are nops
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other
arch has ever used it).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20250224235542.2562848-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Sean Christopherson ed8f966331 KVM: Assert that a destroyed/freed vCPU is no longer visible
After freeing a vCPU, assert that it is no longer reachable, and that
kvm_get_vcpu() doesn't return garbage or a pointer to some other vCPU.
While KVM obviously shouldn't be attempting to access a freed vCPU, it's
all too easy for KVM to make a VM-wide request, e.g. via KVM_BUG_ON() or
kvm_flush_remote_tlbs().

Alternatively, KVM could short-circuit problematic paths if the VM's
refcount has gone to zero, e.g. in kvm_make_all_cpus_request(), or KVM
could try disallow making global requests during teardown.  But given that
deleting the vCPU from the array Just Works, adding logic to the requests
path is unnecessary, and trying to make requests illegal during teardown
would be a fool's errand.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250224235542.2562848-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Al Viro f9835fa147
make use of anon_inode_getfile_fmode()
["fallen through the cracks" misc stuff]

A bunch of anon_inode_getfile() callers follow it with adjusting
->f_mode; we have a helper doing that now, so let's make use
of it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250118014434.GT1977892@ZenIV
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:31 +01:00
James Houghton aa34b81165 KVM: Allow lockless walk of SPTEs when handing aging mmu_notifier event
It is possible to correctly do aging without taking the KVM MMU lock,
or while taking it for read; add a Kconfig to let architectures do so.
Architectures that select KVM_MMU_LOCKLESS_AGING are responsible for
correctness.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-3-jthoughton@google.com
[sean: massage shortlog+changelog, fix Kconfig goof and shorten name]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14 07:16:35 -08:00
James Houghton 374ccd6360 KVM: Rename kvm_handle_hva_range()
Rename kvm_handle_hva_range() to kvm_age_hva_range(),
kvm_handle_hva_range_no_flush() to kvm_age_hva_range_no_flush(), and
__kvm_handle_hva_range() to kvm_handle_hva_range(), as
kvm_age_hva_range() will get more aging-specific functionality.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-2-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 11:09:30 -08:00
Paolo Bonzini 6f61269495 KVM: remove kvm_arch_post_init_vm
The only statement in a kvm_arch_post_init_vm implementation
can be moved into the x86 kvm_arch_init_vm.  Do so and remove all
traces from architecture-independent code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04 11:27:45 -05:00
Sean Christopherson 66119f8ce1 KVM: Do not restrict the size of KVM-internal memory regions
Exempt KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES restriction, as
the limit on the number of pages exists purely to play nice with dirty
bitmap operations, which use 32-bit values to index the bitmaps, and dirty
logging isn't supported for KVM-internal memslots.

Link: https://lore.kernel.org/all/20240802205003.353672-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250123144627.312456-2-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-2-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Paolo Bonzini 86eb1aef72 Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.

Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.

For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another.  Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.

As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory.  This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.

There are thus three sets of EPT page tables: external, mirror and
direct.  In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:

  external EPT - Hidden within the TDX module, modified via TDX module
                 calls.

  mirror EPT   - Bookkeeping tree used as an optimization by KVM, not
                 used by the processor.

  direct EPT   - Normal EPT that maps unencrypted shared memory.
                 Managed like the EPT of a normal VM.

Modifying external EPT
----------------------

Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops.  Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.

In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG.  For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT.  FROZEN_SPTE acts basically as a spinlock on a PTE.  Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.

Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().

For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA.  Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.

This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations.  Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.

To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT.  For example:

  Memslot deletion           - Private and shared
  MMU notifier based zapping - Shared only
  Conversion to shared       - Private only
  Conversion to private      - Shared only

Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
2025-01-20 07:15:58 -05:00
Paolo Bonzini 7a9164dc69 KVM vcpu_array fixes and cleanups for 6.14:
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
    where KVM would return a pointer to a vCPU prior to it being fully online,
    and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
 
  - Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a
    bug where userspace could coerce KVM into handling the ioctl on a vCPU that
    isn't yet onlined.
 
  - Gracefully handle xa_insert() failures even though such failuires should be
    impossible in practice.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJpxQACgkQOlYIJqCj
 N/0mrxAAia8fym/sz70XJQyOPta9v/PvvqPpmZ1SuAeoHBUxbyqF+HqSpi5n6TIA
 GijHfM5/LC4krRcS4765Pa6UjsxOeoOqbR+PAO9XZ6A5b93pJt5kKQw5JJE7ZckE
 GNJbxCtfnKUP04GTb60pwKr7oVizlUf9AP8x1vGkp2mnnUa/VJTkwziSjum0FIB0
 5tSzNEQL7gegomkzoqsIZxk+F2fxo9uuGu+ACTts51oEf1AC1kSljroSp65m4LhB
 0d0c+4n5bTP5BhvP3DJC1OAi+nouyKCyV6NoM48J5x3RqQGd53ilBNqvY+O1caLv
 LNnlmebPcVmGaWGWq6ta41xyG7klqnvITQe7mVeukSP38/82zkcTryiW4vTbykGz
 7doz7ZcYefUf3U0aFouTHMr53uiUy+ysWCQkFKJvddJCT9vyjHKroWRJvq4dKRGK
 tLfMDKU2/qCnSalWwmSErdLSRDVVZI8tbAFzsk0s/P2kspyiYzgB3LEGy8UGuB6G
 Cov6cCmFI6trpYNwVU4tEVGfPEtqhf4odTzbvwOo1ekZzBwa/6WSGZa5eMQjV4tA
 ikAgaVQYbduUyikEY+A0LUBVqvoVqCUG9eUKaSUHQTIImZ3Hnc0DcUT/RfNdk27A
 nYI+Jz2OFoz1rY8QTXUHEn0IDTZL9ipREeAejbYXg4dee5QTwNY=
 =nx41
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-vcpu_array-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM vcpu_array fixes and cleanups for 6.14:

 - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
   where KVM would return a pointer to a vCPU prior to it being fully online,
   and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.

 - Wait for a vCPU to come online prior to executing a vCPU ioctl to fix a
   bug where userspace could coerce KVM into handling the ioctl on a vCPU that
   isn't yet onlined.

 - Gracefully handle xa_insert() failures even though such failuires should be
   impossible in practice.
2025-01-20 06:36:40 -05:00
Sean Christopherson 0cc3cb2151 KVM: Disallow all flags for KVM-internal memslots
Disallow all flags for KVM-internal memslots as all existing flags require
some amount of userspace interaction to have any meaning.  In addition to
guarding against KVM goofs, explicitly disallowing dirty logging of KVM-
internal memslots will (hopefully) allow exempting KVM-internal memslots
from the KVM_MEM_MAX_NR_PAGES limit, which appears to exist purely because
the dirty bitmap operations use a 32-bit index.

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:16 -08:00
Sean Christopherson 344315e93d KVM: x86: Drop double-underscores from __kvm_set_memory_region()
Now that there's no outer wrapper for __kvm_set_memory_region() and it's
static, drop its double-underscore prefix.

No functional change intended.

Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:16 -08:00
Sean Christopherson 156bffdb2b KVM: Add a dedicated API for setting KVM-internal memslots
Add a dedicated API for setting internal memslots, and have it explicitly
disallow setting userspace memslots.  Setting a userspace memslots without
a direct command from userspace would result in all manner of issues.

No functional change intended.

Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:15 -08:00
Sean Christopherson d131f0042f KVM: Assert slots_lock is held when setting memory regions
Add proper lockdep assertions in __kvm_set_memory_region() and
__x86_set_memory_region() instead of relying comments.

Opportunistically delete __kvm_set_memory_region()'s entire function
comment as the API doesn't allocate memory or select a gfn, and the
"mostly for framebuffers" comment hasn't been true for a very long time.

Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:15 -08:00
Sean Christopherson f81a6d12bf KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
Open code kvm_set_memory_region() into its sole caller in preparation for
adding a dedicated API for setting internal memslots.

Oppurtunistically use the fancy new guard(mutex) to avoid a local 'r'
variable.

Cc: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250111002022.1230573-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14 17:36:15 -08:00
Isaku Yamahata dca6c88532 KVM: Add member to struct kvm_gfn_range to indicate private/shared
Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
 - For KVM's mmu notifier callbacks, zap shared pages only because private
   pages won't have a userspace mapping
 - For setting memory attributes, kvm_arch_pre_set_memory_attributes()
   chooses the aliases based on the attribute.
 - For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23 08:28:55 -05:00
Sean Christopherson 01528db67f KVM: Drop hack that "manually" informs lockdep of kvm->lock vs. vcpu->mutex
Now that KVM takes vcpu->mutex inside kvm->lock when creating a vCPU, drop
the hack to manually inform lockdep of the kvm->lock => vcpu->mutex
ordering.

This effectively reverts commit 42a90008f8 ("KVM: Ensure lockdep knows
about kvm->lock vs. vcpu->mutex ordering rule").

Cc: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241009150455.1057573-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 14:37:30 -08:00
Sean Christopherson e53dc37f5a KVM: Don't BUG() the kernel if xa_insert() fails with -EBUSY
WARN once instead of triggering a BUG if xa_insert() fails because it
encountered an existing entry.  While KVM guarantees there should be no
existing entry, there's no reason to BUG the kernel, as KVM needs to
gracefully handle failure anyways.

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241009150455.1057573-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 14:37:30 -08:00
Sean Christopherson d0831edcd8 Revert "KVM: Fix vcpu_array[0] races"
Now that KVM loads from vcpu_array if and only if the target index is
valid with respect to online_vcpus, i.e. now that it is safe to erase a
not-fully-onlined vCPU entry, revert to storing into vcpu_array before
success is guaranteed.

If xa_store() fails, which _should_ be impossible, then putting the vCPU's
reference to 'struct kvm' results in a refcounting bug as the vCPU fd has
been installed and owns the vCPU's reference.

This was found by inspection, but forcing the xa_store() to fail
confirms the problem:

 | Unable to handle kernel paging request at virtual address ffff800080ecd960
 | Call trace:
 |  _raw_spin_lock_irq+0x2c/0x70
 |  kvm_irqfd_release+0x24/0xa0
 |  kvm_vm_release+0x1c/0x38
 |  __fput+0x88/0x2ec
 |  ____fput+0x10/0x1c
 |  task_work_run+0xb0/0xd4
 |  do_exit+0x210/0x854
 |  do_group_exit+0x70/0x98
 |  get_signal+0x6b0/0x73c
 |  do_signal+0xa4/0x11e8
 |  do_notify_resume+0x60/0x12c
 |  el0_svc+0x64/0x68
 |  el0t_64_sync_handler+0x84/0xfc
 |  el0t_64_sync+0x190/0x194
 | Code: b9000909 d503201f 2a1f03e1 52800028 (88e17c08)

Practically speaking, this is a non-issue as xa_store() can't fail, absent
a nasty kernel bug.  But the code is visually jarring and technically
broken.

This reverts commit afb2acb2e3.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Luczaj <mhal@rbox.co>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Reported-by: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241009150455.1057573-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 14:37:30 -08:00
Sean Christopherson 6e2b2358b3 KVM: Grab vcpu->mutex across installing the vCPU's fd and bumping online_vcpus
During vCPU creation, acquire vcpu->mutex prior to exposing the vCPU to
userspace, and hold the mutex until online_vcpus is bumped, i.e. until the
vCPU is fully online from KVM's perspective.

To ensure asynchronous vCPU ioctls also wait for the vCPU to come online,
explicitly check online_vcpus at the start of kvm_vcpu_ioctl(), and take
the vCPU's mutex to wait if necessary (having to wait for any ioctl should
be exceedingly rare, i.e. not worth optimizing).

Reported-by: Will Deacon <will@kernel.org>
Reported-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/all/20240730155646.1687-1-will@kernel.org
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241009150455.1057573-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 14:37:30 -08:00
Paolo Bonzini d96c77bd4e KVM: x86: switch hugepage recovery thread to vhost_task
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances
inside never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 -nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled.  If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14 13:20:04 -05:00
Paolo Bonzini c59de14133 KVM x86 MMU changes for 6.13
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
    documentation, harden against unexpected changes, and to simplify
    A/D-disabled MMUs by using the hardware-defined A/D bits to track if a
    PFN is Accessed and/or Dirty.
 
  - Elide TLB flushes when aging SPTEs, as has been done in x86's primary
    MMU for over 10 years.
 
  - Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when
    dirty logging is toggled off, which reduces the time it takes to disable
    dirty logging by ~3x.
 
  - Recover huge pages in-place in the TDP MMU instead of zapping the SP
    and waiting until the page is re-accessed to create a huge mapping.
    Proactively installing huge pages can reduce vCPU jitter in extreme
    scenarios.
 
  - Remove support for (poorly) reclaiming page tables in shadow MMUs via
    the primary MMU's shrinker interface.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczpgQACgkQOlYIJqCj
 N/0qTg//WMEFjC05VrtfwnI8+aCN3au1ALh1Lh1AQzgpVtJfINAmOBAVcIA8JEm9
 kX+kApVW0y5y88VgHEb3J43Rbk0LuLQTePQhsfZ8vVJNhNjnxeraOR8wyLsjOjVS
 Dpb7AtMxKM/9GqrP5YBFaPPf5YKu/FZFpZfNdKu4esf+Re+usWYpJcTOJOAkQeCL
 v9grKLdVZCWEydO9QXrO6EnOW+EWc5Rmrx4521OBggqbeslAjbD6mAhE32T5Eg2C
 fdwVAN/XcVv5g1lIof/fHyiXviiztSV+FJCxipxJYsn3HWPgSxBcRvrydsOLY5lT
 ghvHn0YptZaxND9oyksP63lrF7LuAb8FI44kr9OOZIZeT0QFVNGBWOn2jv6k1xkr
 hLJ1zqKgIwXMgSALlfk2VoywUDLRwv4OAoL31j7m+/nGMlFDtFfY3oARfpZ4kI7K
 Taop67iQkHFIh8eX0t3HCPbIop/0oMKINKHC0VmVDgml5l+vIUApRvD6gsXYpnTD
 Q+AdQgwYU0lM7J6U9GjebisSanim3zKpUggrG5FdAwGGNjns5HLtzLLyBu+KNHic
 aTPs8HCu0jK51Re3fS7Q5IZBzsec15uv1tyCIDVhNuqtocgzXER6Y4YojC8qNPla
 XR8H2d/SskJBNkdfbvWsAp+sd/nvtEgv8deEvTmO2SEg5pxhnN8=
 =xTZ8
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.13

 - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes, and to simplify
   A/D-disabled MMUs by using the hardware-defined A/D bits to track if a
   PFN is Accessed and/or Dirty.

 - Elide TLB flushes when aging SPTEs, as has been done in x86's primary
   MMU for over 10 years.

 - Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when
   dirty logging is toggled off, which reduces the time it takes to disable
   dirty logging by ~3x.

 - Recover huge pages in-place in the TDP MMU instead of zapping the SP
   and waiting until the page is re-accessed to create a huge mapping.
   Proactively installing huge pages can reduce vCPU jitter in extreme
   scenarios.

 - Remove support for (poorly) reclaiming page tables in shadow MMUs via
   the primary MMU's shrinker interface.
2024-11-13 06:31:54 -05:00
Paolo Bonzini b39d1578d5 KVM generic changes for 6.13
- Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
    partial poasses over "all" vCPUs.  Opportunistically expand the comment
    to better explain the motivation and logic.
 
  - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
    of RCU, so that running a vCPU on a different task doesn't encounter
    long stalls due to having to wait for all CPUs become quiescent.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczn8AACgkQOlYIJqCj
 N/39KA/8CKNrGLbbRSn0jaOLW2Xq8mA2ce1CgzTGScIcL6R+bLHSTRPeeUXMPuC7
 CbaRCrBtkP+qn0GJPp+Jo0luH5fxSQL16/iXC+rB8//Lw0ttEygWO3oY8MsV1YjF
 opRhHjqHNIzT1gLEXRSzwm/lBv3jjZG8+5hjaz/LKKIABuzS4bsmacNvHFpIXqhx
 tJDoZydiTFNGFA9+GD5telD/CxK43U5VIbwwnisFtSvNE+PMm1vT3cHIT8lUvuxB
 eoeAyS78SuWnJn4KhYSjByleBYAeqgxaGNHZVow0B6CMKRLocZdppRQklXE3ap/2
 3+V3LNdFl5LBjAP/FHWuMUoovGANUyVQumCHaflz9d8a4acMQInWJtirg/iqGLDN
 JXAXJZBh+JybEBVsPbGvxotTaXPRuMYzKCF512dRcmSzNk73yUSDg8XfOI45rJ9k
 sxFTAGk4blFhE7Htos5rQ/NmvY1K62va1TAdFXqThdx+b6Y5TjWmRE8dCIMouLoJ
 UpEiL0CmbS1yAFXReeyIV/EcBI4Q8C3KXP13n9zyNUCnEQK7CQgc+OmakFL6d5MX
 p8Dq98NBZ0GN+Ay1x9WjfpQCvaH8/7Hf9mltbkTzQ8Zrz/4nmuZ5YuRsIunLL/uO
 dyAs72I9lw7tHAoHJtBd9W7ShKJwYHLtQl9i9umKPOnsc03iwN8=
 =/hvr
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.13

 - Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
   partial poasses over "all" vCPUs.  Opportunistically expand the comment
   to better explain the motivation and logic.

 - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter
   long stalls due to having to wait for all CPUs become quiescent.
2024-11-13 06:24:19 -05:00
Paolo Bonzini e3e0f9b7ae KVM/riscv changes for 6.13
- Accelerate KVM RISC-V when running as a guest
 - Perf support to collect KVM guest statistics from host side
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmct20AACgkQrUjsVaLH
 LAdyWw/+Os6fIFhThsur7z26E/+7Edxk/QEfUMEEvQPwKfwiRdHYLU1dWGTCw2nA
 xr8yR9MQFkUMTHZBPi+zki47uE6gW1qMb1tl8AuZ4cjZCMSg0pBObdB/axT5+517
 boC74A1r04dlvW3sEvcMaHen0B75Hwm703jjOIiThiZFlB519HcattTPAJo8nTRD
 D4hhpTY/v8QyFlMg0m7KnOOa3MtmR0Q704LaAdEQ5S9kdMlUtMzljHvWIzYdTfjJ
 C7v7EAo+iRi7ecBdg9ZxhOkcKUB2xF+lmLYBRoiOeaECJ4RIaseY63SqLyzlGeNZ
 Dx7DuF0mljC3nmHPH/60pOxDKYWGxjx5IQF1ORBvcsux4uhwH8KkyjcxYody/m82
 HXalAwmN4mA2asrt0xTpIoOUkwOGy7LEKx6083hbrSgzk9E/bnoi+YDe+Gm4z+YR
 ofqsj4G6x0fS85RCKB4iQGRcHgW2IPHYeKWXYYP0m1COe9kj9E3POBpAjj3qMxZr
 R+eveYhLZ7A3B/DAVYQQVoCcQWECXC0M4FOOVh4vEV2JeEB55dlI77/p6X5y7RGj
 YA7njv9eC5hrxOgUMOLFIvMqFKe7dcuYmE3V11+TD5lHBfqWfKE7vA1V1rOA8e8R
 Qf5DXLVBbsNj2yirbj1LI3x7jg5aXw5IhylJNn/c1vezqw8RUfg=
 =sI+4
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.13

- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
2024-11-08 12:13:48 -05:00
Sean Christopherson 2ebbe0308c KVM: Allow arch code to elide TLB flushes when aging a young page
Add a Kconfig to allow architectures to opt-out of a TLB flush when a
young page is aged, as invalidating TLB entries is not functionally
required on most KVM-supported architectures.  Stale TLB entries can
result in false negatives and theoretically lead to suboptimal reclaim,
but in practice all observations have been that the performance gained by
skipping TLB flushes outweighs any performance lost by reclaiming hot
pages.

E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB
flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB
that's required for ordering (presumably because there are optimizations
related to eliding other TLB flushes when doing make-before-break).

Link: https://lore.kernel.org/r/20241011021051.1557902-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 15:25:40 -07:00
Sean Christopherson 3e7f43188e KVM: Protect vCPU's "last run PID" with rwlock, not RCU
To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead
of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a
vCPU.  When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to
run SEV migration helper vCPUs during post-copy, the synchronize_rcu()
needed to change the PID associated with the vCPU can stall for hundreds
of milliseconds, which is problematic for latency sensitive post-copy
operations.

In the directed yield path, do not acquire the lock if it's contended,
i.e. if the associated PID is changing, as that means the vCPU's task is
already running.

Reported-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:41:22 -07:00
Sean Christopherson 6cf9ef23d9 KVM: Return '0' directly when there's no task to yield to
Do "return 0" instead of initializing and returning a local variable in
kvm_vcpu_yield_to(), e.g. so that it's more obvious what the function
returns if there is no task.

No functional change intended.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240802200136.329973-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:29:31 -07:00
Sean Christopherson 7e513617da KVM: Rework core loop of kvm_vcpu_on_spin() to use a single for-loop
Rework kvm_vcpu_on_spin() to use a single for-loop instead of making "two"
passes over all vCPUs.  Given N=kvm->last_boosted_vcpu, the logic is to
iterate from vCPU[N+1]..vcpu[N-1], i.e. using two loops is just a kludgy
way of handling the wrap from the last vCPU to vCPU0 when a boostable vCPU
isn't found in vcpu[N+1]..vcpu[MAX].

Open code the xa_load() instead of using kvm_get_vcpu() to avoid reading
online_vcpus in every loop, as well as the accompanying smp_rmb(), i.e.
make it a custom kvm_for_each_vcpu(), for all intents and purposes.

Oppurtunistically clean up the comment explaining the logic.

Link: https://lore.kernel.org/r/20240802202121.341348-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:27:51 -07:00
Sean Christopherson 8b15c3764c KVM: Don't grab reference on VM_MIXEDMAP pfns that have a "struct page"
Now that KVM no longer relies on an ugly heuristic to find its struct page
references, i.e. now that KVM can't get false positives on VM_MIXEDMAP
pfns, remove KVM's hack to elevate the refcount for pfns that happen to
have a valid struct page.  In addition to removing a long-standing wart
in KVM, this allows KVM to map non-refcounted struct page memory into the
guest, e.g. for exposing GPU TTM buffers to KVM guests.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-86-seanjc@google.com>
2024-10-25 13:01:35 -04:00
Sean Christopherson 93b7da404f KVM: Drop APIs that manipulate "struct page" via pfns
Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone.

No functional change intended.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-85-seanjc@google.com>
2024-10-25 13:01:35 -04:00
Sean Christopherson 31fccdd212 KVM: Make kvm_follow_pfn.refcounted_page a required field
Now that the legacy gfn_to_pfn() APIs are gone, and all callers of
hva_to_pfn() pass in a refcounted_page pointer, make it a required field
to ensure all future usage in KVM plays nice.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-82-seanjc@google.com>
2024-10-25 13:01:35 -04:00
Sean Christopherson 06cdaff80e KVM: Drop gfn_to_pfn() APIs now that all users are gone
Drop gfn_to_pfn() and all its variants now that all users are gone.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-80-seanjc@google.com>
2024-10-25 13:01:35 -04:00
Sean Christopherson f42e289a20 KVM: Add support for read-only usage of gfn_to_page()
Rework gfn_to_page() to support read-only accesses so that it can be used
by arm64 to get MTE tags out of guest memory.

Opportunistically rewrite the comment to be even more stern about using
gfn_to_page(), as there are very few scenarios where requiring a struct
page is actually the right thing to do (though there are such scenarios).
Add a FIXME to call out that KVM probably should be pinning pages, not
just getting pages.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-77-seanjc@google.com>
2024-10-25 13:00:50 -04:00
Sean Christopherson ce6bf70346 KVM: Convert gfn_to_page() to use kvm_follow_pfn()
Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will
eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-76-seanjc@google.com>
2024-10-25 13:00:49 -04:00
Sean Christopherson 1c7b627e93 KVM: Add kvm_faultin_pfn() to specifically service guest page faults
Add a new dedicated API, kvm_faultin_pfn(), for servicing guest page
faults, i.e. for getting pages/pfns that will be mapped into the guest via
an mmu_notifier-protected KVM MMU.  Keep struct kvm_follow_pfn buried in
internal code, as having __kvm_faultin_pfn() take "out" params is actually
cleaner for several architectures, e.g. it allows the caller to have its
own "page fault" structure without having to marshal data to/from
kvm_follow_pfn.

Long term, common KVM would ideally provide a kvm_page_fault structure, a
la x86's struct of the same name.  But all architectures need to be
converted to a common API before that can happen.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-44-seanjc@google.com>
2024-10-25 13:00:47 -04:00
Sean Christopherson 68e51d0a43 KVM: Disallow direct access (w/o mmu_notifier) to unpinned pfn by default
Add an off-by-default module param to control whether or not KVM is allowed
to map memory that isn't pinned, i.e. that KVM can't guarantee won't be
freed while it is mapped into KVM and/or the guest.  Don't remove the
functionality entirely, as there are use cases where mapping unpinned
memory is safe (as defined by the platform owner), e.g. when memory is
hidden from the kernel and managed by userspace, in which case userspace
is already fully trusted to not muck with guest memory mappings.

But for more typical setups, mapping unpinned memory is wildly unsafe, and
unnecessary.  The APIs are used exclusively by x86's nested virtualization
support, and there is no known (or sane) use case for mapping PFN-mapped
memory a KVM guest _and_ letting the guest use it for virtualization
structures.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-36-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson 2e5fdf60a9 KVM: Get writable mapping for __kvm_vcpu_map() only when necessary
When creating a memory map for read, don't request a writable pfn from the
primary MMU.  While creating read-only mappings can be theoretically slower,
as they don't play nice with fast GUP due to the need to break CoW before
mapping the underlying PFN, practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-35-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson 365e319208 KVM: Pass in write/dirty to kvm_vcpu_map(), not kvm_vcpu_unmap()
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them
pass "true" as a @writable param to kvm_vcpu_map(), and thus create a
read-only mapping when possible.

Note, creating read-only mappings can be theoretically slower, as they
don't play nice with fast GUP due to the need to break CoW before mapping
the underlying PFN.  But practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-34-seanjc@google.com>
2024-10-25 12:59:07 -04:00
Sean Christopherson 2bcb52a360 KVM: Pin (as in FOLL_PIN) pages during kvm_vcpu_map()
Pin, as in FOLL_PIN, pages when mapping them for direct access by KVM.
As per Documentation/core-api/pin_user_pages.rst, writing to a page that
was gotten via FOLL_GET is explicitly disallowed.

  Correct (uses FOLL_PIN calls):
      pin_user_pages()
      write to the data within the pages
      unpin_user_pages()

  INCORRECT (uses FOLL_GET calls):
      get_user_pages()
      write to the data within the pages
      put_page()

Unfortunately, FOLL_PIN is a "private" flag, and so kvm_follow_pfn must
use a one-off bool instead of being able to piggyback the "flags" field.

Link: https://lwn.net/Articles/930667
Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-32-seanjc@google.com>
2024-10-25 12:58:00 -04:00
David Stevens 2ff072ba7a KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn()
Migrate kvm_vcpu_map() to kvm_follow_pfn(), and have it track whether or
not the map holds a refcounted struct page.  Precisely tracking struct
page references will eventually allow removing kvm_pfn_to_refcounted_page()
and its various wrappers.

Signed-off-by: David Stevens <stevensd@chromium.org>
[sean: use a pointer instead of a boolean]
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-31-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson 775e3ff7bf KVM: Move kvm_{set,release}_page_{clean,dirty}() helpers up in kvm_main.c
Hoist the kvm_{set,release}_page_{clean,dirty}() APIs further up in
kvm_main.c so that they can be used by the kvm_follow_pfn family of APIs.

No functional change intended.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-29-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson 3dd48ecfac KVM: Provide refcounted page as output field in struct kvm_follow_pfn
Add kvm_follow_pfn.refcounted_page as an output for the "to pfn" APIs to
"return" the struct page that is associated with the returned pfn (if KVM
acquired a reference to the page).  This will eventually allow removing
KVM's hacky kvm_pfn_to_refcounted_page() code, which is error prone and
can't detect pfns that are valid, but aren't (currently) refcounted.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-28-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson 12fac89950 KVM: Use plain "struct page" pointer instead of single-entry array
Use a single pointer instead of a single-entry array for the struct page
pointer in hva_to_pfn_fast().  Using an array makes the code unnecessarily
annoying to read and update.

No functional change intended.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-27-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson ef7db98e47 KVM: Use NULL for struct page pointer to indicate mremapped memory
Drop yet another unnecessary magic page value from KVM, as there's zero
reason to use a poisoned pointer to indicate "no page".  If KVM uses a
NULL page pointer, the kernel will explode just as quickly as if KVM uses
a poisoned pointer.  Never mind the fact that such usage would be a
blatant and egregious KVM bug.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-23-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson 5488499f9c KVM: Explicitly initialize all fields at the start of kvm_vcpu_map()
Explicitly initialize the entire kvm_host_map structure when mapping a
pfn, as some callers declare their struct on the stack, i.e. don't
zero-initialize the struct, which makes the map->hva in kvm_vcpu_unmap()
*very* suspect.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-22-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson 0b139b877b KVM: Remove pointless sanity check on @map param to kvm_vcpu_(un)map()
Drop kvm_vcpu_{,un}map()'s useless checks on @map being non-NULL.  The map
is 100% kernel controlled, any caller that passes a NULL pointer is broken
and needs to be fixed, i.e. a crash due to a NULL pointer dereference is
desirable (though obviously not as desirable as not having a bug in the
first place).

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-21-seanjc@google.com>
2024-10-25 12:57:59 -04:00
David Stevens c0461f2063 KVM: Introduce kvm_follow_pfn() to eventually replace "gfn_to_pfn" APIs
Introduce kvm_follow_pfn() to eventually supplant the various "gfn_to_pfn"
APIs, albeit by adding more wrappers.  The primary motivation of the new
helper is to pass a structure instead of an ever changing set of parameters,
e.g. so that tweaking the behavior, inputs, and/or outputs of the "to pfn"
helpers doesn't require churning half of KVM.

In the more distant future, the APIs exposed to arch code could also
follow suit, e.g. by adding something akin to x86's "struct kvm_page_fault"
when faulting in guest memory.  But for now, the goal is purely to clean
up KVM's "internal" MMU code.

As part of the conversion, replace the write_fault, interruptible, and
no-wait boolean flags with FOLL_WRITE, FOLL_INTERRUPTIBLE, and FOLL_NOWAIT
respectively.  Collecting the various FOLL_* flags into a single field
will again ease the pain of passing new flags.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: David Stevens <stevensd@chromium.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-20-seanjc@google.com>
2024-10-25 12:57:59 -04:00
Sean Christopherson cccefb0a0d KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25 12:57:58 -04:00