Commit Graph

2732 Commits

Author SHA1 Message Date
Linus Torvalds 2215336295 hyperv-next for v6.18
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmjkpakTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXip5B/48MvTFJ1qwRGPVzevZQ8Z4SDogEREp
 69VS/xRf1YCIzyXyanwqf1dXLq8NAqicSp6ewpJAmNA55/9O0cwT2EtohjeGCu61
 krPIvS3KT7xI0uSEniBdhBtALYBscnQ0e3cAbLNzL7bwA6Q6OmvoIawpBADgE/cW
 aZNCK9jy+WUqtXc6lNtkJtST0HWGDn0h04o2hjqIkZ+7ewjuEEJBUUB/JZwJ41Od
 UxbID0PAcn9O4n/u/Y/GH65MX+ddrdCgPHEGCLAGAKT24lou3NzVv445OuCw0c4W
 ilALIRb9iea56ZLVBW5O82+7g9Ag41LGq+841MNlZjeRNONGykaUpTWZ
 =OR26
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Unify guest entry code for KVM and MSHV (Sean Christopherson)

 - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
   (Nam Cao)

 - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
   (Mukesh Rathor)

 - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)

 - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
   Kumar T S M)

 - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
   Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)

* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hyperv: Remove the spurious null directive line
  MAINTAINERS: Mark hyperv_fb driver Obsolete
  fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
  Drivers: hv: Make CONFIG_HYPERV bool
  Drivers: hv: Add CONFIG_HYPERV_VMBUS option
  Drivers: hv: vmbus: Fix typos in vmbus_drv.c
  Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
  Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
  x86/hyperv: Switch to msi_create_parent_irq_domain()
  mshv: Use common "entry virt" APIs to do work in root before running guest
  entry: Rename "kvm" entry code assets to "virt" to genericize APIs
  entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
  mshv: Handle NEED_RESCHED_LAZY before transferring to guest
  x86/hyperv: Add kexec/kdump support on Azure CVMs
  Drivers: hv: Simplify data structures for VMBus channel close message
  Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
  mshv: Add support for a new parent partition configuration
  clocksource: hyper-v: Skip unnecessary checks for the root partition
  hyperv: Add missing field to hv_output_map_device_interrupt
2025-10-07 08:40:15 -07:00
Sean Christopherson 9be7e1e320 entry: Rename "kvm" entry code assets to "virt" to genericize APIs
Rename the "kvm" entry code files and Kconfigs to use generic "virt"
nomenclature so that the code can be reused by other hypervisors (or
rather, their root/dom0 partition drivers), without incorrectly suggesting
the code somehow relies on and/or involves KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30 22:50:18 +00:00
Sean Christopherson 20c4892058 KVM: Export KVM-internal symbols for sub-modules only
Rework the vast majority of KVM's exports to expose symbols only to KVM
submodules, i.e. to x86's kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko.
With few exceptions, KVM's exported APIs are intended (and safe) for KVM-
internal usage only.

Keep kvm_get_kvm(), kvm_get_kvm_safe(), and kvm_put_kvm() as normal
exports, as they are needed by VFIO, and are generally safe for external
usage (though ideally even the get/put APIs would be KVM-internal, and
VFIO would pin a VM by grabbing a reference to its associated file).

Implement a framework in kvm_types.h in anticipation of providing a macro
to restrict KVM-specific kernel exports, i.e. to provide symbol exports
for KVM if and only if KVM is built as one or more modules.

Link: https://lore.kernel.org/r/20250919003303.1355064-3-seanjc@google.com
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30 13:40:02 -04:00
Paolo Bonzini a104e0a305 KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
    KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
    instead of latent guest failures.
 
  - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
    host from tampering with the guest's TSC frequency, while still allowing the
    the VMM to configure the guest's TSC frequency prior to launch.
 
  - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
    wrapping all accesses via READ_ONCE().
 
  - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
    bogous XCR0 value in KVM's software model.
 
  - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
    avoid leaving behind stale state (thankfully not consumed in KVM).
 
  - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
    instead of subtly relying on guest_memfd to do the "heavy" lifting.
 
  - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
    desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
    TSC_AUX due to hardware not matching the value cached in the user-return MSR
    infrastructure.
 
  - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
    and clean up the AVIC initialization code along the way.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXH54ACgkQOlYIJqCj
 N/0OCw//e+0o6jov6/PO8ljq6sXJySsXKxEFYnvQlWYzjqtlVs05Y2SY0GBTnMu3
 g0ie2c4V3VD7cY5bGAWETWvrOMLqGXM3E7v9dVOuE4xU3xx0HkCAlXc/woOLUXoT
 jo/komNXnpeiZ1QRO9FlGooHTJ6Y+jg6/mM7asStS2Pk3Mm//wYgQej9mSJDrypo
 NB4+BCS9cyt8rndNtCUkyedFYMboVQ8AEvXh/jeydhw4rdbBh0/Ci2IKGcVI5DP1
 be8GD/FsNTIUDtieHRYCR+LCKCMFj/hYzlg2nQ6UjxHZbvlDyQuh2Ld2LtZiGSef
 ejNr9e+ro6vxWBgX6wplWtKRLxBYEnQ1h/rQ9A3g50TuhrtFJbxBxY7DPQ16hlBJ
 EB/E1JFvVgkGVrYN0oPQCvvfhFtpkx43qnEBw4q0pbdAS79XOnG2GJFvI0hpZAP6
 qwy19lbsJ5g3qLTlDPChxQJC08gThn3CbarCmZNNzBpPDQoLDUfYBfyN4prRPuiN
 UByfaaEC0Fi6JSgmHsO0LsUB9K++k2ucWiIIW4YQhVgPUtCjTNLe9omgGJ1UYe0X
 YITqgklewe3QtBJ46JE0APkPaHio7r6zd7QvO+RhRFkjwZfY6dlsrSImykKrpK3O
 rPaZnW+UpAnA1XIqroMl1RVoczFCfGcP1Cat9JwScBVVxjJ1DlI=
 =zd53
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.18

 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via
   KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
   instead of latent guest failures.

 - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
   host from tampering with the guest's TSC frequency, while still allowing the
   the VMM to configure the guest's TSC frequency prior to launch.

 - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
   wrapping all accesses via READ_ONCE().

 - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
   bogous XCR0 value in KVM's software model.

 - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
   avoid leaving behind stale state (thankfully not consumed in KVM).

 - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
   instead of subtly relying on guest_memfd to do the "heavy" lifting.

 - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
   desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
   TSC_AUX due to hardware not matching the value cached in the user-return MSR
   infrastructure.

 - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
   and clean up the AVIC initialization code along the way.
2025-09-30 13:34:12 -04:00
Paolo Bonzini 5b0d0d8542 KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
    reduce guest jitter when restoring NX huge pages.
 
  - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
    relevant memslot to fix an issue where prefaulting could deadlock with the
    memslot update.
 
  - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
    invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
    to the aforementioned prefaulting case.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXHaEACgkQOlYIJqCj
 N/1uDxAAxGMl1q1Hg0tpVPw7PdcourXlVYJjFzsrK6CdtZpL7n2GJPVhEFBDovud
 oIM9IIiP5f2UDtWeRb6b/mm9INqwTB8lyswbJk/tO+CshBiBdE7PfDbzDzvj9lAv
 Uecc6tQhv+CDpJcSf7t5OqgiRo5gEBTXZZj0l5GOdtiaOU09eq4ttZTME5S1jQgh
 kBddFd3glWeMLv67cTNCxdHsOFnaVWIBoupfw7Fv7LVJ1k6cgKyHAhjfq8A9elEK
 3CyDo8DZ8MG4aguhHzAUQuEM9ELMxOTyJG8xS2BWtFA/glbvUBnOfGeyTmHgo/nN
 qKyjytlpmO0yIlehTd/5tLfpidL8l30VN7+nDpqwTjCDEz9bC39zC9zBmKni84Dt
 wItfmELb6lbvprA+FOseiRwk7/2quLrgc4y21GI29Zqbf6wMoQEnRHF/moFZ3cqg
 C/SP1Ev6N5ENM2BZG9mFSRWr8e2yyan8YWs+AUtsBEM82KaeJrMlZ4yqA1m33a5T
 YK5eL3DablObdfvvz1YXCVxByQ7aIbVCpE3VVigeyHrqoR/EFwZMzYLouOI34jjN
 Nj5+Qck6VMhI+OetUlcXS1D/DIHgpDgZFPcgeLURiwO0l62H/gYLHuoCek4YmkIi
 30ZwVXubBWDg5TcxEi5oIbVfyZfHNi+MyeLMWLEy6hEdnFsTsZU=
 =6qMx
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.18

 - Recover possible NX huge pages within the TDP MMU under read lock to
   reduce guest jitter when restoring NX huge pages.

 - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
   relevant memslot to fix an issue where prefaulting could deadlock with the
   memslot update.

 - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
   invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
   to the aforementioned prefaulting case.
2025-09-30 13:32:27 -04:00
Paolo Bonzini 99cab80208 KVM common changes for 6.18
Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is
 now included in GFP_NOWAIT.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXG7YACgkQOlYIJqCj
 N/3H9w//UvhvgWHdKPWbtiqWyzJK6dLgWTU9StitsuPqwxHMLTwtyiwSJGhnotis
 KmlTK3LNYMlgC3cBa3v58Sgdp15yUJN4Cz4b609cPnOTuF7sQJedhi/Rw3UoFS2R
 m0RehHiBR0lujUmMnn6QL1vh4d+D2w6JfSAyuhczglwAhkeRZ3SEd8FeOm8re3iy
 Kz89XhpwZ1jafCOWrkmP/ba4tfwQHHInvFBF5WrLmjOgJ59sDAILYT0UFiQm5tH4
 tXLtLhW/NBmp7WSCuV4VsHDiMN31cMeTt1RXAabDe0vkPYU6yhbcxsNf+uljDvmc
 zMsTQDiwmKwCMLoE2OKAqN4z6LNtojTwHzwjOu496VIxlyusG5qen94Lsub8Az0n
 tsZ3VpifToL+95/Vgrkl7mQuje7aEOmTn0XTcGI030uO9ODpUVyrB260l5Y+vPIk
 icZtA64bInWo1j8HT8iTHXJvheWSZXM46cdMzyPBtyD1PQ2/FPgrCrf3r+xpkVrQ
 wy/2eRW64kungzZovhHYZjsQmtCbG+snPCsXsKZVWFOT8iT0lQZk40BUiBf++yMV
 5JEXtaazgL3Ze2WCTETcViidV4WDrEpGdkUnoBunAqHqkUAAcSwDBpu2++vzdJwD
 sX0Ld8rOq3GwNWn/6Jkk9gwGH9ZPQM0AUWQ06VXlpcq/YelDgdM=
 =I3Dm
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM common changes for 6.18

Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is
now included in GFP_NOWAIT.
2025-09-30 13:27:59 -04:00
Paolo Bonzini 924ebaefce KVM/arm64 updates for 6.18
- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
   allowing more registers to be used as part of the message payload.
 
 - Change the way pKVM allocates its VM handles, making sure that the
   privileged hypervisor is never tricked into using uninitialised
   data.
 
 - Speed up MMIO range registration by avoiding unnecessary RCU
   synchronisation, which results in VMs starting much quicker.
 
 - Add the dump of the instruction stream when panic-ing in the EL2
   payload, just like the rest of the kernel has always done. This will
   hopefully help debugging non-VHE setups.
 
 - Add 52bit PA support to the stage-1 page-table walker, and make use
   of it to populate the fault level reported to the guest on failing
   to translate a stage-1 walk.
 
 - Add NV support to the GICv3-on-GICv5 emulation code, ensuring
   feature parity for guests, irrespective of the host platform.
 
 - Fix some really ugly architecture problems when dealing with debug
   in a nested VM. This has some bad performance impacts, but is at
   least correct.
 
 - Add enough infrastructure to be able to disable EL2 features and
   give effective values to the EL2 control registers. This then allows
   a bunch of features to be turned off, which helps cross-host
   migration.
 
 - Large rework of the selftest infrastructure to allow most tests to
   transparently run at EL2. This is the first step towards enabling
   NV testing.
 
 - Various fixes and improvements all over the map, including one BE
   fix, just in time for the removal of the feature.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmjVhSsACgkQI9DQutE9
 ekOSIg//YdbA/zo17GrLnJnlGdUnS0e3/357n3e5Lypx1UFRDTmNacpVPw4VG/jt
 eVpQn7AgYwyvKfCq46eD+hBBqNv1XTn4DeXttv7CmVqhCRythsEvDkTBWSt7oYUZ
 xfYXCMKNhqUElH4AbYYx3y7nb2E9/KVGr+NBn6Vf5c14OZ3MGVc/fyp4jM1ih5dR
 kcV2onAYohlIGvFEyZMBtJ+jkYkqfIbfxfqCL0RAET0aEBFcmM1aXybWZj47hlLM
 f2j+E6cFQ0ZzUt+3pFhT75wo43lHGtIFDjVd60uishyU+NXTVvqRmXDTRU4k546W
 18HHX1yijbzuXIatVhVRo2hIq3jKU37T9wtj46BejbDHRdAPENEyN/Qopm7rNS+X
 mCwOT7He6KR+H4rU6nFaTcsS7bNRCvIbZP9i9zb6NElbvXu5QnM8BUQsYFCDUa/n
 xtbtQlckbo/7zeoUsBDrGj2XmCf0d45FTHb7fdWOYEmMSmJhXYpUKdM4JcLyhKoQ
 DD0ox2S+pt2lwNw3XOSABdES0KJxCvDASAMIgn2h2sGpY8FxsBcVW/BufXopdafG
 UeInxWaILp2iCDM4tH2GLjKqlvMAOwcA+mAEZToXypxlJAnYA6J1pXCF8WEaM6+D
 BGrLli8Zd8JRs87byq6K7tp8oLNzZdliJp73j5jfOHTJA4MnvFI=
 =iAL3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.18

- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
  allowing more registers to be used as part of the message payload.

- Change the way pKVM allocates its VM handles, making sure that the
  privileged hypervisor is never tricked into using uninitialised
  data.

- Speed up MMIO range registration by avoiding unnecessary RCU
  synchronisation, which results in VMs starting much quicker.

- Add the dump of the instruction stream when panic-ing in the EL2
  payload, just like the rest of the kernel has always done. This will
  hopefully help debugging non-VHE setups.

- Add 52bit PA support to the stage-1 page-table walker, and make use
  of it to populate the fault level reported to the guest on failing
  to translate a stage-1 walk.

- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
  feature parity for guests, irrespective of the host platform.

- Fix some really ugly architecture problems when dealing with debug
  in a nested VM. This has some bad performance impacts, but is at
  least correct.

- Add enough infrastructure to be able to disable EL2 features and
  give effective values to the EL2 control registers. This then allows
  a bunch of features to be turned off, which helps cross-host
  migration.

- Large rework of the selftest infrastructure to allow most tests to
  transparently run at EL2. This is the first step towards enabling
  NV testing.

- Various fixes and improvements all over the map, including one BE
  fix, just in time for the removal of the feature.
2025-09-30 13:23:28 -04:00
Sean Christopherson 5b66e335ea KVM: SEV: Reject non-positive effective lengths during LAUNCH_UPDATE
Check for an invalid length during LAUNCH_UPDATE at the start of
snp_launch_update() instead of subtly relying on kvm_gmem_populate() to
detect the bad state.  Code that directly handles userspace input
absolutely should sanitize those inputs; failure to do so is asking for
bugs where KVM consumes an invalid "npages".

Keep the check in gmem, but wrap it in a WARN to flag any bad usage by
the caller.

Note, this is technically an ABI change as KVM would previously allow a
length of '0'.  But allowing a length of '0' is nonsensical and creates
pointless conundrums in KVM.  E.g. an empty range is arguably neither
private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't
be made private.  In practice, no known or well-behaved VMM passes a
length of '0'.

Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between
1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with
npages=0 when doing "npages = params.len / PAGE_SIZE".

Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 08:55:20 -07:00
Keir Fraser 7d9a0273c4 KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
Device MMIO registration may happen quite frequently during VM boot,
and the SRCU synchronization each time has a measurable effect
on VM startup time. In our experiments it can account for around 25%
of a VM's startup time.

Replace the synchronization with a deferred free of the old kvm_io_bus
structure.

Tested-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-15 10:55:23 +01:00
Keir Fraser 7788255aba KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths
This ensures that, if a VCPU has "observed" that an IO registration has
occurred, the instruction currently being trapped or emulated will also
observe the IO registration.

At the same time, enforce that kvm_get_bus() is used only on the
update side, ensuring that a long-term reference cannot be obtained by
an SRCU reader.

Signed-off-by: Keir Fraser <keirf@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-15 10:55:23 +01:00
Sean Christopherson 2bc2694fe2 KVM: TDX: Do not retry locally when the retry is caused by invalid memslot
Avoid local retries within the TDX EPT violation handler if a retry is
triggered by faulting in an invalid memslot, indicating that the memslot
is undergoing a removal process.  Faulting in a GPA from an invalid
memslot will never succeed, and holding SRCU prevents memslot deletion
from succeeding, i.e. retrying when the memslot is being actively deleted
will lead to (breakable) deadlock.

Opportunistically export kvm_vcpu_gfn_to_memslot() to allow for a per-vCPU
lookup (which, strictly speaking, is unnecessary since TDX doesn't support
SMM, but aligns the TDX code with the MMU code).

Fixes: b0327bb2e7 ("KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com
[Yan: Wrote patch log, comment, fixed a minor error, function export]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250822070523.26495-1-yan.y.zhao@intel.com
[sean: massage changelog, relocate and tweak comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-10 12:06:35 -07:00
Fuad Tabba 3d3a04fad2 KVM: Allow and advertise support for host mmap() on guest_memfd files
Now that all the x86 and arm64 plumbing for mmap() on guest_memfd is in
place, allow userspace to set GUEST_MEMFD_FLAG_MMAP and advertise support
via a new capability, KVM_CAP_GUEST_MEMFD_MMAP.

The availability of this capability is determined per architecture, and
its enablement for a specific guest_memfd instance is controlled by the
GUEST_MEMFD_FLAG_MMAP flag at creation time.

Update the KVM API documentation to detail the KVM_CAP_GUEST_MEMFD_MMAP
capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential
information regarding support for mmap in guest_memfd.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:37:03 -04:00
Fuad Tabba 576d035e2a KVM: guest_memfd: Track guest_memfd mmap support in memslot
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of
memslot->flags (which makes it strictly for KVM's internal use). This
flag tracks when a guest_memfd-backed memory slot supports host
userspace mmap operations, which implies that all memory, not just
private memory for CoCo VMs, is consumed through guest_memfd: "gmem
only".

This optimization avoids repeatedly checking the underlying guest_memfd
file for mmap support, which would otherwise require taking and
releasing a reference on the file for each check. By caching this
information directly in the memslot, we reduce overhead and simplify the
logic involved in handling guest_memfd-backed pages for host mappings.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba a12578e147 KVM: guest_memfd: Add plumbing to host to map guest_memfd pages
Introduce the core infrastructure to enable host userspace to mmap()
guest_memfd-backed memory. This is needed for several evolving KVM use
cases:

* Non-CoCo VM backing: Allows VMMs like Firecracker to run guests
  entirely backed by guest_memfd, even for non-CoCo VMs [1]. This
  provides a unified memory management model and simplifies guest memory
  handling.

* Direct map removal for enhanced security: This is an important step
  for direct map removal of guest memory [2]. By allowing host userspace
  to fault in guest_memfd pages directly, we can avoid maintaining host
  kernel direct maps of guest memory. This provides additional hardening
  against Spectre-like transient execution attacks by removing a
  potential attack surface within the kernel.

* Future guest_memfd features: This also lays the groundwork for future
  enhancements to guest_memfd, such as supporting huge pages and
  enabling in-place sharing of guest memory with the host for CoCo
  platforms that permit it [3].

Enable the basic mmap and fault handling logic within guest_memfd, but
hold off on allow userspace to actually do mmap() until the architecture
support is also in place.

[1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com
[3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba d1e54dd08f KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default"
VM types when running on 64-bit KVM.  This will allow using guest_memfd
to back non-private memory for all VM shapes, by supporting mmap() on
guest_memfd.

Opportunistically clean up various conditionals that become tautologies
once x86 selects KVM_GUEST_MEMFD more broadly.  Specifically, because
SW protected VMs, SEV, and TDX are all 64-bit only, private memory no
longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because
it is effectively a prerequisite.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba 69116e01f6 KVM: Fix comments that refer to slots_lock
Fix comments so that they refer to slots_lock instead of slots_locks
(remove trailing s).

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba 923310be23 KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.

The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".

As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba 36cf63bb5d KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
The original name was vague regarding its functionality. This Kconfig
option specifically enables and gates the kvm_gmem_populate() function,
which is responsible for populating a GPA range with guest data.

The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the
option: to enable arch-specific guest_memfd population mechanisms. It
also follows the same pattern as the other HAVE_KVM_ARCH_* configuration
options.

This improves clarity for developers and ensures the name accurately
reflects the functionality it controls, especially as guest_memfd
support expands beyond purely "private" memory scenarios.

Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to
minimize churn, and to hopefully make it easier to see what features
require HAVE_KVM_ARCH_GMEM_POPULATE.  On that note, omit GMEM_POPULATE
for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for
software-protected VMs.

As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD
for all 64-bit KVM builds, at which point the intermediate config will
become obsolete and can/will be dropped.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba 19a9a1ab5c KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to
CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only
supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables
guest_memfd in general, which is not exclusively for private memory.
Subsequent patches in this series will add guest_memfd support for
non-CoCo VMs, whose memory is not private.

Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately
reflects its broader scope as the main Kconfig option for all
guest_memfd-backed memory. This provides clearer semantics for the
option and avoids confusion as new features are introduced.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:34:59 -04:00
Qianfeng Rong cf6a8401b6 KVM: remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant.  Let's clean up these
redundant flags across subsystems.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://lore.kernel.org/r/20250807144943.581663-1-rongqianfeng@vivo.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:51:13 -07:00
Paolo Bonzini 9de13951d5 Merge tag 'kvm-x86-no_assignment-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM VFIO device assignment cleanups for 6.17

Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
2025-07-29 08:36:42 -04:00
Paolo Bonzini cc5a1021aa Merge tag 'kvm-x86-dirty_ring-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM Dirty Ring changes for 6.17

Fix issues with dirty ring harvesting where KVM doesn't bound the processing
of entries in any way, which allows userspace to keep KVM in a tight loop
indefinitely.  Clean up code and comments along the way.
2025-07-29 08:36:41 -04:00
Paolo Bonzini d284562862 Merge tag 'kvm-x86-generic-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.17

 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related
   to private <=> shared memory conversions.

 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
2025-07-29 08:36:41 -04:00
Paolo Bonzini f02b1bcc73 Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM IRQ changes for 6.17

 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large numbers
   of VMs.  Such use cases typically don't actually use irqbypass, but
   eliminating the pointless registration is a future problem to solve as it
   likely requires new uAPI.

 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.

 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
   and PIT emulation at compile time.

 - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.

 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.

 - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.

 - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
   clear in the vCPU's physical ID table entry.

 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.

 - Dedup x86's device posted IRQ code, as the vast majority of functionality
   can be shared verbatime between SVM and VMX.

 - Harden the device posted IRQ code against bugs and runtime errors.

 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).

 - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
   only if KVM needs a notification in order to wake the vCPU.

 - Decouple device posted IRQs from VFIO device assignment, as binding a VM to
   a VFIO group is not a requirement for enabling device posted IRQs.

 - Clean up and document/comment the irqfd assignment code.

 - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
   ensure an eventfd is bound to at most one irqfd through the entire host,
   and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-29 08:35:46 -04:00
Shivank Garg 87d4fbf4a3 KVM: guest_memfd: Remove redundant kvm_gmem_getattr implementation
Remove the redundant kvm_gmem_getattr() implementation that simply calls
generic_fillattr() without any special handling. The VFS layer
(vfs_getattr_nosec()) will call generic_fillattr() by default when no
custom getattr operation is provided in the inode_operations structure.

This is a cleanup with no functional change.

Signed-off-by: Shivank Garg <shivankg@amd.com>
Reviewed-By: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250602172317.10601-1-shivankg@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 13:42:33 -07:00
Sean Christopherson bbc13ae593 VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count.  Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives.  E.g. prior to commit 2edd9cb79f ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.

And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.

Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:33 -07:00
Liam Merwick 47bb584237 KVM: Allow CPU to reschedule while setting per-page memory attributes
When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+),
the host can experience CPU soft lockups when running an operation in
kvm_vm_set_mem_attributes() to set memory attributes on the whole
range of guest memory.

watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372]
CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 #1 PREEMPT(voluntary)
Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024
RIP: 0010:xas_create+0x78/0x1f0
Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87
RSP: 0018:ffffad890a34b940 EFLAGS: 00000286
RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000
RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868
R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868
FS:  00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
 <TASK>
 xas_store+0x58/0x630
 __xa_store+0xa5/0x130
 xa_store+0x2c/0x50
 kvm_vm_set_mem_attributes+0x343/0x710 [kvm]
 kvm_vm_ioctl+0x796/0xab0 [kvm]
 __x64_sys_ioctl+0xa3/0xd0
 do_syscall_64+0x8c/0x7a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5578d031bb
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb
RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b
RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000
R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120
R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0

While looping through the range of memory setting the attributes,
call cond_resched() to give the scheduler a chance to run a higher
priority task on the runqueue if necessary and avoid staying in
kernel mode long enough to trigger the lockup.

Fixes: 5a475554db ("KVM: Introduce per-page memory attributes")
Cc: stable@vger.kernel.org # 6.12.x
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:17 -07:00
Sean Christopherson b599d44a71 KVM: Drop sanity check that per-VM list of irqfds is unique
Now that the eventfd's waitqueue ensures it has at most one priority
waiter, i.e. prevents KVM from binding multiple irqfds to one eventfd,
drop KVM's sanity check that eventfds are unique for a single VM.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:59 -07:00
Sean Christopherson 2cdd64cbf9 KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter
Disallow binding an irqfd to an eventfd that already has a priority waiter,
i.e. to an eventfd that already has an attached irqfd.  KVM always
operates in exclusive mode for EPOLL_IN (unconditionally returns '1'),
i.e. only the first waiter will be notified.

KVM already disallows binding multiple irqfds to an eventfd in a single
VM, but doesn't guard against multiple VMs binding to an eventfd.  Adding
the extra protection reduces the pain of a userspace VMM bug, e.g. if
userspace fails to de-assign before re-assigning when transferring state
for intra-host migration, then the migration will explicitly fail as
opposed to dropping IRQs on the destination VM.

Temporarily keep KVM's manual check on irqfds.items, but add a WARN, e.g.
to allow sanity checking the waitqueue enforcement.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:59 -07:00
Sean Christopherson 867347bb21 sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and
instead have callers manually add the flag prior to adding their structure
to the queue.  Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature
of exclusive, priority waiters means that only the first waiter added will
ever receive notifications.

Pushing the flawed behavior to callers will allow fixing the problem one
hypervisor at a time (KVM added the flawed API, and then KVM's code was
copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for
adding an API that provides true exclusivity, i.e. that guarantees at most
one priority waiter is in the queue.

Opportunistically add a comment in Hyper-V to call out the mess.  Xen
privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e.
can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE.  And KVM is primed to
switch to the aforementioned fully exclusive API, i.e. won't be carrying
the flawed code for long.

No functional change intended.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:57 -07:00
Sean Christopherson 86e00cd162 KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
Add an irqfd to its target eventfd's waitqueue while holding irqfds.lock,
which is mildly terrifying but functionally safe.  irqfds.lock is taken
inside the waitqueue's lock, but if and only if the eventfd is being
released, i.e. that path is mutually exclusive with registration as KVM
holds a reference to the eventfd (and obviously must do so to avoid UAF).

This will allow using the eventfd's waitqueue to enforce KVM's requirement
that eventfd is assigned to at most one irqfd, without introducing races.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:56 -07:00
Sean Christopherson 5f8ca05ea9 KVM: Add irqfd to KVM's list via the vfs_poll() callback
Add the irqfd structure to KVM's list of irqfds in kvm_irqfd_register(),
i.e. via the vfs_poll() callback.  This will allow taking irqfds.lock
across the entire registration sequence (add to waitqueue, add to list),
and more importantly will allow inserting into KVM's list if and only if
adding to the waitqueue succeeds (spoiler alert), without needing to
juggle return codes in weird ways.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:55 -07:00
Sean Christopherson b5c543518a KVM: Initialize irqfd waitqueue callback when adding to the queue
Initialize the irqfd waitqueue callback immediately prior to inserting the
irqfd into the eventfd's waitqueue.  Pre-initializing the state in a
completely different context is all kinds of confusing, and incorrectly
suggests that the waitqueue function needs to be initialize prior to
vfs_poll().

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:55 -07:00
Sean Christopherson 140768a7bf KVM: Acquire SCRU lock outside of irqfds.lock during assignment
Acquire SRCU outside of irqfds.lock so that the locking is symmetrical,
and add a comment explaining why on earth KVM holds SRCU for so long.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:54 -07:00
Sean Christopherson 283ed5001d KVM: Use a local struct to do the initial vfs_poll() on an irqfd
Use a function-local struct for the poll_table passed to vfs_poll(), as
nothing in the vfs_poll() callchain grabs a long-term reference to the
structure, i.e. its lifetime doesn't need to be tied to the irqfd.  Using
a local structure will also allow propagating failures out of the polling
callback without further polluting kvm_kernel_irqfd.

Opportunstically rename irqfd_ptable_queue_proc() to kvm_irqfd_register()
to capture what it actually does.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250522235223.3178519-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:53 -07:00
Sean Christopherson 77bb184ab8 KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing().
Calling arch code to know whether or not to call arch code is absurd.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:33 -07:00
Sean Christopherson b33252b9d1 KVM: Don't WARN if updating IRQ bypass route fails
Don't bother WARNing if updating an IRTE route fails now that vendor code
provides much more precise WARNs.  The generic WARN doesn't provide enough
information to actually debug the problem, and has obviously done nothing
to surface the myriad bugs in KVM x86's implementation.

Drop all of the associated return code plumbing that existed just so that
common KVM could WARN.

Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:32 -07:00
Liam Merwick aa006b2e51 KVM: fix typo in kvm_vm_set_mem_attributes() comment
It should be 'has' in the sentence and not 'as'.

Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-4-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:55:12 -07:00
Liam Merwick 741e595f02 KVM: Add trace_kvm_vm_set_mem_attributes()
Add a tracing function that, for a guest memory range, displays
the start and end addresses plus the per-page attributes being set.

Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-3-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:55:12 -07:00
Sean Christopherson cb21073767 KVM: Pass new routing entries and irqfd when updating IRTEs
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:53 -07:00
Sean Christopherson e295d2e7fb KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()
Trigger the I/O APIC route rescan that's performed for a split IRQ chip
after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e.
before dropping kvm->irq_lock.  Calling kvm_make_all_cpus_request() under
a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair
in __kvm_make_request()+kvm_check_request() ensures the new routing is
visible to vCPUs prior to the request being visible to vCPUs.

In all likelihood, commit b053b2aef2 ("KVM: x86: Add EOI exit bitmap
inference") somewhat arbitrarily made the request outside of irq_lock to
avoid holding irq_lock any longer than is strictly necessary.  And then
commit abdb080f7a ("kvm/irqchip: kvm_arch_irq_routing_update renaming
split") took the easy route of adding another arch hook instead of risking
a functional change.

Note, the call to synchronize_srcu_expedited() does NOT provide ordering
guarantees with respect to vCPUs scanning the new routing; as above, the
request infrastructure provides the necessary ordering.  I.e. there's no
need to wait for kvm_scan_ioapic_routes() to complete if it's actively
running, because regardless of whether it grabs the old or new table, the
vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again
and see the new mappings.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:41 -07:00
Sean Christopherson 23b54381ce irqbypass: Require producers to pass in Linux IRQ number during registration
Pass in the Linux IRQ associated with an IRQ bypass producer instead of
relying on the caller to set the field prior to registration, as there's
no benefit to relying on callers to do the right thing.

Take care to set producer->irq before __connect(), as KVM expects the IRQ
to be valid as soon as a connection is possible.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:41 -07:00
Sean Christopherson 8394b32fae irqbypass: Use xarray to track producers and consumers
Track IRQ bypass producers and consumers using an xarray to avoid the O(2n)
insertion time associated with walking a list to check for duplicate
entries, and to search for an partner.

At low (tens or few hundreds) total producer/consumer counts, using a list
is faster due to the need to allocate backing storage for xarray.  But as
count creeps into the thousands, xarray wins easily, and can provide
several orders of magnitude better latency at high counts.  E.g. hundreds
of nanoseconds vs. hundreds of milliseconds.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379
Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:40 -07:00
Sean Christopherson 46a4bfd0ae irqbypass: Use guard(mutex) in lieu of manual lock+unlock
Use guard(mutex) to clean up irqbypass's error handling.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:40 -07:00
Sean Christopherson 5d7dbdce38 irqbypass: Use paired consumer/producer to disconnect during unregister
Use the paired consumer/producer information to disconnect IRQ bypass
producers/consumers in O(1) time (ignoring the cost of __disconnect()).

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:39 -07:00
Sean Christopherson add57f493e irqbypass: Explicitly track producer and consumer bindings
Explicitly track IRQ bypass producer:consumer bindings.  This will allow
making removal an O(1) operation; searching through the list to find
information that is trivially tracked (and useful for debug) is wasteful.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:38 -07:00
Sean Christopherson 2b521d86ee irqbypass: Take ownership of producer/consumer token tracking
Move ownership of IRQ bypass token tracking into irqbypass.ko, and
explicitly require callers to pass an eventfd_ctx structure instead of a
completely opaque token.  Relying on producers and consumers to set the
token appropriately is error prone, and hiding the fact that the token must
be an eventfd_ctx pointer (for all intents and purposes) unnecessarily
obfuscates the code and makes it more brittle.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:38 -07:00
Sean Christopherson 07fbc83c01 irqbypass: Drop superfluous might_sleep() annotations
Drop superfluous might_sleep() annotations from irqbypass, mutex_lock()
provides all of the necessary tracking.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:37 -07:00
Sean Christopherson fa079a0616 irqbypass: Drop pointless and misleading THIS_MODULE get/put
Drop irqbypass.ko's superfluous and misleading get/put calls on
THIS_MODULE.  A module taking a reference to itself is useless; no amount
of checks will prevent doom and destruction if the caller hasn't already
guaranteed the liveliness of the module (this goes for any module).  E.g.
if try_module_get() fails because irqbypass.ko is being unloaded, then the
kernel has already hit a use-after-free by virtue of executing code whose
lifecycle is tied to irqbypass.ko.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:52:36 -07:00
Sean Christopherson 614fb9d147 KVM: Assert that slots_lock is held when resetting per-vCPU dirty rings
Assert that slots_lock is held in kvm_dirty_ring_reset() and add a comment
to explain _why_ slots needs to be held for the duration of the reset.

Link: https://lore.kernel.org/all/aCSns6Q5oTkdXUEe@google.com
Suggested-by: James Houghton <jthoughton@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20250516213540.2546077-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20 13:41:04 -07:00