mirror of https://github.com/torvalds/linux.git
2762 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
208eed95fc |
soc: driver updates for 6.19
This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for
power management, as a preparation for future Tegra specific
changes.
- Reset controller updates with added drivers for LAN969x, eic770
and RZ/G3S SoCs.
- Protection of system controller registers on Renesas and Google SoCs,
to prevent trivially triggering a system crash from e.g. debugfs
access.
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmky/8gACgkQmmx57+YA
GNlqohAApPTLM6Q4gf1cIcsTVaP0uxx9CBgupCGuT5ORrOMKBghVWjTOTSxeEAab
UQF465QwYUUu602GH34UmRaY9CKW2bMIsfmkgmxNB4Y4Qd7yCgQNJ/h/TnN0rBH+
qTeEsRH/hax4miSNsh0oOZfVkZkg+23VF02d1VL0CcaX7y4oT45RPBQugrNx/gNS
fHfVwgIq8vJ8WyrmM1h2nv1i1vgSzEy50B3kY674BBw83FcJTafNLvD7N5DSgD1H
/I/2xeyEpb+oL1VfeHcXZaX/jf04O+cmvSzBi+MOH1tI3MpdxJib1vEYBdggoOWN
K/FFGgsOY+DNmJPpSnPTTu8UpzksS8SxGBP7M9Q8roKZwA2c9wLotxySvjki5yv8
2zvabRdzbrSaoYwsH9QnZdQ2hVkJ9W8MESu8PevD3yMNuFUzledPDWW0N1SbGm78
0ZdB6NPdaBZYHMNMRdFhN8P275/Mx5e0XWN9oYMQqjPooH7YkyT7hJWz6ao2PCJP
8mDmnW1RzL+LWf7mJ25ZEtS+YjmKA/PVmogRrGurKCadvdxXqCF09KNljICHhmmu
t0KB4dqw02OXLPvBk21qCi0zL56w1JDgqtS8suFvDYo9sCceeAbAcmpyoUOFj2N+
Upn976tb4iqFrr9mFswpmCJWPpqJkU+A+KnKsIRPU7N4kSrP35I=
=HvlN
-----END PGP SIGNATURE-----
Merge tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
|
|
|
|
51d90a15fe |
ARM:
- Support for userspace handling of synchronous external aborts (SEAs),
allowing the VMM to potentially handle the abort in a non-fatal
manner.
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers in
hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ.
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU.
- Allow page table destruction to reschedule, fixing long need_resched
latencies observed when destroying a large VM.
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register.
- Add AVEC basic support.
- Use 64-bit register definition for EIOINTC.
- Add KVM timer test cases for tools/selftests.
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of
exits (and worse performance) on z10 and earlier processor;
ESCA was introduced by z114/z196 in 2010.
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
caching is disabled, as there can't be any relevant SPTEs to zap.
- Relocate a misplaced export.
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping
the notifier registered is ok; the kernel does not use the MSRs and the
callback will run cleanly and restore host MSRs if the CPU manages to
return to userspace before the system goes down.
- Use the checked version of {get,put}_user().
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
the only reason they were handled in the fast path was to paper of a bug
in the core #MC code, and that has long since been fixed.
- Add emulator support for AVX MOV instructions, to play nice with emulated
devices whose guest drivers like to access PCI BARs with large multi-byte
instructions.
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
migration if the root is clean from a previous flush.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general;
replace it with an off-by-default module param to WARN if hardware fails
a check that KVM does not perform.
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module;
KVM was either working around these in weird, ugly ways, or was simply
oblivious to them (though even Yan's devilish selftests could only break
individual VMs, not the host kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
if creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying TDX capabilities to userspace.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
- Forcefully override ARCH from x86_64 to x86 to play nice with specifying
ARCH=x86_64 on the command line.
- Extend a bunch of nested VMX to validate nested SVM as well.
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
verify KVM can save/restore nested VMX state when L1 is using 5-level
paging, but L2 is not.
- Clean up the guest paging code in anticipation of sharing the core logic for
nested EPT and nested NPT.
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
=PEcw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
|
|
2b09f480f0 |
A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which are
caused by the related overhead. It turned out that the decision to invoke
the exit to user work was not really a decision. More or less each
context switch caused that. There is a long list of small issues which
sums up nicely and results in a 3-4% regression in I/O benchmarks.
The other detail which caused issues due to extra work in context switch
and task migration is the CID (memory context ID) management. It also
requires to use a task work to consolidate the CID space, which is
executed in the context of an arbitrary task and results in sporadic
uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined variant.
- Separating fast and slow path for architectures which use the generic
entry code, so that only fault and error handling goes into the
TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in the
context switch path. That moves the work of switching modes into the
fork/exit path, which is a reasonable tradeoff. That work is only
required when a process creates more threads than the cpuset it is
allowed to run on or when enough threads exit after that. An artificial
thread pool benchmarks which triggers this did not degrade, it actually
improved significantly.
The main effect in migration heavy scenarios is that runqueue lock held
time and therefore contention goes down significantly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksaRYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoencEADA5he8PAFPmSRRPo6+2G5mHzWe8kIU
5ZViQStWFNAA0qqy8VXryWiJ6qqrO6la9o7K4YOXASUtlkVjquRp1DF7PabqGwuy
zshbRCXNlT51J8uqanN8VrGVjlf+bMdHDbGoI1SLkUTxG8b+kDD5PXUQE1ARelPP
Slbg9u+EMrxj6D5MDTPbuW6TqryJEkPtiNScyOz43emp9ww9+WVxenOcRqU4D+Th
mjWmrGIzkroSf4XReMoD/wg9TPTpUjXnNCwl2viY9JvBpkMfYtU4tJAGK3aNFOWy
zsAN0O9CaFGrUEFne7qUmtwhNLdtnjx5HN5pe7yZd1EhdTuQKq4jPiiQnwwm8w72
c0o6m45FNPmPoSyfaZWCkLjbTEUXonT9JF61iN35JVxim8gBDDJjHFKnLxDmLrH3
X0eESE48ReY2EneDV6Y8RJRo6oG14Fccvc39aTf/2Rw3trpmtt2agvConQzupQIg
DzANw4jhUUzFRrHrMHACNsqKFXh9ratue/S9DM3xxTpGO/bKdeK7jGIgzNf8O34M
J0O6Hvk5jMdcWlIJTx21GoGzoSkkXnR49g/71aCcp+MwdY4x9zFz5SWi8LWQRmkx
xRo6tY27Bma8/SEwMJjPpAUXDTpq6v+j3cPisybL1yGsyt9lh+p8LX7VUtwcoEqe
6ZelC5Kgw/+/kg==
=n5KT
-----END PGP SIGNATURE-----
Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
"A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which
are caused by the related overhead. It turned out that the decision to
invoke the exit to user work was not really a decision. More or less
each context switch caused that. There is a long list of small issues
which sums up nicely and results in a 3-4% regression in I/O
benchmarks.
The other detail which caused issues due to extra work in context
switch and task migration is the CID (memory context ID) management.
It also requires to use a task work to consolidate the CID space,
which is executed in the context of an arbitrary task and results in
sporadic uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined
variant.
- Separating fast and slow path for architectures which use the
generic entry code, so that only fault and error handling goes into
the TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in
the context switch path. That moves the work of switching modes
into the fork/exit path, which is a reasonable tradeoff. That work
is only required when a process creates more threads than the
cpuset it is allowed to run on or when enough threads exit after
that. An artificial thread pool benchmarks which triggers this did
not degrade, it actually improved significantly.
The main effect in migration heavy scenarios is that runqueue lock
held time and therefore contention goes down significantly"
* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/mmcid: Switch over to the new mechanism
sched/mmcid: Implement deferred mode change
irqwork: Move data struct to a types header
sched/mmcid: Provide CID ownership mode fixup functions
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Introduce per task/CPU ownership infrastructure
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Move initialization out of line
signal: Move MMCID exit out of sighand lock
sched/mmcid: Convert mm CID mask to a bitmap
cpumask: Cache num_possible_cpus()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
sched/mmcid: Move scheduler code out of global header
sched: Fixup whitespace damage
sched/mmcid: Cacheline align MM CID storage
sched/mmcid: Use proper data structures
sched/mmcid: Revert the complex CID management
...
|
|
|
|
de8e8ebb1a |
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVkAACgkQOlYIJqCj
N/18Ow//cWPmXAdJcM0fRtnSGwzIZszGSD63htgdh5UDeJIFVyUGKH7uGhndQUwK
Uo8jCJ4ikwMxDdCijv+e4eqCCMZjb7HQhFKaauPVCJZOhmZn0br3EB5xX24Qgp8R
YN5gTheiTCHHVaxAMl9grgi1xTRi6pJRufRebOmtyGKNQkclctXcuSdtw7IEhqdM
wKM3eyb7qUhUrmt5tBkSyFAioGcPJIHE3vqLjImqDgduinbXJdQa1sek4Br0sX45
rfISZ2geXDj/Sh7EPrPU1ne5LQbtgzp1WTG6MRCidYfP86riMQUlEMY6odEYAgIX
kCd+z248OJShF5EYcEmjc894YLHJ0vVXIXKx/qh0+Jiobz3bujk+whaxTNa26rj0
3qLPGzFpYugtxkGqBYH4q90oUTovEk4922+jPsQ9GKQ26f0q3XzvriEUSOgrvo0Z
O26OyK7BezqSM5WMMSf/EGI1ESuli5lbBLYDOaNZS35di2YcDEgtaikRETpWwy82
TGxrjyeW9Pu6M3iTtQsOVHNxA4hU//Qd5HcDj5rcXOg1rgiPV9n2OaCEMwc6qi+V
VytbGm4IlMsz6AVHqyv3SUIt1Z4LNAZ/FwK8oeBRVd6LNfm6nfyrW6eQFQVLoIpA
1nyi9XjMg7xj6ubiSEQSTSl9gto8FzVWwLKwZ8dLH7SPvqlz+zY=
=qGpA
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
|
|
|
|
236831743c |
KVM guest_memfd changes for 6.19:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmSgcACgkQOlYIJqCj
N/0MBQ/+KzI/6q6AR2m9jYCbz8APcJpmAEJ4Ma0+Q4XrJfkymAmt9P4M46bRZl5G
Aznaqq9vak1weIzlaFsOzYqEWvSN54P/EfZKkqh0kCDKXGzl1HkSCC7FeThyqZz0
+uuWUiRtWI/dyNFEpIXB/G06DwqhMIKAlk421Zt84iBI/wz2oZeAgWoFCjPWca4a
/L/ClmpzM6LnP/Hg2DyoZtfwAXIy65pb9h0IhKbvGcgSrS4sesZPiSV20KKvKVSp
4+WVLHuNbjk9vWKkmV8IZH+BXAO2J2+y2JYckbx4DvUKQcauXUJjFjp8+wZ4gMrC
SK3SuWTTc3oe7fhaZZ98KY3BO9dq68iCbxpmdkhYrEbNqcDbQA5GMhIUZ2TgqDX7
KQ58s9zyHZvsX4cnL3XY9igsdl6tvqnFqVhGyBaxUrWNJDqAs3NPJr8Td3cnrMKg
MRB2AaJN8xX6DK7JAxKQ4LC0Nb7w3hxF0t0XL2XzBw+6nsu67NjrM7EflIhG+mHJ
YWhrnNh83Yut95cym5mpUU0kFqfHNB5SxRGGokDcQ1NAXBwwE2Tq+YopR9OKRfiZ
52/hkC195D7Xaeo9raf6iKN+YfiZ0uj3TvxuxIHs/EIqmmjhsc8VSwkfFW5OIeYf
Tulmh/ohkq1675By/dapyHvW97PSgd0IqbDzjDrlxmbZbuuS1F8=
=j3On
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-gmem-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM guest_memfd changes for 6.19:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
|
|
|
|
9aca52b552 |
KVM generic changes for 6.19:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmSKsACgkQOlYIJqCj
N/00fA/6A15+hw7l/McGvaiYU1u7BZP2ST9ySK6j+opLJaly1CzJF17164+0sxRr
igwVZrS4pzjlucC8T308AjqMAqkOPkOeBW3YBftdRWUO8gd0ns8XZuYtUxEXcCM1
s6kIQrtuYmMplW0Gb6zCcXuVCFiVkTbV/ZBsk/URgf5tA1hX8W5Tj/VtufEEPqzv
40N6HRvUz2xg5wQWEx9hOG1WQtvPhjOhbRXwDk/Tz/BYoHPK5lh3HEfNAGgyosoW
nOtpm41iA0S/AFCnucuhK9Sv0JzIvcArp1znjXo/f742k8LJZ8UwNjk7iccAYspk
4QzHpXi4MkUoIpUQ/is6ODgu4E4g6/f3QUOmbvX2snXXLluvwfRUrEaoDRaJtaXG
3W5Wz2f5j+J6oNRNWT96qDBbbI9s16tHEFhds1cyk964JYUs5mfVx8d0iNfsmXa0
SDDPvqoFWx3yXtEHg2rgGGQm/oQ02D/azn2XZ/ZMy6EcUP33nVAm4zkeJZXFjofE
Uk7t0kfE4tRweHIxZuOsp9GmSu8Ps5asZpNZuaJWBivPiHxCE7WCSNjM9USUbqFt
NShsZjPMOejGmDfP5e+xbwMrIDr1Ywwp6+QZXDyfj9+O+u3HJ/mg7QyfwWCjuFLy
zD9OCwYnZzvmTFA8zB6kE9DRZY1TLDZWhdmJEWHwIrqrAu0kXv8=
=NRH/
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-generic-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.19:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
|
|
|
|
32bd348be3 |
KVM: Fix last_boosted_vcpu index assignment bug
In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes last_boosted_vcpu to store the loop iteration count rather than the vCPU index, leading to incorrect round-robin behavior in subsequent directed yield operations. Fix this by using 'idx' instead of 'i' in the assignment. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-ID: <20251110033232.12538-7-kernellwp@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a97fbc3ee3 |
syscore: Pass context data to callbacks
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com> |
|
|
|
50efc2340a |
KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl()
Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's TDX code doesn't result in a massive misnomer. To avoid having to retry SEAMCALLs, TDX needs to acquire kvm->lock *and* all vcpu->mutex locks, and acquiring all of those locks after/inside the current vCPU's mutex is a non-starter. However, TDX also needs to acquire the vCPU's mutex and load the vCPU, i.e. the handling is very much not async to the vCPU. No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
0a0da3f921 |
KVM: Make support for kvm_arch_vcpu_async_ioctl() mandatory
Implement kvm_arch_vcpu_async_ioctl() "natively" in x86 and arm64 instead of relying on an #ifdef'd stub, and drop HAVE_KVM_VCPU_ASYNC_IOCTL in anticipation of using the API on x86. Once x86 uses the API, providing a stub for one architecture and having all other architectures opt-in requires more code than simply implementing the API in the lone holdout. Eliminating the Kconfig will also reduce churn if the API is renamed in the future (spoiler alert). No functional change intended. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251030200951.3402865-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ae431059e7 |
KVM: guest_memfd: Remove bindings on memslot deletion when gmem is dying
When unbinding a memslot from a guest_memfd instance, remove the bindings
even if the guest_memfd file is dying, i.e. even if its file refcount has
gone to zero. If the memslot is freed before the file is fully released,
nullifying the memslot side of the binding in kvm_gmem_release() will
write to freed memory, as detected by syzbot+KASAN:
==================================================================
BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
Write of size 8 at addr ffff88807befa508 by task syz.0.17/6022
CPU: 0 UID: 0 PID: 6022 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
kvm_gmem_release+0x176/0x440 virt/kvm/guest_memfd.c:353
__fput+0x44c/0xa70 fs/file_table.c:468
task_work_run+0x1d4/0x260 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fbeeff8efc9
</TASK>
Allocated by task 6023:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:397 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:414
kasan_kmalloc include/linux/kasan.h:262 [inline]
__kmalloc_cache_noprof+0x3e2/0x700 mm/slub.c:5758
kmalloc_noprof include/linux/slab.h:957 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
kvm_set_memory_region+0x747/0xb90 virt/kvm/kvm_main.c:2104
kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 6023:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:252 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
kasan_slab_free include/linux/kasan.h:234 [inline]
slab_free_hook mm/slub.c:2533 [inline]
slab_free mm/slub.c:6622 [inline]
kfree+0x19a/0x6d0 mm/slub.c:6829
kvm_set_memory_region+0x9c4/0xb90 virt/kvm/kvm_main.c:2130
kvm_vm_ioctl_set_memory_region+0x6f/0xd0 virt/kvm/kvm_main.c:2154
kvm_vm_ioctl+0x957/0xc60 virt/kvm/kvm_main.c:5201
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Deliberately don't acquire filemap invalid lock when the file is dying as
the lifecycle of f_mapping is outside the purview of KVM. Dereferencing
the mapping is *probably* fine, but there's no need to invalidate anything
as memslot deletion is responsible for zapping SPTEs, and the only code
that can access the dying file is kvm_gmem_release(), whose core code is
mutually exclusive with unbinding.
Note, the mutual exclusivity is also what makes it safe to access the
bindings on a dying gmem instance. Unbinding either runs with slots_lock
held, or after the last reference to the owning "struct kvm" is put, and
kvm_gmem_release() nullifies the slot pointer under slots_lock, and puts
its reference to the VM after that is done.
Reported-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68fa7a22.a70a0220.3bf6c6.008b.GAE@google.com
Tested-by: syzbot+2479e53d0db9b32ae2aa@syzkaller.appspotmail.com
Fixes:
|
|
|
|
83409986f4 |
rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs == NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de |
|
|
|
0bb4d9c39b |
KVM: guest_memfd: Define a CLASS to get+put guest_memfd file from a memslot
Add a CLASS to handle getting and putting a guest_memfd file given a memslot to reduce the amount of related boilerplate, and more importantly to minimize the chances of forgetting to put the file (thankfully the bug that prompted this didn't escape initial testing). Define a CLASS instead of using __free(fput) as _free() comes with subtle caveats related to FILO ordering (objects are freed in the order in which they are declared), and the recommended solution/workaround (declare file pointers exactly when they are initialized) is visually jarring relative to KVM's (and the kernel's) overall strict adherence to not mixing declarations and code. E.g. the use in kvm_gmem_populate() would be: slot = gfn_to_memslot(kvm, start_gfn); if (!kvm_slot_has_gmem(slot)) return -EINVAL; struct file *file __free(fput) = kvm_gmem_get_file(slot; if (!file) return -EFAULT; filemap_invalidate_lock(file->f_mapping); Note, using CLASS() still declares variables in the middle of code, but the syntactic sugar obfuscates the declaration, i.e. hides the anomaly to a large extent. No functional change intended. Link: https://lore.kernel.org/r/20251007222356.348349-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
e66438bb81 |
KVM: guest_memfd: Add gmem_inode.flags field instead of using i_private
Track a guest_memfd instance's flags in gmem_inode instead of burying them in i_private. Burning an extra 8 bytes per inode is well worth the added clarity provided by explicit tracking. Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/r/20251016172853.52451-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
ed1ffa810b |
KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
Previously, guest-memfd allocations followed local NUMA node id in absence of process mempolicy, resulting in arbitrary memory allocation. Moreover, mbind() couldn't be used by the VMM as guest memory wasn't mapped into userspace when allocation occurred. Enable NUMA policy support by implementing vm_ops for guest-memfd mmap operation. This allows the VMM to use mmap()+mbind() to set the desired NUMA policy for a range of memory, and provides fine-grained control over guest memory allocation across NUMA nodes. Note, using mmap()+mbind() works even for PRIVATE memory, as mbind() doesn't require the memory to be faulted in. However, get_mempolicy() and other paths that require the userspace page tables to be populated may return incorrect information for PRIVATE memory (though under the hood, KVM+guest_memfd will still behave correctly). Store the policy in the inode structure, gmem_inode, as a shared memory policy, so that the policy is a property of the physical memory itself, i.e. not bound to the VMA. In guest_memfd, KVM is the primary MMU and any VMAs are secondary, i.e. using mbind() on a VMA to set policy is a means to an end, e.g. to avoid having to add a file-based equivalent to mbind(). Similarly, retrieve the policy via mpol_shared_policy_lookup(), not get_vma_policy(), even when allocating to fault in memory for userspace mappings, so that the policy stored in gmem_inode is always the source of true. Apply policy changes only to future allocations, i.e. do not migrate existing memory in the guest_memfd instance. This matches mbind(2)'s default behavior, which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (which are not supported by guest_memfd as guest_memfd memory is unmovable). Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Shivank Garg <shivankg@amd.com> Tested-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/all/e9d43abc-bcdb-4f9f-9ad7-5644f714de19@amd.com [sean: fold in fixup (see Link above), massage changelog] Link: https://lore.kernel.org/r/20251016172853.52451-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
f609e89ae8 |
KVM: guest_memfd: Add slab-allocated inode cache
Add a dedicated gmem_inode structure and a slab-allocated inode cache for guest memory backing, similar to how shmem handles inodes. This adds the necessary allocation/destruction functions and prepares for upcoming guest_memfd NUMA policy support changes. Using a dedicated structure will also allow for additional cleanups, e.g. to track flags in gmem_inode instead of i_private. Signed-off-by: Shivank Garg <shivankg@amd.com> Tested-by: Ashish Kalra <ashish.kalra@amd.com> [sean: s/kvm_gmem_inode_info/gmem_inode, name init_once()] Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Link: https://lore.kernel.org/r/20251016172853.52451-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
a63ca4236e |
KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
guest_memfd's inode represents memory the guest_memfd is providing. guest_memfd's file represents a struct kvm's view of that memory. Using a custom inode allows customization of the inode teardown process via callbacks. For example, ->evict_inode() allows customization of the truncation process on file close, and ->destroy_inode() and ->free_inode() allow customization of the inode freeing process. Customizing the truncation process allows flexibility in management of guest_memfd memory and customization of the inode freeing process allows proper cleanup of memory metadata stored on the inode. Memory metadata is more appropriately stored on the inode (as opposed to the file), since the metadata is for the memory and is not unique to a specific binding and struct kvm. Acked-by: David Hildenbrand <david@redhat.com> Co-developed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Tested-by: Ashish Kalra <ashish.kalra@amd.com> [sean: drop helpers, open code logic in __kvm_gmem_create()] Link: https://lore.kernel.org/r/20251016172853.52451-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
392dd9d948 |
KVM: guest_memfd: Add macro to iterate over gmem_files for a mapping/inode
Add a kvm_gmem_for_each_file() to make it more obvious that KVM is iterating over guest_memfd _files_, not guest_memfd instances, as could be assumed given the name "gmem_list". No functional change intended. Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/r/20251016172853.52451-3-seanjc@google.com [sean: drop .clang-format change] Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
497b1dfbca |
KVM: guest_memfd: Rename "struct kvm_gmem" to "struct gmem_file"
Rename the "kvm_gmem" structure to "gmem_file" in anticipation of using dedicated guest_memfd inodes instead of anonyomous inodes, at which point the "kvm_gmem" nomenclature becomes quite misleading. In guest_memfd, inodes are effectively the raw underlying physical storage, and will be used to track properties of the physical memory, while each gmem file is effectively a single VM's view of that storage, and is used to track assets specific to its associated VM, e.g. memslots=>gmem bindings. Using "kvm_gmem" suggests that the per-VM/per-file structures are _the_ guest_memfd instance, which almost the exact opposite of reality. Opportunistically rename local variables from "gmem" to "f", again to avoid confusion once guest_memfd specific inodes come along. No functional change intended. Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/r/20251016172853.52451-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
5f3e10797a |
KVM: guest_memfd: Drop a superfluous local var in kvm_gmem_fault_user_mapping()
Drop the local "int err" that's buried in the middle guest_memfd's user fault handler to avoid the potential for variable shadowing, e.g. if an "err" variable were also declared at function scope. No functional change intended. Link: https://lore.kernel.org/r/20251007222733.349460-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
765fcd7c07 |
KVM: guest_memfd: use folio_nr_pages() instead of shift operation
folio_nr_pages() is a faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Link: https://lore.kernel.org/r/20251004030210.49080-1-pedrodemargomes@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
3f1078a445 |
KVM: guest_memfd: remove redundant gmem variable initialization
Remove redundant initialization of gmem in __kvm_gmem_get_pfn() as it is already initialized at the top of the function. No functional change intended. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/r/20251012071607.17646-2-shivankg@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
049e560d4f |
KVM: guest_memfd: move kvm_gmem_get_index() and use in kvm_gmem_prepare_folio()
Move kvm_gmem_get_index() to the top of the file so that it can be used in kvm_gmem_prepare_folio() to replace the open-coded calculation. No functional change intended. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/r/20251012071607.17646-1-shivankg@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
4361f5aa8b |
KVM x86 fixes for 6.18:
- Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the
bug fixed by commit
|
|
|
|
9259607ec7 |
KVM: Explicitly allocate/setup irqfd cleanup as per-CPU workqueue
Explicitly request the use of per-CPU queues for the irqfd cleanup
workqueue in preparation for changing the default behavior of
alloc_workqueue() from per-CPU to unbound, which will in turn allow for
the removal of WQ_UNBOUND. See commit
|
|
|
|
44c6cb9fe9 |
KVM: guest_memfd: Allow mmap() on guest_memfd for x86 VMs with private memory
Allow mmap() on guest_memfd instances for x86 VMs with private memory as
the need to track private vs. shared state in the guest_memfd instance is
only pertinent to INIT_SHARED. Doing mmap() on private memory isn't
terrible useful (yet!), but it's now possible, and will be desirable when
guest_memfd gains support for other VMA-based syscalls, e.g. mbind() to
set NUMA policy.
Lift the restriction now, before MMAP support is officially released, so
that KVM doesn't need to add another capability to enumerate support for
mmap() on private memory.
Fixes:
|
|
|
|
9aef71c892 |
KVM: Explicitly mark KVM_GUEST_MEMFD as depending on KVM_GENERIC_MMU_NOTIFIER
Add KVM_GENERIC_MMU_NOTIFIER as a dependency for selecting KVM_GUEST_MEMFD,
as guest_memfd relies on kvm_mmu_invalidate_{begin,end}(), which are
defined if and only if the generic mmu_notifier implementation is enabled.
The missing dependency is currently benign as s390 is the only KVM arch
that doesn't utilize the generic mmu_notifier infrastructure, and s390
doesn't currently support guest_memfd.
Fixes:
|
|
|
|
5d3341d684 |
KVM: guest_memfd: Invalidate SHARED GPAs if gmem supports INIT_SHARED
When invalidating gmem ranges, e.g. in response to PUNCH_HOLE, process all
possible range types (PRIVATE vs. SHARED) for the gmem instance. Since
since guest_memfd doesn't yet support in-place conversions, simply pivot
on INIT_SHARED as a gmem instance can currently only have private or shared
memory, not both.
Failure to mark shared GPAs for invalidation is benign in the current code
base, as only x86's TDX consumes KVM_FILTER_{PRIVATE,SHARED}, and TDX
doesn't yet support INIT_SHARED with guest_memfd. However, invalidating
only private GPAs is conceptually wrong and a lurking bug, e.g. could
result in missed invalidations if ARM starts filtering invalidations based
on attributes.
Fixes:
|
|
|
|
fe2bf6234e |
KVM: guest_memfd: Add INIT_SHARED flag, reject user page faults if not set
Add a guest_memfd flag to allow userspace to state that the underlying
memory should be configured to be initialized as shared, and reject user
page faults if the guest_memfd instance's memory isn't shared. Because
KVM doesn't yet support in-place private<=>shared conversions, all
guest_memfd memory effectively follows the initial state.
Alternatively, KVM could deduce the initial state based on MMAP, which for
all intents and purposes is what KVM currently does. However, implicitly
deriving the default state based on MMAP will result in a messy ABI when
support for in-place conversions is added.
For x86 CoCo VMs, which don't yet support MMAP, memory is currently private
by default (otherwise the memory would be unusable). If MMAP implies
memory is shared by default, then the default state for CoCo VMs will vary
based on MMAP, and from userspace's perspective, will change when in-place
conversion support is added. I.e. to maintain guest<=>host ABI, userspace
would need to immediately convert all memory from shared=>private, which
is both ugly and inefficient. The inefficiency could be avoided by adding
a flag to state that memory is _private_ by default, irrespective of MMAP,
but that would lead to an equally messy and hard to document ABI.
Bite the bullet and immediately add a flag to control the default state so
that the effective behavior is explicit and straightforward.
Fixes:
|
|
|
|
d2042d8f96 |
KVM: Rework KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS
Rework the not-yet-released KVM_CAP_GUEST_MEMFD_MMAP into a more generic KVM_CAP_GUEST_MEMFD_FLAGS capability so that adding new flags doesn't require a new capability, and so that developers aren't tempted to bundle multiple flags into a single capability. Note, kvm_vm_ioctl_check_extension_generic() can only return a 32-bit value, but that limitation can be easily circumvented by adding e.g. KVM_CAP_GUEST_MEMFD_FLAGS2 in the unlikely event guest_memfd supports more than 32 flags. Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20251003232606.4070510-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
2215336295 |
hyperv-next for v6.18
-----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmjkpakTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXip5B/48MvTFJ1qwRGPVzevZQ8Z4SDogEREp 69VS/xRf1YCIzyXyanwqf1dXLq8NAqicSp6ewpJAmNA55/9O0cwT2EtohjeGCu61 krPIvS3KT7xI0uSEniBdhBtALYBscnQ0e3cAbLNzL7bwA6Q6OmvoIawpBADgE/cW aZNCK9jy+WUqtXc6lNtkJtST0HWGDn0h04o2hjqIkZ+7ewjuEEJBUUB/JZwJ41Od UxbID0PAcn9O4n/u/Y/GH65MX+ddrdCgPHEGCLAGAKT24lou3NzVv445OuCw0c4W ilALIRb9iea56ZLVBW5O82+7g9Ag41LGq+841MNlZjeRNONGykaUpTWZ =OR26 -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt |
|
|
|
9be7e1e320 |
entry: Rename "kvm" entry code assets to "virt" to genericize APIs
Rename the "kvm" entry code files and Kconfigs to use generic "virt" nomenclature so that the code can be reused by other hypervisors (or rather, their root/dom0 partition drivers), without incorrectly suggesting the code somehow relies on and/or involves KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> |
|
|
|
20c4892058 |
KVM: Export KVM-internal symbols for sub-modules only
Rework the vast majority of KVM's exports to expose symbols only to KVM
submodules, i.e. to x86's kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko.
With few exceptions, KVM's exported APIs are intended (and safe) for KVM-
internal usage only.
Keep kvm_get_kvm(), kvm_get_kvm_safe(), and kvm_put_kvm() as normal
exports, as they are needed by VFIO, and are generally safe for external
usage (though ideally even the get/put APIs would be KVM-internal, and
VFIO would pin a VM by grabbing a reference to its associated file).
Implement a framework in kvm_types.h in anticipation of providing a macro
to restrict KVM-specific kernel exports, i.e. to provide symbol exports
for KVM if and only if KVM is built as one or more modules.
Link: https://lore.kernel.org/r/20250919003303.1355064-3-seanjc@google.com
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
a104e0a305 |
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXH54ACgkQOlYIJqCj
N/0OCw//e+0o6jov6/PO8ljq6sXJySsXKxEFYnvQlWYzjqtlVs05Y2SY0GBTnMu3
g0ie2c4V3VD7cY5bGAWETWvrOMLqGXM3E7v9dVOuE4xU3xx0HkCAlXc/woOLUXoT
jo/komNXnpeiZ1QRO9FlGooHTJ6Y+jg6/mM7asStS2Pk3Mm//wYgQej9mSJDrypo
NB4+BCS9cyt8rndNtCUkyedFYMboVQ8AEvXh/jeydhw4rdbBh0/Ci2IKGcVI5DP1
be8GD/FsNTIUDtieHRYCR+LCKCMFj/hYzlg2nQ6UjxHZbvlDyQuh2Ld2LtZiGSef
ejNr9e+ro6vxWBgX6wplWtKRLxBYEnQ1h/rQ9A3g50TuhrtFJbxBxY7DPQ16hlBJ
EB/E1JFvVgkGVrYN0oPQCvvfhFtpkx43qnEBw4q0pbdAS79XOnG2GJFvI0hpZAP6
qwy19lbsJ5g3qLTlDPChxQJC08gThn3CbarCmZNNzBpPDQoLDUfYBfyN4prRPuiN
UByfaaEC0Fi6JSgmHsO0LsUB9K++k2ucWiIIW4YQhVgPUtCjTNLe9omgGJ1UYe0X
YITqgklewe3QtBJ46JE0APkPaHio7r6zd7QvO+RhRFkjwZfY6dlsrSImykKrpK3O
rPaZnW+UpAnA1XIqroMl1RVoczFCfGcP1Cat9JwScBVVxjJ1DlI=
=zd53
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
|
|
|
|
5b0d0d8542 |
KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
reduce guest jitter when restoring NX huge pages.
- Return -EAGAIN during prefault if userspace concurrently deletes/moves the
relevant memslot to fix an issue where prefaulting could deadlock with the
memslot update.
- Don't retry in TDX's anti-zero-step mitigation if the target memslot is
invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
to the aforementioned prefaulting case.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXHaEACgkQOlYIJqCj
N/1uDxAAxGMl1q1Hg0tpVPw7PdcourXlVYJjFzsrK6CdtZpL7n2GJPVhEFBDovud
oIM9IIiP5f2UDtWeRb6b/mm9INqwTB8lyswbJk/tO+CshBiBdE7PfDbzDzvj9lAv
Uecc6tQhv+CDpJcSf7t5OqgiRo5gEBTXZZj0l5GOdtiaOU09eq4ttZTME5S1jQgh
kBddFd3glWeMLv67cTNCxdHsOFnaVWIBoupfw7Fv7LVJ1k6cgKyHAhjfq8A9elEK
3CyDo8DZ8MG4aguhHzAUQuEM9ELMxOTyJG8xS2BWtFA/glbvUBnOfGeyTmHgo/nN
qKyjytlpmO0yIlehTd/5tLfpidL8l30VN7+nDpqwTjCDEz9bC39zC9zBmKni84Dt
wItfmELb6lbvprA+FOseiRwk7/2quLrgc4y21GI29Zqbf6wMoQEnRHF/moFZ3cqg
C/SP1Ev6N5ENM2BZG9mFSRWr8e2yyan8YWs+AUtsBEM82KaeJrMlZ4yqA1m33a5T
YK5eL3DablObdfvvz1YXCVxByQ7aIbVCpE3VVigeyHrqoR/EFwZMzYLouOI34jjN
Nj5+Qck6VMhI+OetUlcXS1D/DIHgpDgZFPcgeLURiwO0l62H/gYLHuoCek4YmkIi
30ZwVXubBWDg5TcxEi5oIbVfyZfHNi+MyeLMWLEy6hEdnFsTsZU=
=6qMx
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.18' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
reduce guest jitter when restoring NX huge pages.
- Return -EAGAIN during prefault if userspace concurrently deletes/moves the
relevant memslot to fix an issue where prefaulting could deadlock with the
memslot update.
- Don't retry in TDX's anti-zero-step mitigation if the target memslot is
invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
to the aforementioned prefaulting case.
|
|
|
|
99cab80208 |
KVM common changes for 6.18
Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is now included in GFP_NOWAIT. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXG7YACgkQOlYIJqCj N/3H9w//UvhvgWHdKPWbtiqWyzJK6dLgWTU9StitsuPqwxHMLTwtyiwSJGhnotis KmlTK3LNYMlgC3cBa3v58Sgdp15yUJN4Cz4b609cPnOTuF7sQJedhi/Rw3UoFS2R m0RehHiBR0lujUmMnn6QL1vh4d+D2w6JfSAyuhczglwAhkeRZ3SEd8FeOm8re3iy Kz89XhpwZ1jafCOWrkmP/ba4tfwQHHInvFBF5WrLmjOgJ59sDAILYT0UFiQm5tH4 tXLtLhW/NBmp7WSCuV4VsHDiMN31cMeTt1RXAabDe0vkPYU6yhbcxsNf+uljDvmc zMsTQDiwmKwCMLoE2OKAqN4z6LNtojTwHzwjOu496VIxlyusG5qen94Lsub8Az0n tsZ3VpifToL+95/Vgrkl7mQuje7aEOmTn0XTcGI030uO9ODpUVyrB260l5Y+vPIk icZtA64bInWo1j8HT8iTHXJvheWSZXM46cdMzyPBtyD1PQ2/FPgrCrf3r+xpkVrQ wy/2eRW64kungzZovhHYZjsQmtCbG+snPCsXsKZVWFOT8iT0lQZk40BUiBf++yMV 5JEXtaazgL3Ze2WCTETcViidV4WDrEpGdkUnoBunAqHqkUAAcSwDBpu2++vzdJwD sX0Ld8rOq3GwNWn/6Jkk9gwGH9ZPQM0AUWQ06VXlpcq/YelDgdM= =I3Dm -----END PGP SIGNATURE----- Merge tag 'kvm-x86-generic-6.18' of https://github.com/kvm-x86/linux into HEAD KVM common changes for 6.18 Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as __GFP_NOWARN is now included in GFP_NOWAIT. |
|
|
|
924ebaefce |
KVM/arm64 updates for 6.18
- Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructure to be able to disable EL2 features and give effective values to the EL2 control registers. This then allows a bunch of features to be turned off, which helps cross-host migration. - Large rework of the selftest infrastructure to allow most tests to transparently run at EL2. This is the first step towards enabling NV testing. - Various fixes and improvements all over the map, including one BE fix, just in time for the removal of the feature. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmjVhSsACgkQI9DQutE9 ekOSIg//YdbA/zo17GrLnJnlGdUnS0e3/357n3e5Lypx1UFRDTmNacpVPw4VG/jt eVpQn7AgYwyvKfCq46eD+hBBqNv1XTn4DeXttv7CmVqhCRythsEvDkTBWSt7oYUZ xfYXCMKNhqUElH4AbYYx3y7nb2E9/KVGr+NBn6Vf5c14OZ3MGVc/fyp4jM1ih5dR kcV2onAYohlIGvFEyZMBtJ+jkYkqfIbfxfqCL0RAET0aEBFcmM1aXybWZj47hlLM f2j+E6cFQ0ZzUt+3pFhT75wo43lHGtIFDjVd60uishyU+NXTVvqRmXDTRU4k546W 18HHX1yijbzuXIatVhVRo2hIq3jKU37T9wtj46BejbDHRdAPENEyN/Qopm7rNS+X mCwOT7He6KR+H4rU6nFaTcsS7bNRCvIbZP9i9zb6NElbvXu5QnM8BUQsYFCDUa/n xtbtQlckbo/7zeoUsBDrGj2XmCf0d45FTHb7fdWOYEmMSmJhXYpUKdM4JcLyhKoQ DD0ox2S+pt2lwNw3XOSABdES0KJxCvDASAMIgn2h2sGpY8FxsBcVW/BufXopdafG UeInxWaILp2iCDM4tH2GLjKqlvMAOwcA+mAEZToXypxlJAnYA6J1pXCF8WEaM6+D BGrLli8Zd8JRs87byq6K7tp8oLNzZdliJp73j5jfOHTJA4MnvFI= =iAL3 -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.18 - Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructure to be able to disable EL2 features and give effective values to the EL2 control registers. This then allows a bunch of features to be turned off, which helps cross-host migration. - Large rework of the selftest infrastructure to allow most tests to transparently run at EL2. This is the first step towards enabling NV testing. - Various fixes and improvements all over the map, including one BE fix, just in time for the removal of the feature. |
|
|
|
5b66e335ea |
KVM: SEV: Reject non-positive effective lengths during LAUNCH_UPDATE
Check for an invalid length during LAUNCH_UPDATE at the start of snp_launch_update() instead of subtly relying on kvm_gmem_populate() to detect the bad state. Code that directly handles userspace input absolutely should sanitize those inputs; failure to do so is asking for bugs where KVM consumes an invalid "npages". Keep the check in gmem, but wrap it in a WARN to flag any bad usage by the caller. Note, this is technically an ABI change as KVM would previously allow a length of '0'. But allowing a length of '0' is nonsensical and creates pointless conundrums in KVM. E.g. an empty range is arguably neither private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't be made private. In practice, no known or well-behaved VMM passes a length of '0'. Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between 1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with npages=0 when doing "npages = params.len / PAGE_SIZE". Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Michael Roth <michael.roth@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
|
|
|
7d9a0273c4 |
KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
Device MMIO registration may happen quite frequently during VM boot, and the SRCU synchronization each time has a measurable effect on VM startup time. In our experiments it can account for around 25% of a VM's startup time. Replace the synchronization with a deferred free of the old kvm_io_bus structure. Tested-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Keir Fraser <keirf@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> |
|
|
|
7788255aba |
KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths
This ensures that, if a VCPU has "observed" that an IO registration has occurred, the instruction currently being trapped or emulated will also observe the IO registration. At the same time, enforce that kvm_get_bus() is used only on the update side, ensuring that a long-term reference cannot be obtained by an SRCU reader. Signed-off-by: Keir Fraser <keirf@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> |
|
|
|
2bc2694fe2 |
KVM: TDX: Do not retry locally when the retry is caused by invalid memslot
Avoid local retries within the TDX EPT violation handler if a retry is
triggered by faulting in an invalid memslot, indicating that the memslot
is undergoing a removal process. Faulting in a GPA from an invalid
memslot will never succeed, and holding SRCU prevents memslot deletion
from succeeding, i.e. retrying when the memslot is being actively deleted
will lead to (breakable) deadlock.
Opportunistically export kvm_vcpu_gfn_to_memslot() to allow for a per-vCPU
lookup (which, strictly speaking, is unnecessary since TDX doesn't support
SMM, but aligns the TDX code with the MMU code).
Fixes:
|
|
|
|
3d3a04fad2 |
KVM: Allow and advertise support for host mmap() on guest_memfd files
Now that all the x86 and arm64 plumbing for mmap() on guest_memfd is in place, allow userspace to set GUEST_MEMFD_FLAG_MMAP and advertise support via a new capability, KVM_CAP_GUEST_MEMFD_MMAP. The availability of this capability is determined per architecture, and its enablement for a specific guest_memfd instance is controlled by the GUEST_MEMFD_FLAG_MMAP flag at creation time. Update the KVM API documentation to detail the KVM_CAP_GUEST_MEMFD_MMAP capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential information regarding support for mmap in guest_memfd. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
576d035e2a |
KVM: guest_memfd: Track guest_memfd mmap support in memslot
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of memslot->flags (which makes it strictly for KVM's internal use). This flag tracks when a guest_memfd-backed memory slot supports host userspace mmap operations, which implies that all memory, not just private memory for CoCo VMs, is consumed through guest_memfd: "gmem only". This optimization avoids repeatedly checking the underlying guest_memfd file for mmap support, which would otherwise require taking and releasing a reference on the file for each check. By caching this information directly in the memslot, we reduce overhead and simplify the logic involved in handling guest_memfd-backed pages for host mappings. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
a12578e147 |
KVM: guest_memfd: Add plumbing to host to map guest_memfd pages
Introduce the core infrastructure to enable host userspace to mmap() guest_memfd-backed memory. This is needed for several evolving KVM use cases: * Non-CoCo VM backing: Allows VMMs like Firecracker to run guests entirely backed by guest_memfd, even for non-CoCo VMs [1]. This provides a unified memory management model and simplifies guest memory handling. * Direct map removal for enhanced security: This is an important step for direct map removal of guest memory [2]. By allowing host userspace to fault in guest_memfd pages directly, we can avoid maintaining host kernel direct maps of guest memory. This provides additional hardening against Spectre-like transient execution attacks by removing a potential attack surface within the kernel. * Future guest_memfd features: This also lays the groundwork for future enhancements to guest_memfd, such as supporting huge pages and enabling in-place sharing of guest memory with the host for CoCo platforms that permit it [3]. Enable the basic mmap and fault handling logic within guest_memfd, but hold off on allow userspace to actually do mmap() until the architecture support is also in place. [1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding [2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
d1e54dd08f |
KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default" VM types when running on 64-bit KVM. This will allow using guest_memfd to back non-private memory for all VM shapes, by supporting mmap() on guest_memfd. Opportunistically clean up various conditionals that become tautologies once x86 selects KVM_GUEST_MEMFD more broadly. Specifically, because SW protected VMs, SEV, and TDX are all 64-bit only, private memory no longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because it is effectively a prerequisite. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
69116e01f6 |
KVM: Fix comments that refer to slots_lock
Fix comments so that they refer to slots_lock instead of slots_locks (remove trailing s). Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
923310be23 |
KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve clarity and accurately reflect its purpose. The function kvm_slot_can_be_private() was previously used to check if a given kvm_memory_slot is backed by guest_memfd. However, its name implied that the memory in such a slot was exclusively "private". As guest_memfd support expands to include non-private memory (e.g., shared host mappings), it's important to remove this association. The new name, kvm_slot_has_gmem(), states that the slot is backed by guest_memfd without making assumptions about the memory's privacy attributes. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
36cf63bb5d |
KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
The original name was vague regarding its functionality. This Kconfig option specifically enables and gates the kvm_gmem_populate() function, which is responsible for populating a GPA range with guest data. The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the option: to enable arch-specific guest_memfd population mechanisms. It also follows the same pattern as the other HAVE_KVM_ARCH_* configuration options. This improves clarity for developers and ensures the name accurately reflects the functionality it controls, especially as guest_memfd support expands beyond purely "private" memory scenarios. Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to minimize churn, and to hopefully make it easier to see what features require HAVE_KVM_ARCH_GMEM_POPULATE. On that note, omit GMEM_POPULATE for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for software-protected VMs. As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD for all 64-bit KVM builds, at which point the intermediate config will become obsolete and can/will be dropped. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
19a9a1ab5c |
KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables guest_memfd in general, which is not exclusively for private memory. Subsequent patches in this series will add guest_memfd support for non-CoCo VMs, whose memory is not private. Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately reflects its broader scope as the main Kconfig option for all guest_memfd-backed memory. This provides clearer semantics for the option and avoids confusion as new features are introduced. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
|
|
cf6a8401b6 |
KVM: remove redundant __GFP_NOWARN
Commit
|