Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to
the Intel driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the
x86 and PowerPC platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker,
Ian Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang,
Sandipan Das, Thorsten Blum)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy4zoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j6QRAAvQ4GBPrdJLb8oXkLjCmWSp9PfM1h2IW0
reUrcV0BPRAwz4T60QEU2KyiEjvKxNghR6bNw4i3slAZ8EFwP9eWE/0ZYOo5+W/N
wv8vsopv/oZd2L2G5TgxDJf+tLPkqnTvp651LmGAbquPFONN1lsya9UHVPnt2qtv
fvFhjW6D828VoevRcUCsdoEUNlFDkUYQ2c3M1y5H2AI6ILDVxLsp5uYtuVUP+2lQ
7UI/elqRIIblTGT7G9LvTGiXZMm8T58fe1OOLekT6NdweJ3XEt1kMdFo/SCRYfzU
eDVVVLSextZfzBXNPtAEAlM3aSgd8+4m5sACiD1EeOUNjo5J9Sj1OOCa+bZGF/Rl
XNv5Kcp6Kh1T4N5lio8DE/NabmHDqDMbUGfud+VTS8uLLku4kuOWNMxJTD1nQ2Zz
BMfJhP89G9Vk07F9fOGuG1N6mKhIKNOgXh0S92tB7XDHcdJegueu2xh4ZszBL1QK
JVXa4DbnDj+y0LvnV+A5Z6VILr5RiCAipDb9ascByPja6BbN10Nf9Aj4nWwRTwbO
ut5OK/fDKmSjEHn1+a42d4iRxdIXIWhXCyxEhH+hJXEFx9htbQ3oAbXAEedeJTlT
g9QYGAjL96QEd0CqviorV8KyU59nVkEPoLVCumXBZ0WWhNwU6GdAmsW1hLfxQdLN
sp+XHhfxf8M=
=tPRs
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to the Intel
driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the x86 and PowerPC
platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker, Ian
Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang, Sandipan
Das, Thorsten Blum)"
* tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
perf/headers: Clean up <linux/perf_event.h> a bit
perf/uapi: Clean up <uapi/linux/perf_event.h> a bit
perf/uapi: Fix PERF_RECORD_SAMPLE comments in <uapi/linux/perf_event.h>
mips/perf: Remove driver-specific throttle support
xtensa/perf: Remove driver-specific throttle support
sparc/perf: Remove driver-specific throttle support
loongarch/perf: Remove driver-specific throttle support
csky/perf: Remove driver-specific throttle support
arc/perf: Remove driver-specific throttle support
alpha/perf: Remove driver-specific throttle support
perf/apple_m1: Remove driver-specific throttle support
perf/arm: Remove driver-specific throttle support
s390/perf: Remove driver-specific throttle support
powerpc/perf: Remove driver-specific throttle support
perf/x86/zhaoxin: Remove driver-specific throttle support
perf/x86/amd: Remove driver-specific throttle support
perf/x86/intel: Remove driver-specific throttle support
perf: Only dump the throttle log for the leader
perf: Fix the throttle logic for a group
perf/core: Add the is_event_in_freq_mode() helper to simplify the code
...
API:
- Fix memcpy_sglist to handle partially overlapping SG lists.
- Use memcpy_sglist to replace null skcipher.
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK.
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS.
- Hide CRYPTO_MANAGER.
- Add delayed freeing of driver crypto_alg structures.
Compression:
- Allocate large buffers on first use instead of initialisation in scomp.
- Drop destination linearisation buffer in scomp.
- Move scomp stream allocation into acomp.
- Add acomp scatter-gather walker.
- Remove request chaining.
- Add optional async request allocation.
Hashing:
- Remove request chaining.
- Add optional async request allocation.
- Move partial block handling into API.
- Add ahash support to hmac.
- Fix shash documentation to disallow usage in hard IRQs.
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64.
- Drop avx10_256 xts(aes)/ctr(aes) on x86.
- Improve avx-512 optimisations for xts(aes).
- Move chacha arch implementations into lib/crypto.
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm.
- Disable powerpc/poly1305 as it has no SIMD fallback.
- Move sha256 arch implementations into lib/crypto.
- Convert deflate to acomp.
- Set block size correctly in cbcmac.
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss.
- Fix warm-reboot failure by making shutdown do more work in qat.
- Add locking in zynqmp-sha.
- Remove cavium/zip.
- Add support for PCI device 0x17D8 to ccp.
- Add qat_6xxx support in qat.
- Add support for RK3576 in rockchip-rng.
- Add support for i.MX8QM in caam.
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up.
- Add new SEV/SNP platform shutdown API in ccp.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmgz47AACgkQxycdCkmx
i6fvKRAAr4Xa903L0r1Q1P1alQqoFFCqimUWeH72m68LiWynHWi0lUo0z/+tKweg
mnPStz7/Ha9HRHJjdNCMPnlJqXQDkuH3bIOuBJCwduDuhHo9VGOd46XGzmGMv3gb
HKuZhI0lk7pznK3CSyD/2nHmbDCHD+7feTZSBMoN9mm875+aSoM6fdxgak8uPFcq
KbB1L+hObTn2kAPSqRrNOR8/xG2N7hdH8eax7Li+LAtqYNVT5HvWVECsB/CKRPfB
sgAv3UTzcIFapSSHUHaONppSeoqPAIAeV7SdQhJvlT+EUUR/h/B6+D9OUQQqbphQ
LBalgTnqMKl0ymDEQFQ6QyYCat9ZfNmDft2WcXEsxc8PxImkgJI1W3B8O51sOjbG
78D8JqVQ96dleo4FsBhM2wfG0b41JM6zU4raC4vS7a3qsUS+Q1MpehvcS1iORicy
SpGdE8e7DLlxKhzWyW1xJnbrtMZDC7Sa2hUnxrvP0/xOvRhChKscRVtWcf0a5q7X
8JmuvwVSOJuSbQ3MeFbQvpo5lR9+0WsNjM6e9miiH6Y7vZUKmWcq2yDp377qVzeh
7NK6+OwGIQZZExrmtPw2BXwssT9Eg+ks6Y7g2Ne7yzvrjVNfEPY7Cws/5w7p8mRS
qhrcpbJNFlWgD7YYkmGZFTQ8DCN25ipP8lklO/hbcfchqLE/o1o=
=O8L5
-----END PGP SIGNATURE-----
Merge tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix memcpy_sglist to handle partially overlapping SG lists
- Use memcpy_sglist to replace null skcipher
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
- Hide CRYPTO_MANAGER
- Add delayed freeing of driver crypto_alg structures
Compression:
- Allocate large buffers on first use instead of initialisation in scomp
- Drop destination linearisation buffer in scomp
- Move scomp stream allocation into acomp
- Add acomp scatter-gather walker
- Remove request chaining
- Add optional async request allocation
Hashing:
- Remove request chaining
- Add optional async request allocation
- Move partial block handling into API
- Add ahash support to hmac
- Fix shash documentation to disallow usage in hard IRQs
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64
- Drop avx10_256 xts(aes)/ctr(aes) on x86
- Improve avx-512 optimisations for xts(aes)
- Move chacha arch implementations into lib/crypto
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm
- Disable powerpc/poly1305 as it has no SIMD fallback
- Move sha256 arch implementations into lib/crypto
- Convert deflate to acomp
- Set block size correctly in cbcmac
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss
- Fix warm-reboot failure by making shutdown do more work in qat
- Add locking in zynqmp-sha
- Remove cavium/zip
- Add support for PCI device 0x17D8 to ccp
- Add qat_6xxx support in qat
- Add support for RK3576 in rockchip-rng
- Add support for i.MX8QM in caam
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
- Add new SEV/SNP platform shutdown API in ccp"
* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
crypto: qat - add missing header inclusion
crypto: api - Redo lookup on EEXIST
Revert "crypto: testmgr - Add hash export format testing"
crypto: marvell/cesa - Do not chain submitted requests
crypto: powerpc/poly1305 - add depends on BROKEN for now
Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
crypto: ccp - Add missing tee info reg for teev2
crypto: ccp - Add missing bootloader info reg for pspv5
crypto: sun8i-ce - move fallback ahash_request to the end of the struct
crypto: octeontx2 - Use dynamic allocated memory region for lmtst
crypto: octeontx2 - Initialize cptlfs device info once
crypto: xts - Only add ecb if it is not already there
crypto: lrw - Only add ecb if it is not already there
crypto: testmgr - Add hash export format testing
crypto: testmgr - Use ahash for generic tfm
crypto: hmac - Add ahash support
crypto: testmgr - Ignore EEXIST on shash allocation
crypto: algapi - Add driver template support to crypto_inst_setname
crypto: shash - Set reqsize in shash_alg
...
* New features:
- Add large stage-2 mapping support for non-protected pKVM guests,
clawing back some performance.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
protected modes.
- Enable nested virtualisation support on systems that support it
(yes, it has been a long time coming), though it is disabled by
default.
* Improvements, fixes and cleanups:
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. This ensures correctness of
emulation (the data is automatically extracted from the published
JSON files), and helps dealing with the evolution of the
architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmgwU7UACgkQI9DQutE9
ekN93g//fNnejxf01dBFIbuylzYEyHZSEH0iTGLeM+ES9zvntCzciTYVzb27oqNG
RDLShlQYp3w4rAe6ORzyePyHptOmKXCxfj/VXUFp3A7H9QYOxt1nacD3WxI9fCOo
LzaSLquvgwFBaeTdDE0KdeTUKQHluId+w1Azh0lnHGeUP+lOHNZ8FqoP1/la0q04
GvVL+l3wz/IhPP8r1YA0Q1bzJ5SLfSpjIw/0F5H/xgI4lyYdHzgFL8sKuSyFeCyM
2STQi+ZnTCsAs4bkXkw2Pp9CFYrfQgZi+sf7Om+noAKhbJo3vb7/RHpgjv+QCjJy
Kx4g9CbxHfaM03cH6uSLBoFzsACR1iAuUz8BCSRvvVNH4RVT6H+34nzjLZXLncrP
gm1uYs9aMTLr91caeAx0aYIMWGYa1uqV0rum3WxyIHezN9Q/NuQoZyfprUufr8oX
wCYE+ot4VT3DwG0UFZKKwj0BiCbYcbph9nBLVyZJsg8OKxpvspkCtPriFp1kb6BP
dTTGSXd9JJqwSgP9qJLxijcv6Nfgp2gT42TWwh/dJRZXhnTCvr9IyclFIhoIIq3G
Q2BkFCXOoEoNQhBA1tiWzJ9nDHf52P72Z2K1gPyyMZwF49HGa2BZBCJGkqX06wSs
Riolf1/cjFhDno1ThiHKsHT0sG1D4oc9k/1NLq5dyNAEGcgATIA=
=Jju3
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.16
* New features:
- Add large stage-2 mapping support for non-protected pKVM guests,
clawing back some performance.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
protected modes.
- Enable nested virtualisation support on systems that support it
(yes, it has been a long time coming), though it is disabled by
default.
* Improvements, fixes and cleanups:
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. This ensures correctness of
emulation (the data is automatically extracted from the published
JSON files), and helps dealing with the evolution of the
architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
irq_fpu_usable() incorrectly returned true before the FPU is
initialized. The x86 CPU onlining code can call sha256() to checksum
AMD microcode images, before the FPU is initialized. Since sha256()
recently gained a kernel-mode FPU optimized code path, a crash occurred
in kernel_fpu_begin_mask() during hotplug CPU onlining.
(The crash did not occur during boot-time CPU onlining, since the
optimized sha256() code is not enabled until subsys_initcalls run.)
Fix this by making irq_fpu_usable() return false before fpu__init_cpu()
has run. To do this without adding any additional overhead to
irq_fpu_usable(), replace the existing per-CPU bool in_kernel_fpu with
kernel_fpu_allowed which tracks both initialization and usage rather
than just usage. The initial state is false; FPU initialization sets it
to true; kernel-mode FPU sections toggle it to false and then back to
true; and CPU offlining restores it to the initial state of false.
Fixes: 11d7956d52 ("crypto: x86/sha256 - implement library instead of shash")
Reported-by: Ayush Jain <Ayush.Jain3@amd.com>
Closes: https://lore.kernel.org/r/20250516112217.GBaCcf6Yoc6LkIIryP@fat_crate.local
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ayush Jain <Ayush.Jain3@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When starting APs, confidential guests and paravisor guests
need to know the CPU number, and the pattern of using the linear
search has emerged in several places. With N processors that leads
to the O(N^2) time complexity.
Provide the CPU number in the AP wake up callback so that one can
get the CPU number in constant time.
Suggested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20250507182227.7421-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250507182227.7421-3-romank@linux.microsoft.com>
* kvm-arm64/ubsan-el2:
: .
: Add UBSAN support to the EL2 portion of KVM, reusing most of the
: existing logic provided by CONFIG_IBSAN_TRAP.
:
: Patches courtesy of Mostafa Saleh.
: .
KVM: arm64: Handle UBSAN faults
KVM: arm64: Introduce CONFIG_UBSAN_KVM_EL2
ubsan: Remove regs from report_ubsan_failure()
arm64: Introduce esr_is_ubsan_brk()
Signed-off-by: Marc Zyngier <maz@kernel.org>
This adds an addition layer of protection for the saved copy of dm crypt
key. Trying to access the saved copy will cause page fault.
Link: https://lkml.kernel.org/r/20250502011246.99238-9-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Suggested-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1st kernel will build up the kernel command parameter dmcryptkeys as
similar to elfcorehdr to pass the memory address of the stored info of dm
crypt key to kdump kernel.
Link: https://lkml.kernel.org/r/20250502011246.99238-8-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit
480e803dac ("x86/bugs: Restructure spectre_v2 mitigation")
inadvertently changed the spectre-v2 mitigation default from eIBRS to IBRS on
Intel. While splitting the spectre_v2 mitigation in select/update/apply
functions, eIBRS and IBRS selection logic was separated in select and update.
This caused IBRS selection to not consider that eIBRS mitigation is already
selected, fix it.
Fixes: 480e803dac ("x86/bugs: Restructure spectre_v2 mitigation")
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250520-eibrs-fix-v1-1-91bacd35ed09@linux.intel.com
Restructure the ITS mitigation to use select/update/apply functions like
the other mitigations.
There is a particularly complex interaction between ITS and Retbleed as CDT
(Call Depth Tracking) is a mitigation for both, and either its=stuff or
retbleed=stuff will attempt to enable CDT.
retbleed_update_mitigation() runs first and will check the necessary
pre-conditions for CDT if either ITS or Retbleed stuffing is selected. If
checks pass and ITS stuffing is selected, it will select stuffing for
Retbleed as well.
its_update_mitigation() runs after and will either select stuffing if
retbleed stuffing was enabled, or fall back to the default (aligned thunks)
if stuffing could not be enabled.
Enablement of CDT is done exclusively in retbleed_apply_mitigation().
its_apply_mitigation() is only used to enable aligned thunks.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/20250516193212.128782-1-david.kaplan@amd.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgqSbkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGr6sH/1ICAvlin1GuxffE
ISVNz3xhXQpXG2k8yl9r0umpdCfPQbGrxm30vZyuIDNutY/FuMvkIqfu+Z1NnLg0
GidZW015LtXrp7/puKtTnUD5CPSjdETMXig+Q7c1PrxkkmHwz8sBbbm173AIDbDB
t7wwqSEUQh2AIDouGwN+DXB+6bR2FoOXb/k/njmtappIwR3rBc2f1HQJnP095rKO
5AKw1c9DMv5Wq2cEdBOCP48e4CFZEIN1ycW0nvtjpnOmcPOJjLoEothRbntQolqF
udtj5UeTGdAJqmjigv7KHmlrmFNe+GqBq4+beHl5MRxhBaT2uGGaM9jCJiSxT3Jx
sHyYYr8=
=Ddma
-----END PGP SIGNATURE-----
Merge tag 'v6.15-rc7' into x86/core, to pick up fixes
Pick up build fixes from upstream to make this tree more testable.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix SEV-SNP kdump bugs
- Update the email address of Alexey Makhalov in MAINTAINERS
- Add the CPU feature flag for the Zen6 microarchitecture
- Fix typo in system message
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgoj3MRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hppg//S/eodSXrgxzTOvZLu0gFeYN4xyxUnfWl
0Dvc+FRasGCpBpQcD9sl3w9xKnTkaGY250NPP4/OKW2JgiizP6E3UcFYvaDnZ96I
TU3/y3acAI5zpvASOuOuDlwDt0w9xIk5L/K0gcVec9dYnGdAOmTE4jjZV6wDm0Q4
rto8k5E0RmSs5HQ4GcpU2sgzJSlaQlkkxZMo6HaUE6oJUiuodmPnxHkjoLgAQiU9
I0ALcrPVtyI1jap52DVxAIDcMsrOddazYley4IyDRqWezwrtrxkNaEzvNkMWO4ZV
iAnTYe/21HrppsQ40KuYa5VY5k0Dkv+QVzb23rGT2sZlPaXAiPIVUtt25z4VGtve
1z/kn1TszfcqC9sPodVcHQkzNrTktlaEKXd3u9GuFlfMkuj7iSnmYnGoPMo6x7T9
vcbBF6PUQ+uNi7QZXDvww8S0OMBVVlMDOjhuGjFBFzkmfVzkFtdyC1oGXppiXNzg
KG0LjiTDlOeI4B8bxG1Wwldwl7vLfwHJag2xWaw0uQR8mjstkCTLXibjdvz3QNwi
bM14hlG3TxmxJSsYl8QNFnF45DwzApWGKz9K81OPz/yZ2Z6KB1uQqrN2l8+blFt9
OUMEukY9sAcmUR1hkt3Rdynb1ri+jGMcJUGOn48w2ne+qiLoVicp8LEgWO6KoI3Z
vgLnVmqIa9o=
=cD7r
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix SEV-SNP kdump bugs
- Update the email address of Alexey Makhalov in MAINTAINERS
- Add the CPU feature flag for the Zen6 microarchitecture
- Fix typo in system message
* tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove duplicated word in warning message
x86/CPU/AMD: Add X86_FEATURE_ZEN6
x86/sev: Make sure pages are not skipped during kdump
x86/sev: Do not touch VMSA pages during SNP guest memory kdump
MAINTAINERS: Update Alexey Makhalov's email address
x86/sev: Fix operator precedence in GHCB_MSR_VMPL_REQ_LEVEL macro
PARAVIRT_XXL is exclusively utilized by XEN_PV, which is only compatible
with 64-bit machines.
Clearly designate PARAVIRT_XXL as 64-bit only and remove ifdefs to
support CONFIG_PGTABLE_LEVELS < 5.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-5-kirill.shutemov@linux.intel.com
Both Intel and AMD CPUs support 5-level paging, which is expected to
become more widely adopted in the future. All major x86 Linux
distributions have the feature enabled.
Remove CONFIG_X86_5LEVEL and related #ifdeffery for it to make it more readable.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-4-kirill.shutemov@linux.intel.com
Dynamic memory layout is used by KASLR and 5-level paging.
CONFIG_X86_5LEVEL is going to be removed, making 5-level paging support
unconditional which requires unconditional support of dynamic memory
layout.
Remove CONFIG_DYNAMIC_MEMORY_LAYOUT.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-2-kirill.shutemov@linux.intel.com
of_node_to_fwnode() is irqdomain's reimplementation of the "officially"
defined of_fwnode_handle(). The former is in the process of being
removed, so use the latter instead.
[ tglx: Fix up subject prefix ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-11-jirislaby@kernel.org
Resctrl is a filesystem interface to hardware that provides cache
allocation policy and bandwidth control for groups of tasks or CPUs.
To support more than one architecture, resctrl needs to live in /fs/.
Move the code that is concerned with the filesystem interface to
/fs/resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-25-james.morse@arm.com
x86 has an array, rdt_resources_all[], of all possible resources.
The for-each-resource walkers depend on the rid field of all
resources being initialised.
If the array ever grows due to another architecture adding a resource
type that is not defined on x86, the for-each-resources walkers will
loop forever.
Initialise all the rid values in resctrl_arch_late_init() before
any for-each-resource walker can be called.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-24-james.morse@arm.com
checkpatch.pl identifies some direct #includes of asm headers that
can be satisfied by including the corresponding <linux/...> header
instead.
Fix them.
No intentional functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-23-james.morse@arm.com
rdt_init_fs_context() sizes a typed allocation using an explicit
sizeof(type) expression, which checkpatch.pl complains about.
Since this code is about to be factored out and made generic, this
is a good opportunity to fix the code to size the allocation based
on the target pointer instead, to reduce the chance of future mis-
maintenance.
Fix it.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-22-james.morse@arm.com
checkpatch.pl complains about some whitespace anomalies in the
resctrl core code.
This doesn't matter, but since this code is about to be factored
out and made generic, this is a good opportunity to fix these
issues and so reduce future checkpatch fuzz.
Fix them.
No functional change.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-21-james.morse@arm.com
Once the filesystem parts of resctrl move to fs/resctrl, it cannot rely
on definitions in x86's internal.h.
Move definitions in internal.h that need to be shared between the
filesystem and architecture code to header files that fs/resctrl can
include.
Doing this separately means the filesystem code only moves between files
of the same name, instead of having these changes mixed in too.
Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-17-james.morse@arm.com
Add Makefile and Kconfig for fs/resctrl. Add ARCH_HAS_CPU_RESCTRL
for the common parts of the resctrl interface and make X86_CPU_RESCTRL
select this.
Adding an include of asm/resctrl.h to linux/resctrl.h allows the
/fs/resctrl files to switch over to using this header instead.
Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-16-james.morse@arm.com
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename hypervisor_cpuid_base() to cpuid_base_hypervisor().
To align with the new style, also rename:
for_each_possible_hypervisor_cpuid_base(function)
to:
for_each_possible_cpuid_base_hypervisor(function)
Adjust call-sites accordingly.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/aCZOi0Oohc7DpgTo@lx-t490
The CPUID(0x2) descriptors iterator has been renamed from:
for_each_leaf_0x2_entry()
to:
for_each_cpuid_0x2_desc()
since it iterates over CPUID(0x2) cache and TLB "descriptors", not
"entries".
In the macro's x86/cpu call-site, rename the parameter denoting the
parsed descriptor at each iteration from 'entry' to 'desc'.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-8-darwi@linutronix.de
The CPUID(0x2) descriptors iterator has been renamed from:
for_each_leaf_0x2_entry()
to:
for_each_cpuid_0x2_desc()
since it iterates over CPUID(0x2) cache and TLB "descriptors", not
"entries".
In the macro's x86/cacheinfo call-site, rename the parameter denoting the
parsed descriptor at each iteration from 'entry' to 'desc'.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-7-darwi@linutronix.de
Rename the CPUID(0x2) register accessor function:
cpuid_get_leaf_0x2_regs(regs)
to:
cpuid_leaf_0x2(regs)
for consistency with other <cpuid/api.h> accessors that return full CPUID
registers outputs like:
cpuid_leaf(regs)
cpuid_subleaf(regs)
In the same vein, rename the CPUID(0x2) iteration macro:
for_each_leaf_0x2_entry()
to:
for_each_cpuid_0x2_desc()
to include "cpuid" in the macro name, and since what is iterated upon is
CPUID(0x2) cache and TLB "descriptos", not "entries". Prefix an
underscore to that iterator macro parameters, so that the newly renamed
'desc' parameter do not get mixed with "union leaf_0x2_regs :: desc[]" in
the macro's implementation.
Adjust all the affected call-sites accordingly.
While at it, use "CPUID(0x2)" instead of "CPUID leaf 0x2" as this is the
recommended style.
No change in functionality intended.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-6-darwi@linutronix.de
trace.h contains all the tracepoints. After the move to /fs/resctrl, some
of these will be left behind. All the pseudo_lock tracepoints remain part
of the architecture. The lone tracepoint in monitor.c moves to /fs/resctrl.
Split trace.h so that each C file includes a different trace header file.
This means the trace header files are not modified when they are moved.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-14-james.morse@arm.com
MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
the domain-id, and is packed into the mon_data_bits union bitfield.
The width of cache-id in this field is 14 bits.
Expanding the union would break 32bit x86 platforms as this union is
stored as the kernfs kn->priv pointer. This saved allocating memory
for the priv data storage.
The firmware on MPAM platforms have used the PPTT cache-id field to
expose the interconnect's id for the cache, which is sparse and uses
more than 14 bits. Use of this id is to enable PCIe direct cache
injection hints. Using this feature with VFIO means the value provided
by the ACPI table should be exposed to user-space.
To support cache-id values greater than 14 bits, convert the
mon_data_bits union to a structure. These are shared between control
and monitor groups, and are allocated on first use. The list of
allocated struct mon_data is free'd when the filesystem is umount()ed.
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-13-james.morse@arm.com
The resctrl_event_id enum gives names to the counter event numbers on x86.
These are used directly by resctrl.
To allow the MPAM driver to keep an array of these the size of the enum
needs to be known.
Add a 'num_events' enum entry which can be used to size an array. This is
added to the enum to reduce conflicts with another series, which in turn
requires get_arch_mbm_state() to have a default case.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-12-james.morse@arm.com
trace_pagefault_key is used to optimize the pagefault tracepoints when it
is disabled. However, tracepoints already have built-in static_key for this
exact purpose.
Remove this redundant key.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-trace-kernel@vger.kernel.org
Link: https://lore.kernel.org/r/827c7666d2989f08742a4fb869b1ed5bfaaf1dbf.1747046848.git.namcao@linutronix.de
is_mba_sc() is defined in core.c, but has no callers there. It does not access
any architecture private structures.
Move this to rdtgroup.c where the majority of callers are. This makes the move
of the filesystem code to /fs/ cleaner.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-11-james.morse@arm.com
Because ARM's MPAM controls are probed using MMIO, resctrl can't be
initialised until enough CPUs are online to have determined the system-wide
supported num_closid. Arm64 also supports 'late onlined secondaries', where
only a subset of CPUs are online during boot.
These two combine to mean the MPAM driver may not be able to initialise
resctrl until user-space has brought 'enough' CPUs online.
To allow MPAM to initialise resctrl after __init text has been free'd, remove
all the __init markings from resctrl.
The existing __exit markings cause these functions to be removed by the linker
as it has never been possible to build resctrl as a module. MPAM has an error
interrupt which causes the driver to reset and disable itself. Remove the
__exit markings to allow the MPAM driver to tear down resctrl when an error
occurs.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-10-james.morse@arm.com
resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
resctrl can't be built as a module, and the kernfs helpers are not exported so
this is unlikely to change. MPAM has an error interrupt which indicates the
MPAM driver has gone haywire. Should this occur tasks could run with the wrong
control values, leading to bad performance for important tasks. In this
scenario the MPAM driver will reset the hardware, but it needs a way to tell
resctrl that no further configuration should be attempted.
In particular, moving tasks between control or monitor groups does not
interact with the architecture code, so there is no opportunity for the arch
code to indicate that the hardware is no-longer functioning.
Using resctrl_exit() for this leaves the system in a funny state as resctrl is
still mounted, but cannot be un-mounted because the sysfs directory that is
typically used has been removed. Dave Martin suggests this may cause systemd
trouble in the future as not all filesystems can be unmounted.
Add calls to remove all the files and directories in resctrl, and remove the
sysfs_remove_mount_point() call that leaves the system in a funny state. When
triggered, this causes all the resctrl files to disappear. resctrl can be
unmounted, but not mounted again.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-9-james.morse@arm.com
resctrl_exit() removes things like the resctrl mount point directory
and unregisters the filesystem prior to freeing data structures that
were allocated during resctrl_init().
This assumes that there are no online domains when resctrl_exit() is
called. If any domain were online, the limbo or overflow handler could
be scheduled to run.
Add a check for any online control or monitor domains, and document that
the architecture code is required to offline all monitor and control
domains before calling resctrl_exit().
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-8-james.morse@arm.com
resctrl_sched_in() loads the architecture specific CPU MSRs with the
CLOSID and RMID values. This function was named before resctrl was
split to have architecture specific code, and generic filesystem code.
This function is obviously architecture specific, but does not begin
with 'resctrl_arch_', making it the odd one out in the functions an
architecture needs to support to enable resctrl.
Rename it for consistency. This is purely cosmetic.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-7-james.morse@arm.com
Resctrl allocates and finds free CLOSID values using the bits of a u32.
This restricts the number of control groups that can be created by
user-space.
MPAM has an architectural limit of 2^16 CLOSID values, Intel x86 could
be extended beyond 32 values. There is at least one MPAM platform which
supports more than 32 CLOSID values.
Replace the fixed size bitmap with calls to the bitmap API to allocate
an array of a sufficient size.
ffs() returns '1' for bit 0, hence the existing code subtracts 1 from
the index to get the CLOSID value. find_first_bit() returns the bit
number which does not need adjusting.
[ morse: fixed the off-by-one in the allocator and the wrong not-found
value. Removed the limit. Rephrase the commit message. ]
Signed-off-by: Amit Singh Tomar <amitsinght@marvell.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-6-james.morse@arm.com
With the lack of cpumask_any_andnot_but(), cpumask_any_housekeeping()
has to abuse cpumask_nth() functions.
Update cpumask_any_housekeeping() to use the new cpumask_any_but()
and cpumask_any_andnot_but(). These two functions understand
RESCTRL_PICK_ANY_CPU, which simplifies cpumask_any_housekeeping()
significantly.
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: James Morse <james.morse@arm.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-5-james.morse@arm.com
TL;DR: SGX page reclaim touches the page to copy its contents to
secondary storage. SGX instructions do not gracefully handle machine
checks. Despite this, the existing SGX code will try to reclaim pages
that it _knows_ are poisoned. Avoid even trying to reclaim poisoned pages.
The longer story:
Pages used by an enclave only get epc_page->poison set in
arch_memory_failure() but they currently stay on sgx_active_page_list until
sgx_encl_release(), with the SGX_EPC_PAGE_RECLAIMER_TRACKED flag untouched.
epc_page->poison is not checked in the reclaimer logic meaning that, if other
conditions are met, an attempt will be made to reclaim an EPC page that was
poisoned. This is bad because 1. we don't want that page to end up added
to another enclave and 2. it is likely to cause one core to shut down
and the kernel to panic.
Specifically, reclaiming uses microcode operations including "EWB" which
accesses the EPC page contents to encrypt and write them out to non-SGX
memory. Those operations cannot handle MCEs in their accesses other than
by putting the executing core into a special shutdown state (affecting
both threads with HT.) The kernel will subsequently panic on the
remaining cores seeing the core didn't enter MCE handler(s) in time.
Call sgx_unmark_page_reclaimable() to remove the affected EPC page from
sgx_active_page_list on memory error to stop it being considered for
reclaiming.
Testing epc_page->poison in sgx_reclaim_pages() would also work but I assume
it's better to add code in the less likely paths.
The affected EPC page is not added to &node->sgx_poison_page_list until
later in sgx_encl_release()->sgx_free_epc_page() when it is EREMOVEd.
Membership on other lists doesn't change to avoid changing any of the
lists' semantics except for sgx_active_page_list. There's a "TBD" comment
in arch_memory_failure() about pre-emptive actions, the goal here is not
to address everything that it may imply.
This also doesn't completely close the time window when a memory error
notification will be fatal (for a not previously poisoned EPC page) --
the MCE can happen after sgx_reclaim_pages() has selected its candidates
or even *inside* a microcode operation (actually easy to trigger due to
the amount of time spent in them.)
The spinlock in sgx_unmark_page_reclaimable() is safe because
memory_failure() runs in process context and no spinlocks are held,
explicitly noted in a mm/memory-failure.c comment.
Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: balrogg@gmail.com
Cc: linux-sgx@vger.kernel.org
Link: https://lore.kernel.org/r/20250508230429.456271-1-andrew.zaborowski@intel.com
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename have_cpuid_p() to cpuid_feature().
Adjust all call-sites accordingly.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-4-darwi@linutronix.de
The main CPUID header <asm/cpuid.h> was originally a storefront for the
headers:
<asm/cpuid/api.h>
<asm/cpuid/leaf_0x2_api.h>
Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between <asm/cpuid.h> and
<asm/cpuid/api.h>.
Migrate all users to the <asm/cpuid/api.h> header, in preparation of
the removal of <asm/cpuid.h>.
Don't remove <asm/cpuid.h> just yet, in case some new code in -next
started using it.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
Expose certain 'struct cpuinfo_x86' fields via asm-offsets for x86_64
too, so that it will be possible to set CPU capabilities from 64-bit
asm code.
32-bit already used these fields, so simply move those offset exports into
the unified asm-offsets.c file.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250514104242.1275040-12-ardb+git@google.com
The global pseudo-constants 'page_offset_base', 'vmalloc_base' and
'vmemmap_base' are not used extremely early during the boot, and cannot be
used safely until after the KASLR memory randomization code in
kernel_randomize_memory() executes, which may update their values.
So there is no point in setting these variables extremely early, and it
can wait until after the kernel itself is mapped and running from its
permanent virtual mapping.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250513111157.717727-9-ardb+git@google.com
Fix several build errors when CONFIG_MODULES=n, including the following:
../arch/x86/kernel/alternative.c:195:25: error: incomplete definition of type 'struct module'
195 | for (int i = 0; i < mod->its_num_pages; i++) {
Fixes: 872df34d7c ("x86/its: Use dynamic thunks for indirect branches")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a synthetic feature flag for Zen6.
[ bp: Move the feature flag to a free slot and avoid future merge
conflicts from incoming stuff. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250513204857.3376577-1-yazen.ghannam@amd.com
1f4bb068b4 ("x86/bugs: Restructure SRSO mitigation") does this:
if (boot_cpu_data.x86 < 0x19 && !cpu_smt_possible()) {
setup_force_cpu_cap(X86_FEATURE_SRSO_NO);
srso_mitigation = SRSO_MITIGATION_NONE;
return;
}
and, in particular, sets srso_mitigation to NONE. This leads to
reporting
Speculative Return Stack Overflow: Vulnerable
on Zen2 machines.
There's a far bigger confusion with what SRSO_NO means and how it is
used in the code but this will be a matter of future fixes and
restructuring to how the SRSO mitigation gets determined.
Fix the reporting issue for now.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: David Kaplan <david.kaplan@amd.com>
Link: https://lore.kernel.org/20250513110405.15872-1-bp@kernel.org
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KHO kernels are special and use only scratch memory for memblock
allocations, but memory below 1M is ignored by kernel after early boot and
cannot be naturally marked as scratch.
To allow allocation of the real-mode trampoline and a few (if any) other
very early allocations from below 1M forcibly mark the memory below 1M as
scratch.
After real mode trampoline is allocated, clear that scratch marking.
Link: https://lkml.kernel.org/r/20250509074635.3187114-13-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec handover (KHO) creates a metadata that the kernels pass between each
other during kexec. This metadata is stored in memory and kexec image
contains a (physical) pointer to that memory.
In addition, KHO keeps "scratch regions" available for kexec: physically
contiguous memory regions that are guaranteed to not have any memory that
KHO would preserve. The new kernel bootstraps itself using the scratch
regions and sets all handed over memory as in use. When subsystems that
support KHO initialize, they introspect the KHO metadata, restore
preserved memory regions, and retrieve their state stored in the preserved
memory.
Enlighten x86 kexec-file and boot path about the KHO metadata and make
sure it gets passed along to the next kernel.
Link: https://lkml.kernel.org/r/20250509074635.3187114-12-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memblock_reserve() does not distinguish memory used by firmware from
memory used by kernel.
The distinction is nice to have for accounting of early memory allocations
and reservations, but it is essential for kexec handover (kho) to know how
much memory kernel consumes during boot.
Use memblock_reserve_kern() to reserve kernel memory, such as kernel
image, initrd and setup data.
Link: https://lkml.kernel.org/r/20250509074635.3187114-11-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When suspending, save_processor_state() calls mtrr_save_fixed_ranges()
to save fixed-range MTRRs.
On platforms without fixed-range MTRRs like the ACRN hypervisor which
has removed fixed-range MTRR emulation, accessing these MSRs will
trigger an unchecked MSR access error. Make sure fixed-range MTRRs are
supported before access to prevent such error.
Since mtrr_state.have_fixed is only set when MTRRs are present and
enabled, checking the CPU feature flag in mtrr_save_fixed_ranges() is
unnecessary.
Fixes: 3ebad59056 ("[PATCH] x86: Save and restore the fixed-range MTRRs of the BSP when suspending")
Signed-off-by: Jiaqing Zhao <jiaqing.zhao@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250509170633.3411169-2-jiaqing.zhao@linux.intel.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmgebIwACgkQaDWVMHDJ
krCGSA/+I+W/uqiz58Z2Zu4RrXMYFfKJxacF7My9wnOyRxaJduS3qrz1E5wHqBId
f6M8wDx9nS24UxDkBbi84NdtlG1zj8nV8djtszGKVeqHG2DcQMMOXBKZSjOmTo2b
GIZ3a3xEqXaFfnGQxXSZrvtHIwCmv10H2oyGHu0vBp/SJuWXNg72oivOGhbm0uWs
0/bdIK8+1sW7OAmhhKdvMVpmzL8TQJnkUHSkQilPB2Tsf9wWDfeY7kDkK5YwQpk2
ZK+hrmwCFXQZELY65F2+y/cFim/F38HiqVdvIkV1wFSVqVVE9hEKJ4BDZl1fXZKB
p4qpDFgxO27E/eMo9IZfxRH4TdSoK6YLWo9FGWHKBPnciJfAeO9EP/AwAIhEQRdx
YZlN9sGS6ja7O1Eh423BBw6cFj6ta0ck2T1PoYk32FXc6sgqCphsfvBD3+tJxz8/
xoZ3BzoErdPqSXbH5cSI972kQW0JLESiMTZa827qnJtT672t6uBcsnnmR0ZbJH1f
TJCC9qgwpBiEkiGW3gwv00SC7CkXo3o0FJw0pa3MkKHGd7csxBtGBHI1b6Jj+oB0
yWf1HxSqwrq2Yek8R7lWd4jIxyWfKriEMTu7xCMUUFlprKmR2RufsADvqclNyedQ
sGBCc4eu1cpZp2no/IFm+IvkuzUHnkS/WNL1LbZ9YI8h8unjZHE=
=UVgZ
-----END PGP SIGNATURE-----
Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 ITS mitigation from Dave Hansen:
"Mitigate Indirect Target Selection (ITS) issue.
I'd describe this one as a good old CPU bug where the behavior is
_obviously_ wrong, but since it just results in bad predictions it
wasn't wrong enough to notice. Well, the researchers noticed and also
realized that thus bug undermined a bunch of existing indirect branch
mitigations.
Thus the unusually wide impact on this one. Details:
ITS is a bug in some Intel CPUs that affects indirect branches
including RETs in the first half of a cacheline. Due to ITS such
branches may get wrongly predicted to a target of (direct or indirect)
branch that is located in the second half of a cacheline. Researchers
at VUSec found this behavior and reported to Intel.
Affected processors:
- Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet
Lake, Ice Lake, Tiger Lake and Rocket Lake.
Scope of impact:
- Guest/host isolation:
When eIBRS is used for guest/host isolation, the indirect branches
in the VMM may still be predicted with targets corresponding to
direct branches in the guest.
- Intra-mode using cBPF:
cBPF can be used to poison the branch history to exploit ITS.
Realigning the indirect branches and RETs mitigates this attack
vector.
- User/kernel:
With eIBRS enabled user/kernel isolation is *not* impacted by ITS.
- Indirect Branch Prediction Barrier (IBPB):
Due to this bug indirect branches may be predicted with targets
corresponding to direct branches which were executed prior to IBPB.
This will be fixed in the microcode.
Mitigation:
As indirect branches in the first half of cacheline are affected, the
mitigation is to replace those indirect branches with a call to thunk that
is aligned to the second half of the cacheline.
RETs that take prediction from RSB are not affected, but they may be
affected by RSB-underflow condition. So, RETs in the first half of
cacheline are also patched to a return thunk that executes the RET aligned
to second half of cacheline"
* tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftest/x86/bugs: Add selftests for ITS
x86/its: FineIBT-paranoid vs ITS
x86/its: Use dynamic thunks for indirect branches
x86/ibt: Keep IBT disabled during alternative patching
mm/execmem: Unify early execmem_cache behaviour
x86/its: Align RETs in BHB clear sequence to avoid thunking
x86/its: Add support for RSB stuffing mitigation
x86/its: Add "vmexit" option to skip mitigation on some CPUs
x86/its: Enable Indirect Target Selection mitigation
x86/its: Add support for ITS-safe return thunk
x86/its: Add support for ITS-safe indirect thunk
x86/its: Enumerate Indirect Target Selection (ITS) bug
Documentation: x86/bugs/its: Add ITS documentation
programs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmgaKIwACgkQaDWVMHDJ
krCAMQ/9EJiWaATpR24pkXYDDPIVgcXtph7kX7WTha3ZBoA5ab0aJEbvSKZWpBRi
lmIja0iqysgwAtxZ3qMfilmpPN8x/G+Y+q11PnKT8yXtWRM5CoMYXUzWXGSWKCl0
D3hIqXFTLIzutpWn56CCvxId1KOz2O+GiiH+4HZnJTdwB3lIaQpHocqc7I9dXSmK
P+hpXHAZV4o3m6/gdiUxy519Vuz+P+oRaZ2CqGEPP9g8b78ZOsivEJh4HLwgtFXr
iajcpBMN/NLMEXGPKWZEt1JFNpW9cV8i2L5cGf4w6FU1YUIfOX1OSHXL+TjvPm7y
XyU+eqquvI5Qv6Er4Nil1yat10RQhdiBWou6MQmURjwKtBiYHGlosIlnaS2A08CK
VaLOmJ+DaknPnF+c41YeO8+ERjuJ6c6iMOhRLNvhnnFuQUq8Ktxm+qIAmskiiz4Q
gvervpVHlC1I7BwbQSEQdOnBj20XqIP+HAwKzSiRt/PrjBgojuYjDx2U9/ci7DWD
EgVqOw17lXq/HMZVDlnZxwqj2neqh/Nu9RTMKQn8zDbLDjuT+eEXnZrhIPaf0EHc
LfzCXRU+JRUiNsvEdksnSUN7W/Si1073sZrlNje9pY0mjXrcB2A6ChPcSBJepdzc
MJ3SU8Owr6HIcpmQiAK+2+bGYNSM/287dIKRLEUN+aPJZYRSOss=
=b2Ix
-----END PGP SIGNATURE-----
Merge tag 'ibti-hisory-for-linus-2025-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 IBTI mitigation from Dave Hansen:
"Mitigate Intra-mode Branch History Injection via classic BFP programs
This adds the branch history clearing mitigation to cBPF programs for
x86. Intra-mode BHI attacks via cBPF a.k.a IBTI-History was reported
by researchers at VUSec.
For hardware that doesn't support BHI_DIS_S, the recommended
mitigation is to run the short software sequence followed by the IBHF
instruction after cBPF execution. On hardware that does support
BHI_DIS_S, enable BHI_DIS_S and execute the IBHF after cBPF execution.
The Indirect Branch History Fence (IBHF) is a new instruction that
prevents indirect branch target predictions after the barrier from
using branch history from before the barrier while BHI_DIS_S is
enabled. On older systems this will map to a NOP. It is recommended to
add this fence at the end of the cBPF program to support VM migration.
This instruction is required on newer parts with BHI_NO to fully
mitigate against these attacks.
The current code disables the mitigation for anything running with the
SYS_ADMIN capability bit set. The intention was not to waste time
mitigating a process that has access to anything it wants anyway"
* tag 'ibti-hisory-for-linus-2025-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bhi: Do not set BHI_DIS_S in 32-bit mode
x86/bpf: Add IBHF call at end of classic BPF
x86/bpf: Call branch history clearing sequence on exit
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmggVx8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gCsg/+LsBbk+wJyQpVADyYKyuYRx12mVY6Pb2F
Oee+/ivaFeE92Su30nIfzDPyXe3PYCNFW9D26/y8wfgRNhMic5AWW2I2PRy/iV1V
EjyTdP6ign8HUzHfrlZNPJOsextTYqU7uDhwRocgrPBKlgmK8Z9WUfgK+TnhvvMo
fotb6HF96rfGIOww/kqCbGItKCzZ4app8tnzTWlj07qoYkCgrq+J57Jaj2cvYJYI
vhmIsbaG/qrt+3Q9MYiA5HIkxxv6mvpd/MS18dhlaON070TCmhlQxgN0wD6JAfR9
5lkeqhMq50cK5/cr15KAbzAPuqyK0d8DY+6cmEXmNMe0GwOeyoVZPM+Ul3f3l/ku
93M1qd0r1oIJ0ltzFh1Pfw8c8EDMmHz0opiqJ40efaJYUPUsi3jrwmn3hI5GNXdE
gISlPow9EY8MQe4dH4E20zHOTApEOvgJzhgkrR5jzh8JlEdbmFEezfF6fwf5tKfC
m1HNeX0SkjumtUkvka7v++hgpD28UVBA0dau+nhfoRXtUBuithPapLvoacVhum/j
9QxlGMP6VqBmg3GP6AmiccuTcWNsfFBbIUIJ+KT2GI5E3wvPxb9Q9qJTr+mKtoJ9
n6krAi++WUqfInkd4tx0uEk+x8W2mk4gOS/xV/qV8So4R8cyioO1XLHn/r+kwx6a
wLbdafPaJzY=
=ibsn
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix a boot regression on very old x86 CPUs without CPUID support"
* tag 'x86-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Consolidate the loader enablement checking
FineIBT-paranoid was using the retpoline bytes for the paranoid check,
disabling retpolines, because all parts that have IBT also have eIBRS
and thus don't need no stinking retpolines.
Except... ITS needs the retpolines for indirect calls must not be in
the first half of a cacheline :-/
So what was the paranoid call sequence:
<fineibt_paranoid_start>:
0: 41 ba 78 56 34 12 mov $0x12345678, %r10d
6: 45 3b 53 f7 cmp -0x9(%r11), %r10d
a: 4d 8d 5b <f0> lea -0x10(%r11), %r11
e: 75 fd jne d <fineibt_paranoid_start+0xd>
10: 41 ff d3 call *%r11
13: 90 nop
Now becomes:
<fineibt_paranoid_start>:
0: 41 ba 78 56 34 12 mov $0x12345678, %r10d
6: 45 3b 53 f7 cmp -0x9(%r11), %r10d
a: 4d 8d 5b f0 lea -0x10(%r11), %r11
e: 2e e8 XX XX XX XX cs call __x86_indirect_paranoid_thunk_r11
Where the paranoid_thunk looks like:
1d: <ea> (bad)
__x86_indirect_paranoid_thunk_r11:
1e: 75 fd jne 1d
__x86_indirect_its_thunk_r11:
20: 41 ff eb jmp *%r11
23: cc int3
[ dhansen: remove initialization to false ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
ITS mitigation moves the unsafe indirect branches to a safe thunk. This
could degrade the prediction accuracy as the source address of indirect
branches becomes same for different execution paths.
To improve the predictions, and hence the performance, assign a separate
thunk for each indirect callsite. This is also a defense-in-depth measure
to avoid indirect branches aliasing with each other.
As an example, 5000 dynamic thunks would utilize around 16 bits of the
address space, thereby gaining entropy. For a BTB that uses
32 bits for indexing, dynamic thunks could provide better prediction
accuracy over fixed thunks.
Have ITS thunks be variable sized and use EXECMEM_MODULE_TEXT such that
they are both more flexible (got to extend them later) and live in 2M TLBs,
just like kernel code, avoiding undue TLB pressure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
cfi_rewrite_callers() updates the fineIBT hash matching at the caller side,
but except for paranoid-mode it relies on apply_retpoline() and friends for
any ENDBR relocation. This could temporarily cause an indirect branch to
land on a poisoned ENDBR.
For instance, with para-virtualization enabled, a simple wrmsrl() could
have an indirect branch pointing to native_write_msr() who's ENDBR has been
relocated due to fineIBT:
<wrmsrl>:
push %rbp
mov %rsp,%rbp
mov %esi,%eax
mov %rsi,%rdx
shr $0x20,%rdx
mov %edi,%edi
mov %rax,%rsi
call *0x21e65d0(%rip) # <pv_ops+0xb8>
^^^^^^^^^^^^^^^^^^^^^^^
Such an indirect call during the alternative patching could #CP if the
caller is not *yet* adjusted for the new target ENDBR. To prevent a false
#CP, keep CET-IBT disabled until all callers are patched.
Patching during the module load does not need to be guarded by IBT-disable
because the module code is not executed until the patching is complete.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
When retpoline mitigation is enabled for spectre-v2, enabling
call-depth-tracking and RSB stuffing also mitigates ITS. Add cmdline option
indirect_target_selection=stuff to allow enabling RSB stuffing mitigation.
When retpoline mitigation is not enabled, =stuff option is ignored, and
default mitigation for ITS is deployed.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Ice Lake generation CPUs are not affected by guest/host isolation part of
ITS. If a user is only concerned about KVM guests, they can now choose a
new cmdline option "vmexit" that will not deploy the ITS mitigation when
CPU is not affected by guest/host isolation. This saves the performance
overhead of ITS mitigation on Ice Lake gen CPUs.
When "vmexit" option selected, if the CPU is affected by ITS guest/host
isolation, the default ITS mitigation is deployed.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Indirect Target Selection (ITS) is a bug in some pre-ADL Intel CPUs with
eIBRS. It affects prediction of indirect branch and RETs in the
lower half of cacheline. Due to ITS such branches may get wrongly predicted
to a target of (direct or indirect) branch that is located in the upper
half of the cacheline.
Scope of impact
===============
Guest/host isolation
--------------------
When eIBRS is used for guest/host isolation, the indirect branches in the
VMM may still be predicted with targets corresponding to branches in the
guest.
Intra-mode
----------
cBPF or other native gadgets can be used for intra-mode training and
disclosure using ITS.
User/kernel isolation
---------------------
When eIBRS is enabled user/kernel isolation is not impacted.
Indirect Branch Prediction Barrier (IBPB)
-----------------------------------------
After an IBPB, indirect branches may be predicted with targets
corresponding to direct branches which were executed prior to IBPB. This is
mitigated by a microcode update.
Add cmdline parameter indirect_target_selection=off|on|force to control the
mitigation to relocate the affected branches to an ITS-safe thunk i.e.
located in the upper half of cacheline. Also add the sysfs reporting.
When retpoline mitigation is deployed, ITS safe-thunks are not needed,
because retpoline sequence is already ITS-safe. Similarly, when call depth
tracking (CDT) mitigation is deployed (retbleed=stuff), ITS safe return
thunk is not used, as CDT prevents RSB-underflow.
To not overcomplicate things, ITS mitigation is not supported with
spectre-v2 lfence;jmp mitigation. Moreover, it is less practical to deploy
lfence;jmp mitigation on ITS affected parts anyways.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
RETs in the lower half of cacheline may be affected by ITS bug,
specifically when the RSB-underflows. Use ITS-safe return thunk for such
RETs.
RETs that are not patched:
- RET in retpoline sequence does not need to be patched, because the
sequence itself fills an RSB before RET.
- RET in Call Depth Tracking (CDT) thunks __x86_indirect_{call|jump}_thunk
and call_depth_return_thunk are not patched because CDT by design
prevents RSB-underflow.
- RETs in .init section are not reachable after init.
- RETs that are explicitly marked safe with ANNOTATE_UNRET_SAFE.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Due to ITS, indirect branches in the lower half of a cacheline may be
vulnerable to branch target injection attack.
Introduce ITS-safe thunks to patch indirect branches in the lower half of
cacheline with the thunk. Also thunk any eBPF generated indirect branches
in emit_indirect_jump().
Below category of indirect branches are not mitigated:
- Indirect branches in the .init section are not mitigated because they are
discarded after boot.
- Indirect branches that are explicitly marked retpoline-safe.
Note that retpoline also mitigates the indirect branches against ITS. This
is because the retpoline sequence fills an RSB entry before RET, and it
does not suffer from RSB-underflow part of the ITS.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the
first half of a cache line get predicted to a target of a branch located in
the second half of the cache line.
Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
report_ubsan_failure() doesn't use argument regs, and soon it will
be called from the hypervisor context were regs are not available.
So, remove the unused argument.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-3-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
With the possibility of intra-mode BHI via cBPF, complete mitigation for
BHI is to use IBHF (history fence) instruction with BHI_DIS_S set. Since
this new instruction is only available in 64-bit mode, setting BHI_DIS_S in
32-bit mode is only a partial mitigation.
Do not set BHI_DIS_S in 32-bit mode so as to avoid reporting misleading
mitigated status. With this change IBHF won't be used in 32-bit mode, also
remove the CONFIG_X86_64 check from emit_spectre_bhb_barrier().
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Classic BPF programs can be run by unprivileged users, allowing
unprivileged code to execute inside the kernel. Attackers can use this to
craft branch history in kernel mode that can influence the target of
indirect branches.
BHI_DIS_S provides user-kernel isolation of branch history, but cBPF can be
used to bypass this protection by crafting branch history in kernel mode.
To stop intra-mode attacks via cBPF programs, Intel created a new
instruction Indirect Branch History Fence (IBHF). IBHF prevents the
predicted targets of subsequent indirect branches from being influenced by
branch history prior to the IBHF. IBHF is only effective while BHI_DIS_S is
enabled.
Add the IBHF instruction to cBPF jitted code's exit path. Add the new fence
when the hardware mitigation is enabled (i.e., X86_FEATURE_CLEAR_BHB_HW is
set) or after the software sequence (X86_FEATURE_CLEAR_BHB_LOOP) is being
used in a virtual machine. Note that X86_FEATURE_CLEAR_BHB_HW and
X86_FEATURE_CLEAR_BHB_LOOP are mutually exclusive, so the JIT compiler will
only emit the new fence, not the SW sequence, when X86_FEATURE_CLEAR_BHB_HW
is set.
Hardware that enumerates BHI_NO basically has BHI_DIS_S protections always
enabled, regardless of the value of BHI_DIS_S. Since BHI_DIS_S doesn't
protect against intra-mode attacks, enumerate BHI bug on BHI_NO hardware as
well.
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Remove @perm from the guest pseudo FPU container. The field is
initialized during allocation and never used later.
Rename fpu_init_guest_permissions() to show that its sole purpose is to
lock down guest permissions.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mitchell Levy <levymitchell0@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/kvm/af972fe5981b9e7101b64de43c7be0a8cc165323.camel@redhat.com/
Link: https://lore.kernel.org/r/20250506093740.2864458-3-chao.gao@intel.com
When granting userspace or a KVM guest access to an xfeature, preserve the
entity's existing supervisor and software-defined permissions as tracked
by __state_perm, i.e. use __state_perm to track *all* permissions even
though all supported supervisor xfeatures are granted to all FPUs and
FPU_GUEST_PERM_LOCKED disallows changing permissions.
Effectively clobbering supervisor permissions results in inconsistent
behavior, as xstate_get_group_perm() will report supervisor features for
process that do NOT request access to dynamic user xfeatures, whereas any
and all supervisor features will be absent from the set of permissions for
any process that is granted access to one or more dynamic xfeatures (which
right now means AMX).
The inconsistency isn't problematic because fpu_xstate_prctl() already
strips out everything except user xfeatures:
case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);
case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);
and similarly KVM doesn't apply the __state_perm to supervisor states
(kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):
case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;
But if KVM in particular were to ever change, dropping supervisor
permissions would result in subtle bugs in KVM's reporting of supported
CPUID settings. And the above behavior also means that having supervisor
xfeatures in __state_perm is correctly handled by all users.
Dropping supervisor permissions also creates another landmine for KVM. If
more dynamic user xfeatures are ever added, requesting access to multiple
xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
second invocation of __xstate_request_perm() computing the wrong ksize, as
as the mask passed to xstate_calculate_size() would not contain *any*
supervisor features.
Commit 781c64bfcb ("x86/fpu/xstate: Handle supervisor states in XSTATE
permissions") fudged around the size issue for userspace FPUs, but for
reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
because KVM doesn't yet support virtualizing features that have supervisor
xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
xfeatures.
Simply extending the hack-a-fix for guests would temporarily solve the
ksize issue, but wouldn't address the inconsistency issue and would leave
another lurking pitfall for KVM. KVM support for virtualizing CET will
likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
be set in xfeatures_mask_supervisor() and would again be dropped when
granting access to dynamic xfeatures.
Note, the existing clobbering behavior is rather subtle. The @permitted
parameter to __xstate_request_perm() comes from:
permitted = xstate_get_group_perm(guest);
which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
where __state_perm is initialized to:
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
and copied to the guest side of things:
/* Same defaults for guests */
fpu->guest_perm = fpu->perm;
fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
When __xstate_request_perm() restricts the local "mask" variable to
compute the user state size:
mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);
it subtly overwrites the target __state_perm with "mask" containing only
user xfeatures:
perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
WRITE_ONCE(perm->__state_perm, mask);
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Allen <john.allen@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mitchell Levy <levymitchell0@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
Cc: Vignesh Balasubramanian <vigbalas@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xin Li <xin3.li@intel.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/all/ZTqgzZl-reO1m01I@google.com
Link: https://lore.kernel.org/r/20250506093740.2864458-2-chao.gao@intel.com
CPUID(0x80000000).EAX returns the max extended CPUID leaf available. On
x86-32 machines without an extended CPUID range, a CPUID(0x80000000)
query will just repeat the output of the last valid standard CPUID leaf
on the CPU; i.e., a garbage values. Current tip:x86/cpu code protects against
this by doing:
eax = cpuid_eax(0x80000000);
c->extended_cpuid_level = eax;
if ((eax & 0xffff0000) == 0x80000000) {
// CPU has an extended CPUID range. Check for 0x80000001
if (eax >= 0x80000001) {
cpuid(0x80000001, ...);
}
}
This is correct so far. Afterwards though, the same possibly broken EAX
value is used to check the availability of other extended CPUID leaves:
if (c->extended_cpuid_level >= 0x80000007)
...
if (c->extended_cpuid_level >= 0x80000008)
...
if (c->extended_cpuid_level >= 0x8000000a)
...
if (c->extended_cpuid_level >= 0x8000001f)
...
which is invalid. Fix this by immediately setting the CPU's max extended
CPUID leaf to zero if CPUID(0x80000000).EAX doesn't indicate a valid
CPUID extended range.
While at it, add a comment, similar to kernel/head_32.S, clarifying the
CPUID(0x80000000) sanity check.
References: 8a50e5135a ("x86-32: Use symbolic constants, safer CPUID when enabling EFER.NX")
Fixes: 3da99c9776 ("x86: make (early)_identify_cpu more the same between 32bit and 64 bit")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250506050437.10264-3-darwi@linutronix.de
The following register contains bits that indicate the cause for the
previous reset.
PMx000000C0 (FCH::PM::S5_RESET_STATUS)
This is useful for debug. The reasons for reset are broken into 6 high level
categories. Decode it by category and print during boot.
Specifics within a category are split off into debugging documentation.
The register is accessed indirectly through a "PM" port in the FCH. Use
MMIO access in order to avoid restrictions with legacy port access.
Use a late_initcall() to ensure that MMIO has been set up before trying to
access the register.
This register was introduced with AMD Family 17h, so avoid access on older
families. There is no CPUID feature bit for this register.
[ bp: Simplify the reason dumping loop.
- merge a fix to not access an array element after the last one:
https://lore.kernel.org/r/20250505133609.83933-1-superm1@kernel.org
Reported-by: James Dutton <james.dutton@gmail.com>
]
[ mingo:
- Use consistent .rst formatting
- Fix 'Sleep' class field to 'ACPI-State'
- Standardize pin messages around the 'tripped' verbiage
- Remove reference to ring-buffer printing & simplify the wording
- Use curly braces for multi-line conditional statements ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250422234830.2840784-6-superm1@kernel.org
Borislav Petkov reported the following boot crash on x86-32,
with CONFIG_HARDENED_USERCOPY=y:
| usercopy: Kernel memory overwrite attempt detected to SLUB object 'task_struct' (offset 2112, size 160)!
| ...
| kernel BUG at mm/usercopy.c:102!
So the useroffset and usersize arguments are what control the allowed
window of copying in/out of the "task_struct" kmem cache:
/* create a slab on which task_structs can be allocated */
task_struct_whitelist(&useroffset, &usersize);
task_struct_cachep = kmem_cache_create_usercopy("task_struct",
arch_task_struct_size, align,
SLAB_PANIC|SLAB_ACCOUNT,
useroffset, usersize, NULL);
task_struct_whitelist() positions this window based on the location of
the thread_struct within task_struct, and gets the arch-specific details
via arch_thread_struct_whitelist(offset, size):
static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size)
{
/* Fetch thread_struct whitelist for the architecture. */
arch_thread_struct_whitelist(offset, size);
/*
* Handle zero-sized whitelist or empty thread_struct, otherwise
* adjust offset to position of thread_struct in task_struct.
*/
if (unlikely(*size == 0))
*offset = 0;
else
*offset += offsetof(struct task_struct, thread);
}
Commit cb7ca40a38 ("x86/fpu: Make task_struct::thread constant size")
removed the logic for the window, leaving:
static inline void
arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
{
*offset = 0;
*size = 0;
}
So now there is no window that usercopy hardening will allow to be copied
in/out of task_struct.
But as reported above, there *is* a copy in copy_uabi_to_xstate(). (It
seems there are several, actually.)
int copy_sigframe_from_user_to_xstate(struct task_struct *tsk,
const void __user *ubuf)
{
return copy_uabi_to_xstate(x86_task_fpu(tsk)->fpstate, NULL, ubuf, &tsk->thread.pkru);
}
This appears to be writing into x86_task_fpu(tsk)->fpstate. With or
without CONFIG_X86_DEBUG_FPU, this resolves to:
((struct fpu *)((void *)(task) + sizeof(*(task))))
i.e. the memory "after task_struct" is cast to "struct fpu", and the
uses the "fpstate" pointer. How that pointer gets set looks to be
variable, but I think the one we care about here is:
fpu->fpstate = &fpu->__fpstate;
And struct fpu::__fpstate says:
struct fpstate __fpstate;
/*
* WARNING: '__fpstate' is dynamically-sized. Do not put
* anything after it here.
*/
So we're still dealing with a dynamically sized thing, even if it's not
within the literal struct task_struct -- it's still in the kmem cache,
though.
Looking at the kmem cache size, it has allocated "arch_task_struct_size"
bytes, which is calculated in fpu__init_task_struct_size():
int task_size = sizeof(struct task_struct);
task_size += sizeof(struct fpu);
/*
* Subtract off the static size of the register state.
* It potentially has a bunch of padding.
*/
task_size -= sizeof(union fpregs_state);
/*
* Add back the dynamically-calculated register state
* size.
*/
task_size += fpu_kernel_cfg.default_size;
/*
* We dynamically size 'struct fpu', so we require that
* 'state' be at the end of 'it:
*/
CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate);
arch_task_struct_size = task_size;
So, this is still copying out of the kmem cache for task_struct, and the
window seems unchanged (still fpu regs). This is what the window was
before:
void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
{
*offset = offsetof(struct thread_struct, fpu.__fpstate.regs);
*size = fpu_kernel_cfg.default_size;
}
And the same commit I mentioned above removed it.
I think the misunderstanding is here:
| The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed,
| now that the FPU structure is not embedded in the task struct anymore, which
| reduces text footprint a bit.
Yes, FPU is no longer in task_struct, but it IS in the kmem cache named
"task_struct", since the fpstate is still being allocated there.
Partially revert the earlier mentioned commit, along with a
recalculation of the fpstate regs location.
Fixes: cb7ca40a38 ("x86/fpu: Make task_struct::thread constant size")
Reported-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/all/20250409211127.3544993-1-mingo@kernel.org/ # Discussion #1
Link: https://lore.kernel.org/r/202505041418.F47130C4C8@keescook # Discussion #2
Consolidate the whole logic which determines whether the microcode loader
should be enabled or not into a single function and call it everywhere.
Well, almost everywhere - not in mk_early_pgtbl_32() because there the kernel
is running without paging enabled and checking dis_ucode_ldr et al would
require physical addresses and uglification of the code.
But since this is 32-bit, the easier thing to do is to simply map the initrd
unconditionally especially since that mapping is getting removed later anyway
by zap_early_initrd_mapping() and avoid the uglification.
In doing so, address the issue of old 486er machines without CPUID
support, not booting current kernels.
[ mingo: Fix no previous prototype for ‘microcode_loader_disabled’ [-Wmissing-prototypes] ]
Fixes: 4c585af718 ("x86/boot/32: Temporarily map initrd for microcode loading")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/CANpbe9Wm3z8fy9HbgS8cuhoj0TREYEEkBipDuhgkWFvqX0UoVQ@mail.gmail.com
Add aliases for all the data objects that the startup code references -
this is needed so that this code can be moved into its own confined area
where it can only access symbols that have a __pi_ prefix.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-39-ardb+git@google.com
Move early_setup_gdt() out of the startup code that is callable from the
1:1 mapping - this is not needed, and instead, it is better to expose
the helper that does reside in __head directly.
This reduces the amount of code that needs special checks for 1:1
execution suitability. In particular, it avoids dealing with the GHCB
page (and its physical address) in startup code, which runs from the
1:1 mapping, making physical to virtual translations ambiguous.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250504095230.2932860-26-ardb+git@google.com
PF_KTHREAD|PF_USER_WORKER tasks should never clear TIF_NEED_FPU_LOAD,
so the TIF_NEED_FPU_LOAD check should equally filter them out.
And this way an exiting userspace task can avoid the unnecessary "fwait"
if it does context_switch() at least once on its way to exit_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chang S . Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250503143856.GA9009@redhat.com
It makes no sense to copy the bytes after sizeof(struct task_struct),
FPU state will be initialized in fpu_clone().
A plain memcpy(dst, src, sizeof(struct task_struct)) should work too,
but "_and_pad" looks safer.
[ mingo: Simplify it a bit more. ]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S . Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250503143850.GA8997@redhat.com
It is not actually used after:
55bc30f2e3 ("x86/fpu: Remove the thread::fpu pointer")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chang S . Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250503143837.GA8985@redhat.com
Now that switch_fpu_finish() doesn't load the FPU state, it makes more
sense to fold it into switch_fpu_prepare() renamed to switch_fpu(), and
more importantly, use the "prev_p" task as a target for TIF_NEED_FPU_LOAD.
It doesn't make any sense to delay set_tsk_thread_flag(TIF_NEED_FPU_LOAD)
until "prev_p" is scheduled again.
There is no worry about the very first context switch, fpu_clone() must
always set TIF_NEED_FPU_LOAD.
Also, shift the test_tsk_thread_flag(TIF_NEED_FPU_LOAD) from the callers
to switch_fpu().
Note that the "PF_KTHREAD | PF_USER_WORKER" check can be removed but
this deserves a separate patch which can change more functions, say,
kernel_fpu_begin_mask().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chang S . Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250503143830.GA8982@redhat.com
The third argument in wrmsr(msr, low, 0) is unnecessary. Instead, use
wrmsrq(msr, low), which automatically sets the higher 32 bits of the
MSR value to 0.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-15-xin@zytor.com
An MSR value is represented as a 64-bit unsigned integer, with existing
MSR instructions storing it in EDX:EAX as two 32-bit segments.
The new immediate form MSR instructions, however, utilize a 64-bit
general-purpose register to store the MSR value. To unify the usage of
all MSR instructions, let the default MSR access APIs accept an MSR
value as a single 64-bit argument instead of two 32-bit segments.
The dual 32-bit APIs are still available as convenient wrappers over the
APIs that handle an MSR value as a single 64-bit argument.
The following illustrates the updated derivation of the MSR write APIs:
__wrmsrq(u32 msr, u64 val)
/ \
/ \
native_wrmsrq(msr, val) native_wrmsr(msr, low, high)
|
|
native_write_msr(msr, val)
/ \
/ \
wrmsrq(msr, val) wrmsr(msr, low, high)
When CONFIG_PARAVIRT is enabled, wrmsrq() and wrmsr() are defined on top
of paravirt_write_msr():
paravirt_write_msr(u32 msr, u64 val)
/ \
/ \
wrmsrq(msr, val) wrmsr(msr, low, high)
paravirt_write_msr() invokes cpu.write_msr(msr, val), an indirect layer
of pv_ops MSR write call:
If on native:
cpu.write_msr = native_write_msr
If on Xen:
cpu.write_msr = xen_write_msr
Therefore, refactor pv_cpu_ops.write_msr{_safe}() to accept an MSR value
in a single u64 argument, replacing the current dual u32 arguments.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-14-xin@zytor.com
__rdmsr() is the lowest level MSR write API, with native_rdmsr()
and native_rdmsrq() serving as higher-level wrappers around it.
#define native_rdmsr(msr, val1, val2) \
do { \
u64 __val = __rdmsr((msr)); \
(void)((val1) = (u32)__val); \
(void)((val2) = (u32)(__val >> 32)); \
} while (0)
static __always_inline u64 native_rdmsrq(u32 msr)
{
return __rdmsr(msr);
}
However, __rdmsr() continues to be utilized in various locations.
MSR APIs are designed for different scenarios, such as native or
pvops, with or without trace, and safe or non-safe. Unfortunately,
the current MSR API names do not adequately reflect these factors,
making it challenging to select the most appropriate API for
various situations.
To pave the way for improving MSR API names, convert __rdmsr()
uses to native_rdmsrq() to ensure consistent usage. Later, these
APIs can be renamed to better reflect their implications, such as
native or pvops, with or without trace, and safe or non-safe.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-10-xin@zytor.com
__wrmsr() is the lowest level MSR write API, with native_wrmsr()
and native_wrmsrq() serving as higher-level wrappers around it:
#define native_wrmsr(msr, low, high) \
__wrmsr(msr, low, high)
#define native_wrmsrl(msr, val) \
__wrmsr((msr), (u32)((u64)(val)), \
(u32)((u64)(val) >> 32))
However, __wrmsr() continues to be utilized in various locations.
MSR APIs are designed for different scenarios, such as native or
pvops, with or without trace, and safe or non-safe. Unfortunately,
the current MSR API names do not adequately reflect these factors,
making it challenging to select the most appropriate API for
various situations.
To pave the way for improving MSR API names, convert __wrmsr()
uses to native_wrmsr{,q}() to ensure consistent usage. Later,
these APIs can be renamed to better reflect their implications,
such as native or pvops, with or without trace, and safe or
non-safe.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250427092027.1598740-8-xin@zytor.com