mirror of https://github.com/torvalds/linux.git
687 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
a77a94f862 |
x86/microcode: Default-disable late loading
It is dangerous and it should not be used anyway - there's a nice early loading already. Requested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220525161232.14924-3-bp@alien8.de |
|
|
|
c5a3d3c01e |
- Remove a bunch of chicken bit options to turn off CPU features which
are not really needed anymore - Misc fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLdfgACgkQEsHwGGHe VUpB5Q//TIGVgmnSd0YYxY2cIe047lfcd34D+3oEGk0d2FidtirP/tjgBqIXRuY5 UncoveqBuI/6/7bodP/ANg9DNVXv2489eFYyZtEOLSGnfzV2AU10aw95cuQQG+BW YIc6bGSsgfiNo8Vtj4L3xkVqxOrqaCYnh74GTSNNANht3i8KH8Qq9n3qZTuMiF6R fH9xWak3TZB2nMzHdYrXh0sSR6eBHN3KYSiT0DsdlU9PUlavlSPFYQRiAlr6FL6J BuYQdlUaCQbINvaviGW4SG7fhX32RfF/GUNaBajB40TO6H98KZLpBBvstWQ841xd /o44o5wbghoGP1ne8OKwP+SaAV2bE6twd5eO1lpwcpXnQfATvjQ2imxvOiRhy5LY pFPt/hko9gKWJ6SI0SQ4tiKJALFPLWD6561scHU6PoriFhv0SRIaPmJyEsDYynMz bCXaPPsoovRwwwBfAxxQjljIlhQSBVt3gWZ8NWD1tYbNaqM+WK7xKBaONGh3OCw3 iK7lsbbljtM0zmANImYyeo7+Hr1NVOmMiK2WZYbxhxgzH3l8v/6EbDt3I70WU57V 9apCU3/nk/HFpX65SdW5qmuiWLVdH9NXrEqbvaUB4ApT18MdUUugewBhcGnf3Umu wEtltzziqcIkxzDoXXpBGWpX31S7PsM2XVDqYC7dwuNttgEw2Fc= =7AUX -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU feature updates from Borislav Petkov: - Remove a bunch of chicken bit options to turn off CPU features which are not really needed anymore - Misc fixes and cleanups * tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add missing prototype for unpriv_ebpf_notify() x86/pm: Fix false positive kmemleak report in msr_build_context() x86/speculation/srbds: Do not try to turn mitigation off when not supported x86/cpu: Remove "noclflush" x86/cpu: Remove "noexec" x86/cpu: Remove "nosmep" x86/cpu: Remove CONFIG_X86_SMAP and "nosmap" x86/cpu: Remove "nosep" x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid= |
|
|
|
eb39e37d5c |
AMD SEV-SNP support
Add to confidential guests the necessary memory integrity protection against malicious hypervisor-based attacks like data replay, memory remapping and others, thus achieving a stronger isolation from the hypervisor. At the core of the functionality is a new structure called a reverse map table (RMP) with which the guest has a say in which pages get assigned to it and gets notified when a page which it owns, gets accessed/modified under the covers so that the guest can take an appropriate action. In addition, add support for the whole machinery needed to launch a SNP guest, details of which is properly explained in each patch. And last but not least, the series refactors and improves parts of the previous SEV support so that the new code is accomodated properly and not just bolted on. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLU2AACgkQEsHwGGHe VUpb/Q//f4LGiJf4nw1flzpe90uIsHNwAafng3NOjeXmhI/EcOlqPf23WHPCgg3Z 2umfa4sRZyj4aZubDd7tYAoq4qWrQ7pO7viWCNTh0InxBAILOoMPMuq2jSAbq0zV ASUJXeQ2bqjYxX4JV4N5f3HT2l+k68M0mpGLN0H+O+LV9pFS7dz7Jnsg+gW4ZP25 PMPLf6FNzO/1tU1aoYu80YDP1ne4eReLrNzA7Y/rx+S2NAetNwPn21AALVgoD4Nu vFdKh4MHgtVbwaQuh0csb/+4vD+tDXAhc8lbIl+Abl9ZxJaDWtAJW5D9e2CnsHk1 NOkHwnrzizzhtGK1g56YPUVRFAWhZYMOI1hR0zGPLQaVqBnN4b+iahPeRiV0XnGE PSbIHSfJdeiCkvLMCdIAmpE5mRshhRSUfl1CXTCdetMn8xV/qz/vG6bXssf8yhTV cfLGPHU7gfVmsbR9nk5a8KZ78PaytxOxfIDXvCy8JfQwlIWtieaCcjncrj+sdMJy 0fdOuwvi4jma0cyYuPolKiS1Hn4ldeibvxXT7CZQlIx6jZShMbpfpTTJs11XdtHm PdDAc1TY3AqI33mpy9DhDQmx/+EhOGxY3HNLT7evRhv4CfdQeK3cPVUWgo4bGNVv ZnFz7nvmwpyufltW9K8mhEZV267174jXGl6/idxybnlVE7ESr2Y= =Y8kW -----END PGP SIGNATURE----- Merge tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull AMD SEV-SNP support from Borislav Petkov: "The third AMD confidential computing feature called Secure Nested Paging. Add to confidential guests the necessary memory integrity protection against malicious hypervisor-based attacks like data replay, memory remapping and others, thus achieving a stronger isolation from the hypervisor. At the core of the functionality is a new structure called a reverse map table (RMP) with which the guest has a say in which pages get assigned to it and gets notified when a page which it owns, gets accessed/modified under the covers so that the guest can take an appropriate action. In addition, add support for the whole machinery needed to launch a SNP guest, details of which is properly explained in each patch. And last but not least, the series refactors and improves parts of the previous SEV support so that the new code is accomodated properly and not just bolted on" * tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) x86/entry: Fixup objtool/ibt validation x86/sev: Mark the code returning to user space as syscall gap x86/sev: Annotate stack change in the #VC handler x86/sev: Remove duplicated assignment to variable info x86/sev: Fix address space sparse warning x86/sev: Get the AP jump table address from secrets page x86/sev: Add missing __init annotations to SEV init routines virt: sevguest: Rename the sevguest dir and files to sev-guest virt: sevguest: Change driver name to reflect generic SEV support x86/boot: Put globals that are accessed early into the .data section x86/boot: Add an efi.h header for the decompressor virt: sevguest: Fix bool function returning negative value virt: sevguest: Fix return value check in alloc_shared_pages() x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate() virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement virt: sevguest: Add support to get extended report virt: sevguest: Add support to derive key virt: Add SEV-SNP guest driver x86/sev: Register SEV-SNP guest request platform device x86/sev: Provide support for SNP guest request NAEs ... |
|
|
|
400331f8ff |
x86/tsx: Disable TSX development mode at boot
A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com |
|
|
|
95d33bfaa3 |
x86/sev: Register GHCB memory when SEV-SNP is active
The SEV-SNP guest is required by the GHCB spec to register the GHCB's Guest Physical Address (GPA). This is because the hypervisor may prefer that a guest uses a consistent and/or specific GPA for the GHCB associated with a vCPU. For more information, see the GHCB specification section "GHCB GPA Registration". [ bp: Cleanup comments. ] Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220307213356.2797205-18-brijesh.singh@amd.com |
|
|
|
f8858b5eff |
x86/cpu: Remove "noclflush"
Not really needed anymore and there's clearcpuid=. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-7-bp@alien8.de |
|
|
|
385d2ae0a1 |
x86/cpu: Remove "nosmep"
There should be no need to disable SMEP anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-5-bp@alien8.de |
|
|
|
dbae0a934f |
x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"
Those were added as part of the SMAP enablement but SMAP is currently an integral part of kernel proper and there's no need to disable it anymore. Rip out that functionality. Leave --uaccess default on for objtool as this is what objtool should do by default anyway. If still needed - clearcpuid=smap. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-4-bp@alien8.de |
|
|
|
c949110ef4 |
x86/cpu: Remove "nosep"
That chicken bit was added by
|
|
|
|
1625c833db |
x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
Having to give the X86_FEATURE array indices in order to disable a
feature bit for testing is not really user-friendly. So accept the
feature bit names too.
Some feature bits don't have names so there the array indices are still
accepted, of course.
Clearing CPUID flags is not something which should be done in production
so taint the kernel too.
An exemplary cmdline would then be something like:
clearcpuid=de,440,smca,succory,bmi1,3dnow
("succory" is wrong on purpose). And it says:
[ ... ] Clearing CPUID bits: de 13:24 smca (unknown: succory) bmi1 3dnow
[ Fix CONFIG_X86_FEATURE_NAMES=n build error as reported by the 0day
robot: https://lore.kernel.org/r/202203292206.ICsY2RKX-lkp@intel.com ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-2-bp@alien8.de
|
|
|
|
9cea0d46f5 |
Merge branch 'x86/cpu' into x86/core, to resolve conflicts
Conflicts: arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
fe379fa4d1 |
x86/ibt: Disable IBT around firmware
Assume firmware isn't IBT clean and disable it across calls. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.759989383@infradead.org |
|
|
|
af22700390 |
x86/ibt,kexec: Disable CET on kexec
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.641454603@infradead.org |
|
|
|
991625f3dd |
x86/ibt: Add IBT feature, MSR and #CP handling
The bits required to make the hardware go.. Of note is that, provided the syscall entry points are covered with ENDBR, #CP doesn't need to be an IST because we'll never hit the syscall gap. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.582331711@infradead.org |
|
|
|
822ccfade5 |
x86/cpu: Read/save PPIN MSR during initialization
Currently, the PPIN (Protected Processor Inventory Number) MSR is read by every CPU that processes a machine check, CMCI, or just polls machine check banks from a periodic timer. This is not a "fast" MSR, so this adds to overhead of processing errors. Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the PPIN during initialization. Use this copy in mce_setup() instead of reading the MSR. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com |
|
|
|
00a2f23eef |
x86/cpu: X86_FEATURE_INTEL_PPIN finally has a CPUID bit
After nine generations of adding to model specific list of CPUs that support PPIN (Protected Processor Inventory Number) Intel allocated a CPUID bit to enumerate the MSRs. CPUID(EAX=7, ECX=1).EBX bit 0 enumerates presence of MSR_PPIN_CTL and MSR_PPIN. Add it to the "scattered" CPUID bits and add an entry to the ppin_cpuids[] x86_match_cpu() array to catch Intel CPUs that implement it. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-3-tony.luck@intel.com |
|
|
|
0dcab41d34 |
x86/cpu: Merge Intel and AMD ppin_init() functions
The code to decide whether a system supports the PPIN (Protected Processor Inventory Number) MSR was cloned from the Intel implementation. Apart from the X86_FEATURE bit and the MSR numbers it is identical. Merge the two functions into common x86 code, but use x86_match_cpu() instead of the switch (c->x86_model) that was used by the old Intel code. No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com |
|
|
|
25f8c7785e |
- Enable the short string copies for CPUs which support them, in
copy_user_enhanced_fast_string() - Avoid writing MSR_CSTAR on Intel due to TDX guests raising a #VE trap -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcFRcACgkQEsHwGGHe VUrVYRAAg8hJS/aIMnqr+CDX+iOlx2hxJ2TA2bA45NwWc1A4VTt9kwRB0+NIKkjj F3uJbidZjxSch9Oza6O5KyjJK8QtOfqxyYcx8TLjSleqJRoJWxl1Ub1/yAfKIX/0 QsqXVc/OuMzgwVGYLUwGSWifJOWMYKy03vSczmXK74zp9vZ56fdot8rOhDm3Xb/R QSfT5nKlgCvxbvAqgFfbXKoEu/EqT43sTXq4o1C6yDX/G6JOGe6nXZIAvIVm3iKZ utOqO+tBOmektF/yg3EHZL/7paFgtfETcI1YpmPYqKhG3KvvZgm7yyU6SqrcctSx vMSPTcgcuZl2I5OF+eesUGfGGhHSfSPBAhkxpCTOb6lHf73PYRC3BnQtlQkQt6g/ UOtm3fQwrVJcKlMu7nem46iDCgbSyvASFa5ZyuOGcrAiFLhJzQNRDlXLpxp/q615 yOYTRgj4YS6vomzc6bL3zNCcF5aJUwAPNVghe3l2zwKXetoOPvtWX8sKlYjiN3GW DTtEi117IAiWkosDIYY+aFNxLeOqxpNMcOkwd5eHHdpR3rkeFkjOtBctll/eHzPi NYx++cV5yYW0z4S2uRr6o4k4hdgAQU/p7xhdO28Z+yzWpmXQ//79HhiOf2nNd1iI dpQAx9roo8vbR3JYLxGYFuJrZsHna+/f6Gqf5teUy7SjVL5M95U= =zbYM -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Enable the short string copies for CPUs which support them, in copy_user_enhanced_fast_string() - Avoid writing MSR_CSTAR on Intel due to TDX guests raising a #VE trap * tag 'x86_cpu_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/lib: Add fast-short-rep-movs check to copy_user_enhanced_fast_string() x86/cpu: Don't write CSTAR MSR on Intel CPUs |
|
|
|
b64dfcde1c |
x86/mm: Prevent early boot triple-faults with instrumentation
Commit in Fixes added a global TLB flush on the early boot path, after
the kernel switches off of the trampoline page table.
Compiler profiling options enabled with GCOV_PROFILE add additional
measurement code on clang which needs to be initialized prior to
use. The global flush in x86_64_start_kernel() happens before those
initializations can happen, leading to accessing invalid memory.
GCOV_PROFILE builds with gcc are still ok so this is clang-specific.
The second issue this fixes is with KASAN: for a similar reason,
kasan_early_init() needs to have happened before KASAN-instrumented
functions are called.
Therefore, reorder the flush to happen after the KASAN early init
and prevent the compilers from adding profiling instrumentation to
native_write_cr4().
Fixes:
|
|
|
|
9c7e2634f6 |
x86/cpu: Don't write CSTAR MSR on Intel CPUs
Intel CPUs do not support SYSCALL in 32-bit mode, but the kernel initializes MSR_CSTAR unconditionally. That MSR write is normally ignored by the CPU, but in a TDX guest it raises a #VE trap. Exclude Intel CPUs from the MSR_CSTAR initialization. [ tglx: Fixed the subject line and removed the redundant comment. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20211119035803.4012145-1-sathyanarayanan.kuppuswamy@linux.intel.com |
|
|
|
e0f4c59dc4 |
- Start checking a CPUID bit on AMD Zen3 which states that the CPU
clears the segment base when a null selector is written. Do the explicit detection on older CPUs, zen2 and hygon specifically, which have the functionality but do not advertize the CPUID bit. Factor in the presence of a hypervisor underneath the kernel and avoid doing the explicit check there which the HV might've decided to not advertize for migration safety reasons, a.o. - Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting those CPUs in the hardware vulnerabilities detection - Force the compiler to use rIP-relative addressing in the fallback path of static_cpu_has(), in order to avoid unnecessary register pressure -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/wRgACgkQEsHwGGHe VUoGQBAAk9V9//FMoENuGFGul/IK8+VBibTfztYgaPvm7vjMDYaYuRBCQiZg5Y8U D14pwkg7CuRa6iwZmrk/X/y6FVjo5BJA//ROk/n/9JNvV5QUp3/o00uLiziv80K3 H6Wm3PUyGgkpBuJg+/K8SLE9UQ6uSh4nsykS+70Dcd45DtkC/vH8pkDs5Q1fVQwb 7AuOuWTCWKUYOMFYWFI3a9D8tZYhg99ABREbXBaJGiGdIlZKNVe/7W8qQw5s6cVA cD5Q2ILY2RCGP55ZQiWoFy3XNP3/ygvZ7Zm1ARYUvUMR2Y5X2XJWN/B6oMbc0oEu OZsDDA/ILYcah9eBV/zk4ON/1djksp1iWNXNxjct0cNBPAKxi6T/HhHuIHBtzvW+ zDyBWUMLlv1m2i1oW4J4NuNJJi9Gaz+7PesmI7C0OQPgywR8UqqfMD+TzlEHWya1 YqYqI0f3aiyC/sLjUp3GSA7a9sWSd3BZfyAlLBJZCxyXAxX92tXX5BRPh/KYbnJn c/NaYA6X4m4Rdvr0gKKtCklaC6w4GLzVak6wIvftzHlUYsWX21BhnTkQrciKbqc+ AKWed41AO+4pDHROePxc409x3UZolti+1RandikrztIVAolVJ6W/OkHWxXfy28Fg iSrtl4M3omv8fCHDaJ26STrXqxH8pIK8noVolwQoXKyAFVyvXTk= =rlVy -----END PGP SIGNATURE----- Merge tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Start checking a CPUID bit on AMD Zen3 which states that the CPU clears the segment base when a null selector is written. Do the explicit detection on older CPUs, zen2 and hygon specifically, which have the functionality but do not advertize the CPUID bit. Factor in the presence of a hypervisor underneath the kernel and avoid doing the explicit check there which the HV might've decided to not advertize for migration safety reasons, or similar. - Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting those CPUs in the hardware vulnerabilities detection - Force the compiler to use rIP-relative addressing in the fallback path of static_cpu_has(), in order to avoid unnecessary register pressure * tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Fix migration safety with X86_BUG_NULL_SEL x86/CPU: Add support for Vortex CPUs x86/umip: Downgrade warning messages to debug loglevel x86/asm: Avoid adding register pressure for the init case in static_cpu_has() x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix |
|
|
|
8cb1ae19bf |
x86/fpu updates:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from explicit
error codes to a boolean fail/success as that's all what the calling
code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX support:
- Distangle the public header maze and remove especially the misnomed
kitchen sink internal.h which is despite it's name included all over
the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime by
flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code into
the FPU core which removes the number of exports and avoids adding
even more export when AMX has to be supported in KVM. This also
removes duplicated code which was of course unnecessary different and
incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new fpstate
container and just switching the buffer pointer from the user space
buffer to the KVM guest buffer when entering vcpu_run() and flipping
it back when leaving the function. This cuts the memory requirements
of a vCPU for FPU buffers in half and avoids pointless memory copy
operations.
This also solves the so far unresolved problem of adding AMX support
because the current FPU buffer handling of KVM inflicted a circular
dependency between adding AMX support to the core and to KVM. With
the new scheme of switching fpstate AMX support can be added to the
core code without affecting KVM.
- Replace various variables with proper data structures so the extra
information required for adding dynamically enabled FPU features (AMX)
can be added in one place
- Add AMX (Advanved Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
which allows to trap the (first) use of an AMX related instruction,
which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra 8K
or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and cleared
on exec(). The permission policy of the kernel is restricted to
sigaltstack size validation, but the syscall obviously allows
further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
takes granted permissions and the potentially resulting larger
signal frame into account. This mechanism can also be used to
enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support was
added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the use
of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have been
disabled in XCR0. If permission has been granted, then a new fpstate
which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler sends
SIGSEGV to the task. That's not elegant, but unavoidable as the
other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused by
unexpected memory allocation failures is not a fundamentally new
concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is disarmed
for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with the
same life time rules as the FPU register state itself. The mechanism
is keyed off with a static key which is default disabled so !AMX
equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
is limited by comparing the tasks XFD value with a per CPU shadow
variable to avoid redundant MSR writes. In case of switching from a
AMX using task to a non AMX using task or vice versa, the extra MSR
write is obviously inevitable.
All other places which need to be aware of the variable feature sets
and resulting variable sizes are not affected at all because they
retrieve the information (feature set, sizes) unconditonally from
the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support is in
the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which has
not been caught in review and testing right away was restricted to AMX
enabled systems, which is completely irrelevant for anyone outside Intel
and their early access program. There might be dragons lurking as usual,
but so far the fine grained refactoring has held up and eventual yet
undetected fallout is bisectable and should be easily addressable before
the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity to
follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for inclusion
into 5.16-rc1.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
/3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
hcKvDmehQLrUvg==
=x3WL
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
|
|
|
|
9a7e0a90a4 |
Scheduler updates:
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
|
|
|
|
415de44076 |
x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ. To be specific, the case which goes wrong is: 1. Zen1 (or earlier) and Zen2 (or later) in a migration pool 2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL 3. Linux is then migrated to Zen1 Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed. The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour". Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit. [ bp: Minor touchups. ] Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com |
|
|
|
639475d434 |
x86/CPU: Add support for Vortex CPUs
DM&P devices were not being properly identified, which resulted in unneeded Spectre/Meltdown mitigations being applied. The manufacturer states that these devices execute always in-order and don't support either speculative execution or branch prediction, so they are not vulnerable to this class of attack. [1] This is something I've personally tested by a simple timing analysis on my Vortex86MX CPU, and can confirm it is true. Add identification for some devices that lack the CPUID product name call, so they appear properly on /proc/cpuinfo. ¹https://www.ssv-embedded.de/doks/infos/DMP_Ann_180108_Meltdown.pdf [ bp: Massage commit message. ] Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211017094408.1512158-1-marcos@orca.pet |
|
|
|
b56d2795b2 |
x86/fpu: Replace the includes of fpu/internal.h
Now that the file is empty, fixup all references with the proper includes and delete the former kitchen sink. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de |
|
|
|
66558b730f |
sched: Add cluster scheduler level for x86
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is shared among a cluster of cores instead of being exclusive to one single core. To prevent oversubscription of L2 cache, load should be balanced between such L2 clusters, especially for tasks with no shared data. On benchmark such as SPECrate mcf test, this change provides a boost to performance especially on medium load system on Jacobsville. on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each, the benchmark number is as follow: Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3% So this looks pretty good. In terms of the system's task distribution, some pretty bad clumping can be seen for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters. Note this patch isn't an universal win as spreading isn't necessarily a win, particually for those workload who can benefit from packing. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com |
|
|
|
3958b9c34c |
x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=n
Commit |
|
|
|
9164d9493a |
x86/cpu: Add get_llc_id() helper function
Factor out a helper function rather than export cpu_llc_id, which is needed in order to be able to build the AMD uncore driver as a module. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com |
|
|
|
1423e2660c |
Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes. The kernel unconditionally
writes the FPU state to the alternate stack without checking whether
the stack is large enough to accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been
updated despite the fact that the FPU state which is stored on the
signal stack has grown over time which causes trouble in the field
when AVX512 is available on a CPU. The kernel does not expose the
minimum requirements for the alternate stack size depending on the
available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well
- A major cleanup of the x86 FPU code. The recent discoveries of XSTATE
related issues unearthed quite some inconsistencies, duplicated code
and other issues.
The fine granular overhaul addresses this, makes the code more robust
and maintainable, which allows to integrate upcoming XSTATE related
features in sane ways.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz
y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB
KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj
wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL
gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga
f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA
GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK
XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC
M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq
SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx
0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4
/TbbRZu7nXDN2w==
=Mq1m
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
"Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes.
The kernel unconditionally writes the FPU state to the alternate
stack without checking whether the stack is large enough to
accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space
data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never
been updated despite the fact that the FPU state which is stored on
the signal stack has grown over time which causes trouble in the
field when AVX512 is available on a CPU. The kernel does not expose
the minimum requirements for the alternate stack size depending on
the available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well.
- A major cleanup of the x86 FPU code. The recent discoveries of
XSTATE related issues unearthed quite some inconsistencies,
duplicated code and other issues.
The fine granular overhaul addresses this, makes the code more
robust and maintainable, which allows to integrate upcoming XSTATE
related features in sane ways"
* tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again
x86/fpu/signal: Let xrstor handle the features to init
x86/fpu/signal: Handle #PF in the direct restore path
x86/fpu: Return proper error codes from user access functions
x86/fpu/signal: Split out the direct restore code
x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing()
x86/fpu/signal: Sanitize the xstate check on sigframe
x86/fpu/signal: Remove the legacy alignment check
x86/fpu/signal: Move initial checks into fpu__restore_sig()
x86/fpu: Mark init_fpstate __ro_after_init
x86/pkru: Remove xstate fiddling from write_pkru()
x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate()
x86/fpu: Remove PKRU handling from switch_fpu_finish()
x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
x86/fpu: Hook up PKRU into ptrace()
x86/fpu: Add PKRU storage outside of task XSAVE buffer
x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
...
|
|
|
|
909489bf9f |
Changes for this cycle:
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZeG8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gHQw//fI9MAIVQbB6tVMH6GtFkQZIJLMt/bik5
AWelEXoBUbbLFGKpugC+oWGJjsvZ026f65hfQEswuqD4n0Xx8FFPRi51LP88lLya
XQV8nssJYUKYZAVA0EJd7NmnJchbnRc4KQmu6ekEQdP6+Nht8k7U9O2QetgQgcE5
IYhXctoYpr/FnBpV5PmVNAakOt0cZh6mXAtpzjHfdU8lUHZ13zPIpniSXCPd4vUB
u/a3x3l1fP+Gg8d1vpfGCBvNKRBEh5pJsjaObMlLM/qhHupsDi5Ji6y6pcJSgkcv
2nBtRGYDjYIQ0qXx6ILhNuqGFT76i/j2p8YfwMnH4NmYk908RlT0quu7fI8wBO9E
cKd3m9BG8wP67xbOrG/0ckdl3+y/1iW8kPY6SeO03Vvfm6ryqHdZs4oi4CmcX9lP
bFXi5AiYdHm0vqbwQG8P9LerWotgz4yFC9z7yC1KXJDXJxSwVxDFiXvyvxepRi6E
NZxe4RSnDp7sijEvZJa/2EA+rDVDIokfzTLgnRSMkaUuxwNsVjeNsV0b5727kiVC
DwVkxC7NZKG9UBr6WFs9hxRPE0g6xz3EJEBXaWpk2ggBmQxTfBRTjV0Pe3ii7dqQ
z7O3Gv8pojki3ttG4wExLepPHRxTBzjdsoV6/BHZpraYTP11bpQlgx/K7IYJZYa5
Tt9IZ4vNd10=
=mbmH
-----END PGP SIGNATURE-----
Merge tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
* tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Make <asm/asm.h> valid on cross-builds as well
x86/regs: Syscall_get_nr() returns -1 for a non-system call
x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros
x86/syscall: Maximize MSR_SYSCALL_MASK
x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
x86/entry: Reverse arguments to do_syscall_64()
x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>
x86/asm: Use _ASM_BYTES() in <asm/nops.h>
x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence
x86/asm: Have the __ASM_FORM macros handle commas in arguments
|
|
|
|
fa8c84b77a |
x86/cpu: Write the default PKRU value when enabling PKE
In preparation of making the PKRU management more independent from XSTATES, write the default PKRU value into the hardware right after enabling PKRU in CR4. This ensures that switch_to() and copy_thread() have the correct setting for init task and the per CPU idle threads right away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.622983906@linutronix.de |
|
|
|
8a1dc55a3f |
x86/cpu: Sanitize X86_FEATURE_OSPKE
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU feature flag. That's ineffective because all call sites have checks for boot_cpu_data. Make it smarter and force the feature flag when PKU is enabled on the boot cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all over the place. That either compiles the code out when PKEY support is disabled in Kconfig or uses a static_cpu_has() for the feature check which makes a significant difference in hotpaths, e.g. context switch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de |
|
|
|
ce38f038ed |
x86/fpu: Get rid of fpu__get_supported_xfeatures_mask()
This function is really not doing what the comment advertises: "Find supported xfeatures based on cpu features and command-line input. This must be called after fpu__init_parse_early_param() is called and xfeatures_mask is enumerated." fpu__init_parse_early_param() does not exist anymore and the function just returns a constant. Remove it and fix the caller and get rid of further references to fpu__init_parse_early_param(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.816404717@linutronix.de |
|
|
|
b3607269ff |
x86/pkeys: Revert a5eff72597 ("x86/pkeys: Add PKRU value to init_fpstate")
This cannot work and it's unclear how that ever made a difference. init_fpstate.xsave.header.xfeatures is always 0 so get_xsave_addr() will always return a NULL pointer, which will prevent storing the default PKRU value in init_fpstate. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.451391598@linutronix.de |
|
|
|
939ef71329 |
x86/signal: Introduce helpers to get the maximum signal frame size
Signal frames do not have a fixed format and can vary in size when a number of things change: supported XSAVE features, 32 vs. 64-bit apps, etc. Add support for a runtime method for userspace to dynamically discover how large a signal stack needs to be. Introduce a new variable, max_frame_size, and helper functions for the calculation to be used in a new user interface. Set max_frame_size to a system-wide worst-case value, instead of storing multiple app-specific values. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lkml.kernel.org/r/20210518200320.17239-3-chang.seok.bae@intel.com |
|
|
|
b1efd0ff4b |
x86/cpu: Init AP exception handling from cpu_init_secondary()
SEV-ES guests require properly setup task register with which the TSS descriptor in the GDT can be located so that the IST-type #VC exception handler which they need to function properly, can be executed. This setup needs to happen before attempting to load microcode in ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions. Simplify the machinery by running that exception setup from a new function cpu_init_secondary() and explicitly call cpu_init_exception_handling() for the boot CPU before cpu_init(). The latter prepares for fixing and simplifying the exception/IST setup on the boot CPU. There should be no functional changes resulting from this patch. [ tglx: Reworked it so cpu_init_exception_handling() stays seperate ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de |
|
|
|
6de4ac1d03 |
x86/syscall: Maximize MSR_SYSCALL_MASK
It is better to clear as many flags as possible when we do a system call entry, as opposed to the other way around. The fewer flags we keep, the lesser the possible interference between the kernel and user space. The flags changed are: - CF, PF, AF, ZF, SF, OF: these are arithmetic flags which affect branches, possibly speculatively. They should be cleared for the same reasons we now clear all GPRs on entry. - RF: suppresses a code breakpoint on the subsequent instruction. It is probably impossible to enter the kernel with RF set, but if it is somehow not, it would break a kernel debugger setting a breakpoint on the entry point. Either way, user space should not be able to control kernel behavior here. - ID: this flag has no direct effect (it is a scratch bit only.) However, there is no reason to retain the user space value in the kernel, and the standard should be to clear unless needed, not the other way around. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210510185316.3307264-5-hpa@zytor.com |
|
|
|
fc48a6d1fa |
x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the latter has only a single user and is slightly misleading since the only in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210504225632.1532621-3-seanjc@google.com |
|
|
|
b6b4fbd90b |
x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
Initialize MSR_TSC_AUX with CPU node information if RDTSCP or RDPID is
supported. This fixes a bug where vdso_read_cpunode() will read garbage
via RDPID if RDPID is supported but RDTSCP is not. While no known CPU
supports RDPID but not RDTSCP, both Intel's SDM and AMD's APM allow for
RDPID to exist without RDTSCP, e.g. it's technically a legal CPU model
for a virtual machine.
Note, technically MSR_TSC_AUX could be initialized if and only if RDPID
is supported since RDTSCP is currently not used to retrieve the CPU node.
But, the cost of the superfluous WRMSR is negigible, whereas leaving
MSR_TSC_AUX uninitialized is just asking for future breakage if someone
decides to utilize RDTSCP.
Fixes:
|
|
|
|
c6536676c7 |
- turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code. - Add an insn_decode() API which all users of the instruction decoder should preferrably use. Its goal is to keep the details of the instruction decoder away from its users and simplify and streamline how one decodes insns in the kernel. Convert its users to it. - kprobes improvements and fixes - Set the maximum DIE per package variable on Hygon - Rip out the dynamic NOP selection and simplify all the machinery around selecting NOPs. Use the simplified NOPs in objtool now too. - Add Xeon Sapphire Rapids to list of CPUs that support PPIN - Simplify the retpolines by folding the entire thing into an alternative now that objtool can handle alternatives with stack ops. Then, have objtool rewrite the call to the retpoline with the alternative which then will get patched at boot time. - Document Intel uarch per models in intel-family.h - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the exception on Intel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/ PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w= =bm4S -----END PGP SIGNATURE----- Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates from Borislav Petkov: - Turn the stack canary into a normal __percpu variable on 32-bit which gets rid of the LAZY_GS stuff and a lot of code. - Add an insn_decode() API which all users of the instruction decoder should preferrably use. Its goal is to keep the details of the instruction decoder away from its users and simplify and streamline how one decodes insns in the kernel. Convert its users to it. - kprobes improvements and fixes - Set the maximum DIE per package variable on Hygon - Rip out the dynamic NOP selection and simplify all the machinery around selecting NOPs. Use the simplified NOPs in objtool now too. - Add Xeon Sapphire Rapids to list of CPUs that support PPIN - Simplify the retpolines by folding the entire thing into an alternative now that objtool can handle alternatives with stack ops. Then, have objtool rewrite the call to the retpoline with the alternative which then will get patched at boot time. - Document Intel uarch per models in intel-family.h - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the exception on Intel. * tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) x86, sched: Treat Intel SNC topology as default, COD as exception x86/cpu: Comment Skylake server stepping too x86/cpu: Resort and comment Intel models objtool/x86: Rewrite retpoline thunk calls objtool: Skip magical retpoline .altinstr_replacement objtool: Cache instruction relocs objtool: Keep track of retpoline call sites objtool: Add elf_create_undef_symbol() objtool: Extract elf_symbol_add() objtool: Extract elf_strtab_concat() objtool: Create reloc sections implicitly objtool: Add elf_create_reloc() helper objtool: Rework the elf_rebuild_reloc_section() logic objtool: Fix static_call list generation objtool: Handle per arch retpoline naming objtool: Correctly handle retpoline thunk calls x86/retpoline: Simplify retpolines x86/alternatives: Optimize optimize_nops() x86: Add insn_decode_kernel() x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration ... |
|
|
|
64f8e73de0 |
Support for enhanced split lock detection:
Newer CPUs provide a second mechanism to detect operations with lock prefix which go accross a cache line boundary. Such operations have to take bus lock which causes a system wide performance degradation when these operations happen frequently. The new mechanism is not using the #AC exception. It triggers #DB and is restricted to operations in user space. Kernel side split lock access can only be detected by the #AC based variant. Contrary to the #AC based mechanism the #DB based variant triggers _after_ the instruction was executed. The mechanism is CPUID enumerated and contrary to the #AC version which is based on the magic TEST_CTRL_MSR and model/family based enumeration on the way to become architectural. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGkr8THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodUKD/9tUXhInR7+1ykEHpMvdmSp48vqY3nc sKmT22pPl+OchnJ62mw3T8gKpBYVleJmcCaY2qVx7hfaVcWApLGJvX4tmfXmv422 XDSJ6b8Os6wfgx5FR//I17z8ZtXnnuKkPrTMoRsQUw2qLq31y6fdQv+GW/cc1Kpw mengjmPE+HnpaKbtuQfPdc4a+UvLjvzBMAlDZPTBPKYrP4FFqYVnUVwyTg5aLVDY gHz4V8+b502RS/zPfTAtE3J848od+NmcUPdFlcG9DVA+hR0Rl0thvruCTFiD2vVh i9DJ7INof5FoJDEzh0dGsD7x+MB6OY8GZyHdUMeGgIRPtWkqrG52feQQIn2YYlaL fB3DlpNv7NIJ/0JMlALvh8S0tEoOcYdHqH+M/3K/zbzecg/FAo+lVo8WciGLPqWs ykUG5/f/OnlTvgB8po1ebJu0h0jHnoK9heWWXk9zWIRVDPXHFOWKW3kSbTTb3icR 9hfjP/SNejpmt9Ju1OTwsgnV7NALIdVX+G5jyIEsjFl31Co1RZNYhHLFvi11FWlQ /ssvFK9O5ZkliocGCAN9+yuOnM26VqWSCE4fis6/2aSgD2Y4Gpvb//cP96SrcNAH u8eXNvGLlniJP3F3JImWIfIPQTrpvQhcU4eZ6NtviXqj/utQXX6c9PZ1PLYpcvUh 9AWF8rwhT8X4oA== =lmi8 -----END PGP SIGNATURE----- Merge tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 bus lock detection updates from Thomas Gleixner: "Support for enhanced split lock detection: Newer CPUs provide a second mechanism to detect operations with lock prefix which go accross a cache line boundary. Such operations have to take bus lock which causes a system wide performance degradation when these operations happen frequently. The new mechanism is not using the #AC exception. It triggers #DB and is restricted to operations in user space. Kernel side split lock access can only be detected by the #AC based variant. Contrary to the #AC based mechanism the #DB based variant triggers _after_ the instruction was executed. The mechanism is CPUID enumerated and contrary to the #AC version which is based on the magic TEST_CTRL_MSR and model/family based enumeration on the way to become architectural" * tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/admin-guide: Change doc for split_lock_detect parameter x86/traps: Handle #DB for bus lock x86/cpufeatures: Enumerate #DB for bus lock detection |
|
|
|
ebb1064e7c |
x86/traps: Handle #DB for bus lock
Bus locks degrade performance for the whole system, not just for the CPU that requested the bus lock. Two CPU features "#AC for split lock" and "#DB for bus lock" provide hooks so that the operating system may choose one of several mitigation strategies. #AC for split lock is already implemented. Add code to use the #DB for bus lock feature to cover additional situations with new options to mitigate. split_lock_detect= #AC for split lock #DB for bus lock off Do nothing Do nothing warn Kernel OOPs Warn once per task and Warn once per task and and continues to run. disable future checking When both features are supported, warn in #AC fatal Kernel OOPs Send SIGBUS to user. Send SIGBUS to user When both features are supported, fatal in #AC ratelimit:N Do nothing Limit bus lock rate to N per second in the current non-root user. Default option is "warn". Hardware only generates #DB for bus lock detect when CPL>0 to avoid nested #DB from multiple bus locks while the first #DB is being handled. So no need to handle #DB for bus lock detected in the kernel. #DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR while #AC for split lock is enabled by split lock detection bit 29 in TEST_CTRL MSR. Both breakpoint and bus lock in the same instruction can trigger one #DB. The bus lock is handled before the breakpoint in the #DB handler. Delivery of #DB for bus lock in userspace clears DR6[11], which is set by the #DB handler right after reading DR6. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com |
|
|
|
1591584e2e |
x86/process/64: Move cpu_current_top_of_stack out of TSS
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed through the cpu_entry_area which is visible with user CR3 when PTI is enabled and active. This makes it a coveted fruit for attackers. An attacker can fetch the kernel stack top from it and continue next steps of actions based on the kernel stack. But it is actualy not necessary to be stored in the TSS. It is only accessed after the entry code switched to kernel CR3 and kernel GS_BASE which means it can be in any regular percpu variable. The reason why it is in TSS is historical (pre PTI) because TSS is also used as scratch space in SYSCALL_64 and therefore cache hot. A syscall also needs the per CPU variable current_task and eventually __preempt_count, so placing cpu_current_top_of_stack next to them makes it likely that they end up in the same cache line which should avoid performance regressions. This is not enforced as the compiler is free to place these variables, so these entry relevant variables should move into a data structure to make this enforceable. The seccomp_benchmark doesn't show any performance loss in the "getpid native" test result. Actually, the result changes from 93ns before to 92ns with this change when KPTI is disabled. The test is very stable and although the test doesn't show a higher degree of precision it gives enough confidence that moving cpu_current_top_of_stack does not cause a regression. [ tglx: Removed unneeded export. Massaged changelog ] Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com |
|
|
|
d9f6e12fb0 |
x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org |
|
|
|
3fb0fdb3bb |
x86/stackprotector/32: Make the canary into a regular percpu variable
On 32-bit kernels, the stackprotector canary is quite nasty -- it is stored at %gs:(20), which is nasty because 32-bit kernels use %fs for percpu storage. It's even nastier because it means that whether %gs contains userspace state or kernel state while running kernel code depends on whether stackprotector is enabled (this is CONFIG_X86_32_LAZY_GS), and this setting radically changes the way that segment selectors work. Supporting both variants is a maintenance and testing mess. Merely rearranging so that percpu and the stack canary share the same segment would be messy as the 32-bit percpu address layout isn't currently compatible with putting a variable at a fixed offset. Fortunately, GCC 8.1 added options that allow the stack canary to be accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary percpu variable. This lets us get rid of all of the code to manage the stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess. (That name is special. We could use any symbol we want for the %fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any name other than __stack_chk_guard.) Forcibly disable stackprotector on older compilers that don't support the new options and turn the stack canary into a percpu variable. The "lazy GS" approach is now used for all 32-bit configurations. Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels, it loads the GS selector and updates the user GSBASE accordingly. (This is unchanged.) On 32-bit kernels, it loads the GS selector and updates GSBASE, which is now always the user base. This means that the overall effect is the same on 32-bit and 64-bit, which avoids some ifdeffery. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org |
|
|
|
29c395c77a |
Rework of the X86 irq stack handling:
The irq stack switching was moved out of the ASM entry code in course of
the entry code consolidation. It ended up being suboptimal in various
ways.
- Make the stack switching inline so the stackpointer manipulation is not
longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused
about the stack pointer manipulation.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmA21OcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaX0D/9S0ud6oqbsIvI8LwhvYub63a2cjKP9
liHAJ7xwMYYVwzf0skwsPb/QE6+onCzdq0upJkgG/gEYm2KbiaMWZ4GgHdj0O7ER
qXKJONDd36AGxSEdaVzLY5kPuD/mkomGk5QdaZaTmjruthkNzg4y/N2wXUBIMZR0
FdpSpp5fGspSZCn/DXDx6FjClwpLI53VclvDs6DcZ2DIBA0K+F/cSLb1UQoDLE1U
hxGeuNa+GhKeeZ5C+q5giho1+ukbwtjMW9WnKHAVNiStjm0uzdqq7ERGi/REvkcB
LY62u5uOSW1zIBMmzUjDDQEqvypB0iFxFCpN8g9sieZjA0zkaUioRTQyR+YIQ8Cp
l8LLir0dVQivR1bHghHDKQJUpdw/4zvDj4mMH10XHqbcOtIxJDOJHC5D00ridsAz
OK0RlbAJBl9FTdLNfdVReBCoehYAO8oefeyMAG12nZeSh5XVUWl238rvzmzIYNhG
cEtkSx2wIUNEA+uSuI+xvfmwpxL7voTGvqmiRDCAFxyO7Bl/GBu9OEBFA1eOvHB+
+wTmPDMswRetQNh4QCRXzk1JzP1Wk5CobUL9iinCWFoTJmnsPPSOWlosN6ewaNXt
kYFpRLy5xt9EP7dlfgBSjiRlthDhTdMrFjD5bsy1vdm1w7HKUo82lHa4O8Hq3PHS
tinKICUqRsbjig==
=Sqr1
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq entry updates from Thomas Gleixner:
"The irq stack switching was moved out of the ASM entry code in course
of the entry code consolidation. It ended up being suboptimal in
various ways.
This reworks the X86 irq stack handling:
- Make the stack switching inline so the stackpointer manipulation is
not longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got
confused about the stack pointer manipulation"
* tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix stack-swizzle for FRAME_POINTER=y
um: Enforce the usage of asm-generic/softirq_stack.h
x86/softirq/64: Inline do_softirq_own_stack()
softirq: Move do_softirq_own_stack() to generic asm header
softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig
x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
x86/softirq: Remove indirection in do_softirq_own_stack()
x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall
x86/entry: Convert device interrupts to inline stack switching
x86/entry: Convert system vectors to irq stack macro
x86/irq: Provide macro for inlining irq stack switching
x86/apic: Split out spurious handling code
x86/irq/64: Adjust the per CPU irq stack pointer by 8
x86/irq: Sanitize irq stack tracking
x86/entry: Fix instrumentation annotation
|
|
|
|
951c2a51ae |
x86/irq/64: Adjust the per CPU irq stack pointer by 8
The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the
form that it is ready to be assigned to [ER]SP so that the first push ends
up on the top entry of the stack.
But the stack switching on 64 bit has the following rules:
1) Store the current stack pointer (RSP) in the top most stack entry
to allow the unwinder to link back to the previous stack
2) Set RSP to the top most stack entry
3) Invoke functions on the irq stack
4) Pop RSP from the top most stack entry (stored in #1) so it's back
to the original stack.
That requires all stack switching code to decrement the stored pointer by 8
in order to be able to store the current RSP and then set RSP to that
location. That's a pointless exercise.
Do the -8 adjustment right when storing the pointer and make the data type
a void pointer to avoid confusion vs. the struct irq_stack data type which
is on 64bit only used to declare the backing store. Move the definition
next to the inuse flag so they likely end up in the same cache
line. Sticking them into a struct to enforce it is a seperate change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
|
|
|
|
e7f8900179 |
x86/irq: Sanitize irq stack tracking
The recursion protection for hard interrupt stacks is an unsigned int per CPU variable initialized to -1 named __irq_count. The irq stack switching is only done when the variable is -1, which creates worse code than just checking for 0. When the stack switching happens it uses this_cpu_add/sub(1), but there is no reason to do so. It simply can use straight writes. This is a historical leftover from the low level ASM code which used inc and jz to make a decision. Rename it to hardirq_stack_inuse, make it a bool and use plain stores. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de |
|
|
|
fb35d30fe5 |
x86/cpufeatures: Assign dedicated feature word for CPUID_0x8000001F[EAX]
Collect the scattered SME/SEV related feature flags into a dedicated word. There are now five recognized features in CPUID.0x8000001F.EAX, with at least one more on the horizon (SEV-SNP). Using a dedicated word allows KVM to use its automagic CPUID adjustment logic when reporting the set of supported features to userspace. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Brijesh Singh <brijesh.singh@amd.com> Link: https://lkml.kernel.org/r/20210122204047.2860075-2-seanjc@google.com |