Commit Graph

149 Commits

Author SHA1 Message Date
Linus Torvalds 51d90a15fe ARM:
- Support for userspace handling of synchronous external aborts (SEAs),
   allowing the VMM to potentially handle the abort in a non-fatal
   manner.
 
 - Large rework of the VGIC's list register handling with the goal of
   supporting more active/pending IRQs than available list registers in
   hardware. In addition, the VGIC now supports EOImode==1 style
   deactivations for IRQs which may occur on a separate vCPU than the
   one that acked the IRQ.
 
 - Support for FEAT_XNX (user / privileged execute permissions) and
   FEAT_HAF (hardware update to the Access Flag) in the software page
   table walkers and shadow MMU.
 
 - Allow page table destruction to reschedule, fixing long need_resched
   latencies observed when destroying a large VM.
 
 - Minor fixes to KVM and selftests
 
 Loongarch:
 
 - Get VM PMU capability from HW GCFG register.
 
 - Add AVEC basic support.
 
 - Use 64-bit register definition for EIOINTC.
 
 - Add KVM timer test cases for tools/selftests.
 
 RISC/V:
 
 - SBI message passing (MPXY) support for KVM guest
 
 - Give a new, more specific error subcode for the case when in-kernel
   AIA virtualization fails to allocate IMSIC VS-file
 
 - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
   in small chunks
 
 - Fix guest page fault within HLV* instructions
 
 - Flush VS-stage TLB after VCPU migration for Andes cores
 
 s390:
 
 - Always allocate ESCA (Extended System Control Area), instead of
   starting with the basic SCA and converting to ESCA with the
   addition of the 65th vCPU.  The price is increased number of
   exits (and worse performance) on z10 and earlier processor;
   ESCA was introduced by z114/z196 in 2010.
 
 - VIRT_XFER_TO_GUEST_WORK support
 
 - Operation exception forwarding support
 
 - Cleanups
 
 x86:
 
 - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
   caching is disabled, as there can't be any relevant SPTEs to zap.
 
 - Relocate a misplaced export.
 
 - Fix an async #PF bug where KVM would clear the completion queue when the
   guest transitioned in and out of paging mode, e.g. when handling an SMI and
   then returning to paged mode via RSM.
 
 - Leave KVM's user-return notifier registered even when disabling
   virtualization, as long as kvm.ko is loaded.  On reboot/shutdown, keeping
   the notifier registered is ok; the kernel does not use the MSRs and the
   callback will run cleanly and restore host MSRs if the CPU manages to
   return to userspace before the system goes down.
 
 - Use the checked version of {get,put}_user().
 
 - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
   timers can result in a hard lockup in the host.
 
 - Revert the periodic kvmclock sync logic now that KVM doesn't use a
   clocksource that's subject to NTP corrections.
 
 - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
   behind CONFIG_CPU_MITIGATIONS.
 
 - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
   the only reason they were handled in the fast path was to paper of a bug
   in the core #MC code, and that has long since been fixed.
 
 - Add emulator support for AVX MOV instructions, to play nice with emulated
   devices whose guest drivers like to access PCI BARs with large multi-byte
   instructions.
 
 x86 (AMD):
 
 - Fix a few missing "VMCB dirty" bugs.
 
 - Fix the worst of KVM's lack of EFER.LMSLE emulation.
 
 - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
 
 - Fix incorrect handling of selective CR0 writes when checking intercepts
   during emulation of L2 instructions.
 
 - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
   VMRUN and #VMEXIT.
 
 - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
   interrupt if the guest patched the underlying code after the VM-Exit, e.g.
   when Linux patches code with a temporary INT3.
 
 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
   userspace, and extend KVM "support" to all policy bits that don't require
   any actual support from KVM.
 
 x86 (Intel):
 
 - Use the root role from kvm_mmu_page to construct EPTPs instead of the
   current vCPU state, partly as worthwhile cleanup, but mostly to pave the
   way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
   migration if the root is clean from a previous flush.
 
 - Add a few missing nested consistency checks.
 
 - Rip out support for doing "early" consistency checks via hardware as the
   functionality hasn't been used in years and is no longer useful in general;
   replace it with an off-by-default module param to WARN if hardware fails
   a check that KVM does not perform.
 
 - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
   on VM-Enter.
 
 - Misc cleanups.
 
 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module;
   KVM was either working around these in weird, ugly ways, or was simply
   oblivious to them (though even Yan's devilish selftests could only break
   individual VMs, not the host kernel)
 
 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
   if creating said vCPU failed partway through.
 
 - Fix a few sparse warnings (bad annotation, 0 != NULL).
 
 - Use struct_size() to simplify copying TDX capabilities to userspace.
 
 - Fix a bug where TDX would effectively corrupt user-return MSR values if the
   TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
 
 Selftests:
 
 - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
 
 - Forcefully override ARCH from x86_64 to x86 to play nice with specifying
   ARCH=x86_64 on the command line.
 
 - Extend a bunch of nested VMX to validate nested SVM as well.
 
 - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
   verify KVM can save/restore nested VMX state when L1 is using 5-level
   paging, but L2 is not.
 
 - Clean up the guest paging code in anticipation of sharing the core logic for
   nested EPT and nested NPT.
 
 guest_memfd:
 
 - Add NUMA mempolicy support for guest_memfd, and clean up a variety of
   rough edges in guest_memfd along the way.
 
 - Define a CLASS to automatically handle get+put when grabbing a guest_memfd
   from a memslot to make it harder to leak references.
 
 - Enhance KVM selftests to make it easer to develop and debug selftests like
   those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
   often result in hard-to-debug SIGBUS errors.
 
 - Misc cleanups.
 
 Generic:
 
 - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
   irqfd cleanup.
 
 - Fix a goof in the dirty ring documentation.
 
 - Fix choice of target for directed yield across different calls to
   kvm_vcpu_on_spin(); the function was always starting from the first
   vCPU instead of continuing the round-robin search.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
 ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
 pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
 A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
 8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
 bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
 =PEcw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Support for userspace handling of synchronous external aborts
     (SEAs), allowing the VMM to potentially handle the abort in a
     non-fatal manner

   - Large rework of the VGIC's list register handling with the goal of
     supporting more active/pending IRQs than available list registers
     in hardware. In addition, the VGIC now supports EOImode==1 style
     deactivations for IRQs which may occur on a separate vCPU than the
     one that acked the IRQ

   - Support for FEAT_XNX (user / privileged execute permissions) and
     FEAT_HAF (hardware update to the Access Flag) in the software page
     table walkers and shadow MMU

   - Allow page table destruction to reschedule, fixing long
     need_resched latencies observed when destroying a large VM

   - Minor fixes to KVM and selftests

  Loongarch:

   - Get VM PMU capability from HW GCFG register

   - Add AVEC basic support

   - Use 64-bit register definition for EIOINTC

   - Add KVM timer test cases for tools/selftests

  RISC/V:

   - SBI message passing (MPXY) support for KVM guest

   - Give a new, more specific error subcode for the case when in-kernel
     AIA virtualization fails to allocate IMSIC VS-file

   - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
     in small chunks

   - Fix guest page fault within HLV* instructions

   - Flush VS-stage TLB after VCPU migration for Andes cores

  s390:

   - Always allocate ESCA (Extended System Control Area), instead of
     starting with the basic SCA and converting to ESCA with the
     addition of the 65th vCPU. The price is increased number of exits
     (and worse performance) on z10 and earlier processor; ESCA was
     introduced by z114/z196 in 2010

   - VIRT_XFER_TO_GUEST_WORK support

   - Operation exception forwarding support

   - Cleanups

  x86:

   - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
     SPTE caching is disabled, as there can't be any relevant SPTEs to
     zap

   - Relocate a misplaced export

   - Fix an async #PF bug where KVM would clear the completion queue
     when the guest transitioned in and out of paging mode, e.g. when
     handling an SMI and then returning to paged mode via RSM

   - Leave KVM's user-return notifier registered even when disabling
     virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
     keeping the notifier registered is ok; the kernel does not use the
     MSRs and the callback will run cleanly and restore host MSRs if the
     CPU manages to return to userspace before the system goes down

   - Use the checked version of {get,put}_user()

   - Fix a long-lurking bug where KVM's lack of catch-up logic for
     periodic APIC timers can result in a hard lockup in the host

   - Revert the periodic kvmclock sync logic now that KVM doesn't use a
     clocksource that's subject to NTP corrections

   - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
     latter behind CONFIG_CPU_MITIGATIONS

   - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
     path; the only reason they were handled in the fast path was to
     paper of a bug in the core #MC code, and that has long since been
     fixed

   - Add emulator support for AVX MOV instructions, to play nice with
     emulated devices whose guest drivers like to access PCI BARs with
     large multi-byte instructions

  x86 (AMD):

   - Fix a few missing "VMCB dirty" bugs

   - Fix the worst of KVM's lack of EFER.LMSLE emulation

   - Add AVIC support for addressing 4k vCPUs in x2AVIC mode

   - Fix incorrect handling of selective CR0 writes when checking
     intercepts during emulation of L2 instructions

   - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
     on VMRUN and #VMEXIT

   - Fix a bug where KVM corrupt the guest code stream when re-injecting
     a soft interrupt if the guest patched the underlying code after the
     VM-Exit, e.g. when Linux patches code with a temporary INT3

   - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
     to userspace, and extend KVM "support" to all policy bits that
     don't require any actual support from KVM

  x86 (Intel):

   - Use the root role from kvm_mmu_page to construct EPTPs instead of
     the current vCPU state, partly as worthwhile cleanup, but mostly to
     pave the way for tracking per-root TLB flushes, and elide EPT
     flushes on pCPU migration if the root is clean from a previous
     flush

   - Add a few missing nested consistency checks

   - Rip out support for doing "early" consistency checks via hardware
     as the functionality hasn't been used in years and is no longer
     useful in general; replace it with an off-by-default module param
     to WARN if hardware fails a check that KVM does not perform

   - Fix a currently-benign bug where KVM would drop the guest's
     SPEC_CTRL[63:32] on VM-Enter

   - Misc cleanups

   - Overhaul the TDX code to address systemic races where KVM (acting
     on behalf of userspace) could inadvertantly trigger lock contention
     in the TDX-Module; KVM was either working around these in weird,
     ugly ways, or was simply oblivious to them (though even Yan's
     devilish selftests could only break individual VMs, not the host
     kernel)

   - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
     TDX vCPU, if creating said vCPU failed partway through

   - Fix a few sparse warnings (bad annotation, 0 != NULL)

   - Use struct_size() to simplify copying TDX capabilities to userspace

   - Fix a bug where TDX would effectively corrupt user-return MSR
     values if the TDX Module rejects VP.ENTER and thus doesn't clobber
     host MSRs as expected

  Selftests:

   - Fix a math goof in mmu_stress_test when running on a single-CPU
     system/VM

   - Forcefully override ARCH from x86_64 to x86 to play nice with
     specifying ARCH=x86_64 on the command line

   - Extend a bunch of nested VMX to validate nested SVM as well

   - Add support for LA57 in the core VM_MODE_xxx macro, and add a test
     to verify KVM can save/restore nested VMX state when L1 is using
     5-level paging, but L2 is not

   - Clean up the guest paging code in anticipation of sharing the core
     logic for nested EPT and nested NPT

  guest_memfd:

   - Add NUMA mempolicy support for guest_memfd, and clean up a variety
     of rough edges in guest_memfd along the way

   - Define a CLASS to automatically handle get+put when grabbing a
     guest_memfd from a memslot to make it harder to leak references

   - Enhance KVM selftests to make it easer to develop and debug
     selftests like those added for guest_memfd NUMA support, e.g. where
     test and/or KVM bugs often result in hard-to-debug SIGBUS errors

   - Misc cleanups

  Generic:

   - Use the recently-added WQ_PERCPU when creating the per-CPU
     workqueue for irqfd cleanup

   - Fix a goof in the dirty ring documentation

   - Fix choice of target for directed yield across different calls to
     kvm_vcpu_on_spin(); the function was always starting from the first
     vCPU instead of continuing the round-robin search"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
  KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
  KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
  KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
  KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
  KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
  KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
  KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
  KVM: arm64: selftests: Add test for AT emulation
  KVM: arm64: nv: Expose hardware access flag management to NV guests
  KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
  KVM: arm64: Implement HW access flag management in stage-1 SW PTW
  KVM: arm64: Propagate PTW errors up to AT emulation
  KVM: arm64: Add helper for swapping guest descriptor
  KVM: arm64: nv: Use pgtable definitions in stage-2 walk
  KVM: arm64: Handle endianness in read helper for emulated PTW
  KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
  KVM: arm64: Call helper for reading descriptors directly
  KVM: arm64: nv: Advertise support for FEAT_XNX
  KVM: arm64: Teach ptdump about FEAT_XNX permissions
  KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
  ...
2025-12-05 17:01:20 -08:00
Heiko Carstens d0139059e3 KVM: s390: Enable and disable interrupts in entry code
Move enabling and disabling of interrupts around the SIE instruction to
entry code. Enabling interrupts only after the __TI_sie flag has been set
guarantees that the SIE instruction is not executed if an interrupt happens
between enabling interrupts and the execution of the SIE instruction.
Interrupt handlers and machine check handler forward the PSW to the
sie_exit label in such cases.

This is a prerequisite for VIRT_XFER_TO_GUEST_WORK to prevent that guest
context is entered when e.g. a scheduler IPI, indicating that a reschedule
is required, happens right before the SIE instruction, which could lead to
long delays.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2025-11-27 15:39:46 +01:00
Heiko Carstens f5730d44e0 s390: Add stackprotector support
Stackprotector support was previously unavailable on s390 because by
default compilers generate code which is not suitable for the kernel:
the canary value is accessed via thread local storage, where the address
of thread local storage is within access registers 0 and 1.

Using those registers also for the kernel would come with a significant
performance impact and more complicated kernel entry/exit code, since
access registers contents would have to be exchanged on every kernel entry
and exit.

With the upcoming gcc 16 release new compiler options will become available
which allow to generate code suitable for the kernel. [1]

Compiler option -mstack-protector-guard=global instructs gcc to generate
stackprotector code that refers to a global stackprotector canary value via
symbol __stack_chk_guard. Access to this value is guaranteed to occur via
larl and lgrl instructions.

Furthermore, compiler option -mstack-protector-guard-record generates a
section containing all code addresses that reference the canary value.

To allow for per task canary values the instructions which load the address
of __stack_chk_guard are patched so they access a lowcore field instead: a
per task canary value is available within the task_struct of each task, and
is written to the per-cpu lowcore location on each context switch.

Also add sanity checks and debugging option to be consistent with other
kernel code patching mechanisms.

Full debugging output can be enabled with the following kernel command line
options:

debug_stackprotector
bootdebug
ignore_loglevel
earlyprintk
dyndbg="file stackprotector.c +p"

Example debug output:

stackprot: 0000021e402d4eda: c010005a9ae3 -> c01f00070240

where "<insn address>: <old insn> -> <new insn>".

[1] gcc commit 0cd1f03939d5 ("s390: Support global stack protector")

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:45:21 +01:00
Menglong Dong 35561bab76 arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
The include/generated/asm-offsets.h is generated in Kbuild during
compiling from arch/SRCARCH/kernel/asm-offsets.c. When we want to
generate another similar offset header file, circular dependency can
happen.

For example, we want to generate a offset file include/generated/test.h,
which is included in include/sched/sched.h. If we generate asm-offsets.h
first, it will fail, as include/sched/sched.h is included in asm-offsets.c
and include/generated/test.h doesn't exist; If we generate test.h first,
it can't success neither, as include/generated/asm-offsets.h is included
by it.

In x86_64, the macro COMPILE_OFFSETS is used to avoid such circular
dependency. We can generate asm-offsets.h first, and if the
COMPILE_OFFSETS is defined, we don't include the "generated/test.h".

And we define the macro COMPILE_OFFSETS for all the asm-offsets.c for this
purpose.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-25 09:57:15 +02:00
Heiko Carstens 8b72f5a97b s390/mm: Reimplement lazy ASCE handling
Reduce system call overhead time (round trip time for invoking a
non-existent system call) by 25%.

With the removal of set_fs() [1] lazy control register handling was removed
in order to keep kernel entry and exit simple. However this made system
calls slower.

With the conversion to generic entry [2] and numerous follow up changes
which simplified the entry code significantly, adding support for lazy asce
handling doesn't add much complexity to the entry code anymore.

In particular this means:

- On kernel entry the primary asce is not modified and contains the user
  asce

- Kernel accesses which require secondary-space mode (for example futex
  operations) are surrounded by enable_sacf_uaccess() and
  disable_sacf_uaccess() calls. enable_sacf_uaccess() sets the primary asce
  to kernel asce so that the sacf instruction can be used to switch to
  secondary-space mode. The primary asce is changed back to user asce with
  disable_sacf_uaccess().

The state of the control register which contains the primary asce is
reflected with a new TIF_ASCE_PRIMARY bit. This is required on context
switch so that the correct asce is restored for the scheduled in process.

In result address spaces are now setup like this:

CPU running in               | %cr1 ASCE | %cr7 ASCE | %cr13 ASCE
-----------------------------|-----------|-----------|-----------
user space                   |  user     |  user     |  kernel
kernel (no sacf)             |  user     |  user     |  kernel
kernel (during sacf uaccess) |  kernel   |  user     |  kernel
kernel (kvm guest execution) |  guest    |  user     |  kernel

In result cr1 control register content is not changed except for:
- futex system calls
- legacy s390 PCI system calls
- the kvm specific cmpxchg_user_key() uaccess helper

This leads to faster system call execution.

[1] 87d5986345 ("s390/mm: remove set_fs / rework address space handling")
[2] 56e62a7370 ("s390: convert to generic entry")

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-04-14 11:23:21 +02:00
Heiko Carstens b9be1bee2f s390/asm-offsets: Remove ASM_OFFSETS_C
Remove ASM_OFFSETS_C which is used as guard in thread_info.h to decide if
asm-offsets can be included or not.

There is no reason to include asm-offsets.h in thread_info.h anymore.
Remove the define and the not needed include. Explicitly include
asm-offsets.h in all header files which require it, and where it used
to be included implicitly via thread_info.h.

This reduces header dependencies.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-31 12:20:39 +02:00
Heiko Carstens 5eeec56945 s390/asm-offsets: Include ftrace_regs.h instead of ftrace.h
Reduce header dependencies by including ftrace_regs.h and ptrace.h,
which does not include other header files, instead of ftrace.h which
pulls in various other header files.

This is sufficient for __FTRACE_REGS_PT_REGS and __FTRACE_REGS_SIZE.

Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-31 12:20:39 +02:00
Heiko Carstens d104937874 s390/kvm: Split kvm_host header file
In order to generate asm offsets into kvm_s390_sie_block linux/kvm_host.h
is included in asm-offsets.c. This causes quite often header dependency
problems, since linux/kvm_host.h pulls in a lot of other header files.

Solve this problem and split out the hardware structure declarations into a
separate header file. Include only the new header file into asm-offsets.c
instead of linux/kvm_host.h. This is sufficient to generate the two asm
offsets required for kvm (__SIE_PROG0C and __SIE_PROG20).

Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-31 12:20:39 +02:00
Heiko Carstens b10ac5d77c s390/boot: Pass pt_regs to program check handler
Setup a pt_regs structure on the stack, poplulate it in low level assembler
code, and pass it to print_pgm_check_info(). This way there is no need to
access then lowcore from print_pgm_check_info() anymore, and the function
looks like a normal program check handler function.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-04 17:25:22 +01:00
Heiko Carstens a0a8f2b219 s390/asm-offsets: Rename __LC_PGM_INT_CODE
Avoid confusion and rename __LC_PGM_INT_CODE since it correlates to the
pgm_code member of struct lowcore, and not the pgm_int_code member.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-04 17:25:22 +01:00
Heiko Carstens f931f67cfc s390/time: Convert MACHINE_HAS_SCC to machine_has_scc()
Use static branch(es) to implement and use machine_has_scc() instead
of a runtime check via MACHINE_HAS_SCC.

This comes with a cleanup of early time initialization: the initial
tod_clock_base value is now passed via the bootdata mechanism, instead
of using absolute lowcore as transport vehicle from the decompressor
to the kernel.

Also the early tod clock initialization is moved to the decompressor
which allows to use a static branch with machine_has_scc() within the
kernel.

Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-04 17:18:06 +01:00
Masami Hiramatsu (Google) a3ed4157b7 fgraph: Replace fgraph_ret_regs with ftrace_regs
Use ftrace_regs instead of fgraph_ret_regs for tracing return value
on function_graph tracer because of simplifying the callback interface.

The CONFIG_HAVE_FUNCTION_GRAPH_RETVAL is also replaced by
CONFIG_HAVE_FUNCTION_GRAPH_FREGS.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173518991508.391279.16635322774382197642.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:02 -05:00
Linus Torvalds aad3a0d084 ftrace updates for v6.13:
- Merged tag ftrace-v6.12-rc4
 
   There was a fix to locking in register_ftrace_graph() for shadow stacks
   that was sent upstream. But this code was also being rewritten, and the
   locking fix was needed. Merging this fix was required to continue the
   work.
 
 - Restructure the function graph shadow stack to prepare it for use with
   kretprobes
 
   With the goal of merging the shadow stack logic of function graph and
   kretprobes, some more restructuring of the function shadow stack is
   required.
 
   Move out function graph specific fields from the fgraph infrastructure and
   store it on the new stack variables that can pass data from the entry
   callback to the exit callback.
 
   Hopefully, with this change, the merge of kretprobes to use fgraph shadow
   stacks will be ready by the next merge window.
 
 - Make shadow stack 4k instead of using PAGE_SIZE.
 
   Some architectures have very large PAGE_SIZE values which make its use for
   shadow stacks waste a lot of memory.
 
 - Give shadow stacks its own kmem cache.
 
   When function graph is started, every task on the system gets a shadow
   stack. In the future, shadow stacks may not be 4K in size. Have it have
   its own kmem cache so that whatever size it becomes will still be
   efficient in allocations.
 
 - Initialize profiler graph ops as it will be needed for new updates to fgraph
 
 - Convert to use guard(mutex) for several ftrace and fgraph functions
 
 - Add more comments and documentation
 
 - Show function return address in function graph tracer
 
   Add an option to show the caller of a function at each entry of the
   function graph tracer, similar to what the function tracer does.
 
 - Abstract out ftrace_regs from being used directly like pt_regs
 
   ftrace_regs was created to store a partial pt_regs. It holds only the
   registers and stack information to get to the function arguments and
   return values. On several archs, it is simply a wrapper around pt_regs.
   But some users would access ftrace_regs directly to get the pt_regs which
   will not work on all archs. Make ftrace_regs an abstract structure that
   requires all access to its fields be through accessor functions.
 
 - Show how long it takes to do function code modifications
 
   When code modification for function hooks happen, it always had the time
   recorded in how long it took to do the conversion. But this value was
   never exported. Recently the code was touched due to new ROX modification
   handling that caused a large slow down in doing the modifications and
   had a significant impact on boot times.
 
   Expose the timings in the dyn_ftrace_total_info file. This file was
   created a while ago to show information about memory usage and such to
   implement dynamic function tracing. It's also an appropriate file to store
   the timings of this modification as well. This will make it easier to see
   the impact of changes to code modification on boot up timings.
 
 - Other clean ups and small fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZztrUxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnnNAQD6w4q9VQ7oOE2qKLqtnj87h4c1GqKn
 SPkpEfC3n/ATEAD/fnYjT/eOSlHiGHuD/aTA+U/bETrT99bozGM/4mFKEgY=
 =6nCa
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Restructure the function graph shadow stack to prepare it for use
   with kretprobes

   With the goal of merging the shadow stack logic of function graph and
   kretprobes, some more restructuring of the function shadow stack is
   required.

   Move out function graph specific fields from the fgraph
   infrastructure and store it on the new stack variables that can pass
   data from the entry callback to the exit callback.

   Hopefully, with this change, the merge of kretprobes to use fgraph
   shadow stacks will be ready by the next merge window.

 - Make shadow stack 4k instead of using PAGE_SIZE.

   Some architectures have very large PAGE_SIZE values which make its
   use for shadow stacks waste a lot of memory.

 - Give shadow stacks its own kmem cache.

   When function graph is started, every task on the system gets a
   shadow stack. In the future, shadow stacks may not be 4K in size.
   Have it have its own kmem cache so that whatever size it becomes will
   still be efficient in allocations.

 - Initialize profiler graph ops as it will be needed for new updates to
   fgraph

 - Convert to use guard(mutex) for several ftrace and fgraph functions

 - Add more comments and documentation

 - Show function return address in function graph tracer

   Add an option to show the caller of a function at each entry of the
   function graph tracer, similar to what the function tracer does.

 - Abstract out ftrace_regs from being used directly like pt_regs

   ftrace_regs was created to store a partial pt_regs. It holds only the
   registers and stack information to get to the function arguments and
   return values. On several archs, it is simply a wrapper around
   pt_regs. But some users would access ftrace_regs directly to get the
   pt_regs which will not work on all archs. Make ftrace_regs an
   abstract structure that requires all access to its fields be through
   accessor functions.

 - Show how long it takes to do function code modifications

   When code modification for function hooks happen, it always had the
   time recorded in how long it took to do the conversion. But this
   value was never exported. Recently the code was touched due to new
   ROX modification handling that caused a large slow down in doing the
   modifications and had a significant impact on boot times.

   Expose the timings in the dyn_ftrace_total_info file. This file was
   created a while ago to show information about memory usage and such
   to implement dynamic function tracing. It's also an appropriate file
   to store the timings of this modification as well. This will make it
   easier to see the impact of changes to code modification on boot up
   timings.

 - Other clean ups and small fixes

* tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
  ftrace: Show timings of how long nop patching took
  ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash()
  ftrace: Use guard to take the ftrace_lock in release_probe()
  ftrace: Use guard to lock ftrace_lock in cache_mod()
  ftrace: Use guard for match_records()
  fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph()
  fgraph: Give ret_stack its own kmem cache
  fgraph: Separate size of ret_stack from PAGE_SIZE
  ftrace: Rename ftrace_regs_return_value to ftrace_regs_get_return_value
  selftests/ftrace: Fix check of return value in fgraph-retval.tc test
  ftrace: Use arch_ftrace_regs() for ftrace_regs_*() macros
  ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs
  ftrace: Make ftrace_regs abstract from direct use
  fgragh: No need to invoke the function call_filter_check_discard()
  fgraph: Simplify return address printing in function graph tracer
  function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr()
  function_graph: Support recording and printing the function return address
  ftrace: Have calltime be saved in the fgraph storage
  ftrace: Use a running sleeptime instead of saving on shadow stack
  fgraph: Use fgraph data to store subtime for profiler
  ...
2024-11-20 11:34:10 -08:00
Claudio Imbrenda e8d8d97218 s390: Remove gmap pointer from lowcore
Remove the gmap pointer from lowcore, since it is not used anymore.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-9-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:19 +01:00
Claudio Imbrenda f96cb0d61d s390/entry: Remove __GMAP_ASCE and use _PIF_GUEST_FAULT again
Now that the guest ASCE is passed as a parameter to __sie64a(),
_PIF_GUEST_FAULT can be used again to determine whether the fault was a
guest or host fault.

Since the guest ASCE will not be taken from the gmap pointer in lowcore
anymore, __GMAP_ASCE can be removed. For the same reason the guest
ASCE needs now to be saved into the cr1 save area unconditionally.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:18 +01:00
Steven Rostedt 7888af4166 ftrace: Make ftrace_regs abstract from direct use
ftrace_regs was created to hold registers that store information to save
function parameters, return value and stack. Since it is a subset of
pt_regs, it should only be used by its accessor functions. But because
pt_regs can easily be taken from ftrace_regs (on most archs), it is
tempting to use it directly. But when running on other architectures, it
may fail to build or worse, build but crash the kernel!

Instead, make struct ftrace_regs an empty structure and have the
architectures define __arch_ftrace_regs and all the accessor functions
will typecast to it to get to the actual fields. This will help avoid
usage of ftrace_regs directly.

Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/

Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul  Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas  Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav  Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-10 20:18:01 -04:00
Sven Schnelle ee3daf7c05 s390/entry: Unify save_area_sync and save_area_async
In the past two save areas existed because interrupt handlers
and system call / program check handlers where entered with
interrupts enabled. To prevent a handler from overwriting the
save areas from the previous handler, interrupts used the async
save area, while system call and program check handler used the
sync save area.

Since the removal of critical section cleanup from entry.S, handlers are
entered with interrupts disabled. When the interrupts are re-enabled,
the save area is no longer need. Therefore merge both save areas into one.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29 22:56:34 +02:00
Heiko Carstens fc8eac33ad s390/entry: Move SIE indicator flag to thread info
CIF_SIE indicates if a thread is running in SIE context. This is the
state of a thread and not the CPU. Therefore move this indicator to
thread info.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Sven Schnelle d3604ffba1 s390: Move CIF flags to struct pcpu
To allow testing flags for offline CPUs, move the CIF flags
to struct pcpu. To avoid having to calculate the array index
for each access, add a pointer to the pcpu member for the current
cpu to lowcore.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-23 16:02:31 +02:00
Claudio Imbrenda 723ac2d6ba s390/entry: Pass the asce as parameter to sie64a()
Pass the guest ASCE explicitly as parameter, instead of having sie64a()
take it from lowcore.

This removes hidden state from lowcore, and makes things look cleaner.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20240703155900.103783-2-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-07-10 19:50:45 +02:00
Sven Schnelle fa2ae4a377 s390/idle: Rewrite psw_idle() in C
To ease maintenance and further enhancements, convert
the psw_idle() function to C.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Heiko Carstens 62b672c4ba s390/stackstrace: Detect vdso stack frames
Clear the backchain of the extra stack frame added by the vdso user wrapper
code. This allows the user stack walker to detect and skip the non-standard
stack frame. Without this an incorrect instruction pointer would be added
to stack traces, and stack frame walking would be continued with a more or
less random back chain.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Heiko Carstens be72ea09c1 s390/vdso: Introduce and use struct stack_frame_vdso_wrapper
Introduce and use struct stack_frame_vdso_wrapper within vdso user wrapper
code.  With this structure it is possible to automatically generate an
asm-offset define which can be used to save and restore the return address
of the calling function.

Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to
document that the code works with user space stack frames with the standard
stack frame layout.

Fixes: aa44433ac4 ("s390: add USER_STACKTRACE support")
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-14 13:37:07 +02:00
Sven Schnelle e3123dfb53 s390/tracing: pass struct ftrace_regs to ftrace_trace_function
ftrace_trace_function expects a struct ftrace_regs, but the s390
architecure code passes struct pt_regs. This isn't a problem with the
current code because struct ftrace_regs contains only one member:
struct pt_regs. To avoid issues in the future this should be fixed.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-07-24 12:12:24 +02:00
Sven Schnelle 1256e70a08 s390/ftrace: enable HAVE_FUNCTION_GRAPH_RETVAL
Add support for tracing return values in the function graph tracer.
This requires return_to_handler() to record gpr2 and the frame pointer

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-07-24 12:12:22 +02:00
Sven Schnelle efccd4e0f3 s390/entry: remove mcck clock
In the past machine checks where accounted as irq time. With the conversion
to generic entry, it was decided to account machine checks to the current
context. The stckf at the beginning of the machine check handler and the
lowcore member is no longer required, therefore remove it.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-07-03 11:19:42 +02:00
Nico Boehr 6b33e68ab3 s390/entry: sort out physical vs virtual pointers usage in sie64a
Fix virtual vs physical address confusion (which currently are the
same).

sie_block is accessed in entry.S and passed it to hardware, which is why
both its physical and virtual address are needed. To avoid every caller
having to do the virtual-physical conversion, add a new function sie64a()
which converts the virtual address to physical.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20221020143159.294605-3-nrb@linux.ibm.com
Message-Id: <20221020143159.294605-3-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-10-26 14:27:41 +02:00
Heiko Carstens e0ffcf3fe1 s390/stack: add union to reflect kvm stack slot usages
Add a union which describes how the empty stack slots are being used
by kvm and perf. This should help to avoid another bug like the one
which was fixed with commit c9bfb460c3 ("s390/perf: obtain sie_block
from the right address").

Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Tested-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-06-01 12:03:17 +02:00
Heiko Carstens f037acb41d s390/stack: merge empty stack frame slots
Merge empty1 and empty2 arrays within the stack frame to one single
array. This is possible since with commit 42b01a553a ("s390: always
use the packed stack layout") the alternative stack frame layout is
gone.

Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-06-01 12:03:17 +02:00
Heiko Carstens 3384f135e9 s390: generate register offsets into pt_regs automatically
Use asm offsets method to generate register offsets into pt_regs,
instead of open-coding at several places.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2022-05-25 11:46:02 +02:00
Vasily Gorbik 4efd417f29 s390: raise minimum supported machine generation to z10
Machine generations up to z9 (released in May 2006) have been officially
out of service for several years now (z9 end of service - January 31, 2019).
No distributions build kernels supporting those old machine generations
anymore, except Debian, which seems to pick the oldest supported
generation. The team supporting Debian on s390 has been notified about
the change.

Raising minimum supported machine generation to z10 helps to reduce
maintenance cost and effectively remove code, which is not getting
enough testing coverage due to lack of older hardware and distributions
support. Besides that this unblocks some optimization opportunities and
allows to use wider instruction set in asm files for future features
implementation. Due to this change spectre mitigation and usercopy
implementations could be drastically simplified and many newer instructions
could be converted from ".insn" encoding to instruction names.

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-03-10 15:58:17 +01:00
Heiko Carstens 50b7c4688d s390/asm-offsets: remove unused defines
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-03-08 00:33:01 +01:00
Alexander Gordeev dc306186a1 s390/dump: fix old lowcore virtual vs physical address confusion
Virtual addresses of vmcore_info and os_info members are
wrongly passed to copy_oldmem_kernel(), while the function
expects physical address of the source. Instead, __pa()
macro should have been applied.

Yet, use of __pa() macro could be somehow confusing, since
copy_oldmem_kernel() may treat the source as an offset, not
as a direct physical address (that depens from the oldmem
availability and location).

Fix the virtual vs physical address confusion and make the
way the old lowcore is read consistent across all sources.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-02-09 22:56:03 +01:00
Sven Schnelle 5ecb2da660 s390: support command lines longer than 896 bytes
Currently s390 supports a fixed maximum command line length of 896
bytes. This isn't enough as some installers are trying to pass all
configuration data via kernel command line, and even with zfcp alone
it is easy to generate really long command lines. Therefore extend
the command line to 4 kbytes.

In the parm area where the command line is stored there is no indication
of the maximum allowed length, so a new field which contains the maximum
length is added.

The parm area has always been initialized to zero, so with old kernels
this field would read zero. This is important because tools like zipl
could read this field. If it contains a number larger than zero zipl
knows the maximum length that can be stored in the parm area, otherwise
it must assume that it is booting a legacy kernel and only 896 bytes are
available.

The removing of trailing whitespace in head.S is also removed because
code to do this is already present in setup_boot_command_line().

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-10-26 15:21:31 +02:00
Sven Schnelle 3b051e89da s390: add support for BEAR enhancement facility
The Breaking-Event-Address-Register (BEAR) stores the address of the
last breaking event instruction. Breaking events are usually instructions
that change the program flow - for example branches, and instructions
that modify the address in the PSW like lpswe. This is useful for debugging
wild branches, because one could easily figure out where the wild branch
was originating from.

What is problematic is that lpswe is considered a breaking event, and
therefore overwrites BEAR on kernel exit. The BEAR enhancement facility
adds new instructions that allow to save/restore BEAR and also an lpswey
instruction that doesn't cause a breaking event. So we can save BEAR on
kernel entry and restore it on exit to user space.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-10-26 15:21:29 +02:00
Sven Schnelle 26c21aa485 s390: rename last_break to pgm_last_break
With the upcoming BEAR enhancements last_break isn't really
unique, so rename it to pgm_last_break. This way it should
be more obvious that this is the last_break value that is
written by the hardware when a program check occurs.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-10-26 15:21:28 +02:00
Heiko Carstens 3d487acf1b s390: make STACK_FRAME_OVERHEAD available via asm-offsets.h
Make STACK_FRAME_OVERHEAD available via asm-offsets.h. This allows to
add s390 specific asm code to e.g. ftrace samples, without requiring
to add random header files, which might cause all sort of problems on
other architectures. asm-offsets.h can be assumed to be non-problematic.

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20211012133802.2460757-3-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-10-19 15:39:53 +02:00
Alexander Gordeev 915fea04f9 s390/smp: enable DAT before CPU restart callback is called
The restart interrupt is triggered whenever a secondary CPU is
brought online, a remote function call dispatched from another
CPU or a manual PSW restart is initiated and causes the system
to kdump. The handling routine is always called with DAT turned
off. It then initializes the stack frame and invokes a callback.

The existing callbacks handle DAT as follows:

  * __do_restart() and __machine_kexec() turn in on upon entry;
  * __ipl_run(), __reipl_run() and __dump_run() do not turn it
    right away, but all of them call diag308() - which turns DAT
    on, but only if kasan is enabled;

In addition to the described complexity all callbacks (and the
functions they call) should avoid kasan instrumentation while
DAT is off.

This update enables DAT in the assembler restart handler and
relieves any callbacks (which are mostly C functions) from
dealing with DAT altogether.

There are four types of CPU restart that initialize control
registers in different ways:

  1. Start of secondary CPU on boot - control registers are
     inherited from the IPL CPU;
  2. Restart of online CPU - control registers of the CPU being
     restarted are kept;
  3. Hotplug of offline CPU - control registers are inherited
     from the starting CPU;
  4. Start of offline CPU triggered by manual PSW restart -
     the control registers are read from the absolute lowcore
     and contain the boot time IPL CPU values updated with all
     follow-up calls of smp_ctl_set_bit() and smp_ctl_clear_bit()
     routines;

In first three cases contents of the control registers is the
most recent. In the latter case control registers are good
enough to facilitate successful completion of kdump operation.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-08-26 20:22:12 +02:00
Alexander Egorenkov 455cac5028 s390/setup: generate asm offsets from struct parmarea
To reduce duplication, replace error-prone and hard-coded parameter area
offsets with auto-generated ones.

Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-07-27 09:39:16 +02:00
Alexander Gordeev 5fa2ea0714 s390/mcck: move register validation to C code
This update partially reverts commit 3037a52f98 ("s390/nmi:
do register validation as early as possible").

Storage error checks and control registers validation are left
in the assembler code, since correct ASCEs and page tables are
required to enable DAT - which is done before the C handler is
entered.

System damage, kernel instruction address and PSW MWP checks
are left in the assembler code as well, since there is no way
to proceed if one of these checks is failed.

The getcpu vdso syscall reads CPU number from the programmable
field of the TOD clock. Disregard the TOD programmable register
validity bit and load the CPU number into the TOD programmable
field unconditionally.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-07-05 12:44:23 +02:00
Heiko Carstens f73c632d38 s390/ipl: make parameter area accessible via struct parmarea
Since commit 9a965ea95135 ("s390/kexec_file: Simplify parmarea
access") we have struct parmarea which describes the layout of the
kernel parameter area.

Make the kernel parameter area available as global variable parmarea
of type struct parmarea, which allows to easily access its members.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-06-07 17:06:59 +02:00
Sven Schnelle 17e89e1340 s390/facilities: move stfl information from lowcore to global data
With gcc-11, there are a lot of warnings because the facility functions
are accessing lowcore through a null pointer. Fix this by moving the
facility arrays away from lowcore.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-06-07 17:06:58 +02:00
Sven Schnelle af9ad82290 s390/entry: use assignment to read intcode / asm to copy gprs
arch/s390/kernel/syscall.c: In function __do_syscall:
arch/s390/kernel/syscall.c:147:9: warning: memcpy reading 64 bytes from a region of size 0 [-Wstringop-overread]
  147 |         memcpy(&regs->gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned long));
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/s390/kernel/syscall.c:148:9: warning: memcpy reading 4 bytes from a region of size 0 [-Wstringop-overread]
  148 |         memcpy(&regs->int_code, &S390_lowcore.svc_ilc, sizeof(regs->int_code));
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by moving the gprs restore from C to assembly, and use a assignment
for int_code instead of memcpy.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-06-07 17:06:58 +02:00
Sven Schnelle b61b159512 s390: add stack for machine check handler
The previous code used the normal kernel stack for machine checks.
This is problematic when a machine check interrupts a system call
or interrupt handler right at the beginning where registers are set up.

Assume system_call is interrupted at the first instruction and a machine
check is triggered. The machine check handler is called, checks the PSW
to see whether it is coming from user space, notices that it is already
in kernel mode but %r15 still contains the user space stack. This would
lead to a kernel crash.

There are basically two ways of fixing that: Either using the 'critical
cleanup' approach which compares the address in the PSW to see whether
it is already at a point where the stack has been set up, or use an extra
stack for the machine check handler.

For simplicity, we will go with the second approach and allocate an extra
stack. This adds some memory overhead for large systems, but usually large
system have plenty of memory so this isn't really a concern. But it keeps
the mchk stack setup simple and less error prone.

Fixes: 0b0ed657fe ("s390: remove critical section cleanup from entry.S")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@kernel.org> # v5.8+
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13 17:17:53 +01:00
Sven Schnelle 56e62a7370 s390: convert to generic entry
This patch converts s390 to use the generic entry infrastructure from
kernel/entry/*.

There are a few special things on s390:

- PIF_PER_TRAP is moved to TIF_PER_TRAP as the generic code doesn't
  know about our PIF flags in exit_to_user_mode_loop().

- The old code had several ways to restart syscalls:

  a) PIF_SYSCALL_RESTART, which was only set during execve to force a
     restart after upgrading a process (usually qemu-kvm) to pgste page
     table extensions.

  b) PIF_SYSCALL, which is set by do_signal() to indicate that the
     current syscall should be restarted. This is changed so that
     do_signal() now also uses PIF_SYSCALL_RESTART. Continuing to use
     PIF_SYSCALL doesn't work with the generic code, and changing it
     to PIF_SYSCALL_RESTART makes PIF_SYSCALL and PIF_SYSCALL_RESTART
     more unique.

- On s390 calling sys_sigreturn or sys_rt_sigreturn is implemented by
executing a svc instruction on the process stack which causes a fault.
While handling that fault the fault code sets PIF_SYSCALL to hand over
processing to the syscall code on exit to usermode.

The patch introduces PIF_SYSCALL_RET_SET, which is set if ptrace sets
a return value for a syscall. The s390x ptrace ABI uses r2 both for the
syscall number and return value, so ptrace cannot set the syscall number +
return value at the same time. The flag makes handling that a bit easier.
do_syscall() will just skip executing the syscall if PIF_SYSCALL_RET_SET
is set.

CONFIG_DEBUG_ASCE was removd in favour of the generic CONFIG_DEBUG_ENTRY.
CR1/7/13 will be checked both on kernel entry and exit to contain the
correct asces.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-01-19 12:29:26 +01:00
Heiko Carstens 87d5986345 s390/mm: remove set_fs / rework address space handling
Remove set_fs support from s390. With doing this rework address space
handling and simplify it. As a result address spaces are now setup
like this:

CPU running in              | %cr1 ASCE | %cr7 ASCE | %cr13 ASCE
----------------------------|-----------|-----------|-----------
user space                  |  user     |  user     |  kernel
kernel, normal execution    |  kernel   |  user     |  kernel
kernel, kvm guest execution |  gmap     |  user     |  kernel

To achieve this the getcpu vdso syscall is removed in order to avoid
secondary address mode and a separate vdso address space in for user
space. The getcpu vdso syscall will be implemented differently with a
subsequent patch.

The kernel accesses user space always via secondary address space.
This happens in different ways:
- with mvcos in home space mode and directly read/write to secondary
  address space
- with mvcs/mvcp in primary space mode and copy from primary space to
  secondary space or vice versa
- with e.g. cs in secondary space mode and access secondary space

Switching translation modes happens with sacf before and after
instructions which access user space, like before.

Lazy handling of control register reloading is removed in the hope to
make everything simpler, but at the cost of making kernel entry and
exit a bit slower. That is: on kernel entry the primary asce is always
changed to contain the kernel asce, and on kernel exit the primary
asce is changed again so it contains the user asce.

In kernel mode there is only one exception to the primary asce: when
kvm guests are executed the primary asce contains the gmap asce (which
describes the guest address space). The primary asce is reset to
kernel asce whenever kvm guest execution is interrupted, so that this
doesn't has to be taken into account for any user space accesses.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2020-11-23 12:01:12 +01:00
Sven Schnelle 1179f170b6 s390: fix fpu restore in entry.S
We need to disable interrupts in load_fpu_regs(). Otherwise an
interrupt might come in after the registers are loaded, but before
CIF_FPU is cleared in load_fpu_regs(). When the interrupt returns,
CIF_FPU will be cleared and the registers will never be restored.

The entry.S code usually saves the interrupt state in __SF_EMPTY on the
stack when disabling/restoring interrupts. sie64a however saves the pointer
to the sie control block in __SF_SIE_CONTROL, which references the same
location.  This is non-obvious to the reader. To avoid thrashing the sie
control block pointer in load_fpu_regs(), move the __SIE_* offsets eight
bytes after __SF_EMPTY on the stack.

Cc: <stable@vger.kernel.org> # 5.8
Fixes: 0b0ed657fe ("s390: remove critical section cleanup from entry.S")
Reported-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2020-11-23 11:52:13 +01:00
Heiko Carstens cfef9aa69a s390/vdso: remove unused constants
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2020-11-03 15:12:16 +01:00
Sven Schnelle 4bff8cb545 s390: convert to GENERIC_VDSO
Convert s390 to generic vDSO. There are a few special things on s390:

- vDSO can be called without a stack frame - glibc did this in the past.
  So we need to allocate a stackframe on our own.

- The former assembly code used stcke to get the TOD clock and applied
  time steering to it. We need to do the same in the new code. This is done
  in the architecture specific __arch_get_hw_counter function. The steering
  information is stored in an architecure specific area in the vDSO data.

- CPUCLOCK_VIRT is now handled with a syscall fallback, which might
  be slower/less accurate than the old implementation.

The getcpu() function stays as an assembly function because there is no
generic implementation and the code is just a few lines.

Performance number from my system do 100 mio gettimeofday() calls:

Plain syscall: 8.6s
Generic VDSO:  1.3s
old ASM VDSO:  1s

So it's a bit slower but still much faster than syscalls.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-08-26 18:47:21 +02:00
Vincenzo Frascino 478237a595 s390/vdso: fix vDSO clock_getres()
clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
    sec = 0;
    ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the s390 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Link: https://lkml.kernel.org/r/20200324121027.21665-1-vincenzo.frascino@arm.com
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[heiko.carstens@de.ibm.com: use llgf for proper zero extension]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-06-16 13:44:05 +02:00