A large overhaul of the restartable sequences and CID management:

The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksaRYTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoencEADA5he8PAFPmSRRPo6+2G5mHzWe8kIU 5ZViQStWFNAA0qqy8VXryWiJ6qqrO6la9o7K4YOXASUtlkVjquRp1DF7PabqGwuy zshbRCXNlT51J8uqanN8VrGVjlf+bMdHDbGoI1SLkUTxG8b+kDD5PXUQE1ARelPP Slbg9u+EMrxj6D5MDTPbuW6TqryJEkPtiNScyOz43emp9ww9+WVxenOcRqU4D+Th mjWmrGIzkroSf4XReMoD/wg9TPTpUjXnNCwl2viY9JvBpkMfYtU4tJAGK3aNFOWy zsAN0O9CaFGrUEFne7qUmtwhNLdtnjx5HN5pe7yZd1EhdTuQKq4jPiiQnwwm8w72 c0o6m45FNPmPoSyfaZWCkLjbTEUXonT9JF61iN35JVxim8gBDDJjHFKnLxDmLrH3 X0eESE48ReY2EneDV6Y8RJRo6oG14Fccvc39aTf/2Rw3trpmtt2agvConQzupQIg DzANw4jhUUzFRrHrMHACNsqKFXh9ratue/S9DM3xxTpGO/bKdeK7jGIgzNf8O34M J0O6Hvk5jMdcWlIJTx21GoGzoSkkXnR49g/71aCcp+MwdY4x9zFz5SWi8LWQRmkx xRo6tY27Bma8/SEwMJjPpAUXDTpq6v+j3cPisybL1yGsyt9lh+p8LX7VUtwcoEqe 6ZelC5Kgw/+/kg== =n5KT -----END PGP SIGNATURE----- Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
2025-12-02 08:48:53 -08:00 · 2025-12-02 08:48:53 -08:00 · 2b09f480f0
parent 1dce50698a 653fda7ae7
commit 2b09f480f0
40 changed files with 2152 additions and 1517 deletions
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@ -6500,6 +6500,10 @@
 			Memory area to be used by remote processor image,
 			managed by CMA.
 	rseq_debug=	[KNL] Enable or disable restartable sequence
 			debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
 			Format: <bool>
 	rt_group_sched=	[KNL] Enable or disable SCHED_RR/FIFO group scheduling
 			when CONFIG_RT_GROUP_SCHED=y. Defaults to
 			!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@ -100,7 +100,7 @@ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
 static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
 {
 	local_irq_disable();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare_legacy(regs);
 	local_daif_mask();
 	mte_check_tfsr_exit();
 	exit_to_user_mode();
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	 * fetch EBP before invoking any of the syscall entry work
 	 * functions.
 	 */
-	syscall_enter_from_user_mode_prepare(regs);
+	enter_from_user_mode(regs);
 	instrumentation_begin();
 	local_irq_enable();
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@ -187,12 +187,12 @@ convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs);
 extern void send_sigtrap(struct pt_regs *regs, int error_code, int si_code);
-static inline unsigned long regs_return_value(struct pt_regs *regs)
+static __always_inline unsigned long regs_return_value(struct pt_regs *regs)
 {
 	return regs->ax;
 }
-static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
+static __always_inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
 {
 	regs->ax = rc;
 }
@ -277,34 +277,34 @@ static __always_inline bool ip_within_syscall_gap(struct pt_regs *regs)
 }
 #endif
-static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
+static __always_inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-static inline unsigned long instruction_pointer(struct pt_regs *regs)
+static __always_inline unsigned long instruction_pointer(struct pt_regs *regs)
 {
 	return regs->ip;
 }
-static inline void instruction_pointer_set(struct pt_regs *regs,
+static __always_inline
-		unsigned long val)
+void instruction_pointer_set(struct pt_regs *regs, unsigned long val)
 {
 	regs->ip = val;
 }
-static inline unsigned long frame_pointer(struct pt_regs *regs)
+static __always_inline unsigned long frame_pointer(struct pt_regs *regs)
 {
 	return regs->bp;
 }
-static inline unsigned long user_stack_pointer(struct pt_regs *regs)
+static __always_inline unsigned long user_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-static inline void user_stack_pointer_set(struct pt_regs *regs,
+static __always_inline
-		unsigned long val)
+void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
 {
 	regs->sp = val;
 }
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@ -29,6 +29,7 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
 #include <linux/rseq.h>
 #include "mshv_eventfd.h"
 #include "mshv.h"
@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 		}
 	} while (!vp->run.flags.intercept_suspend);
 	rseq_virt_userspace_exit();
 	return ret;
 }
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@ -46,7 +46,7 @@
 #include <linux/cred.h>
 #include <linux/dax.h>
 #include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
--- a/fs/exec.c
+++ b/fs/exec.c
@ -1774,7 +1774,7 @@ static int bprm_execve(struct linux_binprm *bprm)
 		force_fatal_sig(SIGSEGV);
 	sched_mm_cid_after_execve(current);
-	rseq_set_notify_resume(current);
+	rseq_force_update();
 	current->in_execve = 0;
 	return retval;
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@ -45,4 +45,7 @@
 # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #endif
 #define TIF_RSEQ		11	// Run RSEQ fast path
 #define _TIF_RSEQ		BIT(TIF_RSEQ)
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@ -45,6 +45,7 @@ struct device;
 *  bitmap_copy(dst, src, nbits)                *dst = *src
 *  bitmap_and(dst, src1, src2, nbits)          *dst = *src1 & *src2
 *  bitmap_or(dst, src1, src2, nbits)           *dst = *src1 | *src2
 *  bitmap_weighted_or(dst, src1, src2, nbits)	*dst = *src1 | *src2. Returns Hamming Weight of dst
 *  bitmap_xor(dst, src1, src2, nbits)          *dst = *src1 ^ *src2
 *  bitmap_andnot(dst, src1, src2, nbits)       *dst = *src1 & ~(*src2)
 *  bitmap_complement(dst, src, nbits)          *dst = ~(*src)
@ -165,6 +166,8 @@ bool __bitmap_and(unsigned long *dst, const unsigned long *bitmap1,
 		 const unsigned long *bitmap2, unsigned int nbits);
 void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
 		 const unsigned long *bitmap2, unsigned int nbits);
 unsigned int __bitmap_weighted_or(unsigned long *dst, const unsigned long *bitmap1,
 				  const unsigned long *bitmap2, unsigned int nbits);
 void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1,
 		  const unsigned long *bitmap2, unsigned int nbits);
 bool __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1,
@ -337,6 +340,18 @@ void bitmap_or(unsigned long *dst, const unsigned long *src1,
 		__bitmap_or(dst, src1, src2, nbits);
 }
 static __always_inline
 unsigned int bitmap_weighted_or(unsigned long *dst, const unsigned long *src1,
 				const unsigned long *src2, unsigned int nbits)
 {
 	if (small_const_nbits(nbits)) {
 		*dst = *src1 | *src2;
 		return hweight_long(*dst & BITMAP_LAST_WORD_MASK(nbits));
 	} else {
 		return __bitmap_weighted_or(dst, src1, src2, nbits);
 	}
 }
 static __always_inline
 void bitmap_xor(unsigned long *dst, const unsigned long *src1,
 		const unsigned long *src2, unsigned int nbits)
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@ -208,7 +208,7 @@
 */
 #define DEFINE_FREE(_name, _type, _free) \
-	static inline void __free_##_name(void *p) { _type _T = *(_type *)p; _free; }
+	static __always_inline void __free_##_name(void *p) { _type _T = *(_type *)p; _free; }
 #define __free(_name)	__cleanup(__free_##_name)
@ -220,7 +220,7 @@
 		__val;                      \
 	})
-static inline __must_check
+static __always_inline __must_check
 const volatile void * __must_check_fn(const volatile void *val)
 { return val; }
@ -278,16 +278,16 @@ const volatile void * __must_check_fn(const volatile void *val)
 #define DEFINE_CLASS(_name, _type, _exit, _init, _init_args...)		\
 typedef _type class_##_name##_t;					\
-static inline void class_##_name##_destructor(_type *p)			\
+static __always_inline void class_##_name##_destructor(_type *p)	\
 { _type _T = *p; _exit; }						\
-static inline _type class_##_name##_constructor(_init_args)		\
+static __always_inline _type class_##_name##_constructor(_init_args)	\
 { _type t = _init; return t; }
 #define EXTEND_CLASS(_name, ext, _init, _init_args...)			\
 typedef class_##_name##_t class_##_name##ext##_t;			\
-static inline void class_##_name##ext##_destructor(class_##_name##_t *p)\
+static __always_inline void class_##_name##ext##_destructor(class_##_name##_t *p) \
 { class_##_name##_destructor(p); }					\
-static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \
+static __always_inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \
 { class_##_name##_t t = _init; return t; }
 #define CLASS(_name, var)						\
@ -360,7 +360,7 @@ static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 	})
 #define __DEFINE_GUARD_LOCK_PTR(_name, _exp)                                \
-	static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T) \
+	static __always_inline void *class_##_name##_lock_ptr(class_##_name##_t *_T) \
 	{                                                                   \
 		void *_ptr = (void *)(__force unsigned long)*(_exp);        \
 		if (IS_ERR(_ptr)) {                                         \
@ -368,7 +368,7 @@ static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 		}                                                           \
 		return _ptr;                                                \
 	}                                                                   \
-	static inline int class_##_name##_lock_err(class_##_name##_t *_T)   \
+	static __always_inline int class_##_name##_lock_err(class_##_name##_t *_T) \
 	{                                                                   \
 		long _rc = (__force unsigned long)*(_exp);                  \
 		if (!_rc) {                                                 \
@ -397,9 +397,9 @@ static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 	EXTEND_CLASS(_name, _ext, \
 		     ({ void *_t = _T; int _RET = (_lock); if (_T && !(_cond)) _t = ERR_PTR(_RET); _t; }), \
 		     class_##_name##_t _T) \
-	static inline void * class_##_name##_ext##_lock_ptr(class_##_name##_t *_T) \
+	static __always_inline void * class_##_name##_ext##_lock_ptr(class_##_name##_t *_T) \
 	{ return class_##_name##_lock_ptr(_T); } \
-	static inline int class_##_name##_ext##_lock_err(class_##_name##_t *_T) \
+	static __always_inline int class_##_name##_ext##_lock_err(class_##_name##_t *_T) \
 	{ return class_##_name##_lock_err(_T); }
 /*
@ -479,7 +479,7 @@ typedef struct {							\
 	__VA_ARGS__;							\
 } class_##_name##_t;							\
 									\
-static inline void class_##_name##_destructor(class_##_name##_t *_T)	\
+static __always_inline void class_##_name##_destructor(class_##_name##_t *_T) \
 {									\
 	if (!__GUARD_IS_ERR(_T->lock)) { _unlock; }			\
 }									\
@ -487,7 +487,7 @@ static inline void class_##_name##_destructor(class_##_name##_t *_T)	\
 __DEFINE_GUARD_LOCK_PTR(_name, &_T->lock)
 #define __DEFINE_LOCK_GUARD_1(_name, _type, _lock)			\
-static inline class_##_name##_t class_##_name##_constructor(_type *l)	\
+static __always_inline class_##_name##_t class_##_name##_constructor(_type *l) \
 {									\
 	class_##_name##_t _t = { .lock = l }, *_T = &_t;		\
 	_lock;								\
@ -495,7 +495,7 @@ static inline class_##_name##_t class_##_name##_constructor(_type *l)	\
 }
 #define __DEFINE_LOCK_GUARD_0(_name, _lock)				\
-static inline class_##_name##_t class_##_name##_constructor(void)	\
+static __always_inline class_##_name##_t class_##_name##_constructor(void) \
 {									\
 	class_##_name##_t _t = { .lock = (void*)1 },			\
 			 *_T __maybe_unused = &_t;			\
@ -521,9 +521,9 @@ __DEFINE_LOCK_GUARD_0(_name, _lock)
 		        if (_T->lock && !(_cond)) _T->lock = ERR_PTR(_RET);\
 			_t; }),						\
 		     typeof_member(class_##_name##_t, lock) l)		\
-	static inline void * class_##_name##_ext##_lock_ptr(class_##_name##_t *_T) \
+	static __always_inline void * class_##_name##_ext##_lock_ptr(class_##_name##_t *_T) \
 	{ return class_##_name##_lock_ptr(_T); } \
-	static inline int class_##_name##_ext##_lock_err(class_##_name##_t *_T) \
+	static __always_inline int class_##_name##_ext##_lock_err(class_##_name##_t *_T) \
 	{ return class_##_name##_lock_err(_T); }
 #define DEFINE_LOCK_GUARD_1_COND_3(_name, _ext, _lock) \
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@ -126,6 +126,7 @@ extern struct cpumask __cpu_dying_mask;
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
 extern atomic_t __num_online_cpus;
 extern unsigned int __num_possible_cpus;
 extern cpumask_t cpus_booted_once_mask;
@ -728,6 +729,22 @@ void cpumask_or(struct cpumask *dstp, const struct cpumask *src1p,
 				      cpumask_bits(src2p), small_cpumask_bits);
 }
 /**
 * cpumask_weighted_or - *dstp = *src1p | *src2p and return the weight of the result
 * @dstp: the cpumask result
 * @src1p: the first input
 * @src2p: the second input
 *
 * Return: The number of bits set in the resulting cpumask @dstp
 */
 static __always_inline
 unsigned int cpumask_weighted_or(struct cpumask *dstp, const struct cpumask *src1p,
 				 const struct cpumask *src2p)
 {
 	return bitmap_weighted_or(cpumask_bits(dstp), cpumask_bits(src1p),
 				  cpumask_bits(src2p), small_cpumask_bits);
 }
 /**
 * cpumask_xor - *dstp = *src1p ^ *src2p
 * @dstp: the cpumask result
@ -1136,13 +1153,13 @@ void init_cpu_possible(const struct cpumask *src);
 #define __assign_cpu(cpu, mask, val)	\
 	__assign_bit(cpumask_check(cpu), cpumask_bits(mask), (val))
 #define set_cpu_possible(cpu, possible)	assign_cpu((cpu), &__cpu_possible_mask, (possible))
 #define set_cpu_enabled(cpu, enabled)	assign_cpu((cpu), &__cpu_enabled_mask, (enabled))
 #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
 void set_cpu_online(unsigned int cpu, bool online);
 void set_cpu_possible(unsigned int cpu, bool possible);
 /**
 * to_cpumask - convert a NR_CPUS bitmap to a struct cpumask *
@ -1195,7 +1212,12 @@ static __always_inline unsigned int num_online_cpus(void)
 {
 	return raw_atomic_read(&__num_online_cpus);
 }
-#define num_possible_cpus()	cpumask_weight(cpu_possible_mask)
+
 static __always_inline unsigned int num_possible_cpus(void)
 {
 	return __num_possible_cpus;
 }
 #define num_enabled_cpus()	cpumask_weight(cpu_enabled_mask)
 #define num_present_cpus()	cpumask_weight(cpu_present_mask)
 #define num_active_cpus()	cpumask_weight(cpu_active_mask)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@ -3,11 +3,11 @@
 #define __LINUX_ENTRYCOMMON_H
 #include <linux/irq-entry-common.h>
 #include <linux/livepatch.h>
 #include <linux/ptrace.h>
 #include <linux/resume_user_mode.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
 #include <linux/livepatch.h>
 #include <linux/resume_user_mode.h>
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@ -37,6 +37,7 @@
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
@ -44,25 +45,7 @@
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
-/**
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
 * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
 * @regs:	Pointer to currents pt_regs
 *
 * Invoked from architecture specific syscall entry code with interrupts
 * disabled. The calling code has to be non-instrumentable. When the
 * function returns all state is correct, interrupts are enabled and the
 * subsequent functions can be instrumented.
 *
 * This handles lockdep, RCU (context tracking) and tracing state, i.e.
 * the functionality provided by enter_from_user_mode().
 *
 * This is invoked when there is extra architecture specific functionality
 * to be done between establishing state and handling user mode entry work.
 */
 void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
 long syscall_trace_enter(struct pt_regs *regs, long syscall,
 			 unsigned long work);
 /**
 * syscall_enter_from_user_mode_work - Check and handle work before invoking
@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
 * @syscall:	The syscall number
 *
 * Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
- * architecture specific work.
+ * extra architecture specific work.
 *
 * Returns: The original or a modified syscall number
 *
@ -108,8 +91,9 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
 * function returns all state is correct, interrupts are enabled and the
 * subsequent functions can be instrumented.
 *
- * This is combination of syscall_enter_from_user_mode_prepare() and
+ * This is the combination of enter_from_user_mode() and
- * syscall_enter_from_user_mode_work().
+ * syscall_enter_from_user_mode_work() to be used when there is no
 * architecture specific work to be done between the two.
 *
 * Returns: The original or a modified syscall number. See
 * syscall_enter_from_user_mode_work() for further explanation.
@ -162,7 +146,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 			local_irq_enable();
 	}
-	rseq_syscall(regs);
+	rseq_debug_syscall_return(regs);
 	/*
 	 * Do one-time syscall specific work. If these work items are
@ -172,7 +156,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	syscall_exit_to_user_mode_prepare(regs);
 }
 /**
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@ -2,11 +2,12 @@
 #ifndef __LINUX_IRQENTRYCOMMON_H
 #define __LINUX_IRQENTRYCOMMON_H
 #include <linux/context_tracking.h>
 #include <linux/kmsan.h>
 #include <linux/rseq_entry.h>
 #include <linux/static_call_types.h>
 #include <linux/syscalls.h>
 #include <linux/context_tracking.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
 #include <linux/unwind_deferred.h>
 #include <asm/entry-common.h>
@ -29,7 +30,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
-	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
+	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 /**
@ -67,6 +68,7 @@ static __always_inline bool arch_in_rcu_eqs(void) { return false; }
 /**
 * enter_from_user_mode - Establish state when coming from user mode
 * @regs:	Pointer to currents pt_regs
 *
 * Syscall/interrupt entry disables interrupts, but user mode is traced as
 * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@ -195,14 +197,11 @@ static __always_inline void arch_exit_to_user_mode(void) { }
 */
 void arch_do_signal_or_restart(struct pt_regs *regs);
-/**
+/* Handle pending TIF work */
- * exit_to_user_mode_loop - do any pending work before leaving to user space
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
 */
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 				     unsigned long ti_work);
 /**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
 * @regs:	Pointer to pt_regs on entry stack
 *
 * 1) check that interrupts are disabled
@ -210,8 +209,10 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 * 3) call exit_to_user_mode_loop() if any flags from
 *    EXIT_TO_USER_MODE_WORK are set
 * 4) check that interrupts are still disabled
 *
 * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
 */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	unsigned long ti_work;
@ -225,13 +226,52 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 }
 static __always_inline void __exit_to_user_mode_validate(void)
 {
 	/* Ensure that kernel state is sane for a return to userspace */
 	kmap_assert_nomap();
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 }
 /* Temporary workaround to keep ARM64 alive */
 static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
 {
 	__exit_to_user_mode_prepare(regs);
 	rseq_exit_to_user_mode_legacy();
 	__exit_to_user_mode_validate();
 }
 /**
 * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
 * @regs:	Pointer to pt_regs on entry stack
 *
 * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
 * syscalls and interrupts.
 */
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	__exit_to_user_mode_prepare(regs);
 	rseq_syscall_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
 /**
 * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
 * @regs:	Pointer to pt_regs on entry stack
 *
 * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
 * syscalls and interrupts.
 */
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	__exit_to_user_mode_prepare(regs);
 	rseq_irqentry_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
 /**
 * exit_to_user_mode - Fixup state when exiting to user mode
 *
@ -274,7 +314,11 @@ static __always_inline void exit_to_user_mode(void)
 *
 * The function establishes state (lockdep, RCU (context tracking), tracing)
 */
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
 	rseq_note_user_irq_entry();
 }
 /**
 * irqentry_exit_to_user_mode - Interrupt exit work
@ -289,7 +333,13 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
 * Interrupt exit is not invoking #1 which is the syscall specific one time
 * work.
 */
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
 	irqentry_exit_to_user_mode_prepare(regs);
 	instrumentation_end();
 	exit_to_user_mode();
 }
 #ifndef irqentry_state
 /**
@ -354,6 +404,7 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
 * Conditional reschedule with additional sanity checks.
 */
 void raw_irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@ -2,8 +2,9 @@
 #ifndef _LINUX_IRQ_WORK_H
 #define _LINUX_IRQ_WORK_H
-#include <linux/smp_types.h>
+#include <linux/irq_work_types.h>
 #include <linux/rcuwait.h>
 #include <linux/smp_types.h>
 /*
 * An entry can be in one of four states:
@ -14,12 +15,6 @@
 * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
 */
 struct irq_work {
 	struct __call_single_node node;
 	void (*func)(struct irq_work *);
 	struct rcuwait irqwait;
 };
 #define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
 	.node = { .u_flags = (_flags), },			\
 	.func = (_func),					\
--- a/include/linux/irq_work_types.h
+++ b/include/linux/irq_work_types.h
@ -0,0 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_IRQ_WORK_TYPES_H
 #define _LINUX_IRQ_WORK_TYPES_H
 #include <linux/smp_types.h>
 #include <linux/types.h>
 struct irq_work {
 	struct __call_single_node	node;
 	void				(*func)(struct irq_work *);
 	struct rcuwait			irqwait;
 };
 #endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@ -2408,31 +2408,6 @@ struct zap_details {
 /* Set in unmap_vmas() to indicate a final unmap call.  Only used by hugetlb */
 #define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
 #ifdef CONFIG_SCHED_MM_CID
 void sched_mm_cid_before_execve(struct task_struct *t);
 void sched_mm_cid_after_execve(struct task_struct *t);
 void sched_mm_cid_fork(struct task_struct *t);
 void sched_mm_cid_exit_signals(struct task_struct *t);
 static inline int task_mm_cid(struct task_struct *t)
 {
 	return t->mm_cid;
 }
 #else
 static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_fork(struct task_struct *t) { }
 static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
 static inline int task_mm_cid(struct task_struct *t)
 {
 	/*
 	 * Use the processor id as a fall-back when the mm cid feature is
 	 * disabled. This provides functional per-cpu data structure accesses
 	 * in user-space, althrough it won't provide the memory usage benefits.
 	 */
 	return raw_smp_processor_id();
 }
 #endif
 #ifdef CONFIG_MMU
 extern bool can_do_mlock(void);
 #else
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@ -20,6 +20,7 @@
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
 #include <linux/types.h>
 #include <linux/rseq_types.h>
 #include <linux/bitmap.h>
 #include <asm/mmu.h>
@ -922,14 +923,6 @@ struct vm_area_struct {
 #define vma_policy(vma) NULL
 #endif
 #ifdef CONFIG_SCHED_MM_CID
 struct mm_cid {
 	u64 time;
 	int cid;
 	int recent_cid;
 };
 #endif
 /*
 * Opaque type representing current mm_struct flag state. Must be accessed via
 * mm_flags_xxx() helper functions.
@ -991,44 +984,9 @@ struct mm_struct {
 		 */
 		atomic_t mm_users;
-#ifdef CONFIG_SCHED_MM_CID
+		/* MM CID related storage */
-		/**
+		struct mm_mm_cid mm_cid;
-		 * @pcpu_cid: Per-cpu current cid.
+
 		 *
 		 * Keep track of the currently allocated mm_cid for each cpu.
 		 * The per-cpu mm_cid values are serialized by their respective
 		 * runqueue locks.
 		 */
 		struct mm_cid __percpu *pcpu_cid;
 		/*
 		 * @mm_cid_next_scan: Next mm_cid scan (in jiffies).
 		 *
 		 * When the next mm_cid scan is due (in jiffies).
 		 */
 		unsigned long mm_cid_next_scan;
 		/**
 		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
 		 *
 		 * Number of CPUs allowed in the union of all mm's
 		 * threads allowed CPUs.
 		 */
 		unsigned int nr_cpus_allowed;
 		/**
 		 * @max_nr_cid: Maximum number of allowed concurrency
 		 *              IDs allocated.
 		 *
 		 * Track the highest number of allowed concurrency IDs
 		 * allocated for the mm.
 		 */
 		atomic_t max_nr_cid;
 		/**
 		 * @cpus_allowed_lock: Lock protecting mm cpus_allowed.
 		 *
 		 * Provide mutual exclusion for mm cpus_allowed and
 		 * mm nr_cpus_allowed updates.
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@ -1370,37 +1328,6 @@ static inline void vma_iter_init(struct vma_iterator *vmi,
 }
 #ifdef CONFIG_SCHED_MM_CID
 enum mm_cid_state {
 	MM_CID_UNSET = -1U,		/* Unset state has lazy_put flag set. */
 	MM_CID_LAZY_PUT = (1U << 31),
 };
 static inline bool mm_cid_is_unset(int cid)
 {
 	return cid == MM_CID_UNSET;
 }
 static inline bool mm_cid_is_lazy_put(int cid)
 {
 	return !mm_cid_is_unset(cid) && (cid & MM_CID_LAZY_PUT);
 }
 static inline bool mm_cid_is_valid(int cid)
 {
 	return !(cid & MM_CID_LAZY_PUT);
 }
 static inline int mm_cid_set_lazy_put(int cid)
 {
 	return cid | MM_CID_LAZY_PUT;
 }
 static inline int mm_cid_clear_lazy_put(int cid)
 {
 	return cid & ~MM_CID_LAZY_PUT;
 }
 /*
 * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
 */
@ -1415,37 +1342,21 @@ static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
 }
 /* Accessor for struct mm_struct's cidmask. */
-static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
+static inline unsigned long *mm_cidmask(struct mm_struct *mm)
 {
 	unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
 	/* Skip mm_cpus_allowed */
 	cid_bitmap += cpumask_size();
-	return (struct cpumask *)cid_bitmap;
+	return (unsigned long *)cid_bitmap;
 }
-static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
+void mm_init_cid(struct mm_struct *mm, struct task_struct *p);
 {
 	int i;
 	for_each_possible_cpu(i) {
 		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
 		pcpu_cid->cid = MM_CID_UNSET;
 		pcpu_cid->recent_cid = MM_CID_UNSET;
 		pcpu_cid->time = 0;
 	}
 	mm->nr_cpus_allowed = p->nr_cpus_allowed;
 	atomic_set(&mm->max_nr_cid, 0);
 	raw_spin_lock_init(&mm->cpus_allowed_lock);
 	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
 	cpumask_clear(mm_cidmask(mm));
 }
 static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
 {
-	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
+	mm->mm_cid.pcpu = alloc_percpu_noprof(struct mm_cid_pcpu);
-	if (!mm->pcpu_cid)
+	if (!mm->mm_cid.pcpu)
 		return -ENOMEM;
 	mm_init_cid(mm, p);
 	return 0;
@ -1454,37 +1365,24 @@ static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *
 static inline void mm_destroy_cid(struct mm_struct *mm)
 {
-	free_percpu(mm->pcpu_cid);
+	free_percpu(mm->mm_cid.pcpu);
-	mm->pcpu_cid = NULL;
+	mm->mm_cid.pcpu = NULL;
 }
 static inline unsigned int mm_cid_size(void)
 {
-	return 2 * cpumask_size();	/* mm_cpus_allowed(), mm_cidmask(). */
+	/* mm_cpus_allowed(), mm_cidmask(). */
 	return cpumask_size() + bitmap_size(num_possible_cpus());
 }
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
 {
 	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
 	if (!mm)
 		return;
 	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
 	raw_spin_lock(&mm->cpus_allowed_lock);
 	cpumask_or(mm_allowed, mm_allowed, cpumask);
 	WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
 	raw_spin_unlock(&mm->cpus_allowed_lock);
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
 static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
 static inline void mm_destroy_cid(struct mm_struct *mm) { }
 static inline unsigned int mm_cid_size(void)
 {
 	return 0;
 }
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 struct mmu_gather;
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
-	rseq_handle_notify_resume(NULL, regs);
+	rseq_handle_slowpath(regs);
 }
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@ -3,134 +3,164 @@
 #define _LINUX_RSEQ_H
 #ifdef CONFIG_RSEQ
 #include <linux/preempt.h>
 #include <linux/sched.h>
-#ifdef CONFIG_MEMBARRIER
+#include <uapi/linux/rseq.h>
-# define RSEQ_EVENT_GUARD	irq
+
-#else
+void __rseq_handle_slowpath(struct pt_regs *regs);
-# define RSEQ_EVENT_GUARD	preempt
+
-#endif
+/* Invoked from resume_user_mode_work() */
 static inline void rseq_handle_slowpath(struct pt_regs *regs)
 {
 	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
 		if (current->rseq.event.slowpath)
 			__rseq_handle_slowpath(regs);
 	} else {
 		/* '&' is intentional to spare one conditional branch */
 		if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
 			__rseq_handle_slowpath(regs);
 	}
 }
 void __rseq_signal_deliver(int sig, struct pt_regs *regs);
 /*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
+ * Invoked from signal delivery to fixup based on the register context before
- * for direct mask checks.
+ * switching to the signal delivery context.
 */
-enum rseq_event_mask_bits {
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 	RSEQ_EVENT_PREEMPT_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
 	RSEQ_EVENT_SIGNAL_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
 	RSEQ_EVENT_MIGRATE_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
 };
 enum rseq_event_mask {
 	RSEQ_EVENT_PREEMPT	= (1U << RSEQ_EVENT_PREEMPT_BIT),
 	RSEQ_EVENT_SIGNAL	= (1U << RSEQ_EVENT_SIGNAL_BIT),
 	RSEQ_EVENT_MIGRATE	= (1U << RSEQ_EVENT_MIGRATE_BIT),
 };
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
-	if (t->rseq)
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
-		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+		/* '&' is intentional to spare one conditional branch */
 		if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
 			__rseq_signal_deliver(ksig->sig, regs);
 	} else {
 		if (current->rseq.event.has_rseq)
 			__rseq_signal_deliver(ksig->sig, regs);
 	}
 }
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+static inline void rseq_raise_notify_resume(struct task_struct *t)
 static inline void rseq_handle_notify_resume(struct ksignal *ksig,
 					     struct pt_regs *regs)
 {
-	if (current->rseq)
+	set_tsk_thread_flag(t, TIF_RSEQ);
 		__rseq_handle_notify_resume(ksig, regs);
 }
-static inline void rseq_signal_deliver(struct ksignal *ksig,
+/* Invoked from context switch to force evaluation on exit to user */
-				       struct pt_regs *regs)
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
 {
-	scoped_guard(RSEQ_EVENT_GUARD)
+	struct rseq_event *ev = &t->rseq.event;
-		__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
+
-	rseq_handle_notify_resume(ksig, regs);
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
 		/*
 		 * Avoid a boat load of conditionals by using simple logic
 		 * to determine whether NOTIFY_RESUME needs to be raised.
 		 *
 		 * It's required when the CPU or MM CID has changed or
 		 * the entry was from user space.
 		 */
 		bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
 		if (raise) {
 			ev->sched_switch = true;
 			rseq_raise_notify_resume(t);
 		}
 	} else {
 		if (ev->has_rseq) {
 			t->rseq.event.sched_switch = true;
 			rseq_raise_notify_resume(t);
 		}
 	}
 }
-/* rseq_preempt() requires preemption to be disabled. */
+/*
-static inline void rseq_preempt(struct task_struct *t)
+ * Invoked from __set_task_cpu() when a task migrates or from
 * mm_cid_schedin() when the CID changes to enforce an IDs update.
 *
 * This does not raise TIF_NOTIFY_RESUME as that happens in
 * rseq_sched_switch_event().
 */
 static __always_inline void rseq_sched_set_ids_changed(struct task_struct *t)
 {
-	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
+	t->rseq.event.ids_changed = true;
 	rseq_set_notify_resume(t);
 }
-/* rseq_migrate() requires preemption to be disabled. */
+/* Enforce a full update after RSEQ registration and when execve() failed */
-static inline void rseq_migrate(struct task_struct *t)
+static inline void rseq_force_update(void)
 {
-	__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
+	if (current->rseq.event.has_rseq) {
-	rseq_set_notify_resume(t);
+		current->rseq.event.ids_changed = true;
 		current->rseq.event.sched_switch = true;
 		rseq_raise_notify_resume(current);
 	}
 }
 /*
 * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
 * which clears TIF_NOTIFY_RESUME on architectures that don't use the
 * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
 *
 * To avoid updating user space RSEQ in that case just to do it eventually
 * again before returning to user space, because __rseq_handle_slowpath()
 * does nothing when invoked with NULL register state.
 *
 * After returning from guest mode, before exiting to userspace, hypervisors
 * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
 */
 static inline void rseq_virt_userspace_exit(void)
 {
 	/*
 	 * The generic optimization for deferring RSEQ updates until the next
 	 * exit relies on having a dedicated TIF_RSEQ.
 	 */
 	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
 	    current->rseq.event.sched_switch)
 		rseq_raise_notify_resume(current);
 }
 static inline void rseq_reset(struct task_struct *t)
 {
 	memset(&t->rseq, 0, sizeof(t->rseq));
 	t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
 }
 static inline void rseq_execve(struct task_struct *t)
 {
 	rseq_reset(t);
 }
 /*
 * If parent process has a registered restartable sequences area, the
 * child inherits. Unregister rseq for a clone with CLONE_VM set.
 *
 * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault
 * on the COW page on exit to user space, when the child stays on the same
 * CPU as the parent. That's obviously not guaranteed, but in overcommit
 * scenarios it is more likely and optimizes for the fork/exec case without
 * taking the fault.
 */
 static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
 {
-	if (clone_flags & CLONE_VM) {
+	if (clone_flags & CLONE_VM)
-		t->rseq = NULL;
+		rseq_reset(t);
-		t->rseq_len = 0;
+	else
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
 	}
 }
-static inline void rseq_execve(struct task_struct *t)
+#else /* CONFIG_RSEQ */
-{
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
-	t->rseq = NULL;
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-	t->rseq_len = 0;
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
-	t->rseq_sig = 0;
+static inline void rseq_sched_set_ids_changed(struct task_struct *t) { }
-	t->rseq_event_mask = 0;
+static inline void rseq_force_update(void) { }
-}
+static inline void rseq_virt_userspace_exit(void) { }
-
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
-#else
+static inline void rseq_execve(struct task_struct *t) { }
-
+#endif  /* !CONFIG_RSEQ */
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
 }
 static inline void rseq_handle_notify_resume(struct ksignal *ksig,
 					     struct pt_regs *regs)
 {
 }
 static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
 }
 static inline void rseq_preempt(struct task_struct *t)
 {
 }
 static inline void rseq_migrate(struct task_struct *t)
 {
 }
 static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
 {
 }
 static inline void rseq_execve(struct task_struct *t)
 {
 }
 #endif
 #ifdef CONFIG_DEBUG_RSEQ
 void rseq_syscall(struct pt_regs *regs);
-
+#else /* CONFIG_DEBUG_RSEQ */
-#else
+static inline void rseq_syscall(struct pt_regs *regs) { }
-
+#endif /* !CONFIG_DEBUG_RSEQ */
 static inline void rseq_syscall(struct pt_regs *regs)
 {
 }
 #endif
 #endif /* _LINUX_RSEQ_H */
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@ -0,0 +1,616 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_RSEQ_ENTRY_H
 #define _LINUX_RSEQ_ENTRY_H
 /* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
 #ifdef CONFIG_RSEQ_STATS
 #include <linux/percpu.h>
 struct rseq_stats {
 	unsigned long	exit;
 	unsigned long	signal;
 	unsigned long	slowpath;
 	unsigned long	fastpath;
 	unsigned long	ids;
 	unsigned long	cs;
 	unsigned long	clear;
 	unsigned long	fixup;
 };
 DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
 /*
 * Slow path has interrupts and preemption enabled, but the fast path
 * runs with interrupts disabled so there is no point in having the
 * preemption checks implied in __this_cpu_inc() for every operation.
 */
 #ifdef RSEQ_BUILD_SLOW_PATH
 #define rseq_stat_inc(which)	this_cpu_inc((which))
 #else
 #define rseq_stat_inc(which)	raw_cpu_inc((which))
 #endif
 #else /* CONFIG_RSEQ_STATS */
 #define rseq_stat_inc(x)	do { } while (0)
 #endif /* !CONFIG_RSEQ_STATS */
 #ifdef CONFIG_RSEQ
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
 #include <linux/uaccess.h>
 #include <linux/tracepoint-defs.h>
 #ifdef CONFIG_TRACEPOINTS
 DECLARE_TRACEPOINT(rseq_update);
 DECLARE_TRACEPOINT(rseq_ip_fixup);
 void __rseq_trace_update(struct task_struct *t);
 void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
 			   unsigned long offset, unsigned long abort_ip);
 static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
 {
 	if (tracepoint_enabled(rseq_update) && ids)
 		__rseq_trace_update(t);
 }
 static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
 				       unsigned long offset, unsigned long abort_ip)
 {
 	if (tracepoint_enabled(rseq_ip_fixup))
 		__rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
 }
 #else /* CONFIG_TRACEPOINT */
 static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids) { }
 static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
 				       unsigned long offset, unsigned long abort_ip) { }
 #endif /* !CONFIG_TRACEPOINT */
 DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
 #ifdef RSEQ_BUILD_SLOW_PATH
 #define rseq_inline
 #else
 #define rseq_inline __always_inline
 #endif
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
 bool rseq_debug_validate_ids(struct task_struct *t);
 static __always_inline void rseq_note_user_irq_entry(void)
 {
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
 		current->rseq.event.user_irq = true;
 }
 /*
 * Check whether there is a valid critical section and whether the
 * instruction pointer in @regs is inside the critical section.
 *
 *  - If the critical section is invalid, terminate the task.
 *
 *  - If valid and the instruction pointer is inside, set it to the abort IP.
 *
 *  - If valid and the instruction pointer is outside, clear the critical
 *    section address.
 *
 * Returns true, if the section was valid and either fixup or clear was
 * done, false otherwise.
 *
 * In the failure case task::rseq_event::fatal is set when a invalid
 * section was found. It's clear when the failure was an unresolved page
 * fault.
 *
 * If inlined into the exit to user path with interrupts disabled, the
 * caller has to protect against page faults with pagefault_disable().
 *
 * In preemptible task context this would be counterproductive as the page
 * faults could not be fully resolved. As a consequence unresolved page
 * faults in task context are fatal too.
 */
 #ifdef RSEQ_BUILD_SLOW_PATH
 /*
 * The debug version is put out of line, but kept here so the code stays
 * together.
 *
 * @csaddr has already been checked by the caller to be in user space
 */
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
 			       unsigned long csaddr)
 {
 	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
 	u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
 	unsigned long ip = instruction_pointer(regs);
 	u64 __user *uc_head = (u64 __user *) ucs;
 	u32 usig, __user *uc_sig;
 	scoped_user_rw_access(ucs, efault) {
 		/*
 		 * Evaluate the user pile and exit if one of the conditions
 		 * is not fulfilled.
 		 */
 		unsafe_get_user(start_ip, &ucs->start_ip, efault);
 		if (unlikely(start_ip >= tasksize))
 			goto die;
 		/* If outside, just clear the critical section. */
 		if (ip < start_ip)
 			goto clear;
 		unsafe_get_user(offset, &ucs->post_commit_offset, efault);
 		cs_end = start_ip + offset;
 		/* Check for overflow and wraparound */
 		if (unlikely(cs_end >= tasksize || cs_end < start_ip))
 			goto die;
 		/* If not inside, clear it. */
 		if (ip >= cs_end)
 			goto clear;
 		unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
 		/* Ensure it's "valid" */
 		if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
 			goto die;
 		/* Validate that the abort IP is not in the critical section */
 		if (unlikely(abort_ip - start_ip < offset))
 			goto die;
 		/*
 		 * Check version and flags for 0. No point in emitting
 		 * deprecated warnings before dying. That could be done in
 		 * the slow path eventually, but *shrug*.
 		 */
 		unsafe_get_user(head, uc_head, efault);
 		if (unlikely(head))
 			goto die;
 		/* abort_ip - 4 is >= 0. See abort_ip check above */
 		uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
 		unsafe_get_user(usig, uc_sig, efault);
 		if (unlikely(usig != t->rseq.sig))
 			goto die;
 		/* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
 		if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
 			/* If not in interrupt from user context, let it die */
 			if (unlikely(!t->rseq.event.user_irq))
 				goto die;
 		}
 		unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
 		instruction_pointer_set(regs, (unsigned long)abort_ip);
 		rseq_stat_inc(rseq_stats.fixup);
 		break;
 	clear:
 		unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
 		rseq_stat_inc(rseq_stats.clear);
 		abort_ip = 0ULL;
 	}
 	if (unlikely(abort_ip))
 		rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
 	return true;
 die:
 	t->rseq.event.fatal = true;
 efault:
 	return false;
 }
 /*
 * On debug kernels validate that user space did not mess with it if the
 * debug branch is enabled.
 */
 bool rseq_debug_validate_ids(struct task_struct *t)
 {
 	struct rseq __user *rseq = t->rseq.usrptr;
 	u32 cpu_id, uval, node_id;
 	/*
 	 * On the first exit after registering the rseq region CPU ID is
 	 * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
 	 */
 	node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
 		  cpu_to_node(t->rseq.ids.cpu_id) : 0;
 	scoped_user_read_access(rseq, efault) {
 		unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
 		if (cpu_id != t->rseq.ids.cpu_id)
 			goto die;
 		unsafe_get_user(uval, &rseq->cpu_id, efault);
 		if (uval != cpu_id)
 			goto die;
 		unsafe_get_user(uval, &rseq->node_id, efault);
 		if (uval != node_id)
 			goto die;
 		unsafe_get_user(uval, &rseq->mm_cid, efault);
 		if (uval != t->rseq.ids.mm_cid)
 			goto die;
 	}
 	return true;
 die:
 	t->rseq.event.fatal = true;
 efault:
 	return false;
 }
 #endif /* RSEQ_BUILD_SLOW_PATH */
 /*
 * This only ensures that abort_ip is in the user address space and
 * validates that it is preceded by the signature.
 *
 * No other sanity checks are done here, that's what the debug code is for.
 */
 static rseq_inline bool
 rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
 {
 	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
 	unsigned long ip = instruction_pointer(regs);
 	unsigned long tasksize = TASK_SIZE;
 	u64 start_ip, abort_ip, offset;
 	u32 usig, __user *uc_sig;
 	rseq_stat_inc(rseq_stats.cs);
 	if (unlikely(csaddr >= tasksize)) {
 		t->rseq.event.fatal = true;
 		return false;
 	}
 	if (static_branch_unlikely(&rseq_debug_enabled))
 		return rseq_debug_update_user_cs(t, regs, csaddr);
 	scoped_user_rw_access(ucs, efault) {
 		unsafe_get_user(start_ip, &ucs->start_ip, efault);
 		unsafe_get_user(offset, &ucs->post_commit_offset, efault);
 		unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
 		/*
 		 * No sanity checks. If user space screwed it up, it can
 		 * keep the pieces. That's what debug code is for.
 		 *
 		 * If outside, just clear the critical section.
 		 */
 		if (ip - start_ip >= offset)
 			goto clear;
 		/*
 		 * Two requirements for @abort_ip:
 		 *   - Must be in user space as x86 IRET would happily return to
 		 *     the kernel.
 		 *   - The four bytes preceding the instruction at @abort_ip must
 		 *     contain the signature.
 		 *
 		 * The latter protects against the following attack vector:
 		 *
 		 * An attacker with limited abilities to write, creates a critical
 		 * section descriptor, sets the abort IP to a library function or
 		 * some other ROP gadget and stores the address of the descriptor
 		 * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
 		 * protection.
 		 */
 		if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
 			goto die;
 		/* The address is guaranteed to be >= 0 and < TASK_SIZE */
 		uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
 		unsafe_get_user(usig, uc_sig, efault);
 		if (unlikely(usig != t->rseq.sig))
 			goto die;
 		/* Invalidate the critical section */
 		unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
 		/* Update the instruction pointer */
 		instruction_pointer_set(regs, (unsigned long)abort_ip);
 		rseq_stat_inc(rseq_stats.fixup);
 		break;
 	clear:
 		unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
 		rseq_stat_inc(rseq_stats.clear);
 		abort_ip = 0ULL;
 	}
 	if (unlikely(abort_ip))
 		rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
 	return true;
 die:
 	t->rseq.event.fatal = true;
 efault:
 	return false;
 }
 /*
 * Updates CPU ID, Node ID and MM CID and reads the critical section
 * address, when @csaddr != NULL. This allows to put the ID update and the
 * read under the same uaccess region to spare a separate begin/end.
 *
 * As this is either invoked from a C wrapper with @csaddr = NULL or from
 * the fast path code with a valid pointer, a clever compiler should be
 * able to optimize the read out. Spares a duplicate implementation.
 *
 * Returns true, if the operation was successful, false otherwise.
 *
 * In the failure case task::rseq_event::fatal is set when invalid data
 * was found on debug kernels. It's clear when the failure was an unresolved page
 * fault.
 *
 * If inlined into the exit to user path with interrupts disabled, the
 * caller has to protect against page faults with pagefault_disable().
 *
 * In preemptible task context this would be counterproductive as the page
 * faults could not be fully resolved. As a consequence unresolved page
 * faults in task context are fatal too.
 */
 static rseq_inline
 bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
 			     u32 node_id, u64 *csaddr)
 {
 	struct rseq __user *rseq = t->rseq.usrptr;
 	if (static_branch_unlikely(&rseq_debug_enabled)) {
 		if (!rseq_debug_validate_ids(t))
 			return false;
 	}
 	scoped_user_rw_access(rseq, efault) {
 		unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
 		unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
 		unsafe_put_user(node_id, &rseq->node_id, efault);
 		unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
 		if (csaddr)
 			unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
 	}
 	/* Cache the new values */
 	t->rseq.ids.cpu_cid = ids->cpu_cid;
 	rseq_stat_inc(rseq_stats.ids);
 	rseq_trace_update(t, ids);
 	return true;
 efault:
 	return false;
 }
 /*
 * Update user space with new IDs and conditionally check whether the task
 * is in a critical section.
 */
 static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
 					struct rseq_ids *ids, u32 node_id)
 {
 	u64 csaddr;
 	if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
 		return false;
 	/*
 	 * On architectures which utilize the generic entry code this
 	 * allows to skip the critical section when the entry was not from
 	 * a user space interrupt, unless debug mode is enabled.
 	 */
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
 		if (!static_branch_unlikely(&rseq_debug_enabled)) {
 			if (likely(!t->rseq.event.user_irq))
 				return true;
 		}
 	}
 	if (likely(!csaddr))
 		return true;
 	/* Sigh, this really needs to do work */
 	return rseq_update_user_cs(t, regs, csaddr);
 }
 /*
 * If you want to use this then convert your architecture to the generic
 * entry code. I'm tired of building workarounds for people who can't be
 * bothered to make the maintenance of generic infrastructure less
 * burdensome. Just sucking everything into the architecture code and
 * thereby making others chase the horrible hacks and keep them working is
 * neither acceptable nor sustainable.
 */
 #ifdef CONFIG_GENERIC_ENTRY
 /*
 * This is inlined into the exit path because:
 *
 * 1) It's a one time comparison in the fast path when there is no event to
 *    handle
 *
 * 2) The access to the user space rseq memory (TLS) is unlikely to fault
 *    so the straight inline operation is:
 *
 *	- Four 32-bit stores only if CPU ID/ MM CID need to be updated
 *	- One 64-bit load to retrieve the critical section address
 *
 * 3) In the unlikely case that the critical section address is != NULL:
 *
 *     - One 64-bit load to retrieve the start IP
 *     - One 64-bit load to retrieve the offset for calculating the end
 *     - One 64-bit load to retrieve the abort IP
 *     - One 64-bit load to retrieve the signature
 *     - One store to clear the critical section address
 *
 * The non-debug case implements only the minimal required checking. It
 * provides protection against a rogue abort IP in kernel space, which
 * would be exploitable at least on x86, and also against a rogue CS
 * descriptor by checking the signature at the abort IP. Any fallout from
 * invalid critical section descriptors is a user space problem. The debug
 * case provides the full set of checks and terminates the task if a
 * condition is not met.
 *
 * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
 * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
 * slow path there will handle the failure.
 */
 static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct task_struct *t)
 {
 	/*
 	 * Page faults need to be disabled as this is called with
 	 * interrupts disabled
 	 */
 	guard(pagefault)();
 	if (likely(!t->rseq.event.ids_changed)) {
 		struct rseq __user *rseq = t->rseq.usrptr;
 		/*
 		 * If IDs have not changed rseq_event::user_irq must be true
 		 * See rseq_sched_switch_event().
 		 */
 		u64 csaddr;
 		if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
 			return false;
 		if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
 			if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
 				return false;
 		}
 		return true;
 	}
 	struct rseq_ids ids = {
 		.cpu_id = task_cpu(t),
 		.mm_cid = task_mm_cid(t),
 	};
 	u32 node_id = cpu_to_node(ids.cpu_id);
 	return rseq_update_usr(t, regs, &ids, node_id);
 }
 static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 	/*
 	 * If the task did not go through schedule or got the flag enforced
 	 * by the rseq syscall or execve, then nothing to do here.
 	 *
 	 * CPU ID and MM CID can only change when going through a context
 	 * switch.
 	 *
 	 * rseq_sched_switch_event() sets the rseq_event::sched_switch bit
 	 * only when rseq_event::has_rseq is true. That conditional is
 	 * required to avoid setting the TIF bit if RSEQ is not registered
 	 * for a task. rseq_event::sched_switch is cleared when RSEQ is
 	 * unregistered by a task so it's sufficient to check for the
 	 * sched_switch bit alone.
 	 *
 	 * A sane compiler requires three instructions for the nothing to do
 	 * case including clearing the events, but your mileage might vary.
 	 */
 	if (unlikely((t->rseq.event.sched_switch))) {
 		rseq_stat_inc(rseq_stats.fastpath);
 		if (unlikely(!rseq_exit_user_update(regs, t)))
 			return true;
 	}
 	/* Clear state so next entry starts from a clean slate */
 	t->rseq.event.events = 0;
 	return false;
 }
 /* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */
 #ifdef CONFIG_HAVE_GENERIC_TIF_BITS
 static __always_inline bool test_tif_rseq(unsigned long ti_work)
 {
 	return ti_work & _TIF_RSEQ;
 }
 static __always_inline void clear_tif_rseq(void)
 {
 	static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
 	clear_thread_flag(TIF_RSEQ);
 }
 #else
 static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
 static __always_inline void clear_tif_rseq(void) { }
 #endif
 static __always_inline bool
 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
 	if (likely(!test_tif_rseq(ti_work)))
 		return false;
 	if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
 		current->rseq.event.slowpath = true;
 		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
 		return true;
 	}
 	clear_tif_rseq();
 	return false;
 }
 #else /* CONFIG_GENERIC_ENTRY */
 static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
 	return false;
 }
 #endif /* !CONFIG_GENERIC_ENTRY */
 static __always_inline void rseq_syscall_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq.event;
 	rseq_stat_inc(rseq_stats.exit);
 	/* Needed to remove the store for the !lockdep case */
 	if (IS_ENABLED(CONFIG_LOCKDEP)) {
 		WARN_ON_ONCE(ev->sched_switch);
 		ev->events = 0;
 	}
 }
 static __always_inline void rseq_irqentry_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq.event;
 	rseq_stat_inc(rseq_stats.exit);
 	lockdep_assert_once(!ev->sched_switch);
 	/*
 	 * Ensure that event (especially user_irq) is cleared when the
 	 * interrupt did not result in a schedule and therefore the
 	 * rseq processing could not clear it.
 	 */
 	ev->events = 0;
 }
 /* Required to keep ARM64 working */
 static __always_inline void rseq_exit_to_user_mode_legacy(void)
 {
 	struct rseq_event *ev = &current->rseq.event;
 	rseq_stat_inc(rseq_stats.exit);
 	if (static_branch_unlikely(&rseq_debug_enabled))
 		WARN_ON_ONCE(ev->sched_switch);
 	/*
 	 * Ensure that event (especially user_irq) is cleared when the
 	 * interrupt did not result in a schedule and therefore the
 	 * rseq processing did not clear it.
 	 */
 	ev->events = 0;
 }
 void __rseq_debug_syscall_return(struct pt_regs *regs);
 static inline void rseq_debug_syscall_return(struct pt_regs *regs)
 {
 	if (static_branch_unlikely(&rseq_debug_enabled))
 		__rseq_debug_syscall_return(regs);
 }
 #else /* CONFIG_RSEQ */
 static inline void rseq_note_user_irq_entry(void) { }
 static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
 	return false;
 }
 static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
 static inline void rseq_exit_to_user_mode_legacy(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
 #endif /* !CONFIG_RSEQ */
 #endif /* _LINUX_RSEQ_ENTRY_H */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@ -0,0 +1,164 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_RSEQ_TYPES_H
 #define _LINUX_RSEQ_TYPES_H
 #include <linux/irq_work_types.h>
 #include <linux/types.h>
 #include <linux/workqueue_types.h>
 #ifdef CONFIG_RSEQ
 struct rseq;
 /**
 * struct rseq_event - Storage for rseq related event management
 * @all:		Compound to initialize and clear the data efficiently
 * @events:		Compound to access events with a single load/store
 * @sched_switch:	True if the task was scheduled and needs update on
 *			exit to user
 * @ids_changed:	Indicator that IDs need to be updated
 * @user_irq:		True on interrupt entry from user mode
 * @has_rseq:		True if the task has a rseq pointer installed
 * @error:		Compound error code for the slow path to analyze
 * @fatal:		User space data corrupted or invalid
 * @slowpath:		Indicator that slow path processing via TIF_NOTIFY_RESUME
 *			is required
 *
 * @sched_switch and @ids_changed must be adjacent and the combo must be
 * 16bit aligned to allow a single store, when both are set at the same
 * time in the scheduler.
 */
 struct rseq_event {
 	union {
 		u64				all;
 		struct {
 			union {
 				u32		events;
 				struct {
 					u8	sched_switch;
 					u8	ids_changed;
 					u8	user_irq;
 				};
 			};
 			u8			has_rseq;
 			u8			__pad;
 			union {
 				u16		error;
 				struct {
 					u8	fatal;
 					u8	slowpath;
 				};
 			};
 		};
 	};
 };
 /**
 * struct rseq_ids - Cache for ids, which need to be updated
 * @cpu_cid:	Compound of @cpu_id and @mm_cid to make the
 *		compiler emit a single compare on 64-bit
 * @cpu_id:	The CPU ID which was written last to user space
 * @mm_cid:	The MM CID which was written last to user space
 *
 * @cpu_id and @mm_cid are updated when the data is written to user space.
 */
 struct rseq_ids {
 	union {
 		u64		cpu_cid;
 		struct {
 			u32	cpu_id;
 			u32	mm_cid;
 		};
 	};
 };
 /**
 * struct rseq_data - Storage for all rseq related data
 * @usrptr:	Pointer to the registered user space RSEQ memory
 * @len:	Length of the RSEQ region
 * @sig:	Signature of critial section abort IPs
 * @event:	Storage for event management
 * @ids:	Storage for cached CPU ID and MM CID
 */
 struct rseq_data {
 	struct rseq __user		*usrptr;
 	u32				len;
 	u32				sig;
 	struct rseq_event		event;
 	struct rseq_ids			ids;
 };
 #else /* CONFIG_RSEQ */
 struct rseq_data { };
 #endif /* !CONFIG_RSEQ */
 #ifdef CONFIG_SCHED_MM_CID
 #define MM_CID_UNSET	BIT(31)
 #define MM_CID_ONCPU	BIT(30)
 #define MM_CID_TRANSIT	BIT(29)
 /**
 * struct sched_mm_cid - Storage for per task MM CID data
 * @active:	MM CID is active for the task
 * @cid:	The CID associated to the task either permanently or
 *		borrowed from the CPU
 */
 struct sched_mm_cid {
 	unsigned int		active;
 	unsigned int		cid;
 };
 /**
 * struct mm_cid_pcpu - Storage for per CPU MM_CID data
 * @cid:	The CID associated to the CPU either permanently or
 *		while a task with a CID is running
 */
 struct mm_cid_pcpu {
 	unsigned int	cid;
 }____cacheline_aligned_in_smp;
 /**
 * struct mm_mm_cid - Storage for per MM CID data
 * @pcpu:		Per CPU storage for CIDs associated to a CPU
 * @percpu:		Set, when CIDs are in per CPU mode
 * @transit:		Set to MM_CID_TRANSIT during a mode change transition phase
 * @max_cids:		The exclusive maximum CID value for allocation and convergence
 * @irq_work:		irq_work to handle the affinity mode change case
 * @work:		Regular work to handle the affinity mode change case
 * @lock:		Spinlock to protect against affinity setting which can't take @mutex
 * @mutex:		Mutex to serialize forks and exits related to this mm
 * @nr_cpus_allowed:	The number of CPUs in the per MM allowed CPUs map. The map
 *			is growth only.
 * @users:		The number of tasks sharing this MM. Separate from mm::mm_users
 *			as that is modified by mmget()/mm_put() by other entities which
 *			do not actually share the MM.
 * @pcpu_thrs:		Threshold for switching back from per CPU mode
 * @update_deferred:	A deferred switch back to per task mode is pending.
 */
 struct mm_mm_cid {
 	/* Hotpath read mostly members */
 	struct mm_cid_pcpu	__percpu *pcpu;
 	unsigned int		percpu;
 	unsigned int		transit;
 	unsigned int		max_cids;
 	/* Rarely used. Moves @lock and @mutex into the second cacheline */
 	struct irq_work		irq_work;
 	struct work_struct	work;
 	raw_spinlock_t		lock;
 	struct mutex		mutex;
 	/* Low frequency modified */
 	unsigned int		nr_cpus_allowed;
 	unsigned int		users;
 	unsigned int		pcpu_thrs;
 	unsigned int		update_deferred;
 }____cacheline_aligned_in_smp;
 #else /* CONFIG_SCHED_MM_CID */
 struct mm_mm_cid { };
 struct sched_mm_cid { };
 #endif /* !CONFIG_SCHED_MM_CID */
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@ -41,7 +41,7 @@
 #include <linux/task_io_accounting.h>
 #include <linux/posix-timers_types.h>
 #include <linux/restart_block.h>
-#include <uapi/linux/rseq.h>
+#include <linux/rseq_types.h>
 #include <linux/seqlock_types.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
@ -1406,33 +1406,8 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
+	struct rseq_data		rseq;
-	struct rseq __user *rseq;
+	struct sched_mm_cid		mm_cid;
 	u32 rseq_len;
 	u32 rseq_sig;
 	/*
 	 * RmW on rseq_event_mask must be performed atomically
 	 * with respect to preemption.
 	 */
 	unsigned long rseq_event_mask;
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
 	 * validation of read-only fields. The struct rseq has a
 	 * variable-length array at the end, so it cannot be used
 	 * directly. Reserve a size large enough for the known fields.
 	 */
 	char				rseq_fields[sizeof(struct rseq)];
 # endif
 #endif
 #ifdef CONFIG_SCHED_MM_CID
 	int				mm_cid;		/* Current cid in mm */
 	int				last_mm_cid;	/* Most recent cid in mm */
 	int				migrate_from_cpu;
 	int				mm_cid_active;	/* Whether cid bitmap is active */
 	struct callback_head		cid_work;
 #endif
 	struct tlbflush_unmap_batch	tlb_ubc;
@ -2325,6 +2300,32 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
 #define alloc_tag_restore(_tag, _old)		do {} while (0)
 #endif
 /* Avoids recursive inclusion hell */
 #ifdef CONFIG_SCHED_MM_CID
 void sched_mm_cid_before_execve(struct task_struct *t);
 void sched_mm_cid_after_execve(struct task_struct *t);
 void sched_mm_cid_fork(struct task_struct *t);
 void sched_mm_cid_exit(struct task_struct *t);
 static __always_inline int task_mm_cid(struct task_struct *t)
 {
 	return t->mm_cid.cid & ~(MM_CID_ONCPU | MM_CID_TRANSIT);
 }
 #else
 static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_fork(struct task_struct *t) { }
 static inline void sched_mm_cid_exit(struct task_struct *t) { }
 static __always_inline int task_mm_cid(struct task_struct *t)
 {
 	/*
 	 * Use the processor id as a fall-back when the mm cid feature is
 	 * disabled. This provides functional per-cpu data structure accesses
 	 * in user-space, althrough it won't provide the memory usage benefits.
 	 */
 	return task_cpu(t);
 }
 #endif
 #ifndef MODULE
 #ifndef COMPILE_OFFSETS
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@ -67,6 +67,11 @@ enum syscall_work_bit {
 #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
 #endif
 #ifndef TIF_RSEQ
 # define TIF_RSEQ	TIF_NOTIFY_RESUME
 # define _TIF_RSEQ	_TIF_NOTIFY_RESUME
 #endif
 #ifdef __KERNEL__
 #ifndef arch_set_restart_data
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
 	),
 	TP_fast_assign(
-		__entry->cpu_id = raw_smp_processor_id();
+		__entry->cpu_id = t->rseq.ids.cpu_id;
 		__entry->node_id = cpu_to_node(__entry->cpu_id);
-		__entry->mm_cid = task_mm_cid(t);
+		__entry->mm_cid = t->rseq.ids.mm_cid;
 	),
 	TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@ -114,20 +114,13 @@ struct rseq {
 	/*
 	 * Restartable sequences flags field.
 	 *
-	 * This field should only be updated by the thread which
+	 * This field was initially intended to allow event masking for
-	 * registered this data structure. Read by the kernel.
+	 * single-stepping through rseq critical sections with debuggers.
-	 * Mainly used for single-stepping through rseq critical sections
+	 * The kernel does not support this anymore and the relevant bits
-	 * with debuggers.
+	 * are checked for being always false:
 	 *
 	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
 	 *     Inhibit instruction sequence block restart on preemption
 	 *     for this thread.
 	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
 	 *     Inhibit instruction sequence block restart on signal
 	 *     delivery for this thread.
 	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 *     Inhibit instruction sequence block restart on migration for
 	 *     this thread.
 	 */
 	__u32 flags;
--- a/init/Kconfig
+++ b/init/Kconfig
@ -1913,10 +1913,36 @@ config RSEQ
 	  If unsure, say Y.
 config RSEQ_STATS
 	default n
 	bool "Enable lightweight statistics of restartable sequences" if EXPERT
 	depends on RSEQ && DEBUG_FS
 	help
 	  Enable lightweight counters which expose information about the
 	  frequency of RSEQ operations via debugfs. Mostly interesting for
 	  kernel debugging or performance analysis. While lightweight it's
 	  still adding code into the user/kernel mode transitions.
 	  If unsure, say N.
 config RSEQ_DEBUG_DEFAULT_ENABLE
 	default n
 	bool "Enable restartable sequences debug mode by default" if EXPERT
 	depends on RSEQ
 	help
 	  This enables the static branch for debug mode of restartable
 	  sequences.
 	  This also can be controlled on the kernel command line via the
 	  command line parameter "rseq_debug=0/1" and through debugfs.
 	  If unsure, say N.
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
-	depends on RSEQ && DEBUG_KERNEL
+	depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
 	select RSEQ_DEBUG_DEFAULT_ENABLE
 	help
 	  Enable extra debugging checks for the rseq system call.
--- a/init/init_task.c
+++ b/init/init_task.c
@ -250,6 +250,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
 #ifdef CONFIG_SCHED_MM_CID
 	.mm_cid		= { .cid = MM_CID_UNSET, },
 #endif
 };
 EXPORT_SYMBOL(init_task);
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@ -3085,10 +3085,13 @@ EXPORT_SYMBOL(cpu_all_bits);
 #ifdef CONFIG_INIT_ALL_POSSIBLE
 struct cpumask __cpu_possible_mask __ro_after_init
 	= {CPU_BITS_ALL};
 unsigned int __num_possible_cpus __ro_after_init = NR_CPUS;
 #else
 struct cpumask __cpu_possible_mask __ro_after_init;
 unsigned int __num_possible_cpus __ro_after_init;
 #endif
 EXPORT_SYMBOL(__cpu_possible_mask);
 EXPORT_SYMBOL(__num_possible_cpus);
 struct cpumask __cpu_online_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_online_mask);
@ -3116,6 +3119,7 @@ void init_cpu_present(const struct cpumask *src)
 void init_cpu_possible(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_possible_mask, src);
 	__num_possible_cpus = cpumask_weight(&__cpu_possible_mask);
 }
 void set_cpu_online(unsigned int cpu, bool online)
@ -3139,6 +3143,21 @@ void set_cpu_online(unsigned int cpu, bool online)
 	}
 }
 /*
 * This should be marked __init, but there is a boatload of call sites
 * which need to be fixed up to do so. Sigh...
 */
 void set_cpu_possible(unsigned int cpu, bool possible)
 {
 	if (possible) {
 		if (!cpumask_test_and_set_cpu(cpu, &__cpu_possible_mask))
 			__num_possible_cpus++;
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, &__cpu_possible_mask))
 			__num_possible_cpus--;
 	}
 }
 /*
 * Activate the first processor.
 */
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@ -11,19 +11,20 @@
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
-/**
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
- * exit_to_user_mode_loop - do any pending work before leaving to user space
+#define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ)
- * @regs:	Pointer to pt_regs on entry stack
+#else
- * @ti_work:	TIF work flags as read by the caller
+#define EXIT_TO_USER_MODE_WORK_LOOP	(EXIT_TO_USER_MODE_WORK)
- */
+#endif
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+
 static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
 							      unsigned long ti_work)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+	while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
 		local_irq_enable_exit_to_user(ti_work);
@ -62,17 +63,21 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 	return ti_work;
 }
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
+/**
 * exit_to_user_mode_loop - do any pending work before leaving to user space
 * @regs:	Pointer to pt_regs on entry stack
 * @ti_work:	TIF work flags as read by the caller
 */
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 						     unsigned long ti_work)
 {
-	enter_from_user_mode(regs);
+	for (;;) {
-}
+		ti_work = __exit_to_user_mode_loop(regs, ti_work);
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
+		if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work)))
-{
+			return ti_work;
-	instrumentation_begin();
+		ti_work = read_thread_flags();
-	exit_to_user_mode_prepare(regs);
+	}
 	instrumentation_end();
 	exit_to_user_mode();
 }
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	return ret ? : syscall;
 }
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
 	instrumentation_begin();
 	local_irq_enable();
 	instrumentation_end();
 }
 /*
 * If SYSCALL_EMU is set, then the only reason to report is when
 * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
--- a/kernel/exit.c
+++ b/kernel/exit.c
@ -911,6 +911,7 @@ void __noreturn do_exit(long code)
 	user_events_exit(tsk);
 	io_uring_files_cancel();
 	sched_mm_cid_exit(tsk);
 	exit_signals(tsk);  /* sets PF_EXITING */
 	seccomp_filter_release(tsk);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@ -955,10 +955,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #endif
 #ifdef CONFIG_SCHED_MM_CID
-	tsk->mm_cid = -1;
+	tsk->mm_cid.cid = MM_CID_UNSET;
-	tsk->last_mm_cid = -1;
+	tsk->mm_cid.active = 0;
 	tsk->mm_cid_active = 0;
 	tsk->migrate_from_cpu = -1;
 #endif
 	return tsk;
@ -2456,6 +2454,7 @@ __latent_entropy struct task_struct *copy_process(
 	exit_nsproxy_namespaces(p);
 bad_fork_cleanup_mm:
 	if (p->mm) {
 		sched_mm_cid_exit(p);
 		mm_clear_owner(p->mm, p);
 		mmput(p->mm);
 	}
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
 					  unsigned long size, void __user *data)
 {
 	struct ptrace_rseq_configuration conf = {
-		.rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
+		.rseq_abi_pointer = (u64)(uintptr_t)task->rseq.usrptr,
-		.rseq_abi_size = task->rseq_len,
+		.rseq_abi_size = task->rseq.len,
-		.signature = task->rseq_sig,
+		.signature = task->rseq.sig,
 		.flags = 0,
 	};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@ -8,98 +8,7 @@
 * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
 */
 #include <linux/sched.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
 #include <linux/rseq.h>
 #include <linux/types.h>
 #include <linux/ratelimit.h>
 #include <asm/ptrace.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 #define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
 				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
 				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
 #ifdef CONFIG_DEBUG_RSEQ
 static struct rseq *rseq_kernel_fields(struct task_struct *t)
 {
 	return (struct rseq *) t->rseq_fields;
 }
 static int rseq_validate_ro_fields(struct task_struct *t)
 {
 	static DEFINE_RATELIMIT_STATE(_rs,
 				      DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 	u32 cpu_id_start, cpu_id, node_id, mm_cid;
 	struct rseq __user *rseq = t->rseq;
 /*
 	 * Validate fields which are required to be read-only by
 	 * user-space.
 	 */
 	if (!user_read_access_begin(rseq, t->rseq_len))
 		goto efault;
 	unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
 	unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_get_user(node_id, &rseq->node_id, efault_end);
 	unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
 	user_read_access_end();
 	if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
 	    cpu_id != rseq_kernel_fields(t)->cpu_id ||
 	    node_id != rseq_kernel_fields(t)->node_id ||
 	    mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
 		pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
 			"\tcpu_id_start: %u ?= %u\n"
 			"\tcpu_id:       %u ?= %u\n"
 			"\tnode_id:      %u ?= %u\n"
 			"\tmm_cid:       %u ?= %u\n",
 			t->pid, t->comm,
 			cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
 			cpu_id, rseq_kernel_fields(t)->cpu_id,
 			node_id, rseq_kernel_fields(t)->node_id,
 			mm_cid, rseq_kernel_fields(t)->mm_cid);
 	}
 	/* For now, only print a console warning on mismatch. */
 	return 0;
 efault_end:
 	user_read_access_end();
 efault:
 	return -EFAULT;
 }
 /*
 * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
 * state.
 */
 #define rseq_unsafe_put_user(t, value, field, error_label)		\
 	do {								\
 		unsafe_put_user(value, &t->rseq->field, error_label);	\
 		rseq_kernel_fields(t)->field = value;			\
 	} while (0)
 #else
 static int rseq_validate_ro_fields(struct task_struct *t)
 {
 	return 0;
 }
 #define rseq_unsafe_put_user(t, value, field, error_label)		\
 	unsafe_put_user(value, &t->rseq->field, error_label)
 #endif
 /*
 *
 * Restartable sequences are a lightweight interface that allows
 * user-level code to be executed atomically relative to scheduler
 * preemption and signal delivery. Typically used for implementing
@ -158,356 +67,356 @@ static int rseq_validate_ro_fields(struct task_struct *t)
 *   F1. <failure>
 */
-static int rseq_update_cpu_node_id(struct task_struct *t)
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
 #define RSEQ_BUILD_SLOW_PATH
 #include <linux/debugfs.h>
 #include <linux/ratelimit.h>
 #include <linux/rseq_entry.h>
 #include <linux/sched.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/types.h>
 #include <asm/ptrace.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
 static inline void rseq_control_debug(bool on)
 {
-	struct rseq __user *rseq = t->rseq;
+	if (on)
-	u32 cpu_id = raw_smp_processor_id();
+		static_branch_enable(&rseq_debug_enabled);
-	u32 node_id = cpu_to_node(cpu_id);
+	else
-	u32 mm_cid = task_mm_cid(t);
+		static_branch_disable(&rseq_debug_enabled);
 }
 static int __init rseq_setup_debug(char *str)
 {
 	bool on;
 	if (kstrtobool(str, &on))
 		return -EINVAL;
 	rseq_control_debug(on);
 	return 1;
 }
 __setup("rseq_debug=", rseq_setup_debug);
 #ifdef CONFIG_TRACEPOINTS
 /*
-	 * Validate read-only rseq fields.
+ * Out of line, so the actual update functions can be in a header to be
 * inlined into the exit to user code.
 */
-	if (rseq_validate_ro_fields(t))
+void __rseq_trace_update(struct task_struct *t)
-		goto efault;
+{
 	WARN_ON_ONCE((int) mm_cid < 0);
 	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 	rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
 	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
 	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
 	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
 	 * t->rseq_len != ORIG_RSEQ_SIZE.
 	 */
 	user_write_access_end();
 	trace_rseq_update(t);
 	return 0;
 efault_end:
 	user_write_access_end();
 efault:
 	return -EFAULT;
 }
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
 			   unsigned long offset, unsigned long abort_ip)
 {
-	struct rseq __user *rseq = t->rseq;
+	trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
+}
-	    mm_cid = 0;
+#endif /* CONFIG_TRACEPOINTS */
-	/*
+#ifdef CONFIG_DEBUG_FS
-	 * Validate read-only rseq fields.
+#ifdef CONFIG_RSEQ_STATS
-	 */
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
 	if (rseq_validate_ro_fields(t))
 		goto efault;
-	if (!user_write_access_begin(rseq, t->rseq_len))
+static int rseq_stats_show(struct seq_file *m, void *p)
-		goto efault;
+{
 	struct rseq_stats stats = { };
 	unsigned int cpu;
-	/*
+	for_each_possible_cpu(cpu) {
-	 * Reset all fields to their initial state.
+		stats.exit	+= data_race(per_cpu(rseq_stats.exit, cpu));
-	 *
+		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
-	 * All fields have an initial state of 0 except cpu_id which is set to
+		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
-	 * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
+		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
-	 * unregistration can figure out that rseq needs to be registered
+		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
-	 * again.
+		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
-	 */
+		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
-	rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
+		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
 	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
 	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
 	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally reset only if
 	 * t->rseq_len != ORIG_RSEQ_SIZE.
 	 */
 	user_write_access_end();
 	return 0;
 efault_end:
 	user_write_access_end();
 efault:
 	return -EFAULT;
 	}
-/*
+	seq_printf(m, "exit:   %16lu\n", stats.exit);
- * Get the user-space pointer value stored in the 'rseq_cs' field.
+	seq_printf(m, "signal: %16lu\n", stats.signal);
- */
+	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
+	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
-{
+	seq_printf(m, "ids:    %16lu\n", stats.ids);
-	if (!rseq_cs)
+	seq_printf(m, "cs:     %16lu\n", stats.cs);
-		return -EFAULT;
+	seq_printf(m, "clear:  %16lu\n", stats.clear);
 	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
 	return 0;
 }
-#ifdef CONFIG_64BIT
+static int rseq_stats_open(struct inode *inode, struct file *file)
-	if (get_user(*rseq_cs, &rseq->rseq_cs))
+{
-		return -EFAULT;
+	return single_open(file, rseq_stats_show, inode->i_private);
 }
 static const struct file_operations stat_ops = {
 	.open		= rseq_stats_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
 static int __init rseq_stats_init(struct dentry *root_dir)
 {
 	debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
 	return 0;
 }
 #else
-	if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
+static inline void rseq_stats_init(struct dentry *root_dir) { }
-		return -EFAULT;
+#endif /* CONFIG_RSEQ_STATS */
 #endif
-	return 0;
+static int rseq_debug_show(struct seq_file *m, void *p)
 }
 /*
 * If the rseq_cs field of 'struct rseq' contains a valid pointer to
 * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
 */
 static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
 {
-	struct rseq_cs __user *urseq_cs;
+	bool on = static_branch_unlikely(&rseq_debug_enabled);
 	u64 ptr;
 	u32 __user *usig;
 	u32 sig;
 	int ret;
-	ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
+	seq_printf(m, "%d\n", on);
 	if (ret)
 		return ret;
 	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
 	if (!ptr) {
 		memset(rseq_cs, 0, sizeof(*rseq_cs));
 		return 0;
 	}
 	/* Check that the pointer value fits in the user-space process space. */
 	if (ptr >= TASK_SIZE)
 		return -EINVAL;
 	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
 	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
 		return -EFAULT;
 	if (rseq_cs->start_ip >= TASK_SIZE ||
 	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
 	    rseq_cs->abort_ip >= TASK_SIZE ||
 	    rseq_cs->version > 0)
 		return -EINVAL;
 	/* Check for overflow. */
 	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
 		return -EINVAL;
 	/* Ensure that abort_ip is not in the critical section. */
 	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
 		return -EINVAL;
 	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
 	ret = get_user(sig, usig);
 	if (ret)
 		return ret;
 	if (current->rseq_sig != sig) {
 		printk_ratelimited(KERN_WARNING
 			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
 			sig, current->rseq_sig, current->pid, usig);
 		return -EINVAL;
 	}
 	return 0;
 }
-static bool rseq_warn_flags(const char *str, u32 flags)
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
 			    size_t count, loff_t *ppos)
 {
-	u32 test_flags;
+	bool on;
-	if (!flags)
+	if (kstrtobool_from_user(ubuf, count, &on))
-		return false;
+		return -EINVAL;
-	test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
+
-	if (test_flags)
+	rseq_control_debug(on);
-		pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
+	return count;
-	test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
+}
-	if (test_flags)
+
-		pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
+static int rseq_debug_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, rseq_debug_show, inode->i_private);
 }
 static const struct file_operations debug_ops = {
 	.open		= rseq_debug_open,
 	.read		= seq_read,
 	.write		= rseq_debug_write,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
 static int __init rseq_debugfs_init(void)
 {
 	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
 	debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
 	rseq_stats_init(root_dir);
 	return 0;
 }
 __initcall(rseq_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
 {
 	return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
 }
 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
 {
 	struct rseq __user *urseq = t->rseq.usrptr;
 	u64 csaddr;
 	scoped_user_read_access(urseq, efault)
 		unsafe_get_user(csaddr, &urseq->rseq_cs, efault);
 	if (likely(!csaddr))
 		return true;
 	return rseq_update_user_cs(t, regs, csaddr);
 efault:
 	return false;
 }
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
 	u32 flags, event_mask;
 	int ret;
 	if (rseq_warn_flags("rseq_cs", cs_flags))
 		return -EINVAL;
 	/* Get thread flags. */
 	ret = get_user(flags, &t->rseq->flags);
 	if (ret)
 		return ret;
 	if (rseq_warn_flags("rseq", flags))
 		return -EINVAL;
 	/*
 	 * Load and clear event mask atomically with respect to
 	 * scheduler preemption and membarrier IPIs.
 	 */
 	scoped_guard(RSEQ_EVENT_GUARD) {
 		event_mask = t->rseq_event_mask;
 		t->rseq_event_mask = 0;
 	}
 	return !!event_mask;
 }
 static int clear_rseq_cs(struct rseq __user *rseq)
 {
 	/*
-	 * The rseq_cs field is set to NULL on preemption or signal
+	 * Preserve rseq state and user_irq state. The generic entry code
-	 * delivery on top of rseq assembly block, as well as on top
+	 * clears user_irq on the way out, the non-generic entry
-	 * of code outside of the rseq assembly block. This performs
+	 * architectures are not having user_irq.
 	 * a lazy clear of the rseq_cs field.
 	 *
 	 * Set rseq_cs to NULL.
 	 */
-#ifdef CONFIG_64BIT
+	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
 	return put_user(0UL, &rseq->rseq_cs);
 #else
 	if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
 		return -EFAULT;
 	return 0;
 #endif
 }
 /*
 * Unsigned comparison will be true when ip >= start_ip, and when
 * ip < start_ip + post_commit_offset.
 */
 static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
 {
 	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
 }
 static int rseq_ip_fixup(struct pt_regs *regs)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
-	struct rseq_cs rseq_cs;
+	struct rseq_ids ids;
-	int ret;
+	u32 node_id;
-
+	bool event;
 	ret = rseq_get_rseq_cs(t, &rseq_cs);
 	if (ret)
 		return ret;
 	/*
 	 * Handle potentially not being within a critical section.
 	 * If not nested over a rseq critical section, restart is useless.
 	 * Clear the rseq_cs pointer and return.
 	 */
 	if (!in_rseq_cs(ip, &rseq_cs))
 		return clear_rseq_cs(t->rseq);
 	ret = rseq_need_restart(t, rseq_cs.flags);
 	if (ret <= 0)
 		return ret;
 	ret = clear_rseq_cs(t->rseq);
 	if (ret)
 		return ret;
 	trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
 			    rseq_cs.abort_ip);
 	instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
 	return 0;
 }
 /*
 * This resume handler must always be executed between any of:
 * - preemption,
 * - signal delivery,
 * and return to user-space.
 *
 * This is how we can ensure that the entire rseq critical section
 * will issue the commit instruction only if executed atomically with
 * respect to other threads scheduled on the same CPU, and with respect
 * to signal handlers.
 */
 void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 	int ret, sig;
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 	rseq_stat_inc(rseq_stats.slowpath);
 	/*
-	 * regs is NULL if and only if the caller is in a syscall path.  Skip
+	 * Read and clear the event pending bit first. If the task
-	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
+	 * was not preempted or migrated or a signal is on the way,
-	 * kill a misbehaving userspace on debug kernels.
+	 * there is no point in doing any of the heavy lifting here
 	 * on production kernels. In that case TIF_NOTIFY_RESUME
 	 * was raised by some other functionality.
 	 *
 	 * This is correct because the read/clear operation is
 	 * guarded against scheduler preemption, which makes it CPU
 	 * local atomic. If the task is preempted right after
 	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
 	 * again and this function is invoked another time _before_
 	 * the task is able to return to user mode.
 	 *
 	 * On a debug kernel, invoke the fixup code unconditionally
 	 * with the result handed in to allow the detection of
 	 * inconsistencies.
 	 */
-	if (regs) {
+	scoped_guard(irq) {
-		ret = rseq_ip_fixup(regs);
+		event = t->rseq.event.sched_switch;
-		if (unlikely(ret < 0))
+		t->rseq.event.all &= evt_mask.all;
-			goto error;
+		ids.cpu_id = task_cpu(t);
 		ids.mm_cid = task_mm_cid(t);
 	}
-	if (unlikely(rseq_update_cpu_node_id(t)))
+
-		goto error;
+	if (!event)
 		return;
-error:
+	node_id = cpu_to_node(ids.cpu_id);
-	sig = ksig ? ksig->sig : 0;
+
-	force_sigsegv(sig);
+	if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
 		/*
 		 * Clear the errors just in case this might survive magically, but
 		 * leave the rest intact.
 		 */
 		t->rseq.event.error = 0;
 		force_sig(SIGSEGV);
 	}
 }
-#ifdef CONFIG_DEBUG_RSEQ
+void __rseq_handle_slowpath(struct pt_regs *regs)
 {
 	/*
 	 * If invoked from hypervisors before entering the guest via
 	 * resume_user_mode_work(), then @regs is a NULL pointer.
 	 *
 	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
 	 * it before returning from the ioctl() to user space when
 	 * rseq_event.sched_switch is set.
 	 *
 	 * So it's safe to ignore here instead of pointlessly updating it
 	 * in the vcpu_run() loop.
 	 */
 	if (!regs)
 		return;
 	rseq_slowpath_update_usr(regs);
 }
 void __rseq_signal_deliver(int sig, struct pt_regs *regs)
 {
 	rseq_stat_inc(rseq_stats.signal);
 	/*
 	 * Don't update IDs, they are handled on exit to user if
 	 * necessary. The important thing is to abort a critical section of
 	 * the interrupted context as after this point the instruction
 	 * pointer in @regs points to the signal handler.
 	 */
 	if (unlikely(!rseq_handle_cs(current, regs))) {
 		/*
 		 * Clear the errors just in case this might survive
 		 * magically, but leave the rest intact.
 		 */
 		current->rseq.event.error = 0;
 		force_sigsegv(sig);
 	}
 }
 /*
 * Terminate the process if a syscall is issued within a restartable
 * sequence.
 */
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
-	struct rseq_cs rseq_cs;
+	u64 csaddr;
-	if (!t->rseq)
+	if (!t->rseq.event.has_rseq)
 		return;
-	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+	if (get_user(csaddr, &t->rseq.usrptr->rseq_cs))
 		goto fail;
 	if (likely(!csaddr))
 		return;
 	if (unlikely(csaddr >= TASK_SIZE))
 		goto fail;
 	if (rseq_debug_update_user_cs(t, regs, csaddr))
 		return;
 fail:
 	force_sig(SIGSEGV);
 }
 #ifdef CONFIG_DEBUG_RSEQ
 /* Kept around to keep GENERIC_ENTRY=n architectures supported. */
 void rseq_syscall(struct pt_regs *regs)
 {
 	__rseq_debug_syscall_return(regs);
 }
 #endif
 static bool rseq_reset_ids(void)
 {
 	struct rseq_ids ids = {
 		.cpu_id		= RSEQ_CPU_ID_UNINITIALIZED,
 		.mm_cid		= 0,
 	};
 	/*
 	 * If this fails, terminate it because this leaves the kernel in
 	 * stupid state as exit to user space will try to fixup the ids
 	 * again.
 	 */
 	if (rseq_set_ids(current, &ids, 0))
 		return true;
 	force_sig(SIGSEGV);
 	return false;
 }
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 /*
 * sys_rseq - setup restartable sequences for caller thread.
 */
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 		int, flags, u32, sig)
 {
 	int ret;
 	u64 rseq_cs;
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
 		/* Unregister rseq for current thread. */
-		if (current->rseq != rseq || !current->rseq)
+		if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
 			return -EINVAL;
-		if (rseq_len != current->rseq_len)
+		if (rseq_len != current->rseq.len)
 			return -EINVAL;
-		if (current->rseq_sig != sig)
+		if (current->rseq.sig != sig)
 			return -EPERM;
-		ret = rseq_reset_rseq_cpu_node_id(current);
+		if (!rseq_reset_ids())
-		if (ret)
+			return -EFAULT;
-			return ret;
+		rseq_reset(current);
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
 		return 0;
 	}
 	if (unlikely(flags))
 		return -EINVAL;
-	if (current->rseq) {
+	if (current->rseq.usrptr) {
 		/*
 		 * If rseq is already registered, check whether
 		 * the provided address differs from the prior
 		 * one.
 		 */
-		if (current->rseq != rseq || rseq_len != current->rseq_len)
+		if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
 			return -EINVAL;
-		if (current->rseq_sig != sig)
+		if (current->rseq.sig != sig)
 			return -EPERM;
 		/* Already registered. */
 		return -EBUSY;
@ -531,43 +440,39 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 	scoped_user_write_access(rseq, efault) {
 		/*
 		 * If the rseq_cs pointer is non-NULL on registration, clear it to
 		 * avoid a potential segfault on return to user-space. The proper thing
 		 * to do would have been to fail the registration but this would break
 		 * older libcs that reuse the rseq area for new threads without
-	 * clearing the fields.
+		 * clearing the fields. Don't bother reading it, just reset it.
 		 */
-	if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
+		unsafe_put_user(0UL, &rseq->rseq_cs, efault);
-	        return -EFAULT;
+		/* Initialize IDs in user space */
-	if (rseq_cs && clear_rseq_cs(rseq))
+		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
-		return -EFAULT;
+		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
 		unsafe_put_user(0U, &rseq->node_id, efault);
 		unsafe_put_user(0U, &rseq->mm_cid, efault);
 	}
 #ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * Initialize the in-kernel rseq fields copy for validation of
 	 * read-only fields.
 	 */
 	if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
 	    get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
 	    get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
 	    get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
 		return -EFAULT;
 #endif
 	/*
 	 * Activate the registration by setting the rseq area address, length
 	 * and signature in the task struct.
 	 */
-	current->rseq = rseq;
+	current->rseq.usrptr = rseq;
-	current->rseq_len = rseq_len;
+	current->rseq.len = rseq_len;
-	current->rseq_sig = sig;
+	current->rseq.sig = sig;
 	/*
 	 * If rseq was previously inactive, and has just been
 	 * registered, ensure the cpu_id_start and cpu_id fields
 	 * are updated before returning to user-space.
 	 */
-	rseq_set_notify_resume(current);
+	current->rseq.event.has_rseq = true;
-
+	rseq_force_update();
 	return 0;
 efault:
 	return -EFAULT;
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
 	 * is negligible.
 	 */
 	smp_mb();
-	rseq_preempt(current);
+	rseq_sched_switch_event(current);
 }
 static void ipi_sync_rq_state(void *info)
@ -407,9 +407,9 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		 * membarrier, we will end up with some thread in the mm
 		 * running without a core sync.
 		 *
-		 * For RSEQ, don't rseq_preempt() the caller.  User code
+		 * For RSEQ, don't invoke rseq_sched_switch_event() on the
-		 * is not supposed to issue syscalls at all from inside an
+		 * caller.  User code is not supposed to issue syscalls at
-		 * rseq critical section.
+		 * all from inside an rseq critical section.
 		 */
 		if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
 			preempt_disable();
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@ -2223,6 +2223,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	smp_wmb();
 	WRITE_ONCE(task_thread_info(p)->cpu, cpu);
 	p->wake_cpu = cpu;
 	rseq_sched_set_ids_changed(p);
 #endif /* CONFIG_SMP */
 }
@ -3679,283 +3680,212 @@ extern const char *preempt_modes[];
 #ifdef CONFIG_SCHED_MM_CID
-#define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
+static __always_inline bool cid_on_cpu(unsigned int cid)
 #define MM_CID_SCAN_DELAY	100			/* 100ms */
 extern raw_spinlock_t cid_lock;
 extern int use_cid_lock;
 extern void sched_mm_cid_migrate_from(struct task_struct *t);
 extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t);
 extern void task_tick_mm_cid(struct rq *rq, struct task_struct *curr);
 extern void init_sched_mm_cid(struct task_struct *t);
 static inline void __mm_cid_put(struct mm_struct *mm, int cid)
 {
-	if (cid < 0)
+	return cid & MM_CID_ONCPU;
 		return;
 	cpumask_clear_cpu(cid, mm_cidmask(mm));
 }
-/*
+static __always_inline bool cid_in_transit(unsigned int cid)
 * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition to
 * the MM_CID_UNSET state without holding the rq lock, but the rq lock needs to
 * be held to transition to other states.
 *
 * State transitions synchronized with cmpxchg or try_cmpxchg need to be
 * consistent across CPUs, which prevents use of this_cpu_cmpxchg.
 */
 static inline void mm_cid_put_lazy(struct task_struct *t)
 {
-	struct mm_struct *mm = t->mm;
+	return cid & MM_CID_TRANSIT;
 	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
 	int cid;
 	lockdep_assert_irqs_disabled();
 	cid = __this_cpu_read(pcpu_cid->cid);
 	if (!mm_cid_is_lazy_put(cid) ||
 	    !try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
 		return;
 	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 }
-static inline int mm_cid_pcpu_unset(struct mm_struct *mm)
+static __always_inline unsigned int cpu_cid_to_cid(unsigned int cid)
 {
-	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
+	return cid & ~MM_CID_ONCPU;
-	int cid, res;
+}
-	lockdep_assert_irqs_disabled();
+static __always_inline unsigned int cid_to_cpu_cid(unsigned int cid)
-	cid = __this_cpu_read(pcpu_cid->cid);
+{
-	for (;;) {
+	return cid | MM_CID_ONCPU;
-		if (mm_cid_is_unset(cid))
+}
 static __always_inline unsigned int cid_to_transit_cid(unsigned int cid)
 {
 	return cid | MM_CID_TRANSIT;
 }
 static __always_inline unsigned int cid_from_transit_cid(unsigned int cid)
 {
 	return cid & ~MM_CID_TRANSIT;
 }
 static __always_inline bool cid_on_task(unsigned int cid)
 {
 	/* True if none of the MM_CID_ONCPU, MM_CID_TRANSIT, MM_CID_UNSET bits is set */
 	return cid < MM_CID_TRANSIT;
 }
 static __always_inline void mm_drop_cid(struct mm_struct *mm, unsigned int cid)
 {
 	clear_bit(cid, mm_cidmask(mm));
 }
 static __always_inline void mm_unset_cid_on_task(struct task_struct *t)
 {
 	unsigned int cid = t->mm_cid.cid;
 	t->mm_cid.cid = MM_CID_UNSET;
 	if (cid_on_task(cid))
 		mm_drop_cid(t->mm, cid);
 }
 static __always_inline void mm_drop_cid_on_cpu(struct mm_struct *mm, struct mm_cid_pcpu *pcp)
 {
 	/* Clear the ONCPU bit, but do not set UNSET in the per CPU storage */
 	pcp->cid = cpu_cid_to_cid(pcp->cid);
 	mm_drop_cid(mm, pcp->cid);
 }
 static inline unsigned int __mm_get_cid(struct mm_struct *mm, unsigned int max_cids)
 {
 	unsigned int cid = find_first_zero_bit(mm_cidmask(mm), max_cids);
 	if (cid >= max_cids)
 		return MM_CID_UNSET;
-		/*
+	if (test_and_set_bit(cid, mm_cidmask(mm)))
-		 * Attempt transition from valid or lazy-put to unset.
+		return MM_CID_UNSET;
-		 */
+	return cid;
-		res = cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, cid, MM_CID_UNSET);
+}
-		if (res == cid)
+
-			break;
+static inline unsigned int mm_get_cid(struct mm_struct *mm)
-		cid = res;
+{
 	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
 	while (cid == MM_CID_UNSET) {
 		cpu_relax();
 		cid = __mm_get_cid(mm, num_possible_cpus());
 	}
 	return cid;
 }
-static inline void mm_cid_put(struct mm_struct *mm)
+static inline unsigned int mm_cid_converge(struct mm_struct *mm, unsigned int orig_cid,
 					   unsigned int max_cids)
 {
-	int cid;
+	unsigned int new_cid, cid = cpu_cid_to_cid(orig_cid);
-	lockdep_assert_irqs_disabled();
+	/* Is it in the optimal CID space? */
-	cid = mm_cid_pcpu_unset(mm);
+	if (likely(cid < max_cids))
-	if (cid == MM_CID_UNSET)
+		return orig_cid;
 	/* Try to find one in the optimal space. Otherwise keep the provided. */
 	new_cid = __mm_get_cid(mm, max_cids);
 	if (new_cid != MM_CID_UNSET) {
 		mm_drop_cid(mm, cid);
 		/* Preserve the ONCPU mode of the original CID */
 		return new_cid | (orig_cid & MM_CID_ONCPU);
 	}
 	return orig_cid;
 }
 static __always_inline void mm_cid_update_task_cid(struct task_struct *t, unsigned int cid)
 {
 	if (t->mm_cid.cid != cid) {
 		t->mm_cid.cid = cid;
 		rseq_sched_set_ids_changed(t);
 	}
 }
 static __always_inline void mm_cid_update_pcpu_cid(struct mm_struct *mm, unsigned int cid)
 {
 	__this_cpu_write(mm->mm_cid.pcpu->cid, cid);
 }
 static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int cpu_cid)
 {
 	unsigned int max_cids, tcid = t->mm_cid.cid;
 	struct mm_struct *mm = t->mm;
 	max_cids = READ_ONCE(mm->mm_cid.max_cids);
 	/* Optimize for the common case where both have the ONCPU bit set */
 	if (likely(cid_on_cpu(cpu_cid & tcid))) {
 		if (likely(cpu_cid_to_cid(cpu_cid) < max_cids)) {
 			mm_cid_update_task_cid(t, cpu_cid);
 			return;
 	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 		}
-
+		/* Try to converge into the optimal CID space */
-static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
+		cpu_cid = mm_cid_converge(mm, cpu_cid, max_cids);
 {
 	struct cpumask *cidmask = mm_cidmask(mm);
 	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
 	int cid, max_nr_cid, allowed_max_nr_cid;
 	/*
 	 * After shrinking the number of threads or reducing the number
 	 * of allowed cpus, reduce the value of max_nr_cid so expansion
 	 * of cid allocation will preserve cache locality if the number
 	 * of threads or allowed cpus increase again.
 	 */
 	max_nr_cid = atomic_read(&mm->max_nr_cid);
 	while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed),
 					   atomic_read(&mm->mm_users))),
 	       max_nr_cid > allowed_max_nr_cid) {
 		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into max_nr_cid. */
 		if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid)) {
 			max_nr_cid = allowed_max_nr_cid;
 			break;
 		}
 	}
 	/* Try to re-use recent cid. This improves cache locality. */
 	cid = __this_cpu_read(pcpu_cid->recent_cid);
 	if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
 	    !cpumask_test_and_set_cpu(cid, cidmask))
 		return cid;
 	/*
 	 * Expand cid allocation if the maximum number of concurrency
 	 * IDs allocated (max_nr_cid) is below the number cpus allowed
 	 * and number of threads. Expanding cid allocation as much as
 	 * possible improves cache locality.
 	 */
 	cid = max_nr_cid;
 	while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
 		/* atomic_try_cmpxchg loads previous mm->max_nr_cid into cid. */
 		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
 			continue;
 		if (!cpumask_test_and_set_cpu(cid, cidmask))
 			return cid;
 	}
 	/*
 	 * Find the first available concurrency id.
 	 * Retry finding first zero bit if the mask is temporarily
 	 * filled. This only happens during concurrent remote-clear
 	 * which owns a cid without holding a rq lock.
 	 */
 	for (;;) {
 		cid = cpumask_first_zero(cidmask);
 		if (cid < READ_ONCE(mm->nr_cpus_allowed))
 			break;
 		cpu_relax();
 	}
 	if (cpumask_test_and_set_cpu(cid, cidmask))
 		return -1;
 	return cid;
 }
 /*
 * Save a snapshot of the current runqueue time of this cpu
 * with the per-cpu cid value, allowing to estimate how recently it was used.
 */
 static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
 {
 	struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(rq));
 	lockdep_assert_rq_held(rq);
 	WRITE_ONCE(pcpu_cid->time, rq->clock);
 }
 static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
 			       struct mm_struct *mm)
 {
 	int cid;
 	/*
 	 * All allocations (even those using the cid_lock) are lock-free. If
 	 * use_cid_lock is set, hold the cid_lock to perform cid allocation to
 	 * guarantee forward progress.
 	 */
 	if (!READ_ONCE(use_cid_lock)) {
 		cid = __mm_cid_try_get(t, mm);
 		if (cid >= 0)
 			goto end;
 		raw_spin_lock(&cid_lock);
 	} else {
-		raw_spin_lock(&cid_lock);
+		/* Hand over or drop the task owned CID */
-		cid = __mm_cid_try_get(t, mm);
+		if (cid_on_task(tcid)) {
-		if (cid >= 0)
+			if (cid_on_cpu(cpu_cid))
-			goto unlock;
+				mm_unset_cid_on_task(t);
 			else
 				cpu_cid = cid_to_cpu_cid(tcid);
 		}
 		/* Still nothing, allocate a new one */
 		if (!cid_on_cpu(cpu_cid))
 			cpu_cid = cid_to_cpu_cid(mm_get_cid(mm));
 	}
 	mm_cid_update_pcpu_cid(mm, cpu_cid);
 	mm_cid_update_task_cid(t, cpu_cid);
 }
-	/*
+static __always_inline void mm_cid_from_task(struct task_struct *t, unsigned int cpu_cid)
 	 * cid concurrently allocated. Retry while forcing following
 	 * allocations to use the cid_lock to ensure forward progress.
 	 */
 	WRITE_ONCE(use_cid_lock, 1);
 	/*
 	 * Set use_cid_lock before allocation. Only care about program order
 	 * because this is only required for forward progress.
 	 */
 	barrier();
 	/*
 	 * Retry until it succeeds. It is guaranteed to eventually succeed once
 	 * all newcoming allocations observe the use_cid_lock flag set.
 	 */
 	do {
 		cid = __mm_cid_try_get(t, mm);
 		cpu_relax();
 	} while (cid < 0);
 	/*
 	 * Allocate before clearing use_cid_lock. Only care about
 	 * program order because this is for forward progress.
 	 */
 	barrier();
 	WRITE_ONCE(use_cid_lock, 0);
 unlock:
 	raw_spin_unlock(&cid_lock);
 end:
 	mm_cid_snapshot_time(rq, mm);
 	return cid;
 }
 static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
 			     struct mm_struct *mm)
 {
-	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
+	unsigned int max_cids, tcid = t->mm_cid.cid;
-	int cid;
+	struct mm_struct *mm = t->mm;
-	lockdep_assert_rq_held(rq);
+	max_cids = READ_ONCE(mm->mm_cid.max_cids);
-	cid = __this_cpu_read(pcpu_cid->cid);
+	/* Optimize for the common case, where both have the ONCPU bit clear */
-	if (mm_cid_is_valid(cid)) {
+	if (likely(cid_on_task(tcid | cpu_cid))) {
-		mm_cid_snapshot_time(rq, mm);
+		if (likely(tcid < max_cids)) {
-		return cid;
+			mm_cid_update_pcpu_cid(mm, tcid);
 			return;
 		}
-	if (mm_cid_is_lazy_put(cid)) {
+		/* Try to converge into the optimal CID space */
-		if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
+		tcid = mm_cid_converge(mm, tcid, max_cids);
-			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+	} else {
 		/* Hand over or drop the CPU owned CID */
 		if (cid_on_cpu(cpu_cid)) {
 			if (cid_on_task(tcid))
 				mm_drop_cid_on_cpu(mm, this_cpu_ptr(mm->mm_cid.pcpu));
 			else
 				tcid = cpu_cid_to_cid(cpu_cid);
 		}
-	cid = __mm_cid_get(rq, t, mm);
+		/* Still nothing, allocate a new one */
-	__this_cpu_write(pcpu_cid->cid, cid);
+		if (!cid_on_task(tcid))
-	__this_cpu_write(pcpu_cid->recent_cid, cid);
+			tcid = mm_get_cid(mm);
-
+		/* Set the transition mode flag if required */
-	return cid;
+		tcid |= READ_ONCE(mm->mm_cid.transit);
 	}
 	mm_cid_update_pcpu_cid(mm, tcid);
 	mm_cid_update_task_cid(t, tcid);
 }
-static inline void switch_mm_cid(struct rq *rq,
+static __always_inline void mm_cid_schedin(struct task_struct *next)
 				 struct task_struct *prev,
 				 struct task_struct *next)
 {
-	/*
+	struct mm_struct *mm = next->mm;
-	 * Provide a memory barrier between rq->curr store and load of
+	unsigned int cpu_cid;
-	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
+
-	 *
+	if (!next->mm_cid.active)
-	 * Should be adapted if context_switch() is modified.
+		return;
-	 */
+
-	if (!next->mm) {                                // to kernel
+	cpu_cid = __this_cpu_read(mm->mm_cid.pcpu->cid);
-		/*
+	if (likely(!READ_ONCE(mm->mm_cid.percpu)))
-		 * user -> kernel transition does not guarantee a barrier, but
+		mm_cid_from_task(next, cpu_cid);
-		 * we can use the fact that it performs an atomic operation in
+	else
-		 * mmgrab().
+		mm_cid_from_cpu(next, cpu_cid);
 		 */
 		if (prev->mm)                           // from user
 			smp_mb__after_mmgrab();
 		/*
 		 * kernel -> kernel transition does not change rq->curr->mm
 		 * state. It stays NULL.
 		 */
 	} else {                                        // to user
 		/*
 		 * kernel -> user transition does not provide a barrier
 		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
 		 * Provide it here.
 		 */
 		if (!prev->mm) {                        // from kernel
 			smp_mb();
 		} else {				// from user
 			/*
 			 * user->user transition relies on an implicit
 			 * memory barrier in switch_mm() when
 			 * current->mm changes. If the architecture
 			 * switch_mm() does not have an implicit memory
 			 * barrier, it is emitted here.  If current->mm
 			 * is unchanged, no barrier is needed.
 			 */
 			smp_mb__after_switch_mm();
 }
 static __always_inline void mm_cid_schedout(struct task_struct *prev)
 {
 	/* During mode transitions CIDs are temporary and need to be dropped */
 	if (likely(!cid_in_transit(prev->mm_cid.cid)))
 		return;
 	mm_drop_cid(prev->mm, cid_from_transit_cid(prev->mm_cid.cid));
 	prev->mm_cid.cid = MM_CID_UNSET;
 }
-	if (prev->mm_cid_active) {
+
-		mm_cid_snapshot_time(rq, prev->mm);
+static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct *next)
-		mm_cid_put_lazy(prev);
+{
-		prev->mm_cid = -1;
+	mm_cid_schedout(prev);
-	}
+	mm_cid_schedin(next);
 	if (next->mm_cid_active)
 		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
 }
 #else /* !CONFIG_SCHED_MM_CID: */
-static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, struct task_struct *next) { }
+static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct *next) { }
 static inline void sched_mm_cid_migrate_from(struct task_struct *t) { }
 static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t) { }
 static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
--- a/kernel/signal.c
+++ b/kernel/signal.c
@ -3125,7 +3125,6 @@ void exit_signals(struct task_struct *tsk)
 	cgroup_threadgroup_change_begin(tsk);
 	if (thread_group_empty(tsk) || (tsk->signal->flags & SIGNAL_GROUP_EXIT)) {
 		sched_mm_cid_exit_signals(tsk);
 		tsk->flags |= PF_EXITING;
 		cgroup_threadgroup_change_end(tsk);
 		return;
@ -3136,7 +3135,6 @@ void exit_signals(struct task_struct *tsk)
 	 * From now this task is not visible for group-wide signals,
 	 * see wants_signal(), do_signal_stop().
 	 */
 	sched_mm_cid_exit_signals(tsk);
 	tsk->flags |= PF_EXITING;
 	cgroup_threadgroup_change_end(tsk);
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@ -355,6 +355,12 @@ unsigned int __bitmap_weight_andnot(const unsigned long *bitmap1,
 }
 EXPORT_SYMBOL(__bitmap_weight_andnot);
 unsigned int __bitmap_weighted_or(unsigned long *dst, const unsigned long *bitmap1,
 				  const unsigned long *bitmap2, unsigned int bits)
 {
 	return BITMAP_WEIGHT(({dst[idx] = bitmap1[idx] | bitmap2[idx]; dst[idx]; }), bits);
 }
 void __bitmap_set(unsigned long *map, unsigned int start, int len)
 {
 	unsigned long *p = map + BIT_WORD(start);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@ -49,6 +49,7 @@
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
 #include <linux/suspend.h>
 #include <linux/rseq.h>
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		vcpu->wants_to_run = false;
 		/*
 		 * FIXME: Remove this hack once all KVM architectures
 		 * support the generic TIF bits, i.e. a dedicated TIF_RSEQ.
 		 */
 		rseq_virt_userspace_exit();
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
 	}