Difference between revisions of "Restartable Sequences"

From CRIU
Jump to navigation Jump to search
Line 39: Line 39:
 
or signal). Then we need to put an pointer to <code>struct rseq_cs</code> to the <code>(struct rseq).rseq_cs</code> field.
 
or signal). Then we need to put an pointer to <code>struct rseq_cs</code> to the <code>(struct rseq).rseq_cs</code> field.
  
== What about <code>flags</code>? ==
+
== Hnalding of <code>flags</code>? ==
  
You may have noticed that both <code>struct rseq</code> and <code>struct rseq_cs</code> have <code>flags</code> field. It may took values from <code>enum rseq_cs_flags</code>.
+
Flags can be specified at either <code>struct rseq</code> and <code>struct rseq_cs</code> using values from <code>enum rseq_cs_flags</code>. Regardless of where they are set, the kernel combines them when determining restart behavior:
  
First of all, a user may specify flags in any place they will be combined on the kernel side:
 
 
<pre>
 
<pre>
 
static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
 
static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
Line 59: Line 58:
 
</pre>
 
</pre>
  
The most common <code>flags</code> value is zero. In this case, the rseq CS will be interrupted (and IP will be fixed up to the abort handler)
+
The most common <code>flags</code> value is zero. In this case, the restartable sequence critical section is interrupted whenever a preemption, CPU migration, or signal occurs, and the instruction pointer (IP) will be redirected to the abort handler. In some scenarios, however, applications may prefer to allow a critical section to complete even when certain events occur, and can do so by explicitly setting the appropriate flags.
if preemption, migration, or signal occurs. But there are situations when users may want not to abort section once one of these events happen.
 
  
It's worth mentioning that <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> can be used only in combination with <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> and <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code>:
+
Note that <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> must be used in conjunction with both <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> and <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code>. The kernel enforces this constraint to prevent inconsistent restart semantics:
 
<pre>
 
<pre>
 
/*
 
/*
Line 75: Line 73:
 
return -EINVAL;
 
return -EINVAL;
 
</pre>
 
</pre>
 
  
 
== How CRIU handles rseq ==
 
== How CRIU handles rseq ==

Revision as of 14:09, 17 January 2026

Restartable sequences (rseq) are short, carefully defined sections of user-space code that enable efficient access to per-CPU data structures without relying on heavyweight synchronization primitives such as mutexes or atomic operations.

Support for rseq was introduced in the Linux kernel in version 4.18, allowing user-space programs to register critical code paths that the kernel can safely restart when a CPU migration or preemption occurs. This mechanism enables high-performance, scalable data access patterns while preserving correctness. The 5-year journey to bring restartable sequences to Linux article provides more information about how restartable sequences work, their design, use cases, and kernel integration.

Linux kernel interface

The Linux kernel interface for rseq is intentionally minimal. It consists of a single system call: sys_rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig)

The full definition of the rseq data structures and related flags is provided in include/uapi/linux/rseq.h:

enum rseq_cs_flags {
	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT = (1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL = (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
};

struct rseq_cs {
	__u32 version; /* always 0 at this moment */
	enum rseq_cs_flags flags;
	void *start_ip;
	/* Offset from start_ip. */
	intptr_t post_commit_offset;
	void *abort_ip;
}

struct rseq {
	__u32 cpu_id_start;
	__u32 cpu_id;
	struct rseq_cs *rseq_cs;
	enum rseq_cs_flags flags;
}

From the userspace side, we need to keep struct rseq somewhere and register it on the kernel side using the rseq syscall. Then, once we want to execute some code as a rseq critical section (rseq cs or just CS) we need to allocate and fill with the data struct rseq_cs. We have to specify the start address of our CS, and the address of the abort handler (called when CS was interrupted by a preemption, migration or signal). Then we need to put an pointer to struct rseq_cs to the (struct rseq).rseq_cs field.

Hnalding of flags?

Flags can be specified at either struct rseq and struct rseq_cs using values from enum rseq_cs_flags. Regardless of where they are set, the kernel combines them when determining restart behavior:

static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
{
	u32 flags, event_mask;
	int ret;

	/* Get thread flags. */
	ret = get_user(flags, &t->rseq->flags);
	if (ret)
		return ret;

	/* Take critical section flags into account. */
	flags |= cs_flags; // <<<<<<<< here we have flags combined from struct rseq + struct rseq_cs

The most common flags value is zero. In this case, the restartable sequence critical section is interrupted whenever a preemption, CPU migration, or signal occurs, and the instruction pointer (IP) will be redirected to the abort handler. In some scenarios, however, applications may prefer to allow a critical section to complete even when certain events occur, and can do so by explicitly setting the appropriate flags.

Note that RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL must be used in conjunction with both RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT and RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE. The kernel enforces this constraint to prevent inconsistent restart semantics:

	/*
	 * Restart on signal can only be inhibited when restart on
	 * preempt and restart on migrate are inhibited too. Otherwise,
	 * a preempted signal handler could fail to restart the prior
	 * execution context on sigreturn.
	 */
	if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) &&
		     (flags & RSEQ_CS_PREEMPT_MIGRATE_FLAGS) !=
		     RSEQ_CS_PREEMPT_MIGRATE_FLAGS))
		return -EINVAL;

How CRIU handles rseq

CRIU handles the rseq differently depending on the particular case. Let's classify and cover all of them.

  1. the process is not inside the rseq critical section
  2. the process is inside the rseq CS
    1. flags is 0 or RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT or RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
    2. flags is RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

the process is not inside the rseq critical section

Simplest case. Process just have struct rseq registered in the kernel but currently instruction pointer (IP) not inside CS.

Dump

We need only to determine where the struct rseq is and dump its address length and signature. To achieve that we use special ptrace handle PTRACE_GET_RSEQ_CONFIGURATION (refer to the dump_thread_rseq function).

Restore

We need to take data about the struct rseq from the image (see images/rseq.proto) and register it from the parasite context using the rseq syscall (take a look on restore_rseq in criu/pie/restorer.c)

inside CS: flags is 0 or RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT or RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

The process was caught with IP inside CS. Can we act as before? So, dump struct rseq address, restore it, and so on. No, we can't. The reason is that CRIU saves IP as it was during the dump. But the rseq semantic is to jump to abort handler if CS execution was interrupted. In this particular case we have flags equal to 0 or RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT or RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE it means that if CS will be interrupted by the preeption, migration (0) or migration (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT) or preemption (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE) the kernel will fixup IP of the process to the abort handler address.

When we dump the process using CRIU it will just save IP as it was and restore it. That's a serious problem and this may break the user application (even cause crash!).

Lets see fixup_thread_rseq function:

	if (task_in_rseq(rseq_cs, TI_IP(core))) {
		struct pid *tid = &item->threads[i];

...

		pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n",
				tid->real);

...

		if (!(rseq_cs->flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
			pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n",
				tid->real);

			TI_IP(core) = rseq_cs->abort_ip;

			if (item->pid->real == tid->real) {
				compel_set_leader_ip(dmpi(item)->parasite_ctl, rseq_cs->abort_ip);
			} else {
				compel_set_thread_ip(dmpi(item)->thread_ctls[i], rseq_cs->abort_ip);
			}
		}
	}

It checks that process IP inside CS and fixes it up to the abort handler IP as the kernel does.

Dump

We need to determine where the struct rseq is and dump its address length and signature. To achieve that we use special ptrace handle PTRACE_GET_RSEQ_CONFIGURATION (refer to the dump_thread_rseq function).

We have to fix up IP to the abort handler.

Restore

We need to take data about the struct rseq from the image (see images/rseq.proto) and register it from the parasite context using the rseq syscall (take a look on restore_rseq in criu/pie/restorer.c)

No additional actions here. The process will be restored and will continue execution from the abort handler (not within the rseq CS!).

inside CS: flags is RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

Rare case, but we support it too. If the rseq CS has RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL flag it means that its technically non-abortable. So, from the first glance, it seems like we can just not do anything special: save rseq structure address, not fix up IP. This is incorrect.

The kernel will clean up (struct rseq).rseq_cs pointer once we jump into the parasite on the dump:

static int rseq_ip_fixup(struct pt_regs *regs)
{
...

	/*
	 * Handle potentially not being within a critical section.
	 * If not nested over a rseq critical section, restart is useless.
	 * Clear the rseq_cs pointer and return.
	 */
	if (!in_rseq_cs(ip, &rseq_cs))
		return clear_rseq_cs(t);

and after the restore process will continue the rseq CS execution from the same place (it's okay) but from the kernel point of view, the process will continue this execution as not being within the rseq CS (that's bad!). Because the kernel determines execution context from the (struct rseq).rseq_cs field.

Dump

We need to determine where the struct rseq is and dump its address length and signature. To achieve that we use special ptrace handle PTRACE_GET_RSEQ_CONFIGURATION (refer to the dump_thread_rseq function).

We save IP as it was (not doing fixup), but we have to save (struct rseq).rseq_cs field into the CRIU image.

Restore

We need to take data about the struct rseq from the image (see images/rseq.proto) and register it from the parasite context using the rseq syscall (take a look on restore_rseq in criu/pie/restorer.c)

We need to restore (struct rseq).rseq_cs memory externaly using ptrace POKEAREA (see restore_rseq_cs).

TODO

  • tests for all architectures (right now we have ZDTM tests only for x86_64)
  • improvement support of built-in rseq for non-Glibc libraries
  • pre-dump tests (?)
  • leave-running tests (?)
  • crfail test
  • threaded test

Useful links