| Line 74: |
Line 74: |
| | </pre> | | </pre> |
| | | | |
| − | == How CRIU handles rseq == | + | == Checkpoint/Restore of RSEQ == |
| | | | |
| − | CRIU handles the rseq differently depending on the particular case. Let's classify and cover all of them. | + | CRIU handles restartable sequences differently depending on the execution state of the process at checkpoint time. This can be categorized into the following cases: |
| | | | |
| − | # the process is not inside the rseq critical section | + | # Process is not executing inside an rseq critical section |
| − | # the process is inside the rseq CS | + | # Process is executing inside an rseq critical section |
| − | ## <code>flags</code> is <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> | + | ## <code>flags</code> is <code>0</code>, <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code>, or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> |
| − | ## <code>flags</code> is <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> | + | ## <code>flags</code> includes <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> |
| | | | |
| − | === the process is not inside the rseq critical section === | + | === Executing outside critical section === |
| | | | |
| − | Simplest case. Process just have <code>struct rseq</code> registered in the kernel but currently instruction pointer (IP) not inside CS.
| + | This is the simplest case. The process has a <code>struct rseq</code> registered with the kernel, but its instruction pointer (IP) is not currently executing within an RSEQ critical section. |
| | | | |
| − | ==== Dump ==== | + | ==== Checkpoint ==== |
| − | We need only to determine where the <code>struct rseq</code> is and dump its address length and signature.
| + | CRIU only needs to locate the <code>struct rseq</code> instance and record its address, length, and signature. This information is obtained using the ptrace request <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (see the <code>dump_thread_rseq</code> function). |
| − | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function).
| |
| | | | |
| | ==== Restore ==== | | ==== Restore ==== |
| − | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c)
| + | During restore, CRIU retrieves the <code>struct rseq</code> information from the checkpoint image (see images/rseq.proto) and re-register it from the parasite context using the <code>rseq</code> syscall (see <code>restore_rseq</code> in <code>criu/pie/restorer.c</code>). |
| | | | |
| − | === inside CS: <code>flags</code> is <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> === | + | === Executing inside critical section === |
| | | | |
| − | The process was caught with IP inside CS. Can we act as before? So, dump <code>struct rseq</code> address, restore it, and so on. No, we can't.
| + | When a process is being checkpointed while its instruction pointer is inside an RSEQ critical section, CRIU preserves the instruction pointer exactly as it was at checkpoint time. |
| − | The reason is that CRIU saves IP as it was during the dump. But the rseq semantic is to jump to abort handler if CS execution was interrupted.
| + | However, RSEQ semantics require that if execution of a critical section is interrupted, the kernel redirects execution to the associated abort handler. In particular, when the <code>flags</code> value is <code>0</code>, <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> the kernel automatically redirects the instruction pointer to the abort handler associated with the RSEQ critical section. As a result, restoring the process with its instruction pointer unchanged violates the RSEQ semantics, potentially leading to incorrect behavior or application crashes. To address this issue, CRIU explicitly adjusts the instruction pointer to match kernel behavior. |
| − | In this particular case we have <code>flags</code> equal to <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> | |
| − | it means that if CS will be interrupted by the preeption, migration (<code>0</code>) or migration (<code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code>) or preemption (<code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code>)
| |
| − | the kernel will fixup IP of the process to the abort handler address. | |
| | | | |
| − | When we dump the process using CRIU it will just save IP as it was and restore it. That's a serious problem and this may break the user application (even cause crash!).
| + | The logic responsible for this is implemented in the <code>fixup_thread_rseq</code> function: |
| − | | |
| − | Lets see <code>fixup_thread_rseq</code> function:
| |
| | <pre> | | <pre> |
| − | if (task_in_rseq(rseq_cs, TI_IP(core))) {
| + | if (task_in_rseq(rseq_cs, TI_IP(core))) { |
| − | struct pid *tid = &item->threads[i];
| + | struct pid *tid = &item->threads[i]; |
| | | | |
| | ... | | ... |
| | | | |
| − | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n",
| + | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n", |
| − | tid->real);
| + | tid->real); |
| | | | |
| | ... | | ... |
| | | | |
| − | if (!(rseq_cs->flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
| + | if (!(rseq_cs->flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) { |
| − | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n",
| + | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n", |
| − | tid->real);
| + | tid->real); |
| | | | |
| − | TI_IP(core) = rseq_cs->abort_ip;
| + | TI_IP(core) = rseq_cs->abort_ip; |
| | | | |
| − | if (item->pid->real == tid->real) {
| + | if (item->pid->real == tid->real) { |
| − | compel_set_leader_ip(dmpi(item)->parasite_ctl, rseq_cs->abort_ip);
| + | compel_set_leader_ip(dmpi(item)->parasite_ctl, rseq_cs->abort_ip); |
| − | } else {
| + | } else { |
| − | compel_set_thread_ip(dmpi(item)->thread_ctls[i], rseq_cs->abort_ip);
| + | compel_set_thread_ip(dmpi(item)->thread_ctls[i], rseq_cs->abort_ip); |
| − | }
| |
| | } | | } |
| | } | | } |
| | + | } |
| | </pre> | | </pre> |
| | | | |
| − | It checks that process IP inside CS and fixes it up to the abort handler IP as the kernel does.
| + | This code detects when a thread's instruction pointer lies within an RSEQ critical section and, unless <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> is set, rewrites the instruction pointer to the abort handler address. By doing so, CRIU mirrors the kernel's rseq fixup behavior and ensures that the restored process resumes execution in a semantically correct state. |
| | | | |
| − | ==== Dump ==== | + | ==== Checkpoint ==== |
| − | We need to determine where the <code>struct rseq</code> is and dump its address length and signature.
| + | CRIU locates the <code>struct rseq</code> instance and records its address, length, and signature using the <code>PTRACE_GET_RSEQ_CONFIGURATION</code> ptrace request (see <code>dump_thread_rseq</code>). |
| − | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function).
| + | In addition, the instruction pointer is explicitly adjusted to point to the RSEQ abort handler. |
| − | | |
| − | We have to fix up IP to the abort handler.
| |
| | | | |
| | ==== Restore ==== | | ==== Restore ==== |
| − | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c)
| + | During restore, CRIU reads data about the <code>struct rseq</code> state from the checkpoint image (<code>images/rseq.proto</code>) and re-register it from the restorer context using the <code>rseq</code> system call (see <code>restore_rseq</code> in <code>criu/pie/restorer.c</code>). No further action is required: the process resumes execution at the abort handler, outside of the RSEQ critical section. |
| − | | |
| − | No additional actions here. The process will be restored and will continue execution from the abort handler (not within the rseq CS!). | |
| | | | |
| − | === inside CS: <code>flags</code> is <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> === | + | === Executing inside non-abortable critical section === |
| | | | |
| − | Rare case, but we support it too. If the rseq CS has <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> flag it means that its technically
| + | This is a relatively rare case, but it is fully supported by CRIU. When an RSEQ critical section is marked with the <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> flag, it is effectively non-abortable. |
| − | non-abortable. So, from the first glance, it seems like we can just not do anything special: save rseq structure address, not fix up IP. | + | At first glance, this might suggest that no special handling is required as the RSEQ structure could simply be saved, and the instruction pointer left unchanged. However, this assumption is incorrect. |
| − | This is incorrect.
| |
| | | | |
| − | The kernel will clean up <code>(struct rseq).rseq_cs</code> pointer once we jump into the parasite on the dump:
| + | During checkpoint, when CRIU transfers execution to the parasite, the kernel clears the <code>(struct rseq).rseq_cs</code> pointer if it determines that execution is no longer within an rseq critical section: |
| | <pre> | | <pre> |
| | static int rseq_ip_fixup(struct pt_regs *regs) | | static int rseq_ip_fixup(struct pt_regs *regs) |
| Line 165: |
Line 154: |
| | </pre> | | </pre> |
| | | | |
| − | and after the restore process will continue the rseq CS execution from the same place (it's okay) but from the kernel point of view,
| + | As a result, after restore, the process resumes execution at the correct instruction pointer within the critical section, but from the kernel's perspective it is no longer executing inside an RSEQ critical section. This discrepancy is problematic, because the kernel relies on the <code>(struct rseq).rseq_cs</code> field to determine rseq execution context. |
| − | the process will continue this execution as not being within the rseq CS (that's bad!). Because the kernel determines execution context from the <code>(struct rseq).rseq_cs</code> field.
| |
| | | | |
| − | ==== Dump ==== | + | ==== Checkpoint ==== |
| − | We need to determine where the <code>struct rseq</code> is and dump its address length and signature.
| |
| − | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function).
| |
| | | | |
| − | We save IP as it was (not doing fixup), but we have to save <code>(struct rseq).rseq_cs</code> field into the CRIU image.
| + | CRIU locates the <code>struct rseq</code> instance and records its address, length, and signature using the <code>PTRACE_GET_RSEQ_CONFIGURATION</code> ptrace request. |
| | + | The instruction pointer is saved without modification, but the <code>(struct rseq).rseq_cs</code> field is also recorded in the CRIU image. |
| | | | |
| | ==== Restore ==== | | ==== Restore ==== |
| − | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c)
| |
| | | | |
| − | We need to restore <code>(struct rseq).rseq_cs</code> memory externaly using ptrace <code>POKEAREA</code> (see <code>restore_rseq_cs</code>).
| + | During restore, CRIU re-registers the <code>struct rseq</code> from the checkpoint image (<code>images/rseq.proto</code>) using the <code>rseq</code> system call from the restorer context (see <code>restore_rseq</code> in <code>criu/pie/restorer.c</code>). In addition, CRIU explicitly restores the <code>(struct rseq).rseq_cs</code> field using <code>PTRACE_POKEAREA</code> (see <code>restore_rseq_cs</code>) to reestablish the correct <code>rseq</code> execution context in the kernel. |
| | | | |
| | == TODO == | | == TODO == |