Line 76: |
Line 76: |
| | | |
| | | |
| + | == How CRIU handles rseq == |
| + | |
| + | CRIU handles the rseq differently depending on the particular case. Let's classify and cover all of them. |
| + | |
| + | # the process is not inside the rseq critical section |
| + | # the process is inside the rseq CS |
| + | ## <code>flags</code> is <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> |
| + | ## <code>flags</code> is <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> |
| + | |
| + | === the process is not inside the rseq critical section === |
| + | |
| + | Simplest case. Process just have <code>struct rseq</code> registered in the kernel but currently instruction pointer (IP) not inside CS. |
| + | |
| + | ==== Dump ==== |
| + | We need only to determine where the <code>struct rseq</code> is and dump its address length and signature. |
| + | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function). |
| + | |
| + | ==== Restore ==== |
| + | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c) |
| + | |
| + | === inside CS: <code>flags</code> is <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> === |
| + | |
| + | The process was caught with IP inside CS. Can we act as before? So, dump <code>struct rseq</code> address, restore it, and so on. No, we can't. |
| + | The reason is that CRIU saves IP as it was during the dump. But the rseq semantic is to jump to abort handler if CS execution was interrupted. |
| + | In this particular case we have <code>flags</code> equal to <code>0</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code> or <code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code> |
| + | it means that if CS will be interrupted by the preeption, migration (<code>0</code>) or migration (<code>RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT</code>) or preemption (<code>RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE</code>) |
| + | the kernel will fixup IP of the process to the abort handler address. |
| + | |
| + | When we dump the process using CRIU it will just save IP as it was and restore it. That's a serious problem and this may break the user application (even cause crash!). |
| + | |
| + | Lets see <code>fixup_thread_rseq</code> function: |
| + | <pre> |
| + | if (task_in_rseq(rseq_cs, TI_IP(core))) { |
| + | struct pid *tid = &item->threads[i]; |
| + | |
| + | ... |
| + | |
| + | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n", |
| + | tid->real); |
| + | |
| + | ... |
| + | |
| + | if (!(rseq_cs->flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) { |
| + | pr_warn("The %d task is in rseq critical section. IP will be set to rseq abort handler addr\n", |
| + | tid->real); |
| + | |
| + | TI_IP(core) = rseq_cs->abort_ip; |
| + | |
| + | if (item->pid->real == tid->real) { |
| + | compel_set_leader_ip(dmpi(item)->parasite_ctl, rseq_cs->abort_ip); |
| + | } else { |
| + | compel_set_thread_ip(dmpi(item)->thread_ctls[i], rseq_cs->abort_ip); |
| + | } |
| + | } |
| + | } |
| + | </pre> |
| + | |
| + | It checks that process IP inside CS and fixes it up to the abort handler IP as the kernel does. |
| + | |
| + | ==== Dump ==== |
| + | We need to determine where the <code>struct rseq</code> is and dump its address length and signature. |
| + | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function). |
| + | |
| + | We have to fix up IP to the abort handler. |
| + | |
| + | ==== Restore ==== |
| + | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c) |
| + | |
| + | No additional actions here. The process will be restored and will continue execution from the abort handler (not within the rseq CS!). |
| + | |
| + | === inside CS: <code>flags</code> is <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> === |
| + | |
| + | Rare case, but we support it too. If the rseq CS has <code>RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL</code> flag it means that its technically |
| + | non-abortable. So, from the first glance, it seems like we can just not do anything special: save rseq structure address, not fix up IP. |
| + | This is incorrect. |
| + | |
| + | The kernel will clean up <code>(struct rseq).rseq_cs</code> pointer once we jump into the parasite on the dump: |
| + | <pre> |
| + | static int rseq_ip_fixup(struct pt_regs *regs) |
| + | { |
| + | ... |
| + | |
| + | /* |
| + | * Handle potentially not being within a critical section. |
| + | * If not nested over a rseq critical section, restart is useless. |
| + | * Clear the rseq_cs pointer and return. |
| + | */ |
| + | if (!in_rseq_cs(ip, &rseq_cs)) |
| + | return clear_rseq_cs(t); |
| + | </pre> |
| + | |
| + | and after the restore process will continue the rseq CS execution from the same place (it's okay) but from the kernel point of view, |
| + | the process will continue this execution as not being within the rseq CS (that's bad!). Because the kernel determines execution context from the <code>(struct rseq).rseq_cs</code> field. |
| + | |
| + | ==== Dump ==== |
| + | We need to determine where the <code>struct rseq</code> is and dump its address length and signature. |
| + | To achieve that we use special ptrace handle <code>PTRACE_GET_RSEQ_CONFIGURATION</code> (refer to the <code>dump_thread_rseq</code> function). |
| + | |
| + | We save IP as it was (not doing fixup), but we have to save <code>(struct rseq).rseq_cs</code> field into the CRIU image. |
| + | |
| + | ==== Restore ==== |
| + | We need to take data about the <code>struct rseq</code> from the image (see images/rseq.proto) and register it from the parasite context using the <code>rseq</code> syscall (take a look on <code>restore_rseq</code> in criu/pie/restorer.c) |
| + | |
| + | We need to restore <code>(struct rseq).rseq_cs</code> memory externaly using ptrace <code>POKEAREA</code> (see <code>restore_rseq_cs</code>). |
| + | |
| + | == TODO == |
| + | |
| + | * tests for all architectures (right now we have ZDTM tests only for x86_64) |
| + | * improvement support of built-in rseq for non-Glibc libraries |
| + | * pre-dump tests (?) |
| + | * leave-running tests (?) |
| + | * crfail test |
| + | * threaded test |
| | | |
| == Useful links == | | == Useful links == |
Line 82: |
Line 195: |
| * [2] https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/ | | * [2] https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/ |
| * [3] https://lwn.net/Articles/883104/ | | * [3] https://lwn.net/Articles/883104/ |
| + | * [4] https://patchwork.sourceware.org/project/glibc/list/?series=5530&state=* |
| | | |
| [[Category: Under the hood]] | | [[Category: Under the hood]] |
| [[Category: Editor help needed]] | | [[Category: Editor help needed]] |