Line 1: |
Line 1: |
− | == Basic design ==
| + | This page describes the overall design of how Checkpoint and Restore work in CRIU. |
| | | |
− | === Checkpoint ===
| + | == Checkpoint == |
| | | |
| The checkpoint procedure relies heavily on '''/proc''' file system (it's a general place where criu takes all the information it needs). | | The checkpoint procedure relies heavily on '''/proc''' file system (it's a general place where criu takes all the information it needs). |
Line 13: |
Line 13: |
| The process dumper (lets call it a dumper further) does the following steps during checkpoint stage | | The process dumper (lets call it a dumper further) does the following steps during checkpoint stage |
| | | |
− | ==== Collect process tree and freeze it ====
| + | === Collect process tree and freeze it === |
| The '''$pid''' of a process group leader is obtained from the command line (<code>--tree</code> option). By using this '''$pid''' the dumper walks though '''/proc/$pid/task/''' directory collecting threads and through the '''/proc/$pid/task/$tid/children''' to gathers children recursively. While walking tasks are stopped using the <code>ptrace</code>'s <code>PTRACE_SEISE</code> command. | | The '''$pid''' of a process group leader is obtained from the command line (<code>--tree</code> option). By using this '''$pid''' the dumper walks though '''/proc/$pid/task/''' directory collecting threads and through the '''/proc/$pid/task/$tid/children''' to gathers children recursively. While walking tasks are stopped using the <code>ptrace</code>'s <code>PTRACE_SEISE</code> command. |
| | | |
− | ==== Collect tasks' resources and dump them ====
| + | === Collect tasks' resources and dump them === |
| At this step CRIU reads all the information (it knows) about collected tasks and writes them to dump files. The resources are obtained via | | At this step CRIU reads all the information (it knows) about collected tasks and writes them to dump files. The resources are obtained via |
| # VMAs areas are parsed from '''/proc/$pid/smaps''' and mapped files are read from '''/proc/$pid/map_files''' links | | # VMAs areas are parsed from '''/proc/$pid/smaps''' and mapped files are read from '''/proc/$pid/map_files''' links |
Line 29: |
Line 29: |
| | | |
| | | |
− | ==== Cleanup ====
| + | === Cleanup === |
| | | |
| After everything dumped (such as memory pages, which can be written out only from inside dumpee address space) we use ptrace facility again and cure dumpee by dropping out all our parasite code and restoring original code. Then CRIU detaches from tasks and they continue to operate. | | After everything dumped (such as memory pages, which can be written out only from inside dumpee address space) we use ptrace facility again and cure dumpee by dropping out all our parasite code and restoring original code. Then CRIU detaches from tasks and they continue to operate. |
| | | |
− | === Restore ===
| + | == Restore == |
| | | |
− | The restore procedure (aka restorer) proceed in the following steps | + | The restore procedure (aka restorer) is done by CRIU morphing itself into the tasks it restores. On the top-level it consists of 4 steps |
| | | |
− | # A process tree has been read from a file.
| + | === Resolve shared resources === |
− | # Every process started with saved (i.e. original) '''$pid''' (see [[Pid restore]]) via <code>clone()</code> call.
| + | |
− | # Files and pipes are restored (by restored it's meant - they are opened and positioned). | + | At this step CRIU reads in image files and finds out which processes share which resources. Later shared resources are restored by some one process and all the others either inherit one on the 2nd stage (like session) or obtain in some other way. The latter is, for example, shared files which are sent with SCM_CREDS messages via unix sockets, or shared memory areas that are restoring via <code>memfd</code> file descriptor. |
− | # A new memory map is created, filled with data the program had at checkpoint time.
| + | |
− | # Finally the program is kicked to start with rt_sigreturn system call.
| + | === Fork the process tree === |
| + | |
| + | At this step CRIU calls fork() many times to re-created the processes needed to be restored. Note, that threads are not restored here, but on the 4th step. |
| + | |
| + | === Restore basic tasks resources === |
| + | |
| + | Here CRIU restores all resources but |
| + | |
| + | # memory mappings exact location |
| + | # timers |
| + | # credentials |
| + | # threads |
| + | |
| + | These for are delayed till the last stage for the reasons described further. On this stage CRIU opens files, prepares namespaces, maps (and fills with data) private memory areas, creates sockets, calls chdir() and chroot() and dome some more. |
| + | |
| + | === Switch to restorer context, restore the rest and continue === |
| + | |
| + | The reason for restorer blob is simple. Since criu morphs into the target process, it will have to unmap all its memory and put back the target one. While doing so some code should exist in memory (the code doing the munmap and mmap). So we introduced the restorer blob -- the small piece of code, that doesn't intersect with criu mappings AND target mappings. At the end of stage 2 criu jumps into this blob and restores the memory maps. |
| + | |
| + | At the same place we restore timers not to make them fire too early, here we restore credentials to let criu do priveledged operations (like fork-with-pid) and threads not to make them suffer from sudden memory layout change. |
| | | |
| [[Category:Under the hood]] | | [[Category:Under the hood]] |