Basic design
Checkpoint
The checkpoint procedure relies heavily on /proc file system (it's a general place where criu takes all the information it needs). Which includes
- Files descriptors information (via /proc/$pid/fd and /proc/$pid/fdinfo).
- Pipes parameters.
- Memory maps (via /proc/$pid/maps and /proc/$pid/map_files/).
- etc.
The process dumper (lets call it a dumper further) does the following steps during checkpoint stage
Collect process tree and freeze it
The $pid of a process group leader is obtained from the command line (--tree
option). By using this $pid the dumper walks though /proc/$pid/task/ directory collecting threads and through the /proc/$pid/task/$tid/children to gathers children recursively. While walking tasks are stopped using the ptrace
's PTRACE_SEISE
command.
Collect tasks' resources and dump them
At this step CRIU reads all the information (it knows) about collected tasks and writes them to dump files. The resources are obtained via
- VMAs areas are parsed from /proc/$pid/smaps and mapped files are read from /proc/$pid/map_files links
- File descriptor numbers are read via /proc/$pid/fd
- Core parameters of a task (such as registers and friends) are being dumped via ptrace interface and parsing /proc/$pid/stat entry.
Then CRIU injects a parasite code into a task via ptrace interface. This is done in two steps -- at first we inject only a few bytes for mmap syscall at CS:IP the task has at moment of seizing. Then ptrace allow us to run an injected syscall and we allocate enough memory for a parasite code chunk we need for dumping. After that the parasite code is copied into new place inside dumpee address space and CS:IP set respectively to point to our parasite code.
From parsite context CRIU does more information such as
- Credentials
- Contents of memory
Cleanup
After everything dumped (such as memory pages, which can be written out only from inside dumpee address space) we use ptrace facility again and cure dumpee by dropping out all our parasite code and restoring original code. Then CRIU detaches from tasks and they continue to operate.
Restore
The restore procedure (aka restorer) proceed in the following steps
- A process tree has been read from a file.
- Every process started with saved (i.e. original) $pid (see Pid restore) via
clone()
call. - Files and pipes are restored (by restored it's meant - they are opened and positioned).
- A new memory map is created, filled with data the program had at checkpoint time.
- Finally the program is kicked to start with rt_sigreturn system call.