Difference between revisions of "Checkpoint/Restore"

(Created page with "== Basic design == === Checkpoint === The checkpoint procedure relies heavily on '''/proc''' file system (it's a general place where crtools takes all the information it needs)...")
 
 
(14 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Basic design ==
+
This page describes the overall design of how Checkpoint and Restore work in CRIU.
  
=== Checkpoint ===
+
== Checkpoint ==
  
The checkpoint procedure relies heavily on '''/proc''' file system (it's a general place where crtools takes all the information it needs).
+
The checkpoint procedure relies heavily on '''/proc''' file system (it's a general place where criu takes all the information it needs).
Which includes
+
The information gathered from /proc includes:
  
 
* Files descriptors information (via '''/proc/$pid/fd''' and '''/proc/$pid/fdinfo''').
 
* Files descriptors information (via '''/proc/$pid/fd''' and '''/proc/$pid/fdinfo''').
 
* Pipes parameters.
 
* Pipes parameters.
* Memory maps (via '''/proc/$pid/maps''').
+
* Memory maps (via '''/proc/$pid/maps''' and '''/proc/$pid/map_files/''').
 +
* etc.
  
The process dumper (lets call it a dumper further) does the following steps during checkpoint stage
+
The process dumper (called a dumper below) does the following steps during the checkpoint stage.
  
# A '''$pid''' of a process group leader is obtained from the command line.
+
=== Collect process tree and freeze it ===
# By using this '''$pid''' the dumper walks though '''/proc/$pid/status''' and gathers children '''$pids''' recursively. At the end we will have a process tree.
+
The '''$pid''' of a process group leader is obtained from the command line (<code>--tree</code> option). By using this '''$pid''' the dumper walks though '''/proc/$pid/task/''' directory collecting threads and through the '''/proc/$pid/task/$tid/children''' to gathers children recursively. While walking tasks are stopped using the <code>ptrace</code>'s <code>PTRACE_SEIZE</code> command.
# Then it takes every '''$pid''' from a process tree, sends ''SIGSTOP'' to every process found, and performs the following steps on each '''$pid'''.
 
#* Collects VMA areas by parsing '''/proc/$pid/maps'''.
 
#* Seizes a task via relatively new ptrace interface. Seizing a task means to put it into a special state when the task have no idea if it's being operated by ptrace.
 
#* Core parameters of a task (such as registers and friends) are being dumped via ptrace interface and parsing '''/proc/$pid/stat''' entry.
 
#* The dumper injects a parasite code into a task via ptrace interface. This allows us to dump pages of a task right from within the task's address space.
 
#** An injection procedure is pretty simple - the dumper scans executable VMA areas of a task (which were collected previously) and tests if there a place for <code>syscall</code> call, then (by ptrace as well) it substitutes an original code with <code>syscall</code> instructions and creates a new VMA area inside process address space.
 
#** Finally parasite code get copied into the new VMA and the former code which was modified during parasite bootstrap procedure get restored.
 
#* Then (by using a parasite code) the dumper flushes contents of a task's pages to the file. And pulls out parasite code block completely, since we don't need it anymore.
 
#* Once parasite removed a task get unseized via ptrace call but it remains stopped still.
 
#* The dumper writes out files and pipes parameter and data.
 
# The procedure continues for every '''$pid'''.
 
  
=== Restore ===
+
''See also: [[Freezing the tree]]''
  
The restore procedure (aka restorer) proceed in the following steps
+
=== Collect tasks' resources and dump them ===
 +
At this step CRIU reads all the information (it knows) about collected tasks and writes them to dump files. The resources are obtained via
 +
# VMAs areas are parsed from '''/proc/$pid/smaps''' and mapped files are read from '''/proc/$pid/map_files''' links
 +
# File descriptor numbers are read via '''/proc/$pid/fd'''
 +
# Core parameters of a task (such as registers and friends) are being dumped via ptrace interface and parsing '''/proc/$pid/stat''' entry.
  
# A process tree has been read from a file.
+
Then CRIU injects a [[parasite code]] into a task via ptrace interface. This is done in two steps -- at first we inject only a few bytes for ''mmap'' syscall at CS:IP the task has at moment of seizing. Then ptrace allow us to run an injected syscall and we allocate enough memory for a parasite code chunk we need for dumping. After that the parasite code is copied into new place inside dumpee address space and CS:IP set respectively to point to our parasite code.
# Every process started with saved (i.e. original) '''$pid''' via <code>clone()</code> call.
+
 
# Files and pipes are restored (by restored it's meant - they are opened and positioned).
+
From parsite context CRIU does more information such as
# A new memory map is created, filled with data the program had at checkpoint time.
+
# Credentials
# Finally the program is kicked to start with rt_sigreturn system call.
+
# Contents of memory
 +
 
 +
=== Cleanup ===
 +
 
 +
After everything dumped (such as memory pages, which can be written out only from inside dumpee address space) we use ptrace facility again and cure dumpee by dropping out all our parasite code and restoring original code. Then CRIU detaches from tasks and they continue to operate.
 +
 
 +
== Restore ==
 +
 
 +
The restore procedure (aka restorer) is done by CRIU morphing itself into the tasks it restores. On the top-level it consists of 4 steps
 +
 
 +
=== Resolve shared resources ===
 +
 
 +
At this step CRIU reads in image files and finds out which processes share which resources. Later shared resources are restored by some one process and all the others either inherit one on the 2nd stage (like session) or obtain in some other way. The latter is, for example, shared files which are sent with SCM_CREDS messages via unix sockets, or shared memory areas that are restoring via <code>memfd</code> file descriptor.
 +
 
 +
=== Fork the process tree ===
 +
 
 +
At this step CRIU calls fork() many times to re-created the processes needed to be restored. Note, that threads are not restored here, but on the 4th step.
 +
 
 +
=== Restore basic tasks resources ===
 +
 
 +
Here CRIU restores all resources but
 +
 
 +
# memory mappings exact location
 +
# timers
 +
# credentials
 +
# threads
 +
 
 +
The restoration of the above four types of resources are delayed till the last stage for the reasons described below. On this stage CRIU opens files, prepares [[namespaces]], maps (and fills with data) private memory areas, creates sockets, calls chdir() and chroot() and doing some more.
 +
 
 +
=== Switch to restorer context, restore the rest and continue ===
 +
 
 +
The reason for restorer blob is simple. Since criu morphs into the target process, it will have to unmap all its memory and put back the target one. While doing so, some code should exist in memory (the code doing the munmap and mmap). Therefore, the restorer blob is introduced. It's a small piece of code, that doesn't intersect with criu mappings AND target mappings. At the end of stage 2 criu jumps into this blob and restores the memory maps.
 +
 
 +
At the same place we restore timers not to make them fire too early, here we restore credentials to let criu do privileged operations (like fork-with-pid) and threads not to make them suffer from sudden memory layout change.
 +
 
 +
''See also: [[restorer context]], [[tree after restore]].''
 +
 
 +
[[Category:Under the hood]]

Latest revision as of 11:05, 12 May 2017

This page describes the overall design of how Checkpoint and Restore work in CRIU.

CheckpointEdit

The checkpoint procedure relies heavily on /proc file system (it's a general place where criu takes all the information it needs). The information gathered from /proc includes:

  • Files descriptors information (via /proc/$pid/fd and /proc/$pid/fdinfo).
  • Pipes parameters.
  • Memory maps (via /proc/$pid/maps and /proc/$pid/map_files/).
  • etc.

The process dumper (called a dumper below) does the following steps during the checkpoint stage.

Collect process tree and freeze itEdit

The $pid of a process group leader is obtained from the command line (--tree option). By using this $pid the dumper walks though /proc/$pid/task/ directory collecting threads and through the /proc/$pid/task/$tid/children to gathers children recursively. While walking tasks are stopped using the ptrace's PTRACE_SEIZE command.

See also: Freezing the tree

Collect tasks' resources and dump themEdit

At this step CRIU reads all the information (it knows) about collected tasks and writes them to dump files. The resources are obtained via

  1. VMAs areas are parsed from /proc/$pid/smaps and mapped files are read from /proc/$pid/map_files links
  2. File descriptor numbers are read via /proc/$pid/fd
  3. Core parameters of a task (such as registers and friends) are being dumped via ptrace interface and parsing /proc/$pid/stat entry.

Then CRIU injects a parasite code into a task via ptrace interface. This is done in two steps -- at first we inject only a few bytes for mmap syscall at CS:IP the task has at moment of seizing. Then ptrace allow us to run an injected syscall and we allocate enough memory for a parasite code chunk we need for dumping. After that the parasite code is copied into new place inside dumpee address space and CS:IP set respectively to point to our parasite code.

From parsite context CRIU does more information such as

  1. Credentials
  2. Contents of memory

CleanupEdit

After everything dumped (such as memory pages, which can be written out only from inside dumpee address space) we use ptrace facility again and cure dumpee by dropping out all our parasite code and restoring original code. Then CRIU detaches from tasks and they continue to operate.

RestoreEdit

The restore procedure (aka restorer) is done by CRIU morphing itself into the tasks it restores. On the top-level it consists of 4 steps

Resolve shared resourcesEdit

At this step CRIU reads in image files and finds out which processes share which resources. Later shared resources are restored by some one process and all the others either inherit one on the 2nd stage (like session) or obtain in some other way. The latter is, for example, shared files which are sent with SCM_CREDS messages via unix sockets, or shared memory areas that are restoring via memfd file descriptor.

Fork the process treeEdit

At this step CRIU calls fork() many times to re-created the processes needed to be restored. Note, that threads are not restored here, but on the 4th step.

Restore basic tasks resourcesEdit

Here CRIU restores all resources but

  1. memory mappings exact location
  2. timers
  3. credentials
  4. threads

The restoration of the above four types of resources are delayed till the last stage for the reasons described below. On this stage CRIU opens files, prepares namespaces, maps (and fills with data) private memory areas, creates sockets, calls chdir() and chroot() and doing some more.

Switch to restorer context, restore the rest and continueEdit

The reason for restorer blob is simple. Since criu morphs into the target process, it will have to unmap all its memory and put back the target one. While doing so, some code should exist in memory (the code doing the munmap and mmap). Therefore, the restorer blob is introduced. It's a small piece of code, that doesn't intersect with criu mappings AND target mappings. At the end of stage 2 criu jumps into this blob and restores the memory maps.

At the same place we restore timers not to make them fire too early, here we restore credentials to let criu do privileged operations (like fork-with-pid) and threads not to make them suffer from sudden memory layout change.

See also: restorer context, tree after restore.