Changes

693 bytes added ,  19:37, 2 September 2016
→‎Restore: OK this is the way it works. Perhaps we need a picture here.
Line 20: Line 20:  
== Restore ==
 
== Restore ==
   −
Upon restore, CRIU already knows which mappings are shared, and the trick is to restore them as such.
+
During the restore, CRIU already knows which mappings are shared, so they need to be restored as shared.
For that, two different approaches are used, depending on the availability.
+
To restore file mappings, no tricks are needed, they are opened and mmaped with with a MAP_SHARED flag set.
   −
The common part is, between the processes sharing a mapping, the one with the lowest PID
+
Anonymous memory mappings, though, need some work to be restored as such. Here is how it is done.
among the group performs the actual <code>mmap()</code>, while all the others wait
  −
for the mapping to appear and, once it's available, use it.
     −
=== memfd ===
+
Among all the processes sharing a mapping, the one with the lowest PID among the group
 +
(see [[postulates]]) is assigned to be a mapping creator. The creator task is to obtain a mapping
 +
file descriptor, restore the mapping data, and signal all the other process that it's ready.
 +
During this process, all the other processes are waiting.
   −
Linux kernel v3.17 adds a [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
+
First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different
syscall. CRIU restore checks if it is available from the running kernel; it yes, it is used.
+
approaches are used, depending on the availability.
   −
FIXME how
+
In case [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
 +
syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor.
 +
Next, <code>ftruncate()</code> is called to set the proper size of mapping.
   −
HOW: The memfd in question is created in the task with lowest PID (see [[postulates]]) among those having this shmem segment
+
If <code>memfd_create()</code> is not available, the alternative approach is used.
mapped, then criu waits for the others to get this file by opening the creator's /proc/pid/fd/ link.
+
First, mmap() is called to create a mapping. Next, a file in <code>/proc/self/map_files/</code>
Afterwards all the files just mmap() this descriptor into their address space.
+
is opened to get a file descriptor for the mapping. The limitation of this method is,
 +
due to security concerns, /proc/$PID/map_files/ is not available for processes that
 +
live inside a user namespace, so it is impossible to use it if there
 +
are any user namespaces in the dump.
   −
=== /proc/$PID/map_files/ ===
+
Once the creator have the file descriptor, it mmap()s it and restores its content from
 
+
the dump (using memcpy()). The creator then unmaps the the mapping (note the file
This method is used if memfd is not available. The limitation is, /proc/$PID/map_files/ is not available
+
descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the
for users inside user namespaces (due to security concerns), so it's not possible to use it if there
+
waiting processes that the mapping file descriptor is ready.
are any user namespaces in the dump.
     −
FIXME how
+
All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the
 +
wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the
 +
mapping file descriptor.
   −
HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap()
+
Finally, all the processes (including the creator itself) call mmap() to create a
not memfd_create() and creates the shared memory at once. Then it waits for the others to open its
+
needed mapping (note that mmap() arguments such as length, offset and flags may
/proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as if
+
differ for different processes), and close() the mapping file descriptor as it is
they would have done it with memfd descriptor.
+
no longer needed.
    
== Changes tracking ==
 
== Changes tracking ==