Line 20: |
Line 20: |
| == Restore == | | == Restore == |
| | | |
− | Upon restore, CRIU already knows which mappings are shared, and the trick is to restore them as such.
| + | During the restore, CRIU already knows which mappings are shared, so they need to be restored as shared. |
− | For that, two different approaches are used, depending on the availability.
| + | To restore file mappings, no tricks are needed, they are opened and mmaped with with a MAP_SHARED flag set. |
| | | |
− | The common part is, between the processes sharing a mapping, the one with the lowest PID
| + | Anonymous memory mappings, though, need some work to be restored as such. Here is how it is done. |
− | among the group performs the actual <code>mmap()</code>, while all the others wait
| |
− | for the mapping to appear and, once it's available, use it.
| |
| | | |
− | === memfd ===
| + | Among all the processes sharing a mapping, the one with the lowest PID among the group |
| + | (see [[postulates]]) is assigned to be a mapping creator. The creator task is to obtain a mapping |
| + | file descriptor, restore the mapping data, and signal all the other process that it's ready. |
| + | During this process, all the other processes are waiting. |
| | | |
− | Linux kernel v3.17 adds a [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
| + | First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different |
− | syscall. CRIU restore checks if it is available from the running kernel; it yes, it is used.
| + | approaches are used, depending on the availability. |
| | | |
− | FIXME how
| + | In case [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()] |
| + | syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor. |
| + | Next, <code>ftruncate()</code> is called to set the proper size of mapping. |
| | | |
− | HOW: The memfd in question is created in the task with lowest PID (see [[postulates]]) among those having this shmem segment
| + | If <code>memfd_create()</code> is not available, the alternative approach is used. |
− | mapped, then criu waits for the others to get this file by opening the creator's /proc/pid/fd/ link.
| + | First, mmap() is called to create a mapping. Next, a file in <code>/proc/self/map_files/</code> |
− | Afterwards all the files just mmap() this descriptor into their address space.
| + | is opened to get a file descriptor for the mapping. The limitation of this method is, |
| + | due to security concerns, /proc/$PID/map_files/ is not available for processes that |
| + | live inside a user namespace, so it is impossible to use it if there |
| + | are any user namespaces in the dump. |
| | | |
− | === /proc/$PID/map_files/ ===
| + | Once the creator have the file descriptor, it mmap()s it and restores its content from |
− | | + | the dump (using memcpy()). The creator then unmaps the the mapping (note the file |
− | This method is used if memfd is not available. The limitation is, /proc/$PID/map_files/ is not available
| + | descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the |
− | for users inside user namespaces (due to security concerns), so it's not possible to use it if there
| + | waiting processes that the mapping file descriptor is ready. |
− | are any user namespaces in the dump.
| |
| | | |
− | FIXME how
| + | All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the |
| + | wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the |
| + | mapping file descriptor. |
| | | |
− | HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap()
| + | Finally, all the processes (including the creator itself) call mmap() to create a |
− | not memfd_create() and creates the shared memory at once. Then it waits for the others to open its
| + | needed mapping (note that mmap() arguments such as length, offset and flags may |
− | /proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as if
| + | differ for different processes), and close() the mapping file descriptor as it is |
− | they would have done it with memfd descriptor.
| + | no longer needed. |
| | | |
| == Changes tracking == | | == Changes tracking == |