Difference between revisions of "Shared memory"

From CRIU
Jump to: navigation, search
(Checkpoint: rephrased note about anon mappings)
(Restore: OK this is the way it works. Perhaps we need a picture here.)
Line 20: Line 20:
 
== Restore ==
 
== Restore ==
  
Upon restore, CRIU already knows which mappings are shared, and the trick is to restore them as such.
+
During the restore, CRIU already knows which mappings are shared, so they need to be restored as shared.
For that, two different approaches are used, depending on the availability.
+
To restore file mappings, no tricks are needed, they are opened and mmaped with with a MAP_SHARED flag set.
  
The common part is, between the processes sharing a mapping, the one with the lowest PID
+
Anonymous memory mappings, though, need some work to be restored as such. Here is how it is done.
among the group performs the actual <code>mmap()</code>, while all the others wait
 
for the mapping to appear and, once it's available, use it.
 
  
=== memfd ===
+
Among all the processes sharing a mapping, the one with the lowest PID among the group
 +
(see [[postulates]]) is assigned to be a mapping creator. The creator task is to obtain a mapping
 +
file descriptor, restore the mapping data, and signal all the other process that it's ready.
 +
During this process, all the other processes are waiting.
  
Linux kernel v3.17 adds a [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
+
First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different
syscall. CRIU restore checks if it is available from the running kernel; it yes, it is used.
+
approaches are used, depending on the availability.
  
FIXME how
+
In case [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
 +
syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor.
 +
Next, <code>ftruncate()</code> is called to set the proper size of mapping.
  
HOW: The memfd in question is created in the task with lowest PID (see [[postulates]]) among those having this shmem segment
+
If <code>memfd_create()</code> is not available, the alternative approach is used.
mapped, then criu waits for the others to get this file by opening the creator's /proc/pid/fd/ link.
+
First, mmap() is called to create a mapping. Next, a file in <code>/proc/self/map_files/</code>
Afterwards all the files just mmap() this descriptor into their address space.
+
is opened to get a file descriptor for the mapping. The limitation of this method is,
 +
due to security concerns, /proc/$PID/map_files/ is not available for processes that
 +
live inside a user namespace, so it is impossible to use it if there
 +
are any user namespaces in the dump.
  
=== /proc/$PID/map_files/ ===
+
Once the creator have the file descriptor, it mmap()s it and restores its content from
 
+
the dump (using memcpy()). The creator then unmaps the the mapping (note the file
This method is used if memfd is not available. The limitation is, /proc/$PID/map_files/ is not available
+
descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the
for users inside user namespaces (due to security concerns), so it's not possible to use it if there
+
waiting processes that the mapping file descriptor is ready.
are any user namespaces in the dump.
 
  
FIXME how
+
All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the
 +
wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the
 +
mapping file descriptor.
  
HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap()
+
Finally, all the processes (including the creator itself) call mmap() to create a
not memfd_create() and creates the shared memory at once. Then it waits for the others to open its
+
needed mapping (note that mmap() arguments such as length, offset and flags may
/proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as if
+
differ for different processes), and close() the mapping file descriptor as it is
they would have done it with memfd descriptor.
+
no longer needed.
  
 
== Changes tracking ==
 
== Changes tracking ==

Revision as of 19:37, 2 September 2016

This articles describes some intricacies of handling shared memory mappings, i.e. mappings that are shared between a few processes.

Checkpoint

Every process has one or more mmaped files. Some mappings (for example, ones of shared libraries) are shared between a few processes. During the checkpointing, CRIU need to figure out all the mappings that are shared in order to dump them as such.

It does so by performing fstatat() for each entry in /proc/$PID/map_files/, noting the device and inode fields of the structure returned by fstatat(). This information is collected and sorted. Now, if any few processes have a mapping with same device and inode, this mapping is a shared one and should be dumped as such.

It's important to note that the above mechanism works not just for the file-based mappings, but also for the anonymous ones. For an anonymous mapping, kernel actually creates a hidden tmpfs file, and so fstatat() on the /proc/$PID/map_files/ entry works the same way as for other files. The tmpfs file itself is not visible from any tmpfs mounts, but can be opened via its map_files entry.

Restore

During the restore, CRIU already knows which mappings are shared, so they need to be restored as shared. To restore file mappings, no tricks are needed, they are opened and mmaped with with a MAP_SHARED flag set.

Anonymous memory mappings, though, need some work to be restored as such. Here is how it is done.

Among all the processes sharing a mapping, the one with the lowest PID among the group (see postulates) is assigned to be a mapping creator. The creator task is to obtain a mapping file descriptor, restore the mapping data, and signal all the other process that it's ready. During this process, all the other processes are waiting.

First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different approaches are used, depending on the availability.

In case memfd_create() syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor. Next, ftruncate() is called to set the proper size of mapping.

If memfd_create() is not available, the alternative approach is used. First, mmap() is called to create a mapping. Next, a file in /proc/self/map_files/ is opened to get a file descriptor for the mapping. The limitation of this method is, due to security concerns, /proc/$PID/map_files/ is not available for processes that live inside a user namespace, so it is impossible to use it if there are any user namespaces in the dump.

Once the creator have the file descriptor, it mmap()s it and restores its content from the dump (using memcpy()). The creator then unmaps the the mapping (note the file descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the waiting processes that the mapping file descriptor is ready.

All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the mapping file descriptor.

Finally, all the processes (including the creator itself) call mmap() to create a needed mapping (note that mmap() arguments such as length, offset and flags may differ for different processes), and close() the mapping file descriptor as it is no longer needed.

Changes tracking

For iterative migration it's very useful to track changes in memory. Until 2.5 changes were tracked for anonymous memory only, but now criu does this for shared memory as well. To do it criu scans all the shmem segment owners' pagemap (as it does for anon memory) and then AND-s the collected soft-dirty bits.

The changes tracking made developers implement Memory images deduplication for shmem segments as well.

Dumping present pages

When dumping the contents of shared memory CRIU doesn't dump all the data. Instead, it determines which pages contain it and dumps only them. This is done similarly to how regular memory dumping and restoring works, i.e. by analyzing the owners' pagemap entries for PRESENT or SWAPPED bits. But there's one feature of shmem dumps -- sometimes shmem page can exist in the kernel, but not mapped to any process. In this case criu detects one by calling mincore() on the shmem segment, which reports back the page in-memory status. And the mincore bitmap is AND-ed with the per-process ones.

See also

Memory dumping and restoring

Memory images deduplication