This articles describes some intricacies of handling shared memory mappings, i.e. mappings that are shared between a few processes.
Every process has one or more mmaped files. Some mappings (for example, ones of shared libraries) are shared between a few processes. During the checkpointing, CRIU need to figure out all the mappings that are shared in order to dump them as such.
It does so by performing
fstatat() for each entry in
noting the device and inode fields of the structure returned by fstatat(). This information
is collected and sorted. Now, if any few processes have a mapping with same device and inode,
this mapping is a shared one and should be dumped as such.
It's important to note that the above mechanism works not just for the file-based mappings,
but also for the anonymous ones. For an anonymous mapping, kernel actually creates a hidden
tmpfs file, and so
fstatat() on the
works the same way as for other files. The tmpfs file itself is not visible from any tmpfs
mounts, but can be opened via its
Upon restore, CRIU already knows which mappings are shared, and the trick is to restore them as such. For that, two different approaches are used, depending on the availability.
The common part is, between the processes sharing a mapping, the one with the lowest PID
among the group performs the actual
mmap(), while all the others wait
for the mapping to appear and, once it's available, use it.
Linux kernel v3.17 adds a memfd_create() syscall. CRIU restore checks if it is available from the running kernel; it yes, it is used.
HOW: The memfd in question is created in the task with lowest PID (see postulates) among those having this shmem segment mapped, then criu waits for the others to get this file by opening the creator's /proc/pid/fd/ link. Afterwards all the files just mmap() this descriptor into their address space.
This method is used if memfd is not available. The limitation is, /proc/$PID/map_files/ is not available for users inside user namespaces (due to security concerns), so it's not possible to use it if there are any user namespaces in the dump.
HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap() not memfd_create() and creates the shared memory at once. Then it waits for the others to open its /proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as if they would have done it with memfd descriptor.
For iterative migration it's very useful to track changes in memory. Until 2.5 changes were tracked for anonymous memory only, but now criu does this for shared memory as well. To do it criu scans all the shmem segment owners' pagemap (as it does for anon memory) and then AND-s the collected soft-dirty bits.
The changes tracking made developers implement Memory images deduplication for shmem segments as well.
Dumping present pages
When dumping the contents of shared memory CRIU doesn't dump all the data. Instead, it determines which pages contain it and dumps only them. This is done similarly to how regular memory dumping and restoring works, i.e. by analyzing the owners' pagemap entries for PRESENT or SWAPPED bits. But there's one feature of shmem dumps -- sometimes shmem page can exist in the kernel, but not mapped to any process. In this case criu detects one by calling mincore() on the shmem segment, which reports back the page in-memory status. And the mincore bitmap is AND-ed with the per-process ones.