Open main menu



Shared memory

1,783 bytes added, 08:02, 10 January 2018
Checkpoint: twice 'the'
This articles describes some intricacies of handling shared Every process has one or more memory mappings, i.e. regions of virtual memory it allows to use.Some such mappings that are can be shared between a few processes, and they are called shared mappings.In other words, these are shared '''anonymous (not file-based) memory mappings'''.The article describes some intricacies of handling such mappings.
== Checkpoint ==
Every process has one or more mmaped files. Some mappings (for example, ones of shared libraries)are shared between a few processes. During the checkpointing, CRIU need needs to figure outall the shared mappings that are shared in order to dump them as such.
It does so by performing calling <code>fstatat()</code> for on each entry found in the <code>/proc/$PID/map_files/</code>,noting the ''device'' and '':inode'' fields pair of the structure returned by <code>fstatat(). This informationis collected and sorted</code>. Now, if any few some processes have a mapping with the same ''device'' and '':inode''pair,this mapping is a marked as shared one between these processesand should be dumped as such.
It's important to note Note that this <code>fstatat()</code> works for both -- anonymous and file shared mappings, as for because theformer ones kernel actually creates an invisible through the VFS tree a hiddentmpfs-based file, and it's possibleto works with it just like with not visible from any other file (except that it cannot be opened tmpfs mounts, but accessible via any path butits<code>/proc/pid$PID/map_files/address</code> entry. Dumping a mapping means two things:* writing an entry into process' mm.img file;* storing the actual mapping data (contents).For shared mappings, the contents is stored into a pair of image files: pagemap-shmem.img and pages.img.For details, see [[Memory dumps]]. Note that different processes can map different parts of a shared memory segment.In this case, CRIU first collects mapping offsets and lengths from all the processesto determine the total segment size, then reads all the parts contentsfrom the respective processes.
== Restore ==
Upon During the restore, CRIU already knows which mappings are shared, and so they need to berestored as such. Here is how it is done. Among all the processes sharing a mapping, the one with the lowest PID among the trick group(see [[postulates]]) is assigned to be a mapping creator. The creator task is to obtain a mappingfile descriptor, restore them as suchthe mapping data, and signal all the other process that it's ready.During this process, all the other processes are waiting.For thatFirst, the creator need to obtain a file descriptor for the mapping. To achieve it, two different approaches are used, depending on the availability.
The common part In case [ memfd_create()]syscall isavailable (Linux kernel v3.17+), between the processes sharing it is used to obtain a mappingfile descriptor.Next, the one with the lowest PIDamong the group performs the actual <code>mmapftruncate()</code>, while all the others waitfor is called to set the proper size of mapping to appear and, once it's available, use it.
=== memfd ===If <code>memfd_create()</code> is not available, the alternative approach is used.First, mmap() is called to create a mapping. Next, a file in <code>/proc/self/map_files/</code>is opened to get a file descriptor for the mapping. The limitation of this method is,due to security concerns, /proc/$PID/map_files/ is not available for processes thatlive inside a user namespace, so it is impossible to use it if thereare any user namespaces in the dump.
Linux kernel v3.17 adds a [ memfd_createOnce the creator have the file descriptor, it mmap()]s it and restores its content fromsyscallthe dump (using memcpy()). CRIU restore checks if it The creator then unmaps the the mapping (note the filedescriptor is still available from the running kernel; it yes). Next, it calls futex(FUTEX_WAKE) to signal all thewaiting processes that the mapping file descriptor is usedready.
FIXME howAll the other processes that need this mapping wait on futex(FUTEX_WAIT). Once thewait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get themapping file descriptor.
HOW: The memfd in question is created in the task with lowest PID among those having this shmem segmentmappedFinally, then criu waits for all the others to get this file by opening processes (including the creator's /proc/pid/fd/ link.itself) call mmap() to create aAfterwards all the files just needed mapping (note that mmap() this arguments such as length, offset and flags maydiffer for different processes), and close() the mapping file descriptor into their address spaceas it isno longer needed.
=== /proc/$PID/map_files/ =Changes tracking ==
This method is used if memfd is not availableFor [[iterative migration]] it's very useful to track changes in memory. Until CRIU v2. The limitation is5, /proc/$PID/map_files/ is not availablechanges were tracked for users inside user namespaces (due to security concerns)anonymous memory only, so but now it is also shared memory can be tracked as well. To achieve it, CRIU scans all the shmem segment owners's not possible to use pagemap (as it if thereare any user namespaces in does for anonymous memory) and then ANDs the dumpcollected soft-dirty bits.
FIXME howThe changes tracking caused developers to implement [[memory images deduplication]] for shmem segments as well.
HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap()not memfd_create() and creates the shared memory at once. Then it waits for the others to open its/proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as ifthey would have done it with memfd descriptor.== Dumping present pages ==
== Changes tracking ==When dumping the contents of shared memory, CRIU does not dump all of the data. Instead, it determines which pages contain it, and only dumps those pages. This is done similarly to how regular [[memory dumping and restoring]] works, i.e. by lookingfor PRESENT or SWAPPED bits in owners' pagemap entries.
For [[iterative migration]] it's very useful to track changes in There is one particular feature of shared memorydumps worth mentioning. Until 2.5 changes were tracked for anonymous Sometimes, a shared memory onlypagecan exist in the kernel, but now criu does this for shared memory as wellit is not mapped to any process. To do it criu scans all CRIU detects such pages by calling mincore()on the shmem segment owners' pagemap (as it does for anon , which reports back the page in-memory) and then AND-s status. The mincore bitmap is when ANDed withthe collected softper-dirty bitsprocess ones.
The changes tracking made developers implement [[Memory images deduplication]] for shmem segments as well.== See also ==
===* [[Memory dumping and restoring]]* [[Memory images deduplication]]
[[Category:Under the hood]]
[[Category:Empty articlesEditor help needed]]