Difference between revisions of "Shared memory"

From CRIU
Jump to navigation Jump to search
m (→‎Checkpoint: twice 'the')
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This articles describes some intricacies of handling shared memory mappings, i.e. mappings that are shared between a few processes.
+
Every process has one or more memory mappings, i.e. regions of virtual memory it allows to use.
 +
Some such mappings can be shared between a few processes, and they are called shared mappings.
 +
In other words, these are shared '''anonymous (not file-based) memory mappings'''.
 +
The article describes some intricacies of handling such mappings.  
  
 
== Checkpoint ==
 
== Checkpoint ==
  
Every process has one or more mmaped files. Some mappings (for example, ones of shared libraries)
+
During the checkpointing, CRIU needs to figure out all the shared mappings in order to dump them as such.
are shared between a few processes. During the checkpointing, CRIU need to figure out
 
all the mappings that are shared in order to dump them as such.
 
  
It does so by performing <code>fstatat()</code> for each entry in <code>/proc/$PID/map_files/</code>,
+
It does so by calling <code>fstatat()</code> on each entry found in the <code>/proc/$PID/map_files/</code>,
noting the ''device'' and ''inode'' fields of the structure returned by fstatat(). This information
+
noting the ''device:inode'' pair of the structure returned by <code>fstatat()</code>. Now, if some processes
is collected and sorted. Now, if any few processes have a mapping with same ''device'' and ''inode'',
+
have a mapping with the same ''device:inode'' pair, this mapping is marked as shared between these processes
this mapping is a shared one and should be dumped as such.
+
and dumped as such.
  
It's important to note that this works for both -- anonymous and file shared mappings, as for the
+
Note that <code>fstatat()</code> works because the kernel actually creates a hidden
former ones kernel creates an invisible through the VFS tree tmpfs-based file, and it's possible
+
tmpfs file, not visible from any tmpfs mounts, but accessible via its
to works with it just like with any other file (except that it cannot be opened via any path but
+
<code>/proc/$PID/map_files/</code> entry.
/proc/pid/map_files/address).
+
 
 +
Dumping a mapping means two things:
 +
* writing an entry into process' mm.img file;
 +
* storing the actual mapping data (contents).
 +
For shared mappings, the contents is stored into a pair of image files: pagemap-shmem.img and pages.img.
 +
For details, see [[Memory dumps]].
 +
 
 +
Note that different processes can map different parts of a shared memory segment.
 +
In this case, CRIU first collects mapping offsets and lengths from all the processes
 +
to determine the total segment size, then reads all the parts contents
 +
from the respective processes.
  
 
== Restore ==
 
== Restore ==
  
Upon restore, CRIU already knows which mappings are shared, and the trick is to restore them as such.
+
During the restore, CRIU already knows which mappings are shared, so they need to be
For that, two different approaches are used, depending on the availability.
+
restored as such. Here is how it is done.
  
The common part is, between the processes sharing a mapping, the one with the lowest PID
+
Among all the processes sharing a mapping, the one with the lowest PID among the group
among the group performs the actual <code>mmap()</code>, while all the others wait
+
(see [[postulates]]) is assigned to be a mapping creator. The creator task is to obtain a mapping
for the mapping to appear and, once it's available, use it.
+
file descriptor, restore the mapping data, and signal all the other process that it's ready.
 +
During this process, all the other processes are waiting.
  
=== memfd ===
+
First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different
 +
approaches are used, depending on the availability.
  
Linux kernel v3.17 adds a [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
+
In case [http://man7.org/linux/man-pages/man2/memfd_create.2.html memfd_create()]
syscall. CRIU restore checks if it is available from the running kernel; it yes, it is used.
+
syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor.
 +
Next, <code>ftruncate()</code> is called to set the proper size of mapping.
  
FIXME how
+
If <code>memfd_create()</code> is not available, the alternative approach is used.
 +
First, mmap() is called to create a mapping. Next, a file in <code>/proc/self/map_files/</code>
 +
is opened to get a file descriptor for the mapping. The limitation of this method is,
 +
due to security concerns, /proc/$PID/map_files/ is not available for processes that
 +
live inside a user namespace, so it is impossible to use it if there
 +
are any user namespaces in the dump.
  
HOW: The memfd in question is created in the task with lowest PID among those having this shmem segment
+
Once the creator have the file descriptor, it mmap()s it and restores its content from
mapped, then criu waits for the others to get this file by opening the creator's /proc/pid/fd/ link.
+
the dump (using memcpy()). The creator then unmaps the the mapping (note the file
Afterwards all the files just mmap() this descriptor into their address space.
+
descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the
 +
waiting processes that the mapping file descriptor is ready.
  
=== /proc/$PID/map_files/ ===
+
All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the
 +
wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the
 +
mapping file descriptor.
  
This method is used if memfd is not available. The limitation is, /proc/$PID/map_files/ is not available
+
Finally, all the processes (including the creator itself) call mmap() to create a
for users inside user namespaces (due to security concerns), so it's not possible to use it if there
+
needed mapping (note that mmap() arguments such as length, offset and flags may
are any user namespaces in the dump.
+
differ for different processes), and close() the mapping file descriptor as it is
 +
no longer needed.
 +
 
 +
== Changes tracking ==
 +
 
 +
For [[iterative migration]] it's very useful to track changes in memory. Until CRIU v2.5, changes were tracked for anonymous memory only, but now it is also shared memory can be tracked as well. To achieve it, CRIU scans all the shmem segment owners' pagemap (as it does for anonymous memory) and then ANDs the collected soft-dirty bits.
 +
 
 +
The changes tracking caused developers to implement [[memory images deduplication]] for shmem segments as well.
 +
 
 +
== Dumping present pages ==
 +
 
 +
When dumping the contents of shared memory, CRIU does not dump all of the data. Instead, it determines which pages contain
 +
it, and only dumps those pages. This is done similarly to how regular [[memory dumping and restoring]] works, i.e. by looking
 +
for PRESENT or SWAPPED bits in owners' pagemap entries.
  
FIXME how
+
There is one particular feature of shared memory dumps worth mentioning. Sometimes, a shared memory page
 +
can exist in the kernel, but it is not mapped to any process. CRIU detects such pages by calling mincore()
 +
on the shmem segment, which reports back the page in-memory status. The mincore bitmap is when ANDed with
 +
the per-process ones.
  
HOW: The same technique as with memfd is used, with two exceptions. First is that creator calls mmap()
+
== See also ==
not memfd_create() and creates the shared memory at once. Then it waits for the others to open its
 
/proc/pid/map_files/ link. After opening "the others" mmap() one to their address space just as if
 
they would have done it with memfd descriptor.
 
  
===
+
* [[Memory dumping and restoring]]
 +
* [[Memory images deduplication]]
  
 
[[Category:Memory]]
 
[[Category:Memory]]
 
[[Category:Under the hood]]
 
[[Category:Under the hood]]
[[Category:Empty articles]]
+
[[Category:Editor help needed]]

Latest revision as of 08:02, 10 January 2018

Every process has one or more memory mappings, i.e. regions of virtual memory it allows to use. Some such mappings can be shared between a few processes, and they are called shared mappings. In other words, these are shared anonymous (not file-based) memory mappings. The article describes some intricacies of handling such mappings.

Checkpoint[edit]

During the checkpointing, CRIU needs to figure out all the shared mappings in order to dump them as such.

It does so by calling fstatat() on each entry found in the /proc/$PID/map_files/, noting the device:inode pair of the structure returned by fstatat(). Now, if some processes have a mapping with the same device:inode pair, this mapping is marked as shared between these processes and dumped as such.

Note that fstatat() works because the kernel actually creates a hidden tmpfs file, not visible from any tmpfs mounts, but accessible via its /proc/$PID/map_files/ entry.

Dumping a mapping means two things:

  • writing an entry into process' mm.img file;
  • storing the actual mapping data (contents).

For shared mappings, the contents is stored into a pair of image files: pagemap-shmem.img and pages.img. For details, see Memory dumps.

Note that different processes can map different parts of a shared memory segment. In this case, CRIU first collects mapping offsets and lengths from all the processes to determine the total segment size, then reads all the parts contents from the respective processes.

Restore[edit]

During the restore, CRIU already knows which mappings are shared, so they need to be restored as such. Here is how it is done.

Among all the processes sharing a mapping, the one with the lowest PID among the group (see postulates) is assigned to be a mapping creator. The creator task is to obtain a mapping file descriptor, restore the mapping data, and signal all the other process that it's ready. During this process, all the other processes are waiting.

First, the creator need to obtain a file descriptor for the mapping. To achieve it, two different approaches are used, depending on the availability.

In case memfd_create() syscall is available (Linux kernel v3.17+), it is used to obtain a file descriptor. Next, ftruncate() is called to set the proper size of mapping.

If memfd_create() is not available, the alternative approach is used. First, mmap() is called to create a mapping. Next, a file in /proc/self/map_files/ is opened to get a file descriptor for the mapping. The limitation of this method is, due to security concerns, /proc/$PID/map_files/ is not available for processes that live inside a user namespace, so it is impossible to use it if there are any user namespaces in the dump.

Once the creator have the file descriptor, it mmap()s it and restores its content from the dump (using memcpy()). The creator then unmaps the the mapping (note the file descriptor is still available). Next, it calls futex(FUTEX_WAKE) to signal all the waiting processes that the mapping file descriptor is ready.

All the other processes that need this mapping wait on futex(FUTEX_WAIT). Once the wait is over, they open the creator's /proc/$CREATOR_PID/fd/$FD file to get the mapping file descriptor.

Finally, all the processes (including the creator itself) call mmap() to create a needed mapping (note that mmap() arguments such as length, offset and flags may differ for different processes), and close() the mapping file descriptor as it is no longer needed.

Changes tracking[edit]

For iterative migration it's very useful to track changes in memory. Until CRIU v2.5, changes were tracked for anonymous memory only, but now it is also shared memory can be tracked as well. To achieve it, CRIU scans all the shmem segment owners' pagemap (as it does for anonymous memory) and then ANDs the collected soft-dirty bits.

The changes tracking caused developers to implement memory images deduplication for shmem segments as well.

Dumping present pages[edit]

When dumping the contents of shared memory, CRIU does not dump all of the data. Instead, it determines which pages contain it, and only dumps those pages. This is done similarly to how regular memory dumping and restoring works, i.e. by looking for PRESENT or SWAPPED bits in owners' pagemap entries.

There is one particular feature of shared memory dumps worth mentioning. Sometimes, a shared memory page can exist in the kernel, but it is not mapped to any process. CRIU detects such pages by calling mincore() on the shmem segment, which reports back the page in-memory status. The mincore bitmap is when ANDed with the per-process ones.

See also[edit]