Line 1: |
Line 1: |
− | Collected on this page is the design notes about supporting userfaultfd in CRIU
| + | This article describes usage of userfaultfd for lazy restore and lazy migration in CRIU. |
| + | |
| + | == Background == |
| + | The userfaultfd mechanism is designed to allow user-space paging. Its initial implementation merged in Linux 4.3 was designed for KVM/QEMU use-case and lacked some functionality necessary for CRIU. In Linux 4.11 the userfaultfd was extended with so-called "non-cooperative" mode, that allows, at least in theory, lazy (or post-copy) restore in CRIU. |
| | | |
| == Concepts == | | == Concepts == |
| | | |
− | * The <code>restore</code> action should accept yet another API switch: option <code>--lazy-pages</code> | + | * The <code>restore</code> action accepts yet another API switch: option <code>--lazy-pages</code>. In this mode, <code>restore</code> skips injection of lazy pages into the processes address space, but rather registers lazy memory areas with userfaultfd. |
− | * Remote pages drain would require synchronization between criu dump and criu restore. This is the [[P.Haul]] responsibility, so it looks like the best solution here would be to make it work via [[page server]] and setup connection with [[RPC]]'s <code>ps.port</code> option. | + | * The lazy pages are completely handled by dedicated <code>lazy-pages</code> daemon. The daemon recieves userfault file descriptors from <code>restore</code> via UNIX socket. The userfault file descriptors allow reception of page-fault and other events and resolution of these events by the daemon. |
| + | * For the migration case, the <code>dump</code> action also accepts API switch: option <code>--lazy-pages</code>. When this option is used, the <code>dump</code> keeps the memory pages and allows the <code>lazy-pages</code> daemon to request these pages via TCP connection. |
| | | |
| [[File:Criu-memory-wflow.png]] | | [[File:Criu-memory-wflow.png]] |
Line 10: |
Line 14: |
| === Daemon === | | === Daemon === |
| | | |
− | Tasks after restore should have lazy VMAs being backed by userfaultfd, the fd itself should be sent before resume to CRIU (daemon?) and closed. This is CRIU who will monitor the UFFD events and repopulate the tasks address space. It should be able to get pages from both -- remote and local images. | + | Tasks after restore have lazy VMAs registered with userfaultfd, the fd itself is sent before resume to <code>lazy-pages</code> daemon and closed. The daemon monitors the UFFD events and repopulates the tasks address space. The <code>lazy-pages</code> daemon can get pages either from images (both local and remote) or directly from the remote side <code>dump</code>. |
| + | |
| + | When the restored task accesses a missing memory page, it causes a page fault. The <copde>lazy-pages</code> daemon receives the page fault notification and resolves it by populating the faulting task memory. If there were no page faults for some time, the daemon copies the task's remaining memory pages in the background. |
| + | |
| + | ==== Local images ==== |
| | | |
− | === Local images ===
| + | The daemon uses local page-read engine to read pages from images. |
| | | |
− | The daemon should just use local page-read engine and read pages from images.
| + | ==== Remote images ==== |
| | | |
− | === Remote images ===
| + | * The [[page-server]] is run on the remote side with <code>--lazy-pages</code> option. |
| + | * The lazy-pages daemon connects to the remote [[page server]] with <code>--page-server</code> option. The <code>--address</code> and <code>--port</code> options allow setting of IP addrees and port of the listening [[page server]]. |
| + | * Current protocol allows the lazy-pages daemon to request several continous pages. |
| | | |
− | * The page-read engine should be patched to learn how to talk to the remote host ([[page server]] with --page-server option?) on the other end. | + | ==== Migration ==== |
− | * The source node should get pages from tasks dumped and send them out on the destination node. | + | * The <code>dump</code> collects the pages into pipes and starts the [[page-server]] in a mode that allows <code>lazy-pages</code> daemon to connect to it and request the memory pages |
− | * Protocol should include out-of-order pages and background pages pushing (sending them before demand from the process).
| + | * When the restored task accesses a missing memory page, the <code>lazy-pages</code> daemon request the page from the [[page-server]] running on the dump side |
| + | * After the page is received, the <code>lazy-pages</code> daemon injects it into the task's address space using userfautlfd |
| | | |
− | == Known issues == | + | == Limitations == |
| | | |
− | * Only MAP_PRIVATE | MAP_ANONYMOUS will be supported in the 1st version due to kernel constraints. | + | * Currently only MAP_PRIVATE | MAP_ANONYMOUS is supported. Newer kernels (4.11+) allow userfaultfd for hugetlbfs and shared memory, yet to be implemented in CRIU. |
| * Userfault is known not to map one page into two places. Thus -- COW-ed pages will get COW-ed. | | * Userfault is known not to map one page into two places. Thus -- COW-ed pages will get COW-ed. |
− | * Andrea (author) states that UFFDIO_REMAP might be slow as compared to UFFDIO_COPY. Probably it makes sense to copy data into tasks, not move. | + | * The [[Lazy migration]] use-case might be racy because there is no means to synchronize between pending forks, remote pages transfers and page faults. |
− | * Unmaps and mremaps can screw things up. Either we have to make uffd-s per VMA or add events about such things.
| |
− | * Forks are even worse -- kid will just populate its memory with zero pages :(
| |
| | | |
| == See also == | | == See also == |
Line 38: |
Line 47: |
| [[Category:Plans]] | | [[Category:Plans]] |
| [[Category:Development]] | | [[Category:Development]] |
| + | [[Category:Under the hood]] |