Open main menu

CRIU β

Changes

Memory pre dump

1,025 bytes added, 26 March
no edit summary
Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html sys_read_process_vm() ] system call.
If you look at the <code>cr_pre_dump_tasks() </code> function you will see that it
# calls <code>cr_pre_dump_one_task() </code> in a loop for each of the collected processes# calls <code>cr_pre_dump_finish() </code> that unfreezes the tasks by calling <code>pstree_switch_state() </code> and then calls <code>page_xfer_dump_pages() </code> in a loop
This is where the problem sits. The first function (pre_dump_one_task<code>cr_pre_dump_one_task()</code> one) infects the task with parasite and drains all the memory from it. The <code>mem_dump_ctl.pre_dump </code> being set to true makes the memory dumping code save all the pages in page-pipes associated with the task descriptor (the <code>pstree_item</code> object). After that the <code>pre_dump_finish() </code> walks the items and flushes the memory into images.
What we need to do is to tune both calls.
First, the <code>pre_dump_one_task </code> shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of vmaVMA-s from it. After it the <code>pre_dump_finish </code> should unfreeze the tasks (as it does not) and then walk the vmas VMAs provided and call the <code>sys_read_process_vm() </code> for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.
What's tricky in the latter part is that the <code>sys_read_process_vm </code> will race with tasks fixing their vmasVMAs, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.
Another headache is about read-protected memory areas. You will need to check whether the <code>sys_read_process_vm </code> allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.
== Optimizations ==
 
The mentioned system call copies the memory, so we can do better. If we could have a <code>sys_splice_process_vm</code> system call, that behaves like regular <code>splice</code> one, but works on other task's memory, we could replace memory copying with memory vm-splicing and further splicing it into the image or page server socket file descriptor.
 
== More info ==
* [https://github.com/avagin/criu/tree/process_vm_readv Andrey's attempt]
* [https://lore.kernel.org/patchwork/patch/871120/ The vmsplice proposal]
* [https://github.com/avagin/criu/commit/c346530b90a92fe5342fb814b88b735c211361fd Handling read-protected areas]
* [https://gist.github.com/rst0git/f5982eaddcf471068249ee8216920f86 Performance of splice]
[[Category:Memory]]