Changes

1,025 bytes added ,  09:54, 26 March 2019
no edit summary
Line 1: Line 1: −
Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using sys_read_process_vm() system call.
+
Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html sys_read_process_vm()] system call.
   −
If you look at the cr_pre_dump_tasks() function you will see that it
+
If you look at the <code>cr_pre_dump_tasks()</code> function you will see that it
   −
# calls cr_pre_dump_one_task() in a loop for each of the collected processes
+
# calls <code>cr_pre_dump_one_task()</code> in a loop for each of the collected processes
# calls cr_pre_dump_finish() that unfreezes the tasks by calling pstree_switch_state() and then calls page_xfer_dump_pages() in a loop
+
# calls <code>cr_pre_dump_finish()</code> that unfreezes the tasks by calling <code>pstree_switch_state()</code> and then calls <code>page_xfer_dump_pages()</code> in a loop
   −
This is where the problem sits. The first function (pre_dump_one_task) infects the task with parasite and drains all the memory from it. The mem_dump_ctl.pre_dump being set to true makes the memory dumping code save all the pages in page-pipes associated with the pstree_item. After that the pre_dump_finish() walks the items and flushes the memory into images.
+
This is where the problem sits. The first function (<code>cr_pre_dump_one_task()</code> one) infects the task with parasite and drains all the memory from it. The <code>mem_dump_ctl.pre_dump</code> being set to true makes the memory dumping code save all the pages in page-pipes associated with the task descriptor (the <code>pstree_item</code> object). After that the <code>pre_dump_finish()</code> walks the items and flushes the memory into images.
    
What we need to do is to tune both calls.
 
What we need to do is to tune both calls.
   −
First, the pre_dump_one_task shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of vma-s from it. After it the pre_dump_finish should unfreeze the tasks (as it does not) and then walk the vmas provided and call the sys_read_process_vm() for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.
+
First, the <code>pre_dump_one_task</code> shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of VMA-s from it. After it the <code>pre_dump_finish</code> should unfreeze the tasks (as it does not) and then walk the VMAs provided and call the <code>sys_read_process_vm()</code> for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.
   −
What's tricky in the latter part is that the sys_read_process_vm  will race with tasks fixing their vmas, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.
+
What's tricky in the latter part is that the <code>sys_read_process_vm</code> will race with tasks fixing their VMAs, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.
   −
Another headache is about read-protected memory areas. You will need to check whether the sys_read_process_vm allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.
+
Another headache is about read-protected memory areas. You will need to check whether the <code>sys_read_process_vm</code> allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.
    +
== Optimizations ==
 +
 +
The mentioned system call copies the memory, so we can do better. If we could have a <code>sys_splice_process_vm</code> system call, that behaves like regular <code>splice</code> one, but works on other task's memory, we could replace memory copying with memory vm-splicing and further splicing it into the image or page server socket file descriptor.
 +
 +
== More info ==
 +
* [https://github.com/avagin/criu/tree/process_vm_readv Andrey's attempt]
 +
* [https://lore.kernel.org/patchwork/patch/871120/ The vmsplice proposal]
 +
* [https://github.com/avagin/criu/commit/c346530b90a92fe5342fb814b88b735c211361fd Handling read-protected areas]
 +
* [https://gist.github.com/rst0git/f5982eaddcf471068249ee8216920f86 Performance of splice]
 
[[Category:Memory]]
 
[[Category:Memory]]