Memory pre dump

Revision as of 09:54, 26 March 2019 by Xemul (talk | contribs)

Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using sys_read_process_vm() system call.

If you look at the cr_pre_dump_tasks() function you will see that it

  1. calls cr_pre_dump_one_task() in a loop for each of the collected processes
  2. calls cr_pre_dump_finish() that unfreezes the tasks by calling pstree_switch_state() and then calls page_xfer_dump_pages() in a loop

This is where the problem sits. The first function (cr_pre_dump_one_task() one) infects the task with parasite and drains all the memory from it. The mem_dump_ctl.pre_dump being set to true makes the memory dumping code save all the pages in page-pipes associated with the task descriptor (the pstree_item object). After that the pre_dump_finish() walks the items and flushes the memory into images.

What we need to do is to tune both calls.

First, the pre_dump_one_task shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of VMA-s from it. After it the pre_dump_finish should unfreeze the tasks (as it does not) and then walk the VMAs provided and call the sys_read_process_vm() for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.

What's tricky in the latter part is that the sys_read_process_vm will race with tasks fixing their VMAs, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.

Another headache is about read-protected memory areas. You will need to check whether the sys_read_process_vm allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.

Optimizations

The mentioned system call copies the memory, so we can do better. If we could have a sys_splice_process_vm system call, that behaves like regular splice one, but works on other task's memory, we could replace memory copying with memory vm-splicing and further splicing it into the image or page server socket file descriptor.

More info