Memory pre dump
Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using sys_read_process_vm() system call.
If you look at the
cr_pre_dump_tasks() function you will see that it
cr_pre_dump_one_task()in a loop for each of the collected processes
cr_pre_dump_finish()that unfreezes the tasks by calling
pstree_switch_state()and then calls
page_xfer_dump_pages()in a loop
This is where the problem sits. The first function (
cr_pre_dump_one_task() one) infects the task with parasite and drains all the memory from it. The
mem_dump_ctl.pre_dump being set to true makes the memory dumping code save all the pages in page-pipes associated with the task descriptor (the
pstree_item object). After that the
pre_dump_finish() walks the items and flushes the memory into images.
What we need to do is to tune both calls.
pre_dump_one_task shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of VMA-s from it. After it the
pre_dump_finish should unfreeze the tasks (as it does not) and then walk the VMAs provided and call the
sys_read_process_vm() for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.
What's tricky in the latter part is that the
sys_read_process_vm will race with tasks fixing their VMAs, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.
Another headache is about read-protected memory areas. You will need to check whether the
sys_read_process_vm allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.
The mentioned system call copies the memory, so we can do better. If we could have a
sys_splice_process_vm system call, that behaves like regular
splice one, but works on other task's memory, we could replace memory copying with memory vm-splicing and further splicing it into the image or page server socket file descriptor.