Currently when pre-dumping the memory from tasks CRIU drains all the memory into pipes and then pushes it into images. This takes TONS of pipes and pins the whole memory into RAM. The proposal is to fix it by using sys_read_process_vm() system call.
If you look at the cr_pre_dump_tasks()
function you will see that it
- calls
cr_pre_dump_one_task()
in a loop for each of the collected processes - calls
cr_pre_dump_finish()
that unfreezes the tasks by callingpstree_switch_state()
and then callspage_xfer_dump_pages()
in a loop
This is where the problem sits. The first function (cr_pre_dump_one_task()
one) infects the task with parasite and drains all the memory from it. The mem_dump_ctl.pre_dump
being set to true makes the memory dumping code save all the pages in page-pipes associated with the task descriptor (the pstree_item
object). After that the pre_dump_finish()
walks the items and flushes the memory into images.
What we need to do is to tune both calls.
First, the pre_dump_one_task
shouldn't infect victims and drain pages from them. Instead, it should just stop the tasks and get the list of VMA-s from it. After it the pre_dump_finish
should unfreeze the tasks (as it does not) and then walk the VMAs provided and call the sys_read_process_vm()
for each entry met copying the pages contents into the images. Or to the page-server, but the page-xfer stuff handles it.
What's tricky in the latter part is that the sys_read_process_vm
will race with tasks fixing their VMAs, since they are unfrozen. So you'll need to carefully handle the potential partial reads or even errors and tune-up the images respectively.
Another headache is about read-protected memory areas. You will need to check whether the sys_read_process_vm
allows root user (criu runs as root) still read those, and, if no, handle them too by skipping them. Now the pre-dump read-unprotects the pages, then grabs them, then read-protects back, but that's not the option for the new algo.
Optimizations
The mentioned system call copies the memory, so we can do better. If we could have a sys_splice_process_vm
system call, that behaves like regular splice
one, but works on other task's memory, we could replace memory copying with memory vm-splicing and further splicing it into the image or page server socket file descriptor.