Line 62:
Line 62:
See [http://lwn.net/Articles/452184/ "Checkpoint/restart (mostly) in user space"]
See [http://lwn.net/Articles/452184/ "Checkpoint/restart (mostly) in user space"]
−
== 15 Jul 2011: Pavel sent POC to LKML ==
+
== 15 Jul 2011: Pavel sent initial RFC and code ==
+
From http://lwn.net/Articles/451916/:
+
<pre>
+
From: Pavel Emelyanov
+
Subject: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
+
Date: Fri, 15 Jul 2011 17:45:10 +0400
−
From: Pavel Emelyanov
+
Hi guys!
−
Subject: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
+
−
Date: Fri, 15 Jul 2011 17:45:10 +0400
+
There have already been made many attempts to have the checkpoint/restore functionality
+
in Linux, but as far as I can see there's still no final solutions that suits most of
+
the interested people. The main concern about the previous approaches as I see it was
+
about - all that stuff was supposed to sit in the kernel thus creating various problems.
+
+
I'd like to bring this subject back again proposing the way of how to implement c/r
+
mostly in the userspace with the reasonable help of a kernel.
+
+
+
That said, I propose to start with very basic set of objects to c/r that can work with
+
+
* x86_64 tasks (subtree) which includes
+
- registers
+
- TLS
+
- memory of all kinds (file and anon both shared and private)
+
* open regular files
+
* pipes (with data in it)
+
+
Core idea:
+
+
The core idea of the restore process is to implement the binary handler that can execve-ute
+
image files recreating the register and the memory state of a task. Restoring the process
+
tree and opening files is done completely in the user space, i.e. when restoring the subtree
+
of processes I first fork all the tasks in respective order, then open required files and
+
then call execve() to restore registers and memory.
+
+
The checkpointing process is quite simple - all we need about processes can be read from /proc
+
except for several things - registers and private memory. In current implementation to get
+
them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
+
described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
+
mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
+
mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
+
/proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
+
if required map one and read the contents of anon shared memory.
+
+
Other minor stuff is in patches and mostly tools. The set is for linux-2.6.39. The current
+
implementation is not yet well tested and has many other defects, but demonstrates the idea.
+
+
What do you think? Does the support from kernel of the proposed type suit us?
+
+
Thanks,
+
Pavel
+
</pre>