This pages tries to explain differences between CRIU and other C/R solutions.
DMTCP
DMTCP implements checkpoint/restore of a process on a library level. This means, that if you want to C/R some application you should launch one with DMTCP library (dynamically) linked from the very beginning. When launched like this, the DMTCP library intercepts a certain amount of library calls from the application, builds a shadow data-base of information about process' internals and then forwards the request down to glibc/kernel. The information gathered is to be used to create an image of the application. With this approach, one can only dump applications known to run successfully with the DMTCP libraries, but the latter doesn't provide proxies for all kernel APIs (for example, inotify() is known to be unsupported). Another implication of this approach is potential performance issues that arise due to proxying of requests.
Restoration of process set is also tricky, as it frequently requires restoring an object with the predefined ID and kernel is known to provide no APIs for several of them. For example, kernel cannot fork a process with the desired PID. To address that, DMTCP fools a process by intercepting the getpid() library call and providing fake PID value to the application. Such behavior is very dangerous, as application might see wrong files in the /proc filesystem if it will try to access one via its PID.
CRIU, on the other hand, doesn't require any libraries to be pre-loaded. It will checkpoint and restore any arbitrary application, as long as kernel provides all needed facilities. Kernel support for some of CRIU features were added recently, essentially meaning that a recent kernel version might be required.
BLCR
Berkeley Lab Checkpoint/Restart (BLCR) is a part of the Scalable Systems Software Suite , developed by the Future Technologies Group at Lawrence Berkeley National Lab under SciDAC funding from the United States Department of Energy. It is an Open Source, system-level checkpointer designed with High Performance Computing (HPC) applications in mind: in particular CPU and memory intensive batch-scheduled MPI jobs. BLCR is implemented as a GPL-licensed loadable kernel module for Linux 2.4.x and 2.6.x kernels on the x86, x86_64, PPC/PPC64, ARM architectures, and a small LGPL-licensed library.
CRIU, DMTCP, BLCR
“looks\seems like yes/no” - i found only unproved message(s) saying “yes”/“no”
“not yet” - it is officially planned or i found no reasons, why it can’t be done.
CRIU | DMTCP | BLCR | |
arch | x86_64, ARM | x86, x86_64, ARM | x86,x86_64,PPC/PPC64,ARM |
OS | |||
modified kernel | yes, but only for some extra features.
All unnecessary features are already in new kernel versions |
no | no, module can be simply modprobed
|
special libs
|
no | yes | yes |
root privileges | yes, otherwise it would be unsafe,because,for example, of parasite code | no | no |
need to modify programs | no | no | yes
there are some difficulties with statically linked applications, and with LinuxThreads (cuz it does not support them at all) |
need to prepare tasks | no | yes
It preloadsthe DMTCP library. That library runs before the routinemain(). It creates a second thread. Thecheckpoint thread then creates a socket to the DMTCP coordinator andregisters itself. The checkpoint thread also creates a signal handler. |
yes
CR shall notify processes when a checkpoint is to occur (before the kernel takes a checkpoint) to allow the processes to prepare itself accordingly. |
Does it change behavior of the c/r-ed programs?
|
no | yes
because of wrappers on system calls |
yes
because of wrappers on system calls |
migration | yes
even if kernel ,libs, etc are newer
|
yes
if both kernels are recent |
yes
but if all is the same!
|
Containers | yes
LXC and OpenVZ containers |
looks like no
It doesn't support namespaces, so it probably can’t dump containers |
looks like no |
parallel/distributed computations | no | yes
OpenMPI, MPICH2, OpenMP, Cilk are alredy supported and Infiniband is in progress. |
yes
Cray MPI, Intel MPI, LAM/MPI, MPICH-V, MPICH2, MVAPICH, Open MPI, SGI MPT |
c\r gdb with debugging app | no, because they are using the same interface | yes | no |
X-Windows graphics programs (KDE, GNOME, etc) | yes, by using vnc | yes, by using vnc | seems like no
|
Solutions for invocation in the custom software | not yet | yes
Plugins and API |
not yet |
unix sockets | yes,all kinds | yes | no |
udp sockets | yes, both ipv4 and ipv6 | not yet
developers of dmtcp had no request for this |
not yet |
tcp sockets | yes | yes | not yet |
remote tcp connection | yes | not yet
but you can write a simple DMTCP plugin that tells DMTCP how you want to reconnect on restart |
no |
Infiniband | no | not yet
developing is on the half-way |
no |
multithread support | yes | yes | yes |
multiprocess | yes | yes | yes |
process groups | yes | yes | not yet |
zombies | yes | no | no |
namespaces | yes | no | no |
sessions | yes | yes | not yet |
Ptraced programs | no | yes | no |
System V IPC | yes | yes | no |
memory mappings | yes, all kinds | yes | yes, partially |
protected memory | yes | yes | yes |
pipes | yes | yes | not yet |
terminals | yes
only Unix98 PTYs |
yes
|
yes |
non-posix files (inotify, signalfd, eventfd, etc) | yes
inotify, epoll, etc. |
Yes
epoll, eventfd, signalfd are already supported and inotify will be supported in future |
looks like no |
timers | yes | no
Any counter or timer active since the beginning of a process will consider the restarted process to be a new process. |
yes |
Shared resources (files, mm, etc.) | yes
files, memory, etc. |
yes
System V shared memory(shmget, etc.), mmap-based shared memory, shared sockets, pipes, file descriptors. |
no
but it is planned to suppord shared mmap regions |
block devices | looks like yes | looks like yes | no
|
character devices | mostly no
but /dev/null, /dev/zero, etc. are supported |
mostly no
looks like null and zero are supported |
mostly no
but /dev/null and /dev/zero are supported |
capture the contents of all open files | yes | looks like no | not yet |
Sources
DMTCP:
- http://dmtcp.sourceforge.net/
- http://dmtcp.sourceforge.net/papers/dmtcp.pdf
- http://www.ccs.neu.edu/home/gene/papers/ccgrid06.pdf
- http://research.cs.wisc.edu/htcondor/CondorWeek2010/condor-presentations/cooperman-dmtcp.pdf
- http://dmtcp.sourceforge.net/papers/mtcp.pdf
BLCR:
- https://upc-bugs.lbl.gov/blcr/doc/html/
- https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/LBNL-49659.pdf
- https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/blcr.pdf
- https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/checkpointSurvey-020724b.pdf
- https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/lacsi-2003.pdf
- https://ftg.lbl.gov/assets/projects/CheckpointRestart/Pubs/LBNL-60520.pdf