Difference between revisions of "GPU Checkpointing"

m
m
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, check out the [https://arxiv.org/abs/2502.16631 CRIUgpu paper].
+
This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, see the
 +
NVIDIA's [https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/ Checkpointing CUDA Applications with CRIU] blog post and the [https://arxiv.org/abs/2502.16631 CRIUgpu paper].
  
 
== How CRIU integrates with GPU checkpointing mechanisms? ==
 
== How CRIU integrates with GPU checkpointing mechanisms? ==
Line 7: Line 8:
 
=== CUDA Plugin ===
 
=== CUDA Plugin ===
  
The checkpointing functionality for CUDA applications is enabled through a command-line utility called [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint]. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process.
+
The checkpointing functionality for CUDA applications is enabled through the [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint] utility. This utility allows to transparently checkpoint and restore the CUDA state of a running Linux process. The CUDA plugin integrates this external utility with CRIU to safely pause, checkpoint, and restore processes and containers that use NVIDIA GPUs. It detects whether <code>cuda-checkpoint</code> is available and whether the system has an NVIDIA GPU; if not, the plugin disables itself.
  
 
=== AMDGPU Plugin ===
 
=== AMDGPU Plugin ===

Latest revision as of 22:13, 10 November 2025

This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, see the NVIDIA's Checkpointing CUDA Applications with CRIU blog post and the CRIUgpu paper.

How CRIU integrates with GPU checkpointing mechanisms?Edit

CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific plugins.

CUDA PluginEdit

The checkpointing functionality for CUDA applications is enabled through the cuda-checkpoint utility. This utility allows to transparently checkpoint and restore the CUDA state of a running Linux process. The CUDA plugin integrates this external utility with CRIU to safely pause, checkpoint, and restore processes and containers that use NVIDIA GPUs. It detects whether cuda-checkpoint is available and whether the system has an NVIDIA GPU; if not, the plugin disables itself.

AMDGPU PluginEdit

Usage ExamplesEdit

LimitationsEdit