Difference between revisions of "GPU Checkpointing"

From CRIU
Jump to navigation Jump to search
m
m
Line 1: Line 1:
 
This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, check out the [https://arxiv.org/abs/2502.16631 CRIUgpu paper], NVIDIA's [https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/ blog post], and AMD conference talks at  [https://lpc.events/event/11/contributions/891/ LPC] and [https://indico.freedesktop.org/event/1/contributions/18/ X.Org].
 
This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, check out the [https://arxiv.org/abs/2502.16631 CRIUgpu paper], NVIDIA's [https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/ blog post], and AMD conference talks at  [https://lpc.events/event/11/contributions/891/ LPC] and [https://indico.freedesktop.org/event/1/contributions/18/ X.Org].
  
CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific [[plugins]].
+
== How CRIU integrates with GPU checkpointing mechanisms? ==
  
== NVIDIA CUDA ==
+
CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific [[plugins]].
  
The checkpointing functionality for CUDA is enabled through a command-line utility called [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint]. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process.
+
=== CUDA Plugin ===
  
== AMD ROCm ==
+
The checkpointing functionality for CUDA applications is enabled through a command-line utility called [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint]. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process.
 +
 
 +
=== AMDGPU Plugin ===

Revision as of 18:13, 29 October 2025

This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers. For more information, check out the CRIUgpu paper, NVIDIA's blog post, and AMD conference talks at LPC and X.Org.

How CRIU integrates with GPU checkpointing mechanisms?

CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific plugins.

CUDA Plugin

The checkpointing functionality for CUDA applications is enabled through a command-line utility called cuda-checkpoint. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process.

AMDGPU Plugin