Difference between revisions of "GPU Checkpointing"
Jump to navigation
Jump to search
m |
m |
||
| Line 3: | Line 3: | ||
CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific [[plugins]]. | CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific [[plugins]]. | ||
| − | + | == NVIDIA CUDA == | |
The checkpointing functionality for CUDA is enabled through a command-line utility called [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint]. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process. | The checkpointing functionality for CUDA is enabled through a command-line utility called [http://github.com/NVIDIA/cuda-checkpoint cuda-checkpoint]. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process. | ||
| − | + | == AMD ROCm == | |
Revision as of 17:01, 29 October 2025
This page explains how CRIU handles GPU-accelerated workloads, what vendor components are needed (NVIDIA & AMD), and how to use them for processes and containers.
CRIU checkpoints Linux-kernel resources (e.g., memory, threads, files, sockets). GPU state such as device memory, contexts, and queues that lives outside normal process address space needs special handling, so CRIU relies on vendor-specific plugins.
NVIDIA CUDA
The checkpointing functionality for CUDA is enabled through a command-line utility called cuda-checkpoint. This utility allows to transparently checkpoint and restore the GPU state of a running Linux process.