GPU Checkpoint/Restore with CRIUgpu: Zero-Downtime ML

GPU workloads represent the most expensive compute resources in modern data centers. A single NVIDIA H100 can cost $25,000-40,000, and ML inference containers often hold multi-gigabyte models in GPU memory. When these containers restart, the cost isn't just downtime—it's burning money while models reload and caches rebuild.

Traditional container checkpoint/restore withCRIU handles CPU workloads elegantly, but GPU state presents an entirely different challenge. GPU memory lives outside the normal process address space, CUDA contexts maintain complex driver state, and multi-GPU topologies add layers of complexity that standard tools can't handle.

The solution is emerging, but it's not ready for prime time. Here's the current state of GPU container migration and what's coming next.

The GPU State Problem#

GPU workloads maintain state across multiple layers:

CUDA Runtime State:

Device memory allocations (often GBs of model weights)
CUDA contexts and streams
cuDNN handle states
Memory pool configurations

Driver-Level State:

GPU scheduling contexts
Memory management unit (MMU) mappings
PCIe configuration state
Multi-GPU communication channels

Container-Specific State:

GPU device assignments
Resource limits and quotas
Runtime configuration (nvidia-container-runtime)

A standard CRIU checkpoint captures none of this. The process might restore, but it wakes up to find its GPU resources gone.

Current Approaches: Why API Interception Fails#

Most existing solutions use API interception with device proxies - a fundamentally flawed approach that introduces significant challenges:

Challenge 1: Performance Overhead#

API interception sits in the critical path of every GPU operation. Research shows exponential overhead growth with training iterations - what starts as manageable latency becomes prohibitive for long-running workloads.

Challenge 2: Static vs Dynamic Linking#

CUDA defaults to static linking since version 5.5, but API interception requires dynamic linking. This forces recompilation of frameworks like PyTorch from source - often impractical in production environments.

Challenge 3: Complex GPU State Management#

GPUs maintain complex runtime state across streams, contexts, and memory hierarchies. Device proxies must reverse-engineer and replay this state, leading to reliability issues and non-deterministic behavior.

Challenge 4: Limited Ecosystem Support#

Solutions like Cricket work for simple workloads but break with real-world applications that use advanced features like CUDA graphs, multi-GPU communication, or complex memory management patterns.

The Emerging Solution: CRIUgpu#

The breakthrough came in 2025 withCRIUgpu, a research project that integratesNVIDIA's cuda-checkpoint withCRIU to achieve fully transparent GPU container checkpointing. Unlike previous approaches that rely on API interception, CRIUgpu creates unified CPU-GPU snapshots without performance overhead. It operates at the CUDA runtime level, capturing GPU memory and context state.

How CRIUgpu Works#

CRIUgpu leverages NVIDIA's cuda-checkpoint utility integrated with CRIU plugins:

The process:

Lock: CUDA APIs are locked, preventing new GPU operations
Complete: Active GPU work finishes (with configurable timeout)
Checkpoint: GPU memory copied to host, unified with CPU state
Release: GPU resources released, container becomes CPU-only

Restore process:

Acquire: GPU resources re-acquired
Restore: GPU memory and contexts restored at original addresses
Unlock: CUDA APIs unlocked, application resumes

Key advantages:

No API interception overhead
Works with statically linked applications
Supports both CUDA and ROCm
Unified CPU-GPU snapshots
Deterministic restore behavior

What gets captured:

Device memory contents (copied to host during checkpoint)
CUDA contexts, streams, and events
GPU memory mappings (restored at original addresses)
CUDA driver state

Checkpoint process:

Lock: CUDA driver APIs are locked
Complete: Already-submitted work finishes
Copy: Device memory copied to host
Release: GPU resources released

Restore process:

Acquire: GPUs re-acquired by process
Copy: Device memory copied back to GPU
Restore: CUDA objects and mappings restored
Unlock: CUDA driver APIs unlocked

Production Performance Results#

Recent research demonstrates CRIUgpu's production readiness with large-scale workloads:

Large Language Models#

LLaMA 3.1 (8B parameters) on H100:

Checkpoint time: 77 seconds
Restore time: 39 seconds
Checkpoint size: 56GB (97% GPU memory)

GPT-2 XL (1.5B parameters) on A100:

Checkpoint time: 131 seconds
Restore time: 145 seconds
Checkpoint size: 60GB (96% GPU memory)

Multi-GPU Scaling#

CRIUgpu scales linearly with GPU count:

1x A100: 13 seconds checkpoint, 8 seconds restore
2x A100: 26 seconds checkpoint, 17 seconds restore
4x A100: 55 seconds checkpoint, 35 seconds restore

Zero Runtime Overhead#

Unlike API interception approaches, CRIUgpu introduces no steady-state performance overhead. Applications run at native speed until checkpoint/restore operations.

Container Runtime Integration#

Custom runtime hooks can coordinate GPU and CPU state:

Production Challenges#

Memory Transfer Overhead#

GPU memory dumps are massive, but specific timing depends on:

GPU memory size and utilization
Storage I/O bandwidth
Network transfer for cross-node migration
Memory access patterns during dump

Performance characteristics need measurement in your specific environment.

CUDA Version Compatibility#

Multi-GPU Topology Preservation#

Complex topologies don't restore cleanly:

Current Limitations and Requirements#

Hardware Requirements:

Display driver 570+ (full feature set)
Display driver 550+ (basic functionality)
Linux x86_64 only
Same GPU topology for restore (type, count, memory size)

Current Limitations:

No UVM (Unified Virtual Memory) support
No GPU-to-GPU migration between different hardware
No NCCL support for multi-node distributed training
Multi-node checkpointing requires additional coordination

Container Integration:

Requires CRIU 4.0+
Podman support available
Container Device Interface (CDI) integration
NVIDIA Container Toolkit compatibility

Real-World Readiness#

CRIUgpu has been integrated into the upstream CRIU project (version 4.0+) and is available for production use. The technology has moved beyond research prototypes to provide a robust foundation for GPU container checkpointing in enterprise environments.

Container runtimes like Podman already support CRIUgpu through native CRIU integration, enabling transparent GPU container checkpointing without additional tooling or infrastructure changes.

When to Consider GPU Checkpointing#

Strong candidates:

Long-running ML training jobs
Inference services with expensive model loading
Multi-tenant GPU sharing with SLA requirements
Research workloads with checkpoint/restart patterns

Avoid for now:

Latency-sensitive real-time inference
Simple stateless GPU applications
Production systems requiring reliability guarantees

Building Toward Production#

If you're planning GPU checkpoint/restore:

Start with application-level approaches
Prototype with CUDA-checkpoint in development
Measure performance overhead carefully
Plan for manual orchestration initially
Monitor NVIDIA's roadmap for production-ready features

Conclusion#

GPU container checkpointing has evolved from experimental research to production-ready technology. CRIUgpu's breakthrough approach eliminates the fundamental flaws of API interception while delivering true zero-downtime GPU workload migration.

The technology is no longer "coming soon" - it's here, with production deployments already demonstrating its value. For organizations running GPU-intensive workloads, CRIUgpu offers:

Transparent checkpointing without application changes
Zero runtime overhead during normal operation
Unified CPU-GPU snapshots for complete state preservation
Linear scaling across multiple GPUs
Deterministic restore behavior

The business case is compelling: GPU resources are too expensive to waste on unnecessary restarts, and the technology now exists to eliminate them entirely. Early adopters are already gaining competitive advantages through more efficient GPU utilization and true zero-downtime operations.

For platform teams managing GPU infrastructure, the question is no longer whether to adopt GPU checkpointing, but how quickly you can integrate CRIUgpu into your container orchestration pipeline.

‍