0%·6 min left
GPU

GPU Container Checkpoint/Restore with CRIUgpu: Zero-Downtime Live Migration for ML Workloads

Debo Ray

Debo Ray

Co-Founder, CEO

July 11, 20256 min read
GPU Container Checkpoint/Restore with CRIUgpu: Zero-Downtime Live Migration for ML Workloads

GPU workloads represent the most expensive compute resources in modern data centers. A single NVIDIA H100 can cost $25,000-40,000, and ML inference containers often hold multi-gigabyte models in GPU memory. When these containers restart, the cost isn't just downtime—it's burning money while models reload and caches rebuild.

Traditional container checkpoint/restore withCRIU handles CPU workloads elegantly, but GPU state presents an entirely different challenge. GPU memory lives outside the normal process address space, CUDA contexts maintain complex driver state, and multi-GPU topologies add layers of complexity that standard tools can't handle.

The solution is emerging, but it's not ready for prime time. Here's the current state of GPU container migration and what's coming next.

The GPU State Problem#

GPU workloads maintain state across multiple layers:

CUDA Runtime State:

  • Device memory allocations (often GBs of model weights)
  • CUDA contexts and streams
  • cuDNN handle states
  • Memory pool configurations

Driver-Level State:

  • GPU scheduling contexts
  • Memory management unit (MMU) mappings
  • PCIe configuration state
  • Multi-GPU communication channels

Container-Specific State:

  • GPU device assignments
  • Resource limits and quotas
  • Runtime configuration (nvidia-container-runtime)

A standard CRIU checkpoint captures none of this. The process might restore, but it wakes up to find its GPU resources gone.

Current Approaches: Why API Interception Fails#

Most existing solutions use API interception with device proxies - a fundamentally flawed approach that introduces significant challenges:

Challenge 1: Performance Overhead#

API interception sits in the critical path of every GPU operation. Research shows exponential overhead growth with training iterations - what starts as manageable latency becomes prohibitive for long-running workloads.

Challenge 2: Static vs Dynamic Linking#

CUDA defaults to static linking since version 5.5, but API interception requires dynamic linking. This forces recompilation of frameworks like PyTorch from source - often impractical in production environments.

Challenge 3: Complex GPU State Management#

GPUs maintain complex runtime state across streams, contexts, and memory hierarchies. Device proxies must reverse-engineer and replay this state, leading to reliability issues and non-deterministic behavior.

Challenge 4: Limited Ecosystem Support#

Solutions like Cricket work for simple workloads but break with real-world applications that use advanced features like CUDA graphs, multi-GPU communication, or complex memory management patterns.

The Emerging Solution: CRIUgpu#

The breakthrough came in 2025 withCRIUgpu, a research project that integratesNVIDIA's cuda-checkpoint withCRIU to achieve fully transparent GPU container checkpointing. Unlike previous approaches that rely on API interception, CRIUgpu creates unified CPU-GPU snapshots without performance overhead. It operates at the CUDA runtime level, capturing GPU memory and context state.

How CRIUgpu Works#

CRIUgpu leverages NVIDIA's cuda-checkpoint utility integrated with CRIU plugins:

The process:

  1. Lock: CUDA APIs are locked, preventing new GPU operations
  2. Complete: Active GPU work finishes (with configurable timeout)
  3. Checkpoint: GPU memory copied to host, unified with CPU state
  4. Release: GPU resources released, container becomes CPU-only

Restore process:

  1. Acquire: GPU resources re-acquired
  2. Restore: GPU memory and contexts restored at original addresses
  3. Unlock: CUDA APIs unlocked, application resumes

Key advantages:

  • No API interception overhead
  • Works with statically linked applications
  • Supports both CUDA and ROCm
  • Unified CPU-GPU snapshots
  • Deterministic restore behavior

What gets captured:

  • Device memory contents (copied to host during checkpoint)
  • CUDA contexts, streams, and events
  • GPU memory mappings (restored at original addresses)
  • CUDA driver state

Checkpoint process:

  1. Lock: CUDA driver APIs are locked
  2. Complete: Already-submitted work finishes
  3. Copy: Device memory copied to host
  4. Release: GPU resources released

Restore process:

  1. Acquire: GPUs re-acquired by process
  2. Copy: Device memory copied back to GPU
  3. Restore: CUDA objects and mappings restored
  4. Unlock: CUDA driver APIs unlocked

Production Performance Results#

Recent research demonstrates CRIUgpu's production readiness with large-scale workloads:

Large Language Models#

LLaMA 3.1 (8B parameters) on H100:

  • Checkpoint time: 77 seconds
  • Restore time: 39 seconds
  • Checkpoint size: 56GB (97% GPU memory)

GPT-2 XL (1.5B parameters) on A100:

  • Checkpoint time: 131 seconds
  • Restore time: 145 seconds
  • Checkpoint size: 60GB (96% GPU memory)

Multi-GPU Scaling#

CRIUgpu scales linearly with GPU count:

  • 1x A100: 13 seconds checkpoint, 8 seconds restore
  • 2x A100: 26 seconds checkpoint, 17 seconds restore
  • 4x A100: 55 seconds checkpoint, 35 seconds restore

Zero Runtime Overhead#

Unlike API interception approaches, CRIUgpu introduces no steady-state performance overhead. Applications run at native speed until checkpoint/restore operations.

Container Runtime Integration#

Custom runtime hooks can coordinate GPU and CPU state:

Production Challenges#

Memory Transfer Overhead#

GPU memory dumps are massive, but specific timing depends on:

  • GPU memory size and utilization
  • Storage I/O bandwidth
  • Network transfer for cross-node migration
  • Memory access patterns during dump

Performance characteristics need measurement in your specific environment.

CUDA Version Compatibility#

Multi-GPU Topology Preservation#

Complex topologies don't restore cleanly:

Current Limitations and Requirements#

Hardware Requirements:

  • Display driver 570+ (full feature set)
  • Display driver 550+ (basic functionality)
  • Linux x86_64 only
  • Same GPU topology for restore (type, count, memory size)

Current Limitations:

  • No UVM (Unified Virtual Memory) support
  • No GPU-to-GPU migration between different hardware
  • No NCCL support for multi-node distributed training
  • Multi-node checkpointing requires additional coordination

Container Integration:

  • Requires CRIU 4.0+
  • Podman support available
  • Container Device Interface (CDI) integration
  • NVIDIA Container Toolkit compatibility

Real-World Readiness#

CRIUgpu has been integrated into the upstream CRIU project (version 4.0+) and is available for production use. The technology has moved beyond research prototypes to provide a robust foundation for GPU container checkpointing in enterprise environments.

Container runtimes like Podman already support CRIUgpu through native CRIU integration, enabling transparent GPU container checkpointing without additional tooling or infrastructure changes.

When to Consider GPU Checkpointing#

Strong candidates:

  • Long-running ML training jobs
  • Inference services with expensive model loading
  • Multi-tenant GPU sharing with SLA requirements
  • Research workloads with checkpoint/restart patterns

Avoid for now:

  • Latency-sensitive real-time inference
  • Simple stateless GPU applications
  • Production systems requiring reliability guarantees

Building Toward Production#

If you're planning GPU checkpoint/restore:

  1. Start with application-level approaches
  2. Prototype with CUDA-checkpoint in development
  3. Measure performance overhead carefully
  4. Plan for manual orchestration initially
  5. Monitor NVIDIA's roadmap for production-ready features

Conclusion#

GPU container checkpointing has evolved from experimental research to production-ready technology. CRIUgpu's breakthrough approach eliminates the fundamental flaws of API interception while delivering true zero-downtime GPU workload migration.

The technology is no longer "coming soon" - it's here, with production deployments already demonstrating its value. For organizations running GPU-intensive workloads, CRIUgpu offers:

  • Transparent checkpointing without application changes
  • Zero runtime overhead during normal operation
  • Unified CPU-GPU snapshots for complete state preservation
  • Linear scaling across multiple GPUs
  • Deterministic restore behavior

The business case is compelling: GPU resources are too expensive to waste on unnecessary restarts, and the technology now exists to eliminate them entirely. Early adopters are already gaining competitive advantages through more efficient GPU utilization and true zero-downtime operations.

For platform teams managing GPU infrastructure, the question is no longer whether to adopt GPU checkpointing, but how quickly you can integrate CRIUgpu into your container orchestration pipeline.

Share:
Debo Ray

Debo Ray

Co-Founder, CEO

Cut Kubernetes Cost

Before You Pay a Cent.

Every feature unlocked. No hidden fees.

Start for free

Start Free

$0/ month
Kubernetes resource and cost monitoring
Up to 2 active clusters
Platform access for 45 days
Cost attribution for departments
Data export for chargeback
Audit logging