The Complete Guide to Measuring and Fixing GPU Utilization in Kubernetes

GPU infrastructure is among the most expensive resources in modern cloud computing. Yet the average GPU utilization across Kubernetes clusters sits at just 15-25%. This guide covers how to measure GPU utilization accurately and implement strategies to improve it.
Why GPU Clusters Are Underutilized#
GPU underutilization stems from several systemic issues:
Coarse-Grained Allocation#
Kubernetes allocates GPUs as whole units. If a workload needs 20% of a GPU's compute capacity, it still gets an entire GPU. This all-or-nothing allocation model creates massive waste, especially for inference workloads that may only need a fraction of GPU resources.
Bursty Workload Patterns#
Training workloads alternate between compute-heavy forward/backward passes and I/O-heavy data loading. During data loading phases, GPUs sit idle. This bursty pattern means even "fully utilized" training jobs may only use GPUs 50-60% of the time.
Lack of Visibility#
Most teams rely on nvidia-smi for GPU monitoring, which only shows point-in-time snapshots on individual nodes. Without cluster-wide, time-series GPU metrics, teams cannot identify waste patterns or make data-driven optimization decisions.
Fear of Disruption#
GPU workloads (especially training) are often long-running and expensive to restart. Teams over-provision to avoid any risk of resource contention, preferring waste over the possibility of a failed training run.
Measuring GPU Utilization with DCGM#
NVIDIA Data Center GPU Manager (DCGM) provides comprehensive GPU monitoring that goes far beyond nvidia-smi.
Key Metrics to Track#
- DCGM_FI_DEV_GPU_UTIL — SM (Streaming Multiprocessor) utilization percentage
- DCGM_FI_DEV_MEM_COPY_UTIL — Memory bandwidth utilization
- DCGM_FI_DEV_FB_USED — Framebuffer memory used (MB)
- DCGM_FI_DEV_FB_FREE — Framebuffer memory available (MB)
- DCGM_FI_PROF_GR_ENGINE_ACTIVE — Graphics engine active time ratio
- DCGM_FI_PROF_DRAM_ACTIVE — Memory active time ratio
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — Tensor core utilization (critical for AI workloads)
Kubernetes Integration#
Deploy DCGM Exporter as a DaemonSet to expose GPU metrics as Prometheus endpoints:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
ports:
- containerPort: 9400Multi-Dimensional Analysis#
True GPU utilization requires looking at multiple metrics simultaneously:
- Compute utilization (SM activity) tells you if the GPU cores are busy
- Memory utilization tells you if you're bottlenecked on data transfer
- Tensor core utilization tells you if AI workloads are using the specialized hardware
- Power consumption as a proxy for actual computational work
A GPU can show 90% SM utilization while tensor cores sit idle — meaning your AI workload isn't actually using the GPU's most valuable capability.
Optimization Strategies#
Checkpoint/Restore for Training#
Implement application-level checkpointing that saves training state to persistent storage at regular intervals. When combined with preemptible/spot instances, this allows you to:
- Use spot GPUs at 60-70% cost savings
- Resume training from the last checkpoint if interrupted
- Scale training across time periods when GPU prices are lowest
Right-Sizing GPU Memory#
Many inference workloads request GPUs with 80GB memory when they only use 8GB. Audit actual memory usage across your fleet and match workloads to appropriate GPU types:
- A100 80GB — Large model training, multi-model inference
- A100 40GB — Medium model training, single large model inference
- A10G 24GB — Small to medium model inference
- T4 16GB — Cost-effective inference for smaller models
NVIDIA Multi-Instance GPU (MIG)#
MIG allows a single A100 GPU to be partitioned into up to 7 isolated GPU instances. Each instance has dedicated compute, memory, and bandwidth resources:
- 7 x 1g.5gb — Seven small instances for lightweight inference
- 3 x 2g.10gb — Three medium instances
- 2 x 3g.20gb — Two larger instances
- 1 x 7g.40gb — Single full instance
MIG is ideal for inference workloads where each request needs a fraction of GPU resources.
Time-Based Autoscaling#
GPU workloads often follow predictable patterns. Training runs start in the morning and complete overnight. Inference traffic follows business hours. Implement time-based scaling policies that:
- Scale down GPU nodes during off-hours (nights, weekends)
- Pre-warm GPU capacity before expected demand
- Use spot instances for batch training during off-peak periods
GPU Sharing and Virtualization#
For inference workloads that don't need exclusive GPU access, consider:
- Time-slicing — Multiple pods share a GPU with time-multiplexed access
- MPS (Multi-Process Service) — CUDA contexts share a GPU with lower overhead
- vGPU — Hardware-level GPU virtualization (requires NVIDIA licensing)
Implementation Phases#
Phase 1: Visibility (Week 1-2)#
Deploy DCGM Exporter across all GPU nodes. Set up Grafana dashboards showing per-GPU, per-namespace, and per-workload utilization. Establish baselines.
Phase 2: Quick Wins (Week 3-4)#
Right-size GPU memory allocations based on actual usage data. Identify and terminate idle GPU workloads. Implement time-based scaling for development GPU clusters.
Phase 3: Advanced Optimization (Month 2-3)#
Enable MIG for inference workloads. Implement checkpoint/restore for training on spot instances. Deploy GPU-aware bin packing.
Phase 4: Continuous Optimization (Ongoing)#
Monitor utilization trends. Iterate on MIG partition strategies. Expand spot instance usage for fault-tolerant workloads.
Get Started#
DevZero provides automated GPU optimization for Kubernetes clusters. Our platform identifies underutilized GPUs, recommends right-sizing changes, and implements optimizations without disrupting your AI workloads.
Start your free GPU assessment to see how much you could save.