Kubernetes and GPU: The Complete Guide to Running AI/ML Workloads at Scale
A comprehensive deep dive into GPU orchestration in Kubernetes — from device plugins and the GPU Operator to advanced sharing strategies like MIG, MPS, and time-slicing. Learn how to schedule, monitor, and optimize GPU workloads for AI/ML at scale.
GPU as the Engine of AI
Graphics Processing Units (GPUs) have become the cornerstone of modern AI and machine learning infrastructure. What started as specialized hardware for rendering graphics has evolved into the computational backbone powering everything from large language models to autonomous vehicles.
Kubernetes, the de facto standard for container orchestration, wasn't originally designed with GPUs in mind. It was built for CPU-centric workloads with predictable, preemptive scheduling. But as AI workloads exploded, Kubernetes had to adapt — and adapt it did.
According to the 2024 State of Production Kubernetes research by Spectro Cloud over two-thirds of organizations consider Kubernetes key to taking full advantage of AI. The vast majority are either already running AI workloads in production on Kubernetes or plan to within the year.
According to recent industry surveys, over two-thirds of organizations consider Kubernetes key to taking full advantage of AI. The vast majority are either already running AI workloads in production on Kubernetes or plan to within the year.
This guide provides a comprehensive deep dive into running GPU workloads on Kubernetes — from understanding why GPUs are fundamentally different from CPUs, through device plugins and operators, to advanced scheduling strategies, GPU sharing techniques, and production best practices.
Why GPUs Are Different: The Scheduling Challenge
Before diving into Kubernetes specifics, it's essential to understand why GPUs present unique challenges that the traditional Kubernetes scheduler wasn't designed to handle.
2.1 GPUs Bypass the Linux Kernel
Unlike CPUs, GPU workloads bypass the Linux kernel entirely. They don't obey cgroups, namespaces, or the standard Linux scheduler. GPUs run their own show through proprietary drivers and opaque memory management.
This has profound implications:
- No native resource isolation: The kernel can't enforce memory or compute limits on GPU processes the way it does for CPU/memory via cgroups.
- No preemption: The kernel can't preempt a GPU workload the way it can a CPU process. Once a CUDA kernel launches, it runs to completion.
- No native sharing: By default, GPUs are monolithic resources — you either have the whole GPU or nothing.
2.2 GPUs as Non-Overcommittable Resources
In Kubernetes, CPUs can be overcommitted — you can allocate more CPU requests than physically exist, relying on time-sharing. Memory can also be overcommitted in some scenarios.
GPUs are different. By default in Kubernetes, GPUs cannot be overcommitted, and workloads cannot request fractions of a GPU. If your node has 4 GPUs, you can run at most 4 pods requesting 1 GPU each — even if those pods only use 10% of each GPU's capacity.
This fundamental constraint is why GPU sharing strategies (MIG, MPS, time-slicing) have become so important for efficient utilization.
2.3 The GPU Memory Challenge
GPU memory (VRAM) is a precious, finite resource. Unlike system RAM, GPU memory:
- Cannot be swapped to disk
- Is shared between all processes on a GPU by default
- Results in immediate CUDA Out-Of-Memory errors when exhausted
A single runaway process can consume all GPU memory, starving other workloads and potentially crashing applications without warning.
3. The Kubernetes Device Plugin Framework
Kubernetes introduced the Device Plugin framework to address hardware accelerators like GPUs. This framework provides a standardized way to advertise and allocate specialized hardware resources to containers.
3.1 How Device Plugins Work
Device plugins run as DaemonSets on GPU-enabled nodes. They communicate with the kubelet through a gRPC interface to:
- Advertise Resources: Tell Kubernetes what devices are available (e.g., "this node has 4 nvidia.com/gpu devices")
- Allocate Resources: When a pod requests a GPU, provide the necessary device files and environment variables
- Monitor Health: Report device health status and handle failures
3.2 The Container Device Interface (CDI)
To address the complexity of vendor-specific integrations, the community introduced the Container Device Interface (CDI). CDI standardizes how container runtimes interact with device plugins, decoupling device configuration from runtime-specific code.
With CDI, vendors can describe their devices in a JSON specification, and any CDI-compliant runtime (containerd, CRI-O) can consume them without vendor-specific modifications.
3.3 GPU Resource Syntax in Kubernetes
Unlike CPU and memory, GPUs have unique resource request syntax:
- GPUs must be specified in limits only — Kubernetes uses the limit as the request automatically
- Requests and limits must be equal if both are specified
- GPUs must be integers — you cannot request 0.5 GPUs (without sharing strategies)
- You cannot specify GPU requests without specifying limits
4. NVIDIA Device Plugin vs. GPU Operator
NVIDIA provides two approaches for GPU enablement in Kubernetes: the Device Plugin alone, or the comprehensive GPU Operator. Understanding when to use each is a fundamental architectural decision.
4.1 NVIDIA Device Plugin
The NVIDIA Device Plugin is a DaemonSet that exposes NVIDIA GPUs to the Kubernetes scheduler. It's lightweight and focused:
- What it does: Discovers GPUs on nodes, advertises them as nvidia.com/gpu resources, allocates them to pods
- Prerequisites: NVIDIA drivers must be pre-installed on nodes, nvidia-container-toolkit configured
- Best for: Managed cloud environments where nodes come with pre-installed GPU drivers (EKS, GKE, AKS GPU node pools)
4.2 NVIDIA GPU Operator
The NVIDIA GPU Operator is a comprehensive solution that automates the entire GPU software stack deployment. It doesn't replace the Device Plugin — it manages and deploys it as one of many components.
Components managed by the GPU Operator:
- NVIDIA Driver: Containerized driver installation — no host modification needed
- NVIDIA Container Toolkit: Enables GPU access from containers
- NVIDIA Device Plugin: Exposes GPUs to Kubernetes
- GPU Feature Discovery (GFD): Automatically labels nodes with GPU properties (model, memory, MIG capability)
- DCGM Exporter: Exposes GPU metrics to Prometheus
- MIG Manager: Manages Multi-Instance GPU partitioning
- Validator: Verifies all components are working correctly
4.3 Decision Matrix
Use Device Plugin alone when:
- Nodes come with pre-installed, managed GPU drivers
- You have a simple, homogeneous GPU environment
- Minimal operational overhead is the priority
Use GPU Operator when:
- Managing heterogeneous GPU fleets across environments
- You need MIG, time-slicing, or vGPU support
- Automated driver lifecycle management is required
- GPU monitoring and observability are priorities
- Running on bare metal or self-managed infrastructure
5. GPU Scheduling Mechanics in Kubernetes
5.1 Basic GPU Scheduling
When you request a GPU, the kube-scheduler:
- Filters nodes to find those with available nvidia.com/gpu resources
- Scores remaining nodes based on standard predicates (affinity, taints, etc.)
- Selects the best node and binds the pod
- kubelet calls the device plugin to allocate specific GPU(s)
- Container runtime (containerd/CRI-O) launches the container with GPU access
5.2 GPU Feature Discovery Labels
GPU Feature Discovery (GFD) automatically labels nodes with detailed GPU information, enabling precise workload placement:
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.count=8
nvidia.com/gpu.family=ampere
nvidia.com/mig.capable=true
nvidia.com/cuda.driver.major=12These labels enable node selectors and affinity rules for workload-specific GPU requirements.
5.3 Taints and Tolerations for GPU Nodes
GPU nodes are expensive. Best practice is to taint them to prevent non-GPU workloads from being scheduled:
# Taint GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoScheduleGPU workloads must then include tolerations to be scheduled on these nodes.
6. GPU Sharing Strategies: MIG, MPS, Time-Slicing, and vGPU
The inability to share GPUs natively in Kubernetes leads to massive underutilization. Industry surveys show that most organizations struggle not from GPU scarcity, but from poor GPU utilization caused by limited multi-tenancy capabilities.
Four main strategies exist to share GPUs, each with distinct tradeoffs:
6.1 Multi-Instance GPU (MIG)
MIG is a hardware-based partitioning feature available on NVIDIA Ampere (A100) and newer GPUs (A30, H100, H200). It physically divides a GPU into up to 7 isolated instances, each with:
- Dedicated compute cores
- Dedicated memory bandwidth
- Dedicated L2 cache
- Hardware fault isolation
MIG Profiles (A100-80GB example):
- 1g.10gb — 1/7 of GPU compute, 10GB memory (up to 7 instances)
- 2g.20gb — 2/7 of GPU compute, 20GB memory (up to 3 instances)
- 3g.40gb — 3/7 of GPU compute, 40GB memory (up to 2 instances)
- 7g.80gb — Full GPU
Pros: True hardware isolation, guaranteed resources, fault separation, QoS guarantees
Cons: Only supported on Ampere+, fixed partition sizes, static configuration, some memory overhead
Best for: Multi-tenant environments, SLA-bound inference, workloads requiring predictable performance
6.2 Multi-Process Service (MPS)
MPS is a software-based sharing mechanism that allows multiple CUDA processes to share a GPU through a unified server process. Instead of context-switching between processes, MPS interleaves their work.
How it works:
- A control daemon sets the GPU to EXCLUSIVE_PROCESS mode
- An MPS server acts as a proxy between CUDA applications and the GPU
- Multiple clients share the server's scheduling resources
- Context-switch overhead is eliminated
Pros: Higher throughput than time-slicing, memory limit enforcement, works on most NVIDIA GPUs
Cons: No fault isolation (one crash affects all), shared address space, not compatible with MIG
Best for: Trusted internal workloads, inference serving, latency-tolerant batch processing
6.3 Time-Slicing
Time-slicing is the simplest sharing strategy. The GPU rapidly context-switches between processes, giving each a time slice of execution. This is temporal sharing — processes take turns using the GPU.
How it works in Kubernetes:
- Configure the device plugin to advertise "replicas" of each GPU
- A GPU with 4 replicas appears as 4 schedulable resources
- Pods requesting 1 replica get shared access to the underlying GPU
- The GPU's hardware scheduler time-slices between workloads
Pros: Works on all Pascal+ GPUs, simple to configure, flexible oversubscription
Cons: No memory isolation, no fault isolation, context-switch overhead, unpredictable latency
Best for: Development environments, light inference, bursty workloads, cost optimization
6.4 Virtual GPU (vGPU)
NVIDIA vGPU enables multiple virtual machines to share a physical GPU with hardware-accelerated graphics and compute. Each VM sees a virtualized GPU slice.
Requirements: Supported hypervisor (VMware, KVM, Hyper-V), NVIDIA AI Enterprise license, compatible GPU
Best for: VDI, VM-based AI workloads, enterprises with existing virtualization infrastructure
6.5 Comparison Matrix
Isolation: MIG > vGPU > MPS > Time-Slicing
Flexibility: Time-Slicing > MPS > vGPU > MIG
GPU Compatibility: Time-Slicing > MPS > vGPU > MIG (Ampere+ only)
Best Practice: Use MIG for production inference with SLAs. Use time-slicing for development and light workloads. Combine MIG + time-slicing for maximum density.
7. Gang Scheduling with Kueue and Volcano
Distributed training jobs (PyTorch DDP, Horovod, MPI) require gang scheduling — all workers must start together or not at all. The default Kubernetes scheduler doesn't support this.
7.1 The Gang Scheduling Problem
Imagine a 4-GPU training job that needs 4 pods. If Kubernetes schedules 3 pods but can't find resources for the 4th, you have:
- 3 GPUs allocated but idle (waiting for the 4th)
- Potential deadlock if multiple jobs partially schedule
- Wasted resources and blocked pipelines
7.2 Kueue
Kueue is a Kubernetes-native job queueing system designed for batch and AI workloads. It provides:
- Queue-based admission: Jobs wait in queues until resources are available
- Gang semantics: Entire workloads are admitted or queued atomically
- Resource quotas: Enforce per-team or per-project GPU limits
- Preemption: Higher-priority jobs can preempt lower-priority ones
- Fair sharing: Cohorts allow queues to borrow idle quota from peers
7.3 Volcano
Volcano is a CNCF project providing advanced batch scheduling for Kubernetes. It offers:
- PodGroup: Define a group of pods that must be scheduled together
- minMember: Specify minimum pods required to start
- Queue: Hierarchical queuing with weights and priorities
- Actions: Enqueue, allocate, preempt, backfill
7.4 When to Use Which
Use Kueue when: You want Kubernetes-native integration, simple queue management, and work well with standard job controllers
Use Volcano when: You need strict gang semantics, complex scheduling policies, or integration with MPI/PyTorch/TensorFlow operators
Use both: Kueue for admission control and quota, Volcano for strict gang scheduling
8. Monitoring GPU Workloads
8.1 DCGM Exporter
The NVIDIA Data Center GPU Manager (DCGM) Exporter collects GPU metrics and exposes them to Prometheus. Key metrics include:
- DCGM_FI_DEV_GPU_UTIL: GPU utilization percentage
- DCGM_FI_DEV_MEM_COPY_UTIL: Memory bandwidth utilization
- DCGM_FI_DEV_FB_USED: GPU memory used (bytes)
- DCGM_FI_DEV_FB_FREE: GPU memory free (bytes)
- DCGM_FI_DEV_GPU_TEMP: GPU temperature
- DCGM_FI_DEV_POWER_USAGE: Power consumption (watts)
- DCGM_FI_DEV_SM_CLOCK: Streaming multiprocessor clock speed
8.2 Key Alerts to Configure
# High GPU Memory Usage
- alert: HighGPUMemoryUsage
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.9
for: 5m
# Low GPU Utilization (wasted resources)
- alert: LowGPUUtilization
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h]) < 20
for: 2h
# GPU Temperature Warning
- alert: HighGPUTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
9. Best Practices for GPU in Kubernetes
9.1 Use NVIDIA Base Images
Always use official NVIDIA base images (nvcr.io/nvidia/cuda, nvcr.io/nvidia/pytorch, etc.) that include properly configured CUDA libraries and dependencies.
9.2 Taint GPU Nodes
Apply taints to GPU nodes to prevent non-GPU workloads from consuming these expensive resources. Use tolerations only on pods that actually need GPUs.
9.3 Right-Size GPU Resources
Match GPU type to workload requirements. Don't use an A100 for inference that would run fine on a T4. Consider GPU sharing for smaller workloads.
9.4 Implement Resource Quotas
Use ResourceQuotas to limit GPU consumption per namespace, preventing any single team from monopolizing GPU resources.
9.5 Enable GPU Sharing for Development
Use time-slicing or MIG for development environments where isolation isn't critical but cost efficiency matters.
9.6 Use Priority Classes
Define PriorityClasses to ensure production inference workloads take precedence over experimental training jobs.
9.7 Implement Checkpointing
For long-running training jobs, implement checkpointing to survive preemption, spot instance termination, and failures.
9.8 Monitor Utilization Continuously
Track GPU utilization, memory usage, and temperature. Alert on both over-utilization (thermal throttling) and under-utilization (wasted spend).
10. Practical Examples with YAML
10.1 Basic GPU Pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.2-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule10.2 Multi-GPU Training Job
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["torchrun", "--nproc_per_node=4", "train.py"]
resources:
limits:
nvidia.com/gpu: 4
memory: "64Gi"
cpu: "16"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule10.3 Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each physical GPU appears as 4 resources10.4 GPU ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-team
spec:
hard:
requests.nvidia.com/gpu: "8" # Max 8 GPUs for this team
limits.nvidia.com/gpu: "8"
requests.memory: "256Gi"
requests.cpu: "128"11. Cost Optimization Strategies
11.1 Right-Size GPU Selection
Match GPU type to workload. Don't use H100s for inference that works on T4s. Common sizing:
- T4/L4: Inference, light training, development
- A10G: Medium inference, fine-tuning
- A100: Large model training, high-throughput inference
- H100: LLM training, transformer-heavy workloads
11.2 Spot/Preemptible Instances
Use spot instances for fault-tolerant training jobs with checkpointing. Combine with Kueue's preemption policies for graceful handling.
11.3 GPU Sharing for Low-Utilization Workloads
If monitoring shows GPUs at <30% utilization, implement MIG or time-slicing. A single A100 can serve 7 small inference workloads via MIG.
11.4 Cluster Autoscaling
Configure GPU node pools with autoscaling. Scale to zero when no GPU workloads are pending. Use Karpenter for fine-grained GPU instance selection.
12. Conclusion and Future Directions
Kubernetes has evolved from a CPU-centric container orchestrator to a capable platform for GPU-intensive AI/ML workloads. The ecosystem now provides:
- Stable GPU scheduling via device plugins and operators
- Multiple sharing strategies (MIG, MPS, time-slicing) for efficiency
- Gang scheduling via Kueue and Volcano for distributed training
- Rich observability through DCGM and Prometheus integration
Future developments to watch:
- Dynamic Resource Allocation (DRA): More flexible GPU allocation APIs in Kubernetes 1.30+
- NVIDIA KAI Scheduler: Open-sourced enterprise-grade GPU scheduler
- HAMi: Fractional GPU sharing at scale
- Predictive scheduling: AI-driven GPU allocation based on workload patterns
The organizations winning with AI at scale have made a cultural shift: they treat GPUs as a shared, policy-driven substrate governed by queues — not as pets hand-assigned to projects. With the right combination of operators, schedulers, sharing strategies, and monitoring, Kubernetes is becoming the operating system of the AI era.