Kubernetes and GPU: The Complete Guide to Running AI/ML Workloads at Scale

A comprehensive deep dive into GPU orchestration in Kubernetes — from device plugins and the GPU Operator to advanced sharing strategies like MIG, MPS, and time-slicing. Learn how to schedule, monitor, and optimize GPU workloads for AI/ML at scale.

Kubernetes and GPU: The Complete Guide to Running AI/ML Workloads at Scale

GPU as the Engine of AI

Graphics Processing Units (GPUs) have become the cornerstone of modern AI and machine learning infrastructure. What started as specialized hardware for rendering graphics has evolved into the computational backbone powering everything from large language models to autonomous vehicles.

Kubernetes, the de facto standard for container orchestration, wasn't originally designed with GPUs in mind. It was built for CPU-centric workloads with predictable, preemptive scheduling. But as AI workloads exploded, Kubernetes had to adapt — and adapt it did.

According to the 2024 State of Production Kubernetes research by Spectro Cloud over two-thirds of organizations consider Kubernetes key to taking full advantage of AI. The vast majority are either already running AI workloads in production on Kubernetes or plan to within the year.

According to recent industry surveys, over two-thirds of organizations consider Kubernetes key to taking full advantage of AI. The vast majority are either already running AI workloads in production on Kubernetes or plan to within the year.

This guide provides a comprehensive deep dive into running GPU workloads on Kubernetes — from understanding why GPUs are fundamentally different from CPUs, through device plugins and operators, to advanced scheduling strategies, GPU sharing techniques, and production best practices.

Why GPUs Are Different: The Scheduling Challenge

Before diving into Kubernetes specifics, it's essential to understand why GPUs present unique challenges that the traditional Kubernetes scheduler wasn't designed to handle.

2.1 GPUs Bypass the Linux Kernel

Unlike CPUs, GPU workloads bypass the Linux kernel entirely. They don't obey cgroups, namespaces, or the standard Linux scheduler. GPUs run their own show through proprietary drivers and opaque memory management.

This has profound implications:

  • No native resource isolation: The kernel can't enforce memory or compute limits on GPU processes the way it does for CPU/memory via cgroups.
  • No preemption: The kernel can't preempt a GPU workload the way it can a CPU process. Once a CUDA kernel launches, it runs to completion.
  • No native sharing: By default, GPUs are monolithic resources — you either have the whole GPU or nothing.

2.2 GPUs as Non-Overcommittable Resources

In Kubernetes, CPUs can be overcommitted — you can allocate more CPU requests than physically exist, relying on time-sharing. Memory can also be overcommitted in some scenarios.

GPUs are different. By default in Kubernetes, GPUs cannot be overcommitted, and workloads cannot request fractions of a GPU. If your node has 4 GPUs, you can run at most 4 pods requesting 1 GPU each — even if those pods only use 10% of each GPU's capacity.

This fundamental constraint is why GPU sharing strategies (MIG, MPS, time-slicing) have become so important for efficient utilization.

2.3 The GPU Memory Challenge

GPU memory (VRAM) is a precious, finite resource. Unlike system RAM, GPU memory:

  • Cannot be swapped to disk
  • Is shared between all processes on a GPU by default
  • Results in immediate CUDA Out-Of-Memory errors when exhausted

A single runaway process can consume all GPU memory, starving other workloads and potentially crashing applications without warning.

3. The Kubernetes Device Plugin Framework

Kubernetes introduced the Device Plugin framework to address hardware accelerators like GPUs. This framework provides a standardized way to advertise and allocate specialized hardware resources to containers.

3.1 How Device Plugins Work

Device plugins run as DaemonSets on GPU-enabled nodes. They communicate with the kubelet through a gRPC interface to:

  1. Advertise Resources: Tell Kubernetes what devices are available (e.g., "this node has 4 nvidia.com/gpu devices")
  2. Allocate Resources: When a pod requests a GPU, provide the necessary device files and environment variables
  3. Monitor Health: Report device health status and handle failures

3.2 The Container Device Interface (CDI)

To address the complexity of vendor-specific integrations, the community introduced the Container Device Interface (CDI). CDI standardizes how container runtimes interact with device plugins, decoupling device configuration from runtime-specific code.

With CDI, vendors can describe their devices in a JSON specification, and any CDI-compliant runtime (containerd, CRI-O) can consume them without vendor-specific modifications.

3.3 GPU Resource Syntax in Kubernetes

Unlike CPU and memory, GPUs have unique resource request syntax:

  • GPUs must be specified in limits only — Kubernetes uses the limit as the request automatically
  • Requests and limits must be equal if both are specified
  • GPUs must be integers — you cannot request 0.5 GPUs (without sharing strategies)
  • You cannot specify GPU requests without specifying limits

4. NVIDIA Device Plugin vs. GPU Operator

NVIDIA provides two approaches for GPU enablement in Kubernetes: the Device Plugin alone, or the comprehensive GPU Operator. Understanding when to use each is a fundamental architectural decision.

4.1 NVIDIA Device Plugin

The NVIDIA Device Plugin is a DaemonSet that exposes NVIDIA GPUs to the Kubernetes scheduler. It's lightweight and focused:

  • What it does: Discovers GPUs on nodes, advertises them as nvidia.com/gpu resources, allocates them to pods
  • Prerequisites: NVIDIA drivers must be pre-installed on nodes, nvidia-container-toolkit configured
  • Best for: Managed cloud environments where nodes come with pre-installed GPU drivers (EKS, GKE, AKS GPU node pools)

4.2 NVIDIA GPU Operator

The NVIDIA GPU Operator is a comprehensive solution that automates the entire GPU software stack deployment. It doesn't replace the Device Plugin — it manages and deploys it as one of many components.

Components managed by the GPU Operator:

  • NVIDIA Driver: Containerized driver installation — no host modification needed
  • NVIDIA Container Toolkit: Enables GPU access from containers
  • NVIDIA Device Plugin: Exposes GPUs to Kubernetes
  • GPU Feature Discovery (GFD): Automatically labels nodes with GPU properties (model, memory, MIG capability)
  • DCGM Exporter: Exposes GPU metrics to Prometheus
  • MIG Manager: Manages Multi-Instance GPU partitioning
  • Validator: Verifies all components are working correctly

4.3 Decision Matrix

Use Device Plugin alone when:

  • Nodes come with pre-installed, managed GPU drivers
  • You have a simple, homogeneous GPU environment
  • Minimal operational overhead is the priority

Use GPU Operator when:

  • Managing heterogeneous GPU fleets across environments
  • You need MIG, time-slicing, or vGPU support
  • Automated driver lifecycle management is required
  • GPU monitoring and observability are priorities
  • Running on bare metal or self-managed infrastructure

5. GPU Scheduling Mechanics in Kubernetes

5.1 Basic GPU Scheduling

When you request a GPU, the kube-scheduler:

  1. Filters nodes to find those with available nvidia.com/gpu resources
  2. Scores remaining nodes based on standard predicates (affinity, taints, etc.)
  3. Selects the best node and binds the pod
  4. kubelet calls the device plugin to allocate specific GPU(s)
  5. Container runtime (containerd/CRI-O) launches the container with GPU access

5.2 GPU Feature Discovery Labels

GPU Feature Discovery (GFD) automatically labels nodes with detailed GPU information, enabling precise workload placement:

nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.count=8
nvidia.com/gpu.family=ampere
nvidia.com/mig.capable=true
nvidia.com/cuda.driver.major=12

These labels enable node selectors and affinity rules for workload-specific GPU requirements.

5.3 Taints and Tolerations for GPU Nodes

GPU nodes are expensive. Best practice is to taint them to prevent non-GPU workloads from being scheduled:

# Taint GPU nodes

kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

GPU workloads must then include tolerations to be scheduled on these nodes.

6. GPU Sharing Strategies: MIG, MPS, Time-Slicing, and vGPU

The inability to share GPUs natively in Kubernetes leads to massive underutilization. Industry surveys show that most organizations struggle not from GPU scarcity, but from poor GPU utilization caused by limited multi-tenancy capabilities.

Four main strategies exist to share GPUs, each with distinct tradeoffs:

6.1 Multi-Instance GPU (MIG)

MIG is a hardware-based partitioning feature available on NVIDIA Ampere (A100) and newer GPUs (A30, H100, H200). It physically divides a GPU into up to 7 isolated instances, each with:

  • Dedicated compute cores
  • Dedicated memory bandwidth
  • Dedicated L2 cache
  • Hardware fault isolation

MIG Profiles (A100-80GB example):

  • 1g.10gb — 1/7 of GPU compute, 10GB memory (up to 7 instances)
  • 2g.20gb — 2/7 of GPU compute, 20GB memory (up to 3 instances)
  • 3g.40gb — 3/7 of GPU compute, 40GB memory (up to 2 instances)
  • 7g.80gb — Full GPU

Pros: True hardware isolation, guaranteed resources, fault separation, QoS guarantees

Cons: Only supported on Ampere+, fixed partition sizes, static configuration, some memory overhead

Best for: Multi-tenant environments, SLA-bound inference, workloads requiring predictable performance

6.2 Multi-Process Service (MPS)

MPS is a software-based sharing mechanism that allows multiple CUDA processes to share a GPU through a unified server process. Instead of context-switching between processes, MPS interleaves their work.

How it works:

  • A control daemon sets the GPU to EXCLUSIVE_PROCESS mode
  • An MPS server acts as a proxy between CUDA applications and the GPU
  • Multiple clients share the server's scheduling resources
  • Context-switch overhead is eliminated

Pros: Higher throughput than time-slicing, memory limit enforcement, works on most NVIDIA GPUs

Cons: No fault isolation (one crash affects all), shared address space, not compatible with MIG

Best for: Trusted internal workloads, inference serving, latency-tolerant batch processing

6.3 Time-Slicing

Time-slicing is the simplest sharing strategy. The GPU rapidly context-switches between processes, giving each a time slice of execution. This is temporal sharing — processes take turns using the GPU.

How it works in Kubernetes:

  • Configure the device plugin to advertise "replicas" of each GPU
  • A GPU with 4 replicas appears as 4 schedulable resources
  • Pods requesting 1 replica get shared access to the underlying GPU
  • The GPU's hardware scheduler time-slices between workloads

Pros: Works on all Pascal+ GPUs, simple to configure, flexible oversubscription

Cons: No memory isolation, no fault isolation, context-switch overhead, unpredictable latency

Best for: Development environments, light inference, bursty workloads, cost optimization

6.4 Virtual GPU (vGPU)

NVIDIA vGPU enables multiple virtual machines to share a physical GPU with hardware-accelerated graphics and compute. Each VM sees a virtualized GPU slice.

Requirements: Supported hypervisor (VMware, KVM, Hyper-V), NVIDIA AI Enterprise license, compatible GPU

Best for: VDI, VM-based AI workloads, enterprises with existing virtualization infrastructure

6.5 Comparison Matrix

Isolation: MIG > vGPU > MPS > Time-Slicing

Flexibility: Time-Slicing > MPS > vGPU > MIG

GPU Compatibility: Time-Slicing > MPS > vGPU > MIG (Ampere+ only)

Best Practice: Use MIG for production inference with SLAs. Use time-slicing for development and light workloads. Combine MIG + time-slicing for maximum density.

7. Gang Scheduling with Kueue and Volcano

Distributed training jobs (PyTorch DDP, Horovod, MPI) require gang scheduling — all workers must start together or not at all. The default Kubernetes scheduler doesn't support this.

7.1 The Gang Scheduling Problem

Imagine a 4-GPU training job that needs 4 pods. If Kubernetes schedules 3 pods but can't find resources for the 4th, you have:

  • 3 GPUs allocated but idle (waiting for the 4th)
  • Potential deadlock if multiple jobs partially schedule
  • Wasted resources and blocked pipelines

7.2 Kueue

Kueue is a Kubernetes-native job queueing system designed for batch and AI workloads. It provides:

  • Queue-based admission: Jobs wait in queues until resources are available
  • Gang semantics: Entire workloads are admitted or queued atomically
  • Resource quotas: Enforce per-team or per-project GPU limits
  • Preemption: Higher-priority jobs can preempt lower-priority ones
  • Fair sharing: Cohorts allow queues to borrow idle quota from peers

7.3 Volcano

Volcano is a CNCF project providing advanced batch scheduling for Kubernetes. It offers:

  • PodGroup: Define a group of pods that must be scheduled together
  • minMember: Specify minimum pods required to start
  • Queue: Hierarchical queuing with weights and priorities
  • Actions: Enqueue, allocate, preempt, backfill

7.4 When to Use Which

Use Kueue when: You want Kubernetes-native integration, simple queue management, and work well with standard job controllers

Use Volcano when: You need strict gang semantics, complex scheduling policies, or integration with MPI/PyTorch/TensorFlow operators

Use both: Kueue for admission control and quota, Volcano for strict gang scheduling

8. Monitoring GPU Workloads

8.1 DCGM Exporter

The NVIDIA Data Center GPU Manager (DCGM) Exporter collects GPU metrics and exposes them to Prometheus. Key metrics include:

  • DCGM_FI_DEV_GPU_UTIL: GPU utilization percentage
  • DCGM_FI_DEV_MEM_COPY_UTIL: Memory bandwidth utilization
  • DCGM_FI_DEV_FB_USED: GPU memory used (bytes)
  • DCGM_FI_DEV_FB_FREE: GPU memory free (bytes)
  • DCGM_FI_DEV_GPU_TEMP: GPU temperature
  • DCGM_FI_DEV_POWER_USAGE: Power consumption (watts)
  • DCGM_FI_DEV_SM_CLOCK: Streaming multiprocessor clock speed

8.2 Key Alerts to Configure

# High GPU Memory Usage
- alert: HighGPUMemoryUsage
  expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.9
  for: 5m


# Low GPU Utilization (wasted resources)
- alert: LowGPUUtilization
  expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h]) < 20
  for: 2h


# GPU Temperature Warning
- alert: HighGPUTemperature
  expr: DCGM_FI_DEV_GPU_TEMP > 85
  for: 5m

9. Best Practices for GPU in Kubernetes

9.1 Use NVIDIA Base Images

Always use official NVIDIA base images (nvcr.io/nvidia/cuda, nvcr.io/nvidia/pytorch, etc.) that include properly configured CUDA libraries and dependencies.

9.2 Taint GPU Nodes

Apply taints to GPU nodes to prevent non-GPU workloads from consuming these expensive resources. Use tolerations only on pods that actually need GPUs.

9.3 Right-Size GPU Resources

Match GPU type to workload requirements. Don't use an A100 for inference that would run fine on a T4. Consider GPU sharing for smaller workloads.

9.4 Implement Resource Quotas

Use ResourceQuotas to limit GPU consumption per namespace, preventing any single team from monopolizing GPU resources.

9.5 Enable GPU Sharing for Development

Use time-slicing or MIG for development environments where isolation isn't critical but cost efficiency matters.

9.6 Use Priority Classes

Define PriorityClasses to ensure production inference workloads take precedence over experimental training jobs.

9.7 Implement Checkpointing

For long-running training jobs, implement checkpointing to survive preemption, spot instance termination, and failures.

9.8 Monitor Utilization Continuously

Track GPU utilization, memory usage, and temperature. Alert on both over-utilization (thermal throttling) and under-utilization (wasted spend).

10. Practical Examples with YAML

10.1 Basic GPU Pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/cuda:12.2-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

10.2 Multi-GPU Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.01-py3
        command: ["torchrun", "--nproc_per_node=4", "train.py"]
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: "64Gi"
            cpu: "16"
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

10.3 Time-Slicing ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Each physical GPU appears as 4 resources

10.4 GPU ResourceQuota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team
spec:
  hard:
    requests.nvidia.com/gpu: "8"   # Max 8 GPUs for this team
    limits.nvidia.com/gpu: "8"
    requests.memory: "256Gi"
    requests.cpu: "128"

11. Cost Optimization Strategies

11.1 Right-Size GPU Selection

Match GPU type to workload. Don't use H100s for inference that works on T4s. Common sizing:

  • T4/L4: Inference, light training, development
  • A10G: Medium inference, fine-tuning
  • A100: Large model training, high-throughput inference
  • H100: LLM training, transformer-heavy workloads

11.2 Spot/Preemptible Instances

Use spot instances for fault-tolerant training jobs with checkpointing. Combine with Kueue's preemption policies for graceful handling.

11.3 GPU Sharing for Low-Utilization Workloads

If monitoring shows GPUs at <30% utilization, implement MIG or time-slicing. A single A100 can serve 7 small inference workloads via MIG.

11.4 Cluster Autoscaling

Configure GPU node pools with autoscaling. Scale to zero when no GPU workloads are pending. Use Karpenter for fine-grained GPU instance selection.

12. Conclusion and Future Directions

Kubernetes has evolved from a CPU-centric container orchestrator to a capable platform for GPU-intensive AI/ML workloads. The ecosystem now provides:

  1. Stable GPU scheduling via device plugins and operators
  2. Multiple sharing strategies (MIG, MPS, time-slicing) for efficiency
  3. Gang scheduling via Kueue and Volcano for distributed training
  4. Rich observability through DCGM and Prometheus integration

Future developments to watch:

  • Dynamic Resource Allocation (DRA): More flexible GPU allocation APIs in Kubernetes 1.30+
  • NVIDIA KAI Scheduler: Open-sourced enterprise-grade GPU scheduler
  • HAMi: Fractional GPU sharing at scale
  • Predictive scheduling: AI-driven GPU allocation based on workload patterns

The organizations winning with AI at scale have made a cultural shift: they treat GPUs as a shared, policy-driven substrate governed by queues — not as pets hand-assigned to projects. With the right combination of operators, schedulers, sharing strategies, and monitoring, Kubernetes is becoming the operating system of the AI era.