Kubernetes for AI/ML Workloads: What I Learned Shipping to Production

Kubernetes and AI/ML is one of those pairings that's either exactly right or a self-inflicted wound, and the difference is whether you actually needed it. I've run AI workloads both on plain managed compute and on Kubernetes, and the honest lesson is: k8s is a powerful answer to problems you should make sure you actually have. Here's a practical guide to when it's the right call, when it isn't, and how to do it without drowning in YAML.

First, the uncomfortable question: do you need it?

Kubernetes earns its complexity when you have several of these:

Many services that need to scale independently.
GPU workloads you need to schedule, share, and bin-pack efficiently.
A team that already knows k8s (or has a real reason to learn it).
Multi-cloud or on-prem requirements that rule out a single managed platform.
Complex batch/training pipelines alongside online serving.

If you have none of these — if you're a small team serving one or two AI features — Kubernetes is almost certainly overkill. A managed container service or serverless functions (the AWS path I wrote about) will serve you better with a fraction of the ops burden. Don't adopt k8s for the resume; adopt it for the problem.

With that caveat loudly stated — here's what actually matters when you do need it.

GPU scheduling is the heart of it

The single biggest reason to put ML on Kubernetes is GPU orchestration. This is where it genuinely shines and where most of the ML-specific complexity lives:

Device plugins. The NVIDIA device plugin exposes GPUs to the scheduler so pods can request them like any other resource (nvidia.com/gpu: 1).
Node pools by accelerator. Keep GPU nodes in their own pool, taint them, and use tolerations so only GPU workloads land there. You do not want a stateless web pod squatting on a GPU node.
Bin-packing and sharing. GPUs are expensive; idle GPU is the cardinal sin. Techniques like time-slicing or MIG (multi-instance GPU) let multiple smaller workloads share a card. Use them when your workloads don't each saturate a full GPU.
Right-size requests. Over-requesting GPU memory strands capacity; under-requesting causes OOM kills. This needs real measurement, not guesses.

Autoscaling without surprises

Autoscaling AI workloads on k8s has more sharp edges than typical web services:

Horizontal Pod Autoscaler (HPA) on the right metric. CPU is often the wrong signal for inference — scale on queue depth, request concurrency, or GPU utilization via custom/external metrics instead.
Cluster Autoscaler / Karpenter for nodes. GPU nodes are slow and pricey to spin up, so reactive scaling lags. Keep a warm buffer for latency-sensitive serving.
Scale-to-zero for batch. Training and batch jobs should scale to zero when idle — that's where the savings are. Online serving usually can't, because cold model loads kill latency.
Mind the cold-start chain. New node → pull a multi-GB image → load model weights → ready. That can be minutes. Pre-pull images, cache weights on the node, and keep warm capacity for anything user-facing.

The cost traps

Kubernetes makes it easy to waste money on AI workloads if you're not watching:

Idle GPU nodes that never scaled down. The number one offender. Audit ruthlessly.
Over-provisioned requests that strand capacity across the cluster.
Forgotten experiments — a training job someone launched and never cleaned up, quietly billing GPUs for a week.
Cross-AZ data transfer from chatty pods spread across zones.

Wire up cost visibility (Kubecost or your cloud's cost tooling) and put GPU utilization on a dashboard someone actually looks at. The discipline here is the same as on raw cloud — see my AWS cost notes — just with more moving parts.

Patterns that keep it sane

The teams I've seen succeed with ML on Kubernetes share habits:

Managed control plane. Use EKS/GKE/AKS. Running your own control plane on top of running ML is two hard jobs; pick one.
GitOps. Declarative configs in git, applied by a controller (Argo CD or Flux). For ML, where reproducibility matters, this is a gift.
Purpose-built operators. For training and batch ML, tools like Kubeflow, Ray on Kubernetes, or KubeRay handle the ML-specific orchestration so you're not reinventing it in raw YAML.
Separate serving from training. Different scaling profiles, different SLAs, different failure tolerance. Don't let a training job starve your serving pods.
Resource quotas and limits everywhere. Especially on GPU. One unbounded job shouldn't be able to take the cluster hostage.

A pragmatic decision rule

Here's how I actually decide:

One or two AI features, small team? → Skip k8s. Managed containers or serverless. Revisit later.
Many services, real GPU fleet, team that knows k8s? → Kubernetes is a strong fit; invest in GPU scheduling and cost controls from day one.
Heavy training/batch ML? → Kubernetes + a purpose-built operator (Ray/Kubeflow), with scale-to-zero on the batch side.
Somewhere in between? → Start simpler. You can always migrate to Kubernetes; migrating away after over-adopting is the painful direction.

A note on networking and storage

Two areas that quietly cause the most production pain on ML clusters, and that the tutorials skip:

Storage for weights and datasets. Model weights are large and you don't want to re-download them on every pod start. Use a shared, fast volume (a CSI-backed filesystem or a node-local cache) and pre-warm it. For training data, mind the read throughput — a GPU starved waiting on slow storage is the most expensive idle there is.
Networking for distributed training. If you do multi-node training, inter-node bandwidth becomes the bottleneck and you'll care about node placement, high-throughput networking, and topology-aware scheduling. For single-node inference this doesn't matter; for distributed training it's everything.

Neither is glamorous, and both will bite you in week three if you ignore them in week one. Provision the fast path early.

The takeaway

Kubernetes is a superb platform for AI/ML at the right scale — multiple services, a real GPU fleet, and a team equipped to run it. Its core value is GPU orchestration; its core risk is idle, expensive capacity and accidental complexity. If you genuinely have the problems it solves, lean in: managed control plane, GitOps, ML-aware operators, ruthless cost visibility, and serving separated from training. If you don't, the most senior move is to not adopt it yet. Match the tool to the problem, not to the trend.

Related reading