Platform27 April 2026

Your Kubernetes cluster was not designed for GPU workloads

Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.

Pattern·Kubernetes scheduling for GPU workloads

CPU workloads vs GPU workloads on Kubernetes

Your cluster was designed for the left column. AI needs the right.

Dimension	Standard workload	GPU / AI workload
Scheduling	Distribute evenly across nodes	Binary: has GPU or waits
Scaling	New node in seconds	New GPU node in 3-5 minutes
Memory	Predictable, OOM kills recover fast	Spikes 2x under load, slow GPU cleanup
Allocation	Fractional CPU, shared easily	Whole GPU or nothing
Autoscaler	Watch CPU/memory, react in seconds	Needs predictive scaling, long cooldown
Node readiness	Pod starts, serves traffic	Load drivers + CUDA + pre-warm model

What the platform team needs to own

Separate GPU node poolsCustom autoscaler configGPU memory monitoringModel pre-warmingRequest queuing with backpressure

Teams running AI inference on Kubernetes are discovering something uncomfortable: their cluster was not designed for this.

Standard Kubernetes clusters are optimised for stateless, CPU-bound, horizontally scalable workloads. Web servers. API services. Background workers. The scheduler distributes pods evenly. The autoscaler watches CPU and memory. Node pools are homogeneous.

AI inference workloads break all of these assumptions.

GPU scheduling is fundamentally different from CPU scheduling. You cannot timeslice a GPU the way you can a CPU. A pod either has the GPU or it does not. If your cluster has 4 GPU nodes and 5 GPU pods need scheduling, one waits. There is no partial allocation.

Memory requirements are unpredictable. A model that uses 8GB of GPU memory during normal inference might spike to 16GB under certain input patterns. OOM kills on GPU nodes are harder to recover from than CPU OOM kills because GPU state cleanup is slower and less reliable.

Autoscaling does not work the same way. Spinning up a new GPU node takes minutes, not seconds. The node needs GPU drivers, CUDA libraries, and often a model pre-loaded into GPU memory before it can serve traffic. By the time the node is ready, the traffic spike may have passed.

Separate GPU node pools with dedicated scheduling rules
Custom autoscaler configurations with longer cool-down periods and predictive scaling
GPU memory monitoring as a first-class metric, not an afterthought
Model pre-warming strategies so new nodes are ready to serve before they receive traffic
Request queuing with backpressure instead of dropping requests when GPU capacity is full

The platform team needs to own this. If you leave it to the AI team, they will build workarounds that become permanent. If you leave it to nobody, you will discover the limits during a production incident.

ShareLinkedIn

Get the next one in your inbox

One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.

Senna Semakula

Founder, Atruvo

PreviousPrompt injection is an infrastructure problem, not an AI problem NextMost AI monitoring is just uptime monitoring with a new label

Bring your architecture diagram, cloud bill, or last incident summary.

I will tell you what is actually breaking.

30 minutes. No pitch. Ranked risks and a clear next step.