Insights
Platform27 April 2026

Your Kubernetes cluster was not designed for GPU workloads

Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.

CPU workloads vs GPU workloads on Kubernetes

Your cluster was designed for the left column. AI needs the right.

DimensionStandard workloadGPU / AI workload
SchedulingDistribute evenly across nodesBinary: has GPU or waits
ScalingNew node in secondsNew GPU node in 3-5 minutes
MemoryPredictable, OOM kills recover fastSpikes 2x under load, slow GPU cleanup
AllocationFractional CPU, shared easilyWhole GPU or nothing
AutoscalerWatch CPU/memory, react in secondsNeeds predictive scaling, long cooldown
Node readinessPod starts, serves trafficLoad drivers + CUDA + pre-warm model

What the platform team needs to own

Separate GPU node poolsCustom autoscaler configGPU memory monitoringModel pre-warmingRequest queuing with backpressure

Teams running AI inference on Kubernetes are discovering something uncomfortable: their cluster was not designed for this.

Standard Kubernetes clusters are optimised for stateless, CPU-bound, horizontally scalable workloads. Web servers. API services. Background workers. The scheduler distributes pods evenly. The autoscaler watches CPU and memory. Node pools are homogeneous.

AI inference workloads break all of these assumptions.

GPU scheduling is fundamentally different from CPU scheduling. You cannot timeslice a GPU the way you can a CPU. A pod either has the GPU or it does not. If your cluster has 4 GPU nodes and 5 GPU pods need scheduling, one waits. There is no partial allocation.

Memory requirements are unpredictable. A model that uses 8GB of GPU memory during normal inference might spike to 16GB under certain input patterns. OOM kills on GPU nodes are harder to recover from than CPU OOM kills because GPU state cleanup is slower and less reliable.

Autoscaling does not work the same way. Spinning up a new GPU node takes minutes, not seconds. The node needs GPU drivers, CUDA libraries, and often a model pre-loaded into GPU memory before it can serve traffic. By the time the node is ready, the traffic spike may have passed.

  • Separate GPU node pools with dedicated scheduling rules
  • Custom autoscaler configurations with longer cool-down periods and predictive scaling
  • GPU memory monitoring as a first-class metric, not an afterthought
  • Model pre-warming strategies so new nodes are ready to serve before they receive traffic
  • Request queuing with backpressure instead of dropping requests when GPU capacity is full

The platform team needs to own this. If you leave it to the AI team, they will build workarounds that become permanent. If you leave it to nobody, you will discover the limits during a production incident.

ShareLinkedIn

Get the next one in your inbox

One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.

Senna Semakula

Senna Semakula

Founder, Atruvo

Bring your architecture diagram, cloud bill, or last incident summary.

I will tell you what is actually breaking.

30 minutes. No pitch. Ranked risks and a clear next step.