Your Kubernetes cluster was not designed for GPU workloads
Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.
CPU workloads vs GPU workloads on Kubernetes
Your cluster was designed for the left column. AI needs the right.
| Dimension | Standard workload | GPU / AI workload |
|---|---|---|
| Scheduling | Distribute evenly across nodes | Binary: has GPU or waits |
| Scaling | New node in seconds | New GPU node in 3-5 minutes |
| Memory | Predictable, OOM kills recover fast | Spikes 2x under load, slow GPU cleanup |
| Allocation | Fractional CPU, shared easily | Whole GPU or nothing |
| Autoscaler | Watch CPU/memory, react in seconds | Needs predictive scaling, long cooldown |
| Node readiness | Pod starts, serves traffic | Load drivers + CUDA + pre-warm model |
What the platform team needs to own
Teams running AI inference on Kubernetes are discovering something uncomfortable: their cluster was not designed for this.
Standard Kubernetes clusters are optimised for stateless, CPU-bound, horizontally scalable workloads. Web servers. API services. Background workers. The scheduler distributes pods evenly. The autoscaler watches CPU and memory. Node pools are homogeneous.
AI inference workloads break all of these assumptions.
GPU scheduling is fundamentally different from CPU scheduling. You cannot timeslice a GPU the way you can a CPU. A pod either has the GPU or it does not. If your cluster has 4 GPU nodes and 5 GPU pods need scheduling, one waits. There is no partial allocation.
Memory requirements are unpredictable. A model that uses 8GB of GPU memory during normal inference might spike to 16GB under certain input patterns. OOM kills on GPU nodes are harder to recover from than CPU OOM kills because GPU state cleanup is slower and less reliable.
Autoscaling does not work the same way. Spinning up a new GPU node takes minutes, not seconds. The node needs GPU drivers, CUDA libraries, and often a model pre-loaded into GPU memory before it can serve traffic. By the time the node is ready, the traffic spike may have passed.
- Separate GPU node pools with dedicated scheduling rules
- Custom autoscaler configurations with longer cool-down periods and predictive scaling
- GPU memory monitoring as a first-class metric, not an afterthought
- Model pre-warming strategies so new nodes are ready to serve before they receive traffic
- Request queuing with backpressure instead of dropping requests when GPU capacity is full
The platform team needs to own this. If you leave it to the AI team, they will build workarounds that become permanent. If you leave it to nobody, you will discover the limits during a production incident.
Get the next one in your inbox
One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.
