Insights
Hard truths about
platforms, cloud, and AI
Short, direct takes on the problems that cause enterprise systems to fail. No fluff. No theory. Just signal.
Every enterprise has an AI strategy. Almost none have an AI operations plan.
The board approved your AI strategy. But nobody planned how to run AI systems in production at 2am when the model starts returning garbage. That gap is where the next outage is hiding.
Read the insightThe real cost of AI is not the model. It is the data pipeline.
Every AI business case focuses on model costs. They are also the minority of the total cost. The data pipeline is typically 60-70%.
Read moreAI rollbacks are harder than you think
Rolling back a model is not like rolling back code. The output distribution changes, and dependent state becomes inconsistent.
Read moreMost AI monitoring is just uptime monitoring with a new label
Your AI monitoring checks that the service is responding. It does not check that the service is correct. That gap is where incidents hide for weeks.
Read moreYour Kubernetes cluster was not designed for GPU workloads
Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.
Read morePrompt injection is an infrastructure problem, not an AI problem
If your defense is a regex in your application code, you are playing a game you cannot win. AI security needs to live at the platform layer.
Read morePlatform engineers will own AI governance by 2027
The governance problems that cause production incidents are infrastructure problems. Platform teams already know how to solve them.
Read moreWhy your AI fallback is more dangerous than the failure
When AI fails hard, someone gets paged. When it fails softly with a fallback, the system keeps serving bad results and nobody notices.
Read moreAI workloads are hiding in your cloud bill
Nobody knows what AI actually costs because inference runs on shared compute with no attribution. That is a platform architecture problem.
Read moreYour AI pipeline has no owner. That is the real risk.
The ML team built it. The data team feeds it. The platform team hosts it. Nobody owns the whole thing. That gap is where incidents hide.
Read moreThe 3 infrastructure failures every RAG system hits
Vector DB scaling, embedding pipeline throughput, and retrieval quality degradation. The model is usually fine. The platform underneath is not.
Read moreMost AI guardrails only protect the demo
The guardrails most teams put around AI systems are tested against friendly inputs. Production is none of those things.
Read moreField notes, in your inbox
One short, opinionated piece per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.