Insights

Hard truths about
platforms, cloud, and AI

Short, direct takes on the problems that cause enterprise systems to fail. No fluff. No theory. Just signal.

AI GovernanceLatest13 May 2026

Every enterprise has an AI strategy. Almost none have an AI operations plan.

The board approved your AI strategy. But nobody planned how to run AI systems in production at 2am when the model starts returning garbage. That gap is where the next outage is hiding.

Read the insight
Data28 May 2026

The real cost of AI is not the model. It is the data pipeline.

Every AI business case focuses on model costs. They are also the minority of the total cost. The data pipeline is typically 60-70%.

Read more
Platform16 May 2026

AI rollbacks are harder than you think

Rolling back a model is not like rolling back code. The output distribution changes, and dependent state becomes inconsistent.

Read more
SRE27 April 2026

Most AI monitoring is just uptime monitoring with a new label

Your AI monitoring checks that the service is responding. It does not check that the service is correct. That gap is where incidents hide for weeks.

Read more
Platform27 April 2026

Your Kubernetes cluster was not designed for GPU workloads

Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.

Read more
AI Governance27 April 2026

Prompt injection is an infrastructure problem, not an AI problem

If your defense is a regex in your application code, you are playing a game you cannot win. AI security needs to live at the platform layer.

Read more
AI Governance25 April 2026

Platform engineers will own AI governance by 2027

The governance problems that cause production incidents are infrastructure problems. Platform teams already know how to solve them.

Read more
AI Governance11 April 2026

Why your AI fallback is more dangerous than the failure

When AI fails hard, someone gets paged. When it fails softly with a fallback, the system keeps serving bad results and nobody notices.

Read more
Cloud28 March 2026

AI workloads are hiding in your cloud bill

Nobody knows what AI actually costs because inference runs on shared compute with no attribution. That is a platform architecture problem.

Read more
AI Governance14 March 2026

Your AI pipeline has no owner. That is the real risk.

The ML team built it. The data team feeds it. The platform team hosts it. Nobody owns the whole thing. That gap is where incidents hide.

Read more
AI28 February 2026

The 3 infrastructure failures every RAG system hits

Vector DB scaling, embedding pipeline throughput, and retrieval quality degradation. The model is usually fine. The platform underneath is not.

Read more
AI Governance14 February 2026

Most AI guardrails only protect the demo

The guardrails most teams put around AI systems are tested against friendly inputs. Production is none of those things.

Read more

Field notes, in your inbox

One short, opinionated piece per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.

See one of these problems in your system?

Bring your architecture diagram, cloud bill, or last incident. I will tell you what is actually breaking.