Insights

Hard truths about
platforms, cloud, and AI

Short, direct takes on the problems that cause enterprise systems to fail. No fluff. No theory. Just signal.

AI GovernanceLatest13 May 2026

Every enterprise has an AI strategy. Almost none have an AI operations plan.

The board approved your AI strategy. But nobody planned how to run AI systems in production at 2am when the model starts returning garbage. That gap is where the next outage is hiding.

Read the insight

Data28 May 2026

The real cost of AI is not the model. It is the data pipeline.

Every AI business case focuses on model costs. They are also the minority of the total cost. The data pipeline is typically 60-70%.

Platform16 May 2026

AI rollbacks are harder than you think

Rolling back a model is not like rolling back code. The output distribution changes, and dependent state becomes inconsistent.

SRE27 April 2026

Most AI monitoring is just uptime monitoring with a new label

Your AI monitoring checks that the service is responding. It does not check that the service is correct. That gap is where incidents hide for weeks.

Platform27 April 2026

Your Kubernetes cluster was not designed for GPU workloads

Standard K8s clusters are optimised for stateless, CPU-bound workloads. AI inference breaks all of those assumptions.

AI Governance27 April 2026

Prompt injection is an infrastructure problem, not an AI problem

If your defense is a regex in your application code, you are playing a game you cannot win. AI security needs to live at the platform layer.

AI Governance25 April 2026

Platform engineers will own AI governance by 2027

The governance problems that cause production incidents are infrastructure problems. Platform teams already know how to solve them.

AI Governance11 April 2026

Why your AI fallback is more dangerous than the failure

When AI fails hard, someone gets paged. When it fails softly with a fallback, the system keeps serving bad results and nobody notices.

Cloud28 March 2026

AI workloads are hiding in your cloud bill

Nobody knows what AI actually costs because inference runs on shared compute with no attribution. That is a platform architecture problem.

AI Governance14 March 2026

Your AI pipeline has no owner. That is the real risk.

The ML team built it. The data team feeds it. The platform team hosts it. Nobody owns the whole thing. That gap is where incidents hide.

AI28 February 2026

The 3 infrastructure failures every RAG system hits

Vector DB scaling, embedding pipeline throughput, and retrieval quality degradation. The model is usually fine. The platform underneath is not.

AI Governance14 February 2026

Most AI guardrails only protect the demo

The guardrails most teams put around AI systems are tested against friendly inputs. Production is none of those things.

Field notes, in your inbox

One short, opinionated piece per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.

See one of these problems in your system?

Bring your architecture diagram, cloud bill, or last incident. I will tell you what is actually breaking.

Hard truths aboutplatforms, cloud, and AI