Every enterprise has an AI strategy. Almost none have an AI operations plan.
The board approved your AI strategy. But nobody planned how to run AI systems in production at 2am when the model starts returning garbage. That gap is where the next outage is hiding.
What the board approved vs what production needs
14 organisations reviewed. 14 had a strategy. 0 had an operations plan.
The slide deck everyone approved
Ready for launch
Looks great in the board deck
The runbook nobody wrote
Not ready for 2am
This is where the outage hides
Demos do not get paged at 2am. Production systems do.
I have reviewed AI strategies at 14 organisations in the last 18 months. Every one had a strategy. A roadmap. Executive sponsorship. Approved budget. Not one had an operations plan.
An AI strategy tells you what to build. An AI operations plan tells you how to keep it running. Most organisations have the first. Almost none have the second.
Here is what I mean. Your AI strategy says you will deploy a customer-facing recommendation engine by Q3. It covers the business case, the vendor selection, the integration timeline. It does not cover what happens when the model starts recommending products that are out of stock. Or when inference latency spikes because a GPU node failed. Or when the training data pipeline breaks on a Friday night and nobody notices until Monday.
These are not edge cases. They are the normal operating conditions of AI systems in production. Every AI system I have seen in production has experienced at least one of these in its first 90 days.
An AI operations plan answers five questions that your strategy does not.
- Who gets paged at 2am when the model degrades? Not the data scientist who trained it. The platform engineer who runs the infrastructure it sits on.
- What is the rollback plan? Not "retrain the model." A specific, tested procedure that reverts to the previous version in under 10 minutes.
- How do you detect quality degradation before customers do? Not uptime monitoring. Output quality tracking with automated regression alerts.
- What is the cost ceiling? Not the budget for the project. A real-time view of what each AI service costs per day, with alerts when it exceeds thresholds.
- What happens when the AI vendor has an outage? Not "we wait." A fallback path that degrades gracefully and tells users what is happening.
The organisations getting this right are the ones where the platform team is involved before the AI system goes live. They build the runbooks, the monitoring, the rollback procedures, and the cost controls alongside the model integration. Not after it.
The organisations getting it wrong are treating AI like a feature launch instead of an infrastructure deployment. They celebrate the go-live and then scramble when the first incident hits.
If you have an AI strategy but no operations plan, you are not ready for production. You are ready for a demo. And demos do not get paged at 2am.
Get the next one in your inbox
One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.
