AI rollbacks are harder than you think
Rolling back a model is not like rolling back code. The output distribution changes, and dependent state becomes inconsistent.
Why “just roll it back” does not work for AI
Code rollback restores state. Model rollback creates inconsistency.
System restored
Same input = same output. Clean.
System broken
What AI rollback actually requires
Rolling back a model is an infrastructure operation, not a deployment operation.
When a code deployment goes wrong, you roll back. The previous version was working, you redeploy it, and the system recovers. This mental model is so ingrained that teams apply it to AI deployments without thinking.
It does not work the same way.
When you deploy a new version of an API, the old version and the new version produce the same type of output for the same input. The contract is the same. Only the implementation changed. When you deploy a new model, the output distribution changes. A classification model trained on new data may assign different categories to the same inputs. An embedding model may place the same documents in different regions of vector space.
Rolling back a model has side effects that rolling back code does not.
If you updated your embedding model and then roll back, your vector database now contains embeddings from two different models. Similarity search becomes unreliable because the geometric relationships between vectors are inconsistent. You need to re-embed everything, which is an infrastructure operation, not a deployment operation.
If you updated a classification model and then roll back, any decisions made by the new model are now inconsistent with decisions made by the old model. If those decisions were stored, downstream systems may be operating on mixed logic.
- Versioned model artifacts with associated embedding indices
- The ability to rebuild dependent state from a specific model version
- Dual-deployment infrastructure that can run old and new models simultaneously during validation
- Clear rollback runbooks that include data state, not just deployment state
- Automated consistency checks that verify system state after a rollback
Most teams discover these requirements during their first failed model deployment. The ones that prepare build the infrastructure in advance. The ones that do not spend 48 hours in an incident trying to figure out why rolling back the model did not fix the problem.
Get the next one in your inbox
One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.
