The 3 infrastructure failures every RAG system hits
Vector DB scaling, embedding pipeline throughput, and retrieval quality degradation. The model is usually fine. The platform underneath is not.
Where RAG systems break in production
The model is usually fine. The infrastructure underneath is not.
Query
User input
Embed
Vectorize
Vector DB
Search
Retrieve
Top-K docs
LLM
Generate
Answer
Response
500ms
20ms in dev. 500ms at 10M vectors.
6hrs
Batch pipeline lag. Users query yesterday's data.
-23%
Precision drops silently. No monitoring catches it.
Every retrieval-augmented generation system I have seen in production hits the same three infrastructure failures. The model is usually fine. The platform underneath is not.
The first failure is vector database scaling. Teams prototype with a few thousand embeddings and everything works. In production, the index grows to millions of vectors. Query latency jumps from 20ms to 500ms. The retrieval step that was invisible in development becomes the bottleneck in production. Most teams do not shard their index or tune HNSW parameters until it is already a problem.
The second failure is embedding pipeline throughput. New documents need to be embedded and indexed before the RAG system can use them. In a prototype, this happens in a notebook. In production, you need a pipeline that handles thousands of documents per hour, deduplicates, chunks intelligently, and updates the index without downtime. Most teams build this as a batch job. Then they realise their users need near-real-time data and the architecture cannot support it.
The third failure is retrieval quality degradation over time. The embedding model was tuned on a snapshot of your data. Six months later, your data has drifted. New terminology, new products, new internal language. Retrieval precision drops silently. The system still returns results. They are just increasingly wrong.
Nobody notices because there is no monitoring for retrieval relevance, only for retrieval latency.
The fix for all three is the same: treat RAG infrastructure like production infrastructure. Capacity-plan the vector database. Build the embedding pipeline as a streaming service. Monitor retrieval quality, not just availability. The teams that do this ship AI that works. The teams that do not ship demos that break.
Get the next one in your inbox
One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.
