Insights
AI28 February 2026

The 3 infrastructure failures every RAG system hits

Vector DB scaling, embedding pipeline throughput, and retrieval quality degradation. The model is usually fine. The platform underneath is not.

Where RAG systems break in production

The model is usually fine. The infrastructure underneath is not.

Query

User input

Embed

Vectorize

Vector DB

Search

1

Retrieve

Top-K docs

2

LLM

Generate

3

Answer

Response

1Vector DB scaling

500ms

20ms in dev. 500ms at 10M vectors.

2Stale embeddings

6hrs

Batch pipeline lag. Users query yesterday's data.

3Retrieval drift

-23%

Precision drops silently. No monitoring catches it.

Every retrieval-augmented generation system I have seen in production hits the same three infrastructure failures. The model is usually fine. The platform underneath is not.

The first failure is vector database scaling. Teams prototype with a few thousand embeddings and everything works. In production, the index grows to millions of vectors. Query latency jumps from 20ms to 500ms. The retrieval step that was invisible in development becomes the bottleneck in production. Most teams do not shard their index or tune HNSW parameters until it is already a problem.

The second failure is embedding pipeline throughput. New documents need to be embedded and indexed before the RAG system can use them. In a prototype, this happens in a notebook. In production, you need a pipeline that handles thousands of documents per hour, deduplicates, chunks intelligently, and updates the index without downtime. Most teams build this as a batch job. Then they realise their users need near-real-time data and the architecture cannot support it.

The third failure is retrieval quality degradation over time. The embedding model was tuned on a snapshot of your data. Six months later, your data has drifted. New terminology, new products, new internal language. Retrieval precision drops silently. The system still returns results. They are just increasingly wrong.

Nobody notices because there is no monitoring for retrieval relevance, only for retrieval latency.

The fix for all three is the same: treat RAG infrastructure like production infrastructure. Capacity-plan the vector database. Build the embedding pipeline as a streaming service. Monitor retrieval quality, not just availability. The teams that do this ship AI that works. The teams that do not ship demos that break.

ShareLinkedIn

Get the next one in your inbox

One short, opinionated field note per fortnight on platform engineering, cloud, and making AI work in production. No spam. Unsubscribe anytime.

Senna Semakula

Senna Semakula

Founder, Atruvo

Bring your architecture diagram, cloud bill, or last incident summary.

I will tell you what is actually breaking.

30 minutes. No pitch. Ranked risks and a clear next step.