October 18, 2025|15 min read

RAG Failure Modes in Production

After running a RAG system serving 500+ daily users for eight months, I've catalogued every way it can fail. Most of these don't show up in tutorials or benchmarks.

Chunking disasters are the most common. Your chunking strategy determines the ceiling of your retrieval quality. Too small and you lose context. Too large and you dilute relevance. The worst case is when a chunk boundary splits a key piece of information across two chunks, and neither is retrieved.

Embedding drift is subtle but real. If your knowledge base evolves over time (and it will), the distribution of your embeddings shifts. Queries that worked well six months ago may retrieve less relevant results today. Solution: periodic re-embedding and monitoring of retrieval quality metrics.

The retrieval-generation gap is when your retriever finds the right documents but the generator doesn't use them correctly. The model might ignore relevant context, hallucinate despite having the answer in the retrieved text, or synthesize an answer from the wrong chunks.

Metadata filtering is underrated. Pure semantic search is rarely sufficient. Adding structured metadata (date, source, category, access level) and supporting filtered queries dramatically improves precision for real-world use cases.

Monitor everything: retrieval precision, answer quality, user satisfaction signals, and especially the queries where users reformulate — those tell you exactly where your system is failing.