30 June 2026
RAG in production: the 5 mistakes I keep seeing everywhere
The five RAG mistakes I keep seeing in production: poor chunking, no reranking, no evaluation loop, treating RAG as a silver bullet, and skipping latency and cost budgets — with concrete fixes.
RAG works better when we stop selling it as magic
I have seen many RAG systems start with the same promise: connect the company's documents to an LLM and get a reliable assistant with internal context. The demo is usually persuasive. You ask three prepared questions, the system finds the right PDF, quotes a relevant sentence, and everyone sees the potential. The problems start with real users, mixed document types, imperfect permissions, rising costs, and answers that need to be defended.
RAG is not a product. It is one possible architecture for reducing hallucination and injecting context. In production, its quality depends on very concrete decisions: how sources are chunked, how passages are selected, how errors are measured, which questions the system refuses, and how much time and money each answer is allowed to cost. Teams that accept that reality build useful systems. Teams that do not usually discover later that their assistant is just a fuzzy search engine with a polished conversational interface.
Mistake 1: chunking documents without a strategy
The first mistake is almost always chunking. Many teams split every document into fixed blocks of 500 or 1,000 tokens, add some overlap, index everything, and move on. It feels safe because it is easy to automate. It is also a very efficient way to break meaning. A legal clause cut in half, a procedure separated from its exceptions, a table detached from its heading, or a FAQ answer mixed with footer text can be enough to produce an incomplete but confident answer.
The right question is not which chunk size to pick. The right question is which unit actually carries meaning in this corpus. In product documentation, it may be a full section with its title, subsections, and version. In contracts, it may be a complete clause plus the definitions that precede it. In support tickets, it may be the problem, diagnosis, and resolution together, not each message separately. Chunking should respect business structure before it respects a token limit.
The fix is practical: build different chunking strategies per source type, enrich passages with useful metadata, keep the path back to the original document, and test retrieval on real questions. I would rather have a slower ingestion pipeline that can be explained than a huge index where nobody knows why a fragment appears first. If the team cannot explain why the right context should be retrieved, it is hoping.
Mistake 2: assuming vector search is enough
The second mistake is stopping at the top k results from vector search. Embeddings are powerful, but they do not always understand the difference between a passage that is vaguely close and the passage that actually decides the answer. They can favor similar wording, miss a recent constraint, or return five redundant chunks from the same document while the important exception sits somewhere else. In production, that nuance is exactly where expensive mistakes appear.
The fix is to add reranking and often hybrid search. Lexical search catches identifiers, product names, internal codes, and rare terms. The reranker reads the candidate passages again and orders them against the actual question. You can also deduplicate by source, penalize obsolete documents, filter by permission, or require a freshness threshold. This is not a luxury feature. It is often the difference between an answer that merely sounds right and an answer that uses the right evidence.
Mistake 3: shipping without an evaluation pipeline
The third mistake is the most dangerous one: not knowing whether the system is getting better. Teams test a few questions in the interface, tweak a prompt, change the number of chunks, switch the model, and judge the result by feel. That may be enough for a demo. In production, it is not manageable. Every change can improve one family of questions and silently degrade another. Without evaluation, the RAG system becomes a black box nobody wants to touch.
A useful RAG evaluation has to separate at least three things: retrieval, generation, and user experience. Does the right document appear among the candidates? Does the passage sent to the model actually contain the answer? Does the final response respect the sources, cite correctly, refuse when context is insufficient, and remain useful for the business user? These dimensions should not be collapsed into one magical score.
The fix starts small. Create a set of 50 to 100 real questions with expected sources, acceptable answers, expected refusals, and trap cases. Replay that set after every meaningful change. Log retrieved passages, the rank of the correct passage, tokens, latency, model, prompt, and feedback. Add LLM judges only where their judgments are calibrated against human examples. Evaluation is not an administrative layer. It is the steering wheel of the system.
Mistake 4: treating RAG as a silver bullet
The fourth mistake is strategic: treating RAG as the answer to every LLM problem. A RAG system does not solve a poorly defined workflow, contradictory source data, unclear permissions, or a business decision that should remain human. It gives context to the model. It does not automatically turn that context into a safe, traceable, and acceptable decision for the organization.
The fix is to narrow the promise. Define the tasks the RAG system must do very well, the tasks where it should assist without deciding, and the tasks it must refuse. For some workflows, structured extraction, a business rule, a SQL query, a classic search screen, or a well-designed form will outperform a conversational RAG interface. The best systems I have seen do not try to cover everything. They combine retrieval, rules, human validation, business tools, and explicit refusals. It is less spectacular. It is much more reliable.
Mistake 5: skipping latency and cost budgeting
The fifth mistake often appears too late, when users actually start relying on the system. A serious RAG flow may do a lot before answering: query rewriting, hybrid search, reranking, permission filtering, generation, citation formatting, and sometimes a final verification pass. Every step adds latency, tokens, infrastructure, and failure modes. If nobody has defined the budget, the product becomes slow or expensive precisely when it starts proving its value.
The fix is to budget from the design phase. What p95 latency is acceptable for this use case? How many documents can the system reread before the answer costs more than it is worth? Which paths should be fast and approximate, and which paths can be slower because the risk is higher? Measure each step, cache what can be cached, use different models for different tasks, and design a degraded mode. Performance is not just optimization. In RAG, it shapes the product.
RAG readiness checklist
Before putting RAG in production, I want clear answers to a few questions. Which corpora are included, excluded, and owned? Which chunking unit preserves meaning for each source? Which metadata filters language, version, freshness, access rights, and document type? What mix of vector search, lexical search, and reranking is used? Which cases should refuse to answer? Which logs let the team replay a bad response?
I also want a minimum operational foundation: real questions, expected sources, regression tests, latency tracking, cost tracking, examples of dangerous answers, a process to repair the index, and a business owner who can arbitrate ambiguity. If these pieces exist, RAG can become a real product. If they do not, you can still make a demo. But it is important to be honest: production has not started yet.