Retrieval-augmented generation (RAG) is a mouthful for a simple idea: when the model needs to answer about something it doesn't know, you go fetch the relevant material first and stuff it into the prompt. The model then has the source in front of it and answers grounded in that source.
Use RAG when
- The answer lives in a corpus that changes (docs, knowledge base, internal wiki).
- You need attribution back to a source.
- Fine-tuning is overkill — you'd be re-training every time a doc changes.
Skip RAG when
- The corpus is small enough to drop into the prompt directly. If it fits in 50k tokens, you're often better off skipping the retrieval step.
- The task is reasoning, not lookup. RAG doesn't make the model think better; it gives it more to think about.
- Your real bottleneck is the data isn't clean. Garbage in, retrieval out.
Where RAG goes wrong
- Chunk boundaries cut a sentence in half, and the model misses the qualifier. Use overlap and respect paragraph breaks.
- Top-K results all say the same thing. Diversity matters more than rank for "ground me in the corpus" use cases.
- The retriever returns nothing and the model invents an answer. Always check for empty retrieval and tell the model what to do (refuse, or escalate).
- Stale index. Recompute embeddings on a schedule, not just on initial load.
§ Further reading
- 01
Knowledge check
0/1 answered1. Your retriever returns 5 chunks, all from the same doc. What's the most likely first improvement?
Discussion
0 commentsBe the first to start the conversation.