Fundamentals · 2026-05-06
Retrieval Recall vs. Precision: How to Measure RAG Quality Without Fooling Yourself
Recall@k is the metric that matters most in RAG retrieval — but measuring it correctly requires a held-out evaluation set most teams don't have. Here is how to build one and use it.
§1
Why Retrieval Quality Is the RAG Bottleneck
The most common source of poor RAG quality is not the generation model — it is retrieval. If the correct document is not retrieved, no amount of prompt engineering will produce a correct answer. The LLM will either hallucinate (generate from training knowledge) or refuse to answer. Both outcomes frustrate users. Only by measuring retrieval quality directly can you know whether to fix the retrieval layer or the generation layer.
Most teams discover retrieval failures by reading user complaints or by manually examining system outputs. This is reactive and slow. A proactive retrieval evaluation harness catches failures before they reach users, quantifies the severity, and gives you a metric to optimize against. Without this harness, every change to the chunking strategy, embedding model, or retrieval algorithm is a guess.
Retrieval quality has an asymmetric impact on system quality: a retrieval miss (false negative — the right document is not retrieved) is almost always catastrophic for answer quality. A retrieval false positive (the wrong document is retrieved alongside the right one) is less damaging — the LLM often ignores irrelevant context if the relevant context is also present. This asymmetry means recall is more important than precision for most RAG applications.
§2
Recall@k: Definition and Why k Matters
Recall@k is the fraction of queries for which the correct document appears in the top k retrieved results. Formally: for a query q with a known relevant document d, Recall@1 = 1 if d is the top result, 0 otherwise. Recall@5 = 1 if d is in the top 5 results, 0 otherwise. Averaged over all queries in your evaluation set, this gives you a retrieval quality score.
The choice of k should match your system's retrieval configuration. If you retrieve 5 documents and pass all 5 to the LLM, measure Recall@5. If you retrieve 10 documents and re-rank to 3, measure Recall@10 (before re-ranking) and then the fraction of relevant documents in the top 3 after re-ranking. Measuring Recall@100 to make your retrieval look good while only passing 5 documents to the LLM is misleading.
A common mistake is treating a single k as sufficient. Track Recall@1, Recall@3, Recall@5, and Recall@10 together. The shape of the recall curve tells you about retrieval confidence: a system where Recall@1 is 0.60 and Recall@5 is 0.90 has good recall but imprecise ranking. A system where Recall@1 is 0.85 and Recall@5 is 0.87 has strong first-position precision. The first system benefits from a re-ranker; the second may not need one.
§3
Precision vs. Recall Tradeoff in Retrieval
Retrieval precision@k is the fraction of the top k results that are relevant. If you retrieve 5 documents and 3 are relevant to the query, precision@5 is 0.6. High precision means the LLM receives mostly relevant context, which reduces noise and improves generation quality. High recall means the relevant document is likely to be in the retrieved set, which is the prerequisite for correct generation.
The tradeoff: increasing k improves recall (more documents, more chances to include the relevant one) but reduces precision (more irrelevant documents in the context). For most RAG applications, recall is the binding constraint — the system fails completely when recall is 0, but a small amount of irrelevant context is tolerable. Set k based on the recall level you need, then use re-ranking to improve precision within that k.
Re-ranking changes the precision/recall tradeoff without changing k. A cross-encoder re-ranker takes all k retrieved documents and re-ranks them by query-document relevance, pushing the most relevant documents to the top. You then pass only the top n (n < k) to the LLM. This achieves high recall (retrieve k=10) with high precision (pass only top n=3 to the LLM). The computational cost of cross-encoder re-ranking is ~5–15ms per query for typical corpus sizes — almost always worth the quality improvement.
§4
MRR and NDCG: When to Use Ranked Metrics
Mean Reciprocal Rank (MRR) is the average of 1/rank for the first relevant result across all queries. If the relevant document is at position 1, MRR contribution is 1.0. At position 2, it is 0.5. At position 5, it is 0.2. MRR captures the quality of ranking, not just whether the document is retrieved — a system that consistently ranks the relevant document first is better than one that puts it anywhere in the top 10.
NDCG (Normalized Discounted Cumulative Gain) is a ranked metric that accounts for multiple relevant documents and assigns higher scores to relevant documents at higher ranks. NDCG is the right metric when queries have multiple relevant documents (a multi-document retrieval task) or when you want to weight the rank position more finely than MRR allows.
For most RAG applications with single-document relevance per query, MRR is the clearer metric — it is easy to interpret (average reciprocal rank of the first correct result) and correlates directly with whether the best document appears early in the context window. Use NDCG when you have multi-document relevance labels or when you are optimizing a re-ranker that needs a rank-sensitive loss signal.
§5
Building a Held-Out Retrieval Evaluation Set
A retrieval evaluation set requires (query, relevant_document_id) pairs. Building it correctly means the queries come from the same distribution as your production queries, and the relevance labels are accurate. Three practical approaches: human labeling (domain experts write queries and identify the relevant document — highest quality, high cost), production mining (sample real user queries, identify which document provided the answer using session analysis — realistic but requires production traffic), and synthetic generation (use an LLM to generate queries for each document — fast and scalable but introduces model bias).
For synthetic generation: prompt the LLM with the document chunk and ask for 3–5 questions that the chunk uniquely answers. Include a verification step — manually review 10–20% of the generated pairs to estimate how many are genuinely answerable from the document. Discard ambiguous pairs where the question could be answered by multiple documents (these are valid production scenarios but make retrieval eval harder to interpret).
Size matters: 200–500 query-document pairs is sufficient for a useful retrieval eval. Fewer than 100 pairs produces unstable estimates — a 5-percentage-point swing in Recall@5 might be noise rather than a real signal. More than 1,000 pairs starts to have diminishing returns for development iteration speed. Build to 200–300, ship, and grow the set as you discover new failure modes in production.
§6
Automating Retrieval Eval With Synthetic Queries
Automated retrieval eval generation uses an LLM to produce queries for each document chunk at scale. The workflow: for each chunk in your corpus (or a stratified sample of chunks), send the chunk to an LLM with a prompt that requests 3–5 questions the chunk could answer. Store (query, chunk_id) pairs. Filter out questions that are too generic (answerable by common knowledge rather than the specific chunk) and questions that are answerable by many chunks (not useful for measuring recall of a specific chunk).
A good synthetic query generation prompt: "You are an expert at generating retrieval evaluation questions. Given the following document chunk, generate exactly 3 questions that: (1) are answerable ONLY from information in this specific chunk, (2) use different vocabulary than the chunk to test semantic retrieval, (3) represent realistic queries a user might ask. Format as JSON array." The instruction to use different vocabulary is critical — it tests semantic retrieval rather than keyword overlap.
Validate your synthetic queries before running eval: embed a sample of queries and check that ANN search returns the source chunk in the top 3 at least 80% of the time. If your validation baseline is lower, the queries are too generic or ambiguous to be useful eval signals. Regenerate with a more specific prompt or switch to human-authored queries for the portion of your corpus that produces poor synthetic queries.
§7
Interpreting Retrieval Metrics in Context
A Recall@5 of 0.75 means 25% of queries fail to retrieve the relevant document. Whether this is acceptable depends entirely on your application. For a customer support bot, 25% retrieval miss rate means 25% of support queries receive wrong or incomplete answers — probably unacceptable. For an exploratory research assistant, a 25% miss rate might be tolerable if the other 75% of queries produce high-quality results.
Compare metrics against baselines, not absolute targets. BM25 sparse retrieval is a strong baseline — if your dense embedding retrieval cannot beat BM25 on your corpus, the embedding model is not suitable for your domain. A well-implemented hybrid retrieval system (BM25 + dense) should beat either alone by 5–15 percentage points in Recall@5 on most real-world corpora.
Track metrics longitudinally, not just at a single point. A retrieval system that degrades from Recall@5 of 0.85 to 0.72 over three months is telling you something: documents are changing faster than re-ingestion keeps up, the query distribution has shifted, or the embedding model's behavior changed after a provider update. Longitudinal tracking catches silent degradation before it becomes a user-visible problem.
FAQ
Frequently asked questions
- What is recall@k in RAG?
- Recall@k measures what fraction of your queries have the correct document (or chunk) in the top k retrieved results. For example, if you retrieve 5 documents per query and 80 out of 100 test queries have the correct document somewhere in those 5, your Recall@5 is 0.80. This is the primary retrieval quality metric for RAG systems because it directly measures whether the retrieval layer is providing the LLM with the information it needs to generate correct answers. If the correct document is not retrieved, the LLM cannot generate a correct answer regardless of how good the generation model is.
- How do I measure retrieval precision?
- Precision@k is the fraction of your top k retrieved documents that are relevant to the query. To measure it, you need relevance labels for each retrieved result — not just whether the correct document is retrieved, but whether each of the k retrieved documents is relevant. This requires either human labelers or an LLM-as-judge that can assess document-query relevance. In practice, most teams measure Recall@k (did we retrieve the right document at all?) rather than Precision@k (are all retrieved documents relevant?) because retrieval misses are more damaging than irrelevant documents in the context window. Precision becomes important when you are optimizing a re-ranker to select the best n documents from k retrieved candidates.
- What k should I use for recall@k?
- Use the k that matches your actual retrieval configuration — the number of documents you retrieve and pass to the LLM. If your system retrieves 5 documents and uses all 5 as context, measure Recall@5. If you retrieve 10 and re-rank to 3, measure Recall@10 pre-re-ranking and a ranking quality metric post-re-ranking. Using a larger k than your system actually uses (e.g., measuring Recall@100 when you only pass 5 documents to the LLM) makes your retrieval look better than it is and obscures the real bottleneck. Track multiple values of k to understand the recall curve — this tells you whether you need better retrieval (higher Recall@5 at fixed k) or a re-ranker (better precision within the k you already retrieve).
- How do I build a retrieval evaluation dataset?
- The fastest practical approach for most teams: use an LLM to generate synthetic queries from your document chunks. For each chunk, prompt an LLM (GPT-4o or Claude) to generate 3–5 questions that are answerable only from that chunk and use different vocabulary. Store (query, chunk_id) pairs. Manually verify 10–20% to estimate quality. Filter out overly generic questions. Target 200–500 pairs for a useful evaluation set. For higher quality, supplement with human-authored queries from domain experts or mined production queries (once you have traffic). The key requirement is that the evaluation set covers your actual query distribution — a synthetic set that only generates factual lookup queries will not capture reasoning or summarization queries that users actually ask.
- What is a good recall@5 for a production RAG system?
- For well-structured corpora (documentation, FAQs, knowledge bases with clear question-answer alignment), a production RAG system with good embedding and hybrid retrieval should achieve Recall@5 of 0.85–0.95. For unstructured corpora (long-form documents, PDFs with mixed content), 0.75–0.85 is more typical. Below 0.70 indicates a retrieval problem that needs to be addressed before optimizing generation quality. These are rough baselines — the acceptable level depends on your application's tolerance for incorrect answers. Measure your BM25-only baseline first: if hybrid retrieval cannot beat BM25 alone by at least 5 percentage points, your embedding model is not suited to your domain.
Next