RAG Hallucinations: Why Your Retrieval, Not Prompts, Fails

Think of it as a cache miss the caller can't see. A function asks the cache for a key, gets nothing back, and instead of erroring it proceeds on whatever was already in scope. The caller has no idea the lookup failed.

A RAG pipeline has two stages. Retrieval pulls a handful of chunks — small slices of your documents — that look relevant to the query. Generation feeds those chunks to the model, which writes the answer. The model only ever sees what retrieval handed it. It has no view of your full corpus and no signal that a better chunk exists but didn't make the cut.

So when the chunk holding the answer isn't retrieved, the model does exactly what it's built to do: it answers from the chunks it did get, plus its own training. It fills the gap, and it sounds certain doing it. Nothing in the output flags "the source was missing."

The metric for this is recall — specifically recall@k: across your test queries, how often does the chunk that actually answers the question appear in the top-k results retrieval returns. (Top-k = the k highest-scoring chunks, often k=5 or k=10.) When you label a single gold chunk per query — as in the procedure below — recall@k is simply whether that chunk made the top-k; in that setup it's equivalent to what some tools call hit-rate@k. Recall is a retrieval-side number. It's distinct from whether the model then used the chunk well — that's faithfulness, a generation-side concern, and a separate issue. Recall comes first: a model can't ground an answer in a chunk it never received.

A concrete one (illustrative):

Your API documentation says the rate limit is 30 requests per second. A user asks the assistant, "what's the rate limit?" and it answers "100 requests per second." The number is right there in the docs. Retrieval simply didn't pull that chunk into the top-k, so the model guessed.

The trap is that your prompt looks innocent and your generation looks competent — like a test that passes because the assertion is fine, while the fixture quietly loaded the wrong data. To find it, instrument retrieval on its own:

Collect real failures. Pull 15–20 answers your system actually got wrong from your logs — not synthetic queries.
Label the gold chunk. For each, find the chunk in your corpus that should answer it. If none exists, that's a coverage gap, not a retrieval miss — set it aside.
Run retrieval only. No generation. Look at the raw top-k for each query.
Compute recall@k. Did the gold chunk make the top-k?

def recall_at_k(eval_set, retrieve, k=5):
    # eval_set: [{"query": str, "gold_id": str}, ...] — real failures, gold chunk labeled
    # retrieve(query, k): your retriever; returns ranked chunks, each with an "id"
    hits = sum(
        ex["gold_id"] in [c["id"] for c in retrieve(ex["query"], k)]
        for ex in eval_set
    )
    return hits / len(eval_set)

print(f"recall@5: {recall_at_k(eval_set, retrieve, k=5):.2f}")

Bucket every failure. Gold chunk missing from the top-k → retrieval miss; fix retrieval. Gold chunk present but the answer's still wrong → that's a generation problem, a different fix entirely.

Where "check recall first" stops helping:

Checking recall first isn't a cure-all, and treating it as one creates its own bugs — like chasing a null pointer at the crash site when the real fault is upstream in how the data loaded.

A few things recall@k won't tell you. Coverage gaps masquerade as retrieval misses: if no chunk in your corpus answers the question, no retriever can find one, and the fix is content, not search. Recall is easy to game with k: raise k high enough and recall climbs toward 1.0 — but now you're stuffing the model's context with mostly-irrelevant chunks, which degrades the answer in a different way (a future issue). A high recall@20 with a bad answer is not a win. Labeling gold chunks costs real time: 20 well-chosen failures beat 200 unlabeled queries. And the honest boundary — sometimes retrieval is fine and the model still answers wrong. The gold chunk was in the top-k, and generation ignored or contradicted it. That's a real failure mode; it's just not this one, and the prompt work you were tempted to do at the start is the right move there. Measuring recall first tells you which world you're in before you spend the afternoon.

What to do this week:

Build the smallest retrieval eval that would have caught your last bad answer — the way you'd write a failing test before a refactor.

Take 20 real failures from your logs. Label the gold chunk for each. Run retrieval-only and compute recall@k at your current k. You get one number. If it's low, you've been debugging the wrong half of the stack, and the fixes are ordered: raise k first (cheapest); add hybrid search — combine keyword search (exact term matching, e.g. BM25) with vector search (meaning-based matching) so a query that shares no words with the document still matches; then rewrite the query before it hits the index. Re-measure recall@k after each change. Track that one number every week. "Feels better" is not a metric; recall@k is.

What did you think of today's email?
Your feedback helps me create better emails for you! comment down 👇
Loved It 😊
It was ok 🙂
Could be better 🤔

Until next time - Teja Derangula,
Create while it’s easy

Your RAG isn't hallucinating

A concrete one (illustrative):

Where "check recall first" stops helping:

What to do this week:

Reply

Keep Reading

NextGen