Cache the Context, Not Just the Answer

RAG pipelines often rebuild context from scratch on every request: embed the query, hit the vector store, assemble chunks. For an FAQ-style workload where the same handful of questions arrive thousands of times a day, that is the same work repeated endlessly — including a paid embedding API call each time.

Before

A docs assistant gets "How do I reset my password?" 400 times a day, phrased a dozen ways. Each request embeds the query (one embedding API call) and runs a vector search before generation. You pay 400 embedding calls and 400 searches to retrieve essentially the same three chunks.

After

You add a cache keyed by a normalized query (lowercased, trimmed, punctuation stripped). On a hit you reuse the stored chunk IDs and skip both the embedding call and the search entirely; you only run generation.

key = normalize(query)              # "how do i reset my password"
ctx = cache.get(key)
if ctx is None:
    ctx = retrieve(query)           # embed + vector search
    cache.set(key, ctx, ttl=86400)
answer = generate(query, ctx)

Why it works

Real query traffic is heavily skewed: a small set of questions accounts for most volume. Retrieval for a given question is deterministic enough — a fixed embedding model over a static index returns the same neighbors — that the chunk set rarely changes between identical queries, so recomputing it is wasted work. Caching the retrieved context removes the embedding call and the vector search on every hit, which is where the repeat cost lives.

Be precise about what you save. This trims your embedding-API bill and retrieval compute, not the tokens you send to the model: the same context block still goes to the LLM on every hit, so your generation-side input-token cost is unchanged. This is also distinct from caching final answers — you still generate fresh text (useful when you personalize the response or inject live data); you just stop re-paying for retrieval. If you also want to cut the LLM's input-token cost on a stable context block, that's a separate technique (provider-side prompt/context caching), not what this tip does.

Practical notes: normalize keys so trivially different phrasings collide, and consider a small semantic cache (embed once, match near-duplicate queries by cosine threshold) to catch paraphrases — though note that re-embedding to compute the match spends part of what you were trying to save, so it pays off mainly when generation and search dominate. Set a TTL so the cache invalidates when docs change, or bust it on re-index. Keep it Beginner-simple to start: an in-memory dict or Redis with a normalized string key. Hedge your expectations, since savings track your cache hit rate directly. Measure hits versus misses for a week before assuming a big win, and watch for staleness if your knowledge base updates frequently.

Cache the Context, Not Just the Answer

Before

After

Why it works

Get a fresh tip every morning

More in Retrieval & RAG

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Filter by Metadata Before You Search