Semantic Caching for LLM APIs: GPTCache, Redis, and the Threshold That Matters
Users rephrase the same question in a dozen ways — "how do I reset my password", "password reset help", "I forgot my password". Exact-match caching misses all of them. Semantic caching embeds each prompt into a vector and returns a cached answer when cosine similarity exceeds a threshold, intercepting 25-40% of redundant traffic. The whole game is picking the right embedding model, the right threshold, and the right cache layer in your stack.
The problem: exact-match caching misses rephrasings
Every LLM gateway supports exact-match caching — if a user sends the identical prompt twice, return the stored answer. But real users rarely send byte-identical prompts. Support-chat traffic shows 10-30 surface variations of the same intent: "cancel my subscription", "how to cancel", "I want to unsubscribe", "end my plan". Exact-match caching catches none of them.
OpenAI's prompt caching (and the analogous features on Anthropic and Gemini) solves a related but different problem — it caches the prefix of a prompt to reduce prefill cost, but still runs the LLM for the divergent suffix. Semantic caching goes further: it skips the LLM call entirely when a similar past answer exists. The two techniques compose: prompt caching reduces prefill cost; semantic caching eliminates the call.
The hit rate is highest for workloads where many users ask variations of the same questions: customer support, FAQ bots, documentation chat, internal knowledge bases. It is lowest for one-off creative or analytical prompts where no two inputs are alike.
How it works: embed, search, threshold
The pipeline is simple. On every incoming prompt: (1) embed the prompt into a vector using a small embedding model, (2) search the vector store for the nearest cached prompt, (3) if cosine similarity exceeds your threshold, return the cached answer; otherwise call the LLM and store the new prompt+answer+embedding.
# Semantic cache flow (LiteLLM with Redis backend)
from litellm import completion
from litellm.caching import RedisCache
cache = RedisCache(host="redis", port=6379)
# LiteLLM's semantic cache checks similarity before calling the LLM
response = completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "how do I reset my password?"}],
caching=True, # enable caching
cache={"type": "semantic", "similarity_threshold": 0.90},
)
# If "I forgot my password, help" was asked before at sim 0.92 → cached hit, $0
The embedding model is typically a fast, cheap one — OpenAI's text-embedding-3-small or a self-hosted all-MiniLM-L6-v2 (384-dim, runs on CPU). The vector store is Redis (via RedisVL), pgvector, Qdrant, or Milvus. The whole cache layer adds 5-15ms of latency on a miss, and returns in 5-15ms on a hit — versus 500-3000ms for an actual LLM call.
The similarity threshold: the single most important knob
The threshold decides everything. Set it too low and you return semantically-near but factually-wrong answers ("how do I delete my account" served with the cancel-subscription answer). Set it too high and you get few hits. The right value depends on your workload's tolerance for slight semantic drift.
| Threshold | Hit rate | Risk | Best for |
|---|---|---|---|
| 0.95+ | 10-15% | Very low — near-exact matches only | Medical, legal, high-stakes Q&A |
| 0.90-0.95 | 20-30% | Low — safe for most structured output | FAQ, docs chat, RAG user-layer |
| 0.85-0.90 | 35-45% | Moderate — occasional off-target hits | Casual support chat |
| <0.85 | 45%+ | High — false positives common | Not recommended |
Different embedding models produce different similarity distributions. A threshold of 0.90 on text-embedding-3-small is not equivalent to 0.90 on all-MiniLM-L6-v2. Always re-calibrate the threshold when you switch embedding models — sample 100 known-similar and 100 known-dissimilar pairs and find the decision boundary empirically.
GPTCache vs Redis VL vs LiteLLM built-in
There are three common implementation choices, depending on what you already run.
| Option | Embedding backend | Vector store | Best when |
|---|---|---|---|
| GPTCache (OSS) | Pluggable (OpenAI, ONNX, HuggingFace) | Pluggable (Milvus, FAISS, pgvector) | Prototyping, custom pipelines |
| Redis VL / Redis Stack | Pluggable | Redis (native) | Already running Redis; want TTL eviction in one store |
| LiteLLM built-in | OpenAI or Cohere | Redis or in-memory | You use LiteLLM as your gateway (simplest integration) |
If you are already using LiteLLM as your gateway, the built-in semantic cache is the path of least resistance — it is a few lines of config and inherits LiteLLM's TTL, model routing, and observability. For greenfield or custom setups, Redis Stack is the most production-ready: it handles eviction, sharding, and vector search in a single store you probably already operate.
Eviction, TTL, and invalidation
A cache that never expires becomes a correctness liability — answers to "what is the latest GPT model" go stale as models update. Every semantic cache entry should have a TTL (typically 1-7 days for general chat; shorter for time-sensitive content). Redis makes this trivial with per-key TTL; GPTCache requires manual eviction logic.
The harder problem is semantic invalidation: if you update your docs, old cached answers about the old docs are now wrong. The pragmatic approach is to version your cache namespace by content version (e.g. cache:v3:...) and flip the version on doc updates, effectively nuking the old cache. Do not try to surgically invalidate individual entries — it is not worth the complexity.
Measuring hit rate and quality
Two metrics define whether your semantic cache is working. Hit rate is the percentage of calls served from cache — 25-40% is healthy for support-chat workloads; below 15% suggests the threshold is too high or the workload is genuinely one-off. Hit quality is the fraction of cache hits that users find acceptable — measure this by tracking whether a user immediately re-asks the same question (a proxy for "the cached answer was wrong") or by spot-sampling hits for human review.
If your false-positive rate rises above 3-5%, raise the threshold. If your hit rate drops below 15%, lower it or check whether your embedding model is too weak to cluster paraphrases. These two knobs — threshold and embedding model — account for 90% of semantic-cache outcomes.
Composing with other cost optimizations
Semantic caching is one lever in the broader cost optimization playbook, and it composes with the others. After a cache miss, model routing can send the request to a cheaper model (GPT-4o-mini instead of GPT-4o) for simple prompts — see our model routing guide. Prompt compression trims the prompt before the LLM sees it. Each lever is independent; together they compound into 70-85% API spend reduction on the right workloads.
Semantic cache → prompt caching (prefix) → model routing → prompt compression. Check the cheapest lever (exact or semantic cache) first; only fall through to the LLM with the cheapest viable model after compression. Each layer should be measurably cheaper than the one before it.
FAQ
What is semantic caching for LLM APIs?
Semantic caching embeds each prompt into a vector and returns a cached answer when a similar past prompt (cosine similarity above a threshold) exists. It intercepts 25-40% of redundant traffic on typical workloads, cutting API spend by the same fraction. It goes beyond exact-match caching by catching rephrasings.
What similarity threshold should I use?
Start at 0.90 and tune. 0.92-0.95 for factual or structured output; 0.85-0.90 for casual chat. At 0.95+ you get few false-positive hits but lower hit rate; below 0.85 you risk serving semantically-near but factually-wrong answers. Always recalibrate when switching embedding models — similarity distributions differ.
GPTCache vs Redis VL — which should I use?
Use GPTCache for prototyping or custom pipelines. Use Redis VL (Redis Stack) if you already run Redis — it unifies the vector index and TTL-based eviction in one store. If you use LiteLLM as your gateway, its built-in semantic cache (backed by Redis) is the simplest path and inherits routing and observability.
Does semantic caching work for agent and RAG workloads?
It works for the user-facing prompt layer of RAG (FAQ-style questions repeat heavily), but not for intermediate tool-call or retrieval layers where context is dynamic. For agents, cache the final user-facing answer, not the intermediate calls. Hit rate is highest for support chat, FAQ bots, and documentation assistants.
Related deep dives
- LLM Cost Optimization Playbook — the full 5-lever cost-reduction stack, of which semantic caching is one
- Model Routing: GPT-4o vs GPT-4o-mini — what to do after a cache miss: route to the cheapest viable model
- LiteLLM vs Portkey vs OpenRouter — which gateway has the best built-in semantic cache
- OpenAI vs Anthropic vs Gemini Pricing — the per-token savings a cache multiplies
Sources
- Antgroup / Zilliz, "GPTCache: Building Semantic Cache for LLM Applications," 2023-2024
- Redis Inc., "Semantic Caching with Redis Vector Library (RedisVL)," 2024-2025
- LiteLLM documentation, "Semantic Caching with Redis," 2025
- OpenAI, "Prompt Caching" documentation, 2024 (prefix caching — the complementary technique)
- LangChain blog, "Semantic Caching for LLMs: How, When, and Why," 2024
- BerriAI, "How We Cut Our OpenAI Bill by 40% with Semantic Caching," 2024
Hit rates and cost-reduction figures are workload-dependent. The 25-40% range reflects typical support-chat and FAQ workloads; creative or analytical workloads see much lower hit rates. Always measure on your own traffic before projecting savings.