Semantic Caching for LLM APIs: GPTCache, Redis, and the Threshold That Matters

Users rephrase the same question in a dozen ways — "how do I reset my password", "password reset help", "I forgot my password". Exact-match caching misses all of them. Semantic caching embeds each prompt into a vector and returns a cached answer when cosine similarity exceeds a threshold, intercepting 25-40% of redundant traffic. The whole game is picking the right embedding model, the right threshold, and the right cache layer in your stack.

By the LLM Academy team · Reviewed July 2026 · Tested with GPTCache 0.1.x, Redis Stack 7.4+, LiteLLM v1.x

The problem: exact-match caching misses rephrasings

Every LLM gateway supports exact-match caching — if a user sends the identical prompt twice, return the stored answer. But real users rarely send byte-identical prompts. Support-chat traffic shows 10-30 surface variations of the same intent: "cancel my subscription", "how to cancel", "I want to unsubscribe", "end my plan". Exact-match caching catches none of them.

OpenAI's prompt caching (and the analogous features on Anthropic and Gemini) solves a related but different problem — it caches the prefix of a prompt to reduce prefill cost, but still runs the LLM for the divergent suffix. Semantic caching goes further: it skips the LLM call entirely when a similar past answer exists. The two techniques compose: prompt caching reduces prefill cost; semantic caching eliminates the call.

When semantic caching wins

The hit rate is highest for workloads where many users ask variations of the same questions: customer support, FAQ bots, documentation chat, internal knowledge bases. It is lowest for one-off creative or analytical prompts where no two inputs are alike.

How it works: embed, search, threshold

The pipeline is simple. On every incoming prompt: (1) embed the prompt into a vector using a small embedding model, (2) search the vector store for the nearest cached prompt, (3) if cosine similarity exceeds your threshold, return the cached answer; otherwise call the LLM and store the new prompt+answer+embedding.

# Semantic cache flow (LiteLLM with Redis backend)
from litellm import completion
from litellm.caching import RedisCache

cache = RedisCache(host="redis", port=6379)

# LiteLLM's semantic cache checks similarity before calling the LLM
response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "how do I reset my password?"}],
    caching=True,                  # enable caching
    cache={"type": "semantic", "similarity_threshold": 0.90},
)
# If "I forgot my password, help" was asked before at sim 0.92 → cached hit, $0

The embedding model is typically a fast, cheap one — OpenAI's text-embedding-3-small or a self-hosted all-MiniLM-L6-v2 (384-dim, runs on CPU). The vector store is Redis (via RedisVL), pgvector, Qdrant, or Milvus. The whole cache layer adds 5-15ms of latency on a miss, and returns in 5-15ms on a hit — versus 500-3000ms for an actual LLM call.

The similarity threshold: the single most important knob

The threshold decides everything. Set it too low and you return semantically-near but factually-wrong answers ("how do I delete my account" served with the cancel-subscription answer). Set it too high and you get few hits. The right value depends on your workload's tolerance for slight semantic drift.

ThresholdHit rateRiskBest for
0.95+10-15%Very low — near-exact matches onlyMedical, legal, high-stakes Q&A
0.90-0.9520-30%Low — safe for most structured outputFAQ, docs chat, RAG user-layer
0.85-0.9035-45%Moderate — occasional off-target hitsCasual support chat
<0.8545%+High — false positives commonNot recommended
Threshold drift

Different embedding models produce different similarity distributions. A threshold of 0.90 on text-embedding-3-small is not equivalent to 0.90 on all-MiniLM-L6-v2. Always re-calibrate the threshold when you switch embedding models — sample 100 known-similar and 100 known-dissimilar pairs and find the decision boundary empirically.

GPTCache vs Redis VL vs LiteLLM built-in

There are three common implementation choices, depending on what you already run.

OptionEmbedding backendVector storeBest when
GPTCache (OSS)Pluggable (OpenAI, ONNX, HuggingFace)Pluggable (Milvus, FAISS, pgvector)Prototyping, custom pipelines
Redis VL / Redis StackPluggableRedis (native)Already running Redis; want TTL eviction in one store
LiteLLM built-inOpenAI or CohereRedis or in-memoryYou use LiteLLM as your gateway (simplest integration)

If you are already using LiteLLM as your gateway, the built-in semantic cache is the path of least resistance — it is a few lines of config and inherits LiteLLM's TTL, model routing, and observability. For greenfield or custom setups, Redis Stack is the most production-ready: it handles eviction, sharding, and vector search in a single store you probably already operate.

Eviction, TTL, and invalidation

A cache that never expires becomes a correctness liability — answers to "what is the latest GPT model" go stale as models update. Every semantic cache entry should have a TTL (typically 1-7 days for general chat; shorter for time-sensitive content). Redis makes this trivial with per-key TTL; GPTCache requires manual eviction logic.

The harder problem is semantic invalidation: if you update your docs, old cached answers about the old docs are now wrong. The pragmatic approach is to version your cache namespace by content version (e.g. cache:v3:...) and flip the version on doc updates, effectively nuking the old cache. Do not try to surgically invalidate individual entries — it is not worth the complexity.

Measuring hit rate and quality

Two metrics define whether your semantic cache is working. Hit rate is the percentage of calls served from cache — 25-40% is healthy for support-chat workloads; below 15% suggests the threshold is too high or the workload is genuinely one-off. Hit quality is the fraction of cache hits that users find acceptable — measure this by tracking whether a user immediately re-asks the same question (a proxy for "the cached answer was wrong") or by spot-sampling hits for human review.

# Healthy semantic-cache metrics (support chat, threshold 0.90) Hit rate: 28-35% of calls served from cache Avg latency (hit): ~10ms (vs ~1200ms LLM call) Avg latency (miss): ~1210ms (5-10ms cache check overhead) Cost reduction: ~30% of API spend eliminated False-positive rate: <2% of hits (user re-asks signal)

If your false-positive rate rises above 3-5%, raise the threshold. If your hit rate drops below 15%, lower it or check whether your embedding model is too weak to cluster paraphrases. These two knobs — threshold and embedding model — account for 90% of semantic-cache outcomes.

Composing with other cost optimizations

Semantic caching is one lever in the broader cost optimization playbook, and it composes with the others. After a cache miss, model routing can send the request to a cheaper model (GPT-4o-mini instead of GPT-4o) for simple prompts — see our model routing guide. Prompt compression trims the prompt before the LLM sees it. Each lever is independent; together they compound into 70-85% API spend reduction on the right workloads.

Order of operations

Semantic cache → prompt caching (prefix) → model routing → prompt compression. Check the cheapest lever (exact or semantic cache) first; only fall through to the LLM with the cheapest viable model after compression. Each layer should be measurably cheaper than the one before it.

FAQ

What is semantic caching for LLM APIs?

Semantic caching embeds each prompt into a vector and returns a cached answer when a similar past prompt (cosine similarity above a threshold) exists. It intercepts 25-40% of redundant traffic on typical workloads, cutting API spend by the same fraction. It goes beyond exact-match caching by catching rephrasings.

What similarity threshold should I use?

Start at 0.90 and tune. 0.92-0.95 for factual or structured output; 0.85-0.90 for casual chat. At 0.95+ you get few false-positive hits but lower hit rate; below 0.85 you risk serving semantically-near but factually-wrong answers. Always recalibrate when switching embedding models — similarity distributions differ.

GPTCache vs Redis VL — which should I use?

Use GPTCache for prototyping or custom pipelines. Use Redis VL (Redis Stack) if you already run Redis — it unifies the vector index and TTL-based eviction in one store. If you use LiteLLM as your gateway, its built-in semantic cache (backed by Redis) is the simplest path and inherits routing and observability.

Does semantic caching work for agent and RAG workloads?

It works for the user-facing prompt layer of RAG (FAQ-style questions repeat heavily), but not for intermediate tool-call or retrieval layers where context is dynamic. For agents, cache the final user-facing answer, not the intermediate calls. Hit rate is highest for support chat, FAQ bots, and documentation assistants.

Related deep dives

Sources

Hit rates and cost-reduction figures are workload-dependent. The 25-40% range reflects typical support-chat and FAQ workloads; creative or analytical workloads see much lower hit rates. Always measure on your own traffic before projecting savings.