What similarity threshold should I use for semantic caching?

Start at 0.90 cosine similarity for general chat workloads and tune based on your error tolerance. At 0.95+ you get very few false-positive cache hits but lower hit rate (15-25%); at 0.85 you maximize hit rate (35-45%) but risk returning a slightly-off answer. For factual or structured-output workloads, use 0.92-0.95; for casual chat, 0.85-0.90 is acceptable.

Semantic Caching for LLM APIs: GPTCache, Redis, and the Threshold That Matters

Q: What is semantic caching for LLM APIs?

Semantic caching intercepts LLM API calls by embedding each prompt into a vector and checking if a semantically similar past prompt (cosine similarity above a threshold, typically 0.85-0.95) has already been answered. If so, it returns the cached answer instead of calling the LLM. This intercepts 25-40% of redundant traffic on typical workloads, cutting API spend by the same fraction.

Q: GPTCache vs Redis VL — which should I use?

Use GPTCache (open-source, Python-native) for quick prototyping and when you want flexible embedding backends. Use Redis Vector Library (RedisVL) or Redis Stack when you already run Redis in production — it gives you a single store for both the vector index and TTL-based eviction, and scales horizontally. LiteLLM also has a built-in semantic cache backed by Redis that integrates cleanly with the gateway.

Q: Does semantic caching work for agent and RAG workloads?

It works well for the user-facing prompt layer of RAG (FAQ-style questions repeat heavily), but not for the retrieval or tool-call layer where context varies per call. For agents, cache the final answer to the user, not the intermediate LLM calls that depend on dynamic tool results. The hit rate is highest for support chat, FAQ bots, and any workload where users rephrase the same intent.

Users rephrase the same question in a dozen ways — "how do I reset my password", "password reset help", "I forgot my password". Exact-match caching misses all of them. Semantic caching embeds each prompt into a vector and returns a cached answer when cosine similarity exceeds a threshold, intercepting 25-40% of redundant traffic. The whole game is picking the right embedding model, the right threshold, and the right cache layer in your stack.

By the LLM Academy team · Reviewed July 2026 · Tested with GPTCache 0.1.x, Redis Stack 7.4+, LiteLLM v1.x

The problem: exact-match caching misses rephrasings

Every LLM gateway supports exact-match caching — if a user sends the identical prompt twice, return the stored answer. But real users rarely send byte-identical prompts. Support-chat traffic shows 10-30 surface variations of the same intent: "cancel my subscription", "how to cancel", "I want to unsubscribe", "end my plan". Exact-match caching catches none of them.

OpenAI's prompt caching (and the analogous features on Anthropic and Gemini) solves a related but different problem — it caches the prefix of a prompt to reduce prefill cost, but still runs the LLM for the divergent suffix. Semantic caching goes further: it skips the LLM call entirely when a similar past answer exists. The two techniques compose: prompt caching reduces prefill cost; semantic caching eliminates the call.

When semantic caching wins

The hit rate is highest for workloads where many users ask variations of the same questions: customer support, FAQ bots, documentation chat, internal knowledge bases. It is lowest for one-off creative or analytical prompts where no two inputs are alike.

How it works: embed, search, threshold

The pipeline is simple. On every incoming prompt: (1) embed the prompt into a vector using a small embedding model, (2) search the vector store for the nearest cached prompt, (3) if cosine similarity exceeds your threshold, return the cached answer; otherwise call the LLM and store the new prompt+answer+embedding.

# Semantic cache flow (LiteLLM with Redis backend)
from litellm import completion
from litellm.caching import RedisCache

cache = RedisCache(host="redis", port=6379)

# LiteLLM's semantic cache checks similarity before calling the LLM
response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "how do I reset my password?"}],
    caching=True,                  # enable caching
    cache={"type": "semantic", "similarity_threshold": 0.90},
)
# If "I forgot my password, help" was asked before at sim 0.92 → cached hit, $0

The embedding model is typically a fast, cheap one — OpenAI's text-embedding-3-small or a self-hosted all-MiniLM-L6-v2 (384-dim, runs on CPU). The vector store is Redis (via RedisVL), pgvector, Qdrant, or Milvus. The whole cache layer adds 5-15ms of latency on a miss, and returns in 5-15ms on a hit — versus 500-3000ms for an actual LLM call.

The similarity threshold: the single most important knob

The threshold decides everything. Set it too low and you return semantically-near but factually-wrong answers ("how do I delete my account" served with the cancel-subscription answer). Set it too high and you get few hits. The right value depends on your workload's tolerance for slight semantic drift.

Threshold	Hit rate	Risk	Best for
0.95+	10-15%	Very low — near-exact matches only	Medical, legal, high-stakes Q&A
0.90-0.95	20-30%	Low — safe for most structured output	FAQ, docs chat, RAG user-layer
0.85-0.90	35-45%	Moderate — occasional off-target hits	Casual support chat
<0.85	45%+	High — false positives common	Not recommended

Threshold drift

Different embedding models produce different similarity distributions. A threshold of 0.90 on text-embedding-3-small is not equivalent to 0.90 on all-MiniLM-L6-v2. Always re-calibrate the threshold when you switch embedding models — sample 100 known-similar and 100 known-dissimilar pairs and find the decision boundary empirically.

GPTCache vs Redis VL vs LiteLLM built-in

There are three common implementation choices, depending on what you already run.

Option	Embedding backend	Vector store	Best when
GPTCache (OSS)	Pluggable (OpenAI, ONNX, HuggingFace)	Pluggable (Milvus, FAISS, pgvector)	Prototyping, custom pipelines
Redis VL / Redis Stack	Pluggable	Redis (native)	Already running Redis; want TTL eviction in one store
LiteLLM built-in	OpenAI or Cohere	Redis or in-memory	You use LiteLLM as your gateway (simplest integration)

If you are already using LiteLLM as your gateway, the built-in semantic cache is the path of least resistance — it is a few lines of config and inherits LiteLLM's TTL, model routing, and observability. For greenfield or custom setups, Redis Stack is the most production-ready: it handles eviction, sharding, and vector search in a single store you probably already operate.

Eviction, TTL, and invalidation

A cache that never expires becomes a correctness liability — answers to "what is the latest GPT model" go stale as models update. Every semantic cache entry should have a TTL (typically 1-7 days for general chat; shorter for time-sensitive content). Redis makes this trivial with per-key TTL; GPTCache requires manual eviction logic.

The harder problem is semantic invalidation: if you update your docs, old cached answers about the old docs are now wrong. The pragmatic approach is to version your cache namespace by content version (e.g. cache:v3:...) and flip the version on doc updates, effectively nuking the old cache. Do not try to surgically invalidate individual entries — it is not worth the complexity.

Measuring hit rate and quality

Two metrics define whether your semantic cache is working. Hit rate is the percentage of calls served from cache — 25-40% is healthy for support-chat workloads; below 15% suggests the threshold is too high or the workload is genuinely one-off. Hit quality is the fraction of cache hits that users find acceptable — measure this by tracking whether a user immediately re-asks the same question (a proxy for "the cached answer was wrong") or by spot-sampling hits for human review.

# Healthy semantic-cache metrics (support chat, threshold 0.90) Hit rate: 28-35% of calls served from cache Avg latency (hit): ~10ms (vs ~1200ms LLM call) Avg latency (miss): ~1210ms (5-10ms cache check overhead) Cost reduction: ~30% of API spend eliminated False-positive rate: <2% of hits (user re-asks signal)

If your false-positive rate rises above 3-5%, raise the threshold. If your hit rate drops below 15%, lower it or check whether your embedding model is too weak to cluster paraphrases. These two knobs — threshold and embedding model — account for 90% of semantic-cache outcomes.

Composing with other cost optimizations

Semantic caching is one lever in the broader cost optimization playbook, and it composes with the others. After a cache miss, model routing can send the request to a cheaper model (GPT-4o-mini instead of GPT-4o) for simple prompts — see our model routing guide. Prompt compression trims the prompt before the LLM sees it. Each lever is independent; together they compound into 70-85% API spend reduction on the right workloads.

Order of operations

Semantic cache → prompt caching (prefix) → model routing → prompt compression. Check the cheapest lever (exact or semantic cache) first; only fall through to the LLM with the cheapest viable model after compression. Each layer should be measurably cheaper than the one before it.

FAQ

What is semantic caching for LLM APIs?

Semantic caching embeds each prompt into a vector and returns a cached answer when a similar past prompt (cosine similarity above a threshold) exists. It intercepts 25-40% of redundant traffic on typical workloads, cutting API spend by the same fraction. It goes beyond exact-match caching by catching rephrasings.

What similarity threshold should I use?

Start at 0.90 and tune. 0.92-0.95 for factual or structured output; 0.85-0.90 for casual chat. At 0.95+ you get few false-positive hits but lower hit rate; below 0.85 you risk serving semantically-near but factually-wrong answers. Always recalibrate when switching embedding models — similarity distributions differ.

GPTCache vs Redis VL — which should I use?

Use GPTCache for prototyping or custom pipelines. Use Redis VL (Redis Stack) if you already run Redis — it unifies the vector index and TTL-based eviction in one store. If you use LiteLLM as your gateway, its built-in semantic cache (backed by Redis) is the simplest path and inherits routing and observability.

Does semantic caching work for agent and RAG workloads?

It works for the user-facing prompt layer of RAG (FAQ-style questions repeat heavily), but not for intermediate tool-call or retrieval layers where context is dynamic. For agents, cache the final user-facing answer, not the intermediate calls. Hit rate is highest for support chat, FAQ bots, and documentation assistants.

Related deep dives

LLM Cost Optimization Playbook — the full 5-lever cost-reduction stack, of which semantic caching is one
Model Routing: GPT-4o vs GPT-4o-mini — what to do after a cache miss: route to the cheapest viable model
LiteLLM vs Portkey vs OpenRouter — which gateway has the best built-in semantic cache
OpenAI vs Anthropic vs Gemini Pricing — the per-token savings a cache multiplies

Sources

Antgroup / Zilliz, "GPTCache: Building Semantic Cache for LLM Applications," 2023-2024
Redis Inc., "Semantic Caching with Redis Vector Library (RedisVL)," 2024-2025
LiteLLM documentation, "Semantic Caching with Redis," 2025
OpenAI, "Prompt Caching" documentation, 2024 (prefix caching — the complementary technique)
LangChain blog, "Semantic Caching for LLMs: How, When, and Why," 2024
BerriAI, "How We Cut Our OpenAI Bill by 40% with Semantic Caching," 2024

Hit rates and cost-reduction figures are workload-dependent. The 25-40% range reflects typical support-chat and FAQ workloads; creative or analytical workloads see much lower hit rates. Always measure on your own traffic before projecting savings.