LLM Model Routing: When to Route to GPT-4o-mini vs GPT-4o

Not every prompt needs GPT-4o. On most production traffic, 60-80% of requests — classification, extraction, formatting, FAQ — are handled perfectly well by a model costing 10-30x less. Model routing is the gateway layer that decides which model each prompt goes to. Done well, it cuts API spend 50-70% with no perceptible quality loss. Done badly, it silently degrades output on the prompts that needed the strong model. Here is how to do it well.

By the LLM Academy team · Reviewed July 2026 · Patterns apply to LiteLLM, Portkey, OpenRouter, and custom gateways

The economic case

The price gap between frontier and budget models is enormous. At 2026 rates, GPT-4o costs roughly $2.50 / $10.00 per million input/output tokens, while GPT-4o-mini costs roughly $0.15 / $0.60 — a 15-17x ratio. If 70% of your traffic can be served acceptably by the mini model, your blended cost drops by roughly 65%. See our live pricing comparison for the current numbers across providers.

# Blended cost example (100M output tokens/month) All GPT-4o: 100M × $10.00 = $1,000,000/month 70% mini, 30% 4o: 70M × $0.60 + 30M × $10.00 = $42,000 + $300,000 = $342,000/month Savings: ~$658,000/month (66% reduction)

The question is never "is routing worth it" — at this price gap it almost always is. The question is how to decide which prompts to downgrade without hurting quality.

Rule-based routing: the 80% solution

Most routing decisions can be made with simple, transparent rules. The signals that reliably predict "this prompt is simple" are: short input length, presence of formatting or extraction keywords, a known intent classification (from the application layer), and structured-output requests (JSON extraction, classification labels). Conversely, long inputs, open-ended creative asks, and reasoning-heavy prompts should go to the strong model.

SignalRoute to miniRoute to 4o
Prompt length< 500 tokens> 2000 tokens
Task type keyword"summarize", "classify", "extract", "format""analyze", "reason", "design", "compare"
Output typeShort label, JSON field, yes/noMulti-paragraph, open-ended
Intent (from app)FAQ lookup, simple tool callComplex planning, creative writing
UncertaintyBorderline → default to 4o
Default to the strong model on ambiguity

The asymmetric risk is routing a hard prompt to the cheap model (silent quality loss) versus routing an easy prompt to the expensive model (wasted money). Always bias borderline cases to the strong model. You can tighten rules over time as you gather evidence about what the cheap model handles well.

# LiteLLM router config (YAML) — rule-based routing
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini

router_settings:
  routing_strategy: simple-shuffle   # default; use custom for rules

# Custom rule: short prompts → mini, applied in your gateway code
def route_model(prompt: str) -> str:
    tokens = estimate_tokens(prompt)
    if tokens < 500 and any(k in prompt.lower() for k in
       ["summarize", "classify", "extract", "format", "is this"]):
        return "gpt-4o-mini"
    return "gpt-4o"   # default to strong

Classifier-based routing: when rules are not enough

Rules fail on prompts that look simple but are actually hard ("write a 3-line poem about grief") or look complex but are mechanical ("summarize this 5000-word contract in 3 bullets"). A classifier-based router trains a small model (often a fine-tuned BERT or the mini model itself) to predict whether the strong model will meaningfully outperform the cheap one on each prompt. RouteRouter, Unify AI, and the academic "RouteLLM" paper (Ong et al., 2024) all follow this pattern.

The trade-off is added latency and complexity. The classifier call adds 20-80ms and the classifier itself needs labeled training data (prompts where you know which model won). For most teams this is overkill — rules handle 80% of the savings. But if you run very high volume and your prompts are genuinely hard to classify by surface signals, a classifier can squeeze out another 10-15% savings on top of rules.

Classifier maintenance burden

A classifier router is a model — it drifts as your traffic mix changes and needs retraining. Unless you have the eval infrastructure and traffic volume to justify it, the operational tax is not worth the marginal savings. Start with rules; only add a classifier when you have evidence that rules are mis-routing a meaningful fraction of traffic.

Cascade routing: try cheap, fall back to strong

A third pattern is the cascade: always try the cheap model first, then use a confidence or validation check to decide whether to fall back to the strong model. For structured-output tasks this works well — if the mini model returns valid JSON with all required fields, use it; otherwise retry with the strong model. For open-ended generation, confidence is hard to assess and cascades are less reliable.

# Cascade pattern for structured extraction
def extract_with_cascade(prompt: str) -> dict:
    result = call_model("gpt-4o-mini", prompt)
    if is_valid_json(result) and has_all_fields(result):
        return result               # cheap model succeeded
    # Fallback: strong model
    return call_model("gpt-4o", prompt)

The downside of cascades is latency on the fallback path — the user waits for a failed cheap call plus the strong call. For interactive workloads, run the cascade asynchronously or set a tight timeout on the cheap model so fallback triggers quickly. For batch workloads, cascades are almost always worth it.

The eval harness: how you avoid silent quality loss

Model routing is only safe if you can detect when it goes wrong. The non-negotiable prerequisite is an eval harness: a labeled set of prompts with known-good outputs (or quality scores), run continuously against both the strong and the cheap model. The metric to track is downgrade regret — the fraction of prompts you routed to the cheap model where the strong model's output scored meaningfully higher on your eval.

If your downgrade-regret rate rises above a few percent, your routing rules are too aggressive — tighten them. If it sits near zero, your rules are conservative and you can afford to downgrade more aggressively. This feedback loop is the difference between routing that saves money and routing that quietly breaks your product. See our LLM-as-a-Judge guide for how to automate the quality scoring.

Sample, don't gate

In production, run a small fraction of traffic (1-5%) through both models and log the quality delta. This shadow-traffic approach gives you continuous eval data without gating every request on a quality check. Langfuse and Langsmith both support this pattern natively.

Composing with caching and other levers

Model routing sits in the middle of the cost-optimization stack. Semantic caching runs first — if a similar prompt was answered before, return the cached answer and skip routing entirely. After routing selects a model, prompt compression trims the prompt and batch APIs can further halve cost for non-time-sensitive workloads. Each layer is independent and they compound.

# Order of cost-optimization checks (per request) 1. Exact-match cache → hit? return (free) 2. Semantic cache → hit? return (~free) 3. Model routing → pick cheapest viable model 4. Prompt compression → trim input tokens 5. Batch API → defer if latency allows (50% off) # Each layer applies only if the previous missed.

Gateway support: LiteLLM, Portkey, OpenRouter

All three major gateways support model routing, with different ergonomics. LiteLLM offers a YAML-based router with fallback chains and custom strategies — the most flexible for rule-based routing. Portkey has a visual routing UI with conditional rules and A/B testing. OpenRouter auto-routes across providers for cost and uptime, with less granular control. See our gateway comparison for which fits your stack.

GatewayRule routingClassifier routingCascade/fallback
LiteLLM✓ YAML + custom code✓ Plug-in your own✓ Native
Portkey✓ Visual rule builderLimited✓ Native
OpenRouterAuto (cost/latency)✓ Provider fallback

FAQ

How much does model routing save?

Typically 50-70% of API cost, by sending the 60-80% of simple prompts to a cheap model (GPT-4o-mini, Claude Haiku) and reserving the expensive model for the prompts that genuinely need it. The exact savings depend on the price ratio and the fraction of traffic you can safely downgrade.

Should I use rule-based or classifier-based routing?

Start with rules (length, keywords, intent) — they are transparent, fast, and handle 80% of cases. Add a classifier only if you have the eval harness and traffic volume to justify the complexity and maintenance. Most teams never need the ML approach.

What is the latency cost of model routing?

Rule-based routing adds under 1ms. Classifier-based routing adds 20-80ms for the routing model call. For latency-critical workloads, prefer rules or run the classifier asynchronously on streaming traffic.

Can model routing hurt quality?

Yes — routing a genuinely hard prompt to the cheap model degrades output silently. The mitigation is a continuous eval harness that measures downgrade-regret rate, and a conservative default that routes borderline cases to the strong model. Monitor regret rate and tighten rules when it rises.

Related deep dives

Sources

Savings figures depend on workload mix, model pricing (which changes), and the quality bar of your application. Always validate routing decisions against your own eval set before trusting projected savings.