How to Self-Host LiteLLM
A production-ready LiteLLM proxy in 15 minutes: Docker Compose with PostgreSQL for spend tracking, virtual team API keys with budgets, multi-provider routing, and rate limits. Every config is copy-pasteable — verify against the official docs for your LiteLLM version.
Why self-host LiteLLM?
Self-hosting LiteLLM gives you a single OpenAI-compatible endpoint that routes to 100+ providers — OpenAI, Anthropic, Google, AWS Bedrock, Azure, and your own self-hosted vLLM replicas — with zero per-request markup. You pay providers directly, and LiteLLM tracks every cent. For teams spending meaningful money on LLM APIs, the control and cost transparency justify the modest operational overhead of running one Docker container and a Postgres database.
If you want zero infrastructure, OpenRouter or Portkey's SaaS are faster to start. If you want cost control, data residency, and no markup, self-hosted LiteLLM wins. See our gateway comparison for the full tradeoff.
Prerequisites
You need a host with Docker and Docker Compose installed — any cloud VM or on-prem server works. You need provider API keys for whichever models you want to route to (at minimum, an OpenAI key). For production, allocate a PostgreSQL database (the Docker Compose below includes one) so LiteLLM can persist spend tracking, team keys, and usage logs across restarts.
Step 1 — Create config.yaml
The config.yaml is the heart of a LiteLLM deployment. It declares which models are available, which provider keys to use, and the master key for admin access. This minimal config exposes GPT-4o and GPT-4o-mini through OpenAI, and a self-hosted Llama through a vLLM endpoint — LiteLLM presents all three behind one unified API.
# config.yaml
model_list:
- model_name: gpt-4o # the name your app calls
litellm_params:
model: openai/gpt-4o # the actual provider model
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: llama-self-hosted
litellm_params:
model: openai/meta-llama/Llama-3.1-70B-Instruct
api_base: http://vllm:8000 # your self-hosted vLLM
api_key: os.environ/VLLM_KEY
router_settings:
routing_strategy: usage-based-routing # balance load across replicas
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY # admin access
database_url: os.environ/DATABASE_URL # Postgres for spend tracking
The os.environ/ prefix pulls values from environment variables, keeping secrets out of the config file. The router_settings block enables load balancing when you list multiple deployments under one model_name.
Step 2 — Run with Docker Compose
The official Docker quick start bundles LiteLLM with PostgreSQL. This docker-compose.yml brings up both, mounts your config, and wires the database connection. The STORE_MODEL_IN_DB flag tells LiteLLM to persist team keys and spend in Postgres rather than in-memory (which would reset on restart).
# docker-compose.yml
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000" # proxy endpoint
volumes:
- ./config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- VLLM_KEY=${VLLM_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=postgresql://litellm:litellm@db:5432/litellm
- STORE_MODEL_IN_DB=True
command: --config /app/config.yaml --port 4000 --num_workers 4
depends_on:
- db
db:
image: postgres:16
environment:
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm
- POSTGRES_DB=litellm
volumes:
- litellm_db:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
litellm_db:
Bring it up with docker-compose up -d. The TECHSY 2026 setup guide notes that this Compose file — with PostgreSQL, virtual team keys, budgets, cost tracking, and rate limits — is the production baseline; the in-memory quick start is only for local testing. Verify the proxy is healthy at http://localhost:4000/health.
[IMAGE: Docker Compose topology — app → LiteLLM proxy → Postgres (spend DB) + providers]
Step 3 — Create virtual team keys with budgets
Instead of handing every team your raw OpenAI key, mint virtual keys per team with their own budgets and rate limits. LiteLLM tracks spend against each key and rejects requests once the budget is exhausted. Use the master key to authenticate the admin call.
# Create a team key capped at $50/month
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"team_id": "marketing-team",
"max_budget": 50.00,
"budget_duration": "1mo",
"rpm_limit": 100,
"models": ["gpt-4o-mini", "gpt-4o"]
}'
# Response: { "key": "sk-litellm-abc123...", ... }
The returned sk-litellm-... key is what the marketing team uses in their app. They can only call the models you allow, they are capped at $50/month, and rate-limited to 100 requests/minute. Every call is logged to Postgres with token counts and cost, so you get per-team spend dashboards for free — this is the budget governance lever from our cost optimization playbook.
Step 4 — Call the proxy from your app
LiteLLM exposes an OpenAI-compatible API, so any OpenAI SDK client works unchanged. Point base_url at your proxy and pass the virtual key.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000",
api_key="sk-litellm-abc123...", # the virtual team key
)
response = client.chat.completions.create(
model="gpt-4o-mini", # routes via config.yaml
messages=[{"role": "user", "content": "Summarize this article."}],
)
Behind the scenes, LiteLLM translates this to the provider's native format, tracks the token cost against the marketing team's budget, applies the rate limit, and logs the request. Your app code never changes when you swap providers or add fallbacks — that all lives in config.yaml.
Production checklist
- Health checks: wire
/healthinto your load balancer; LiteLLM reports unhealthy if no models are reachable. - Secrets: keep all keys in environment variables or a secret manager — never commit them in
config.yaml. - Backups: the Postgres database holds spend history and team keys — back it up regularly.
- Fallbacks: configure fallback chains in
config.yamlso a provider outage fails over automatically. - Observability: LiteLLM logs support export to Langfuse or LangSmith for tracing — see our observability cluster (coming).
- Scaling: run multiple LiteLLM replicas behind a load balancer; the DB-backed state makes them stateless.
Common pitfalls
Spend resets on restart means you forgot STORE_MODEL_IN_DB=True or the DATABASE_URL is wrong — team keys and budgets live in Postgres. 429 from providers means your rate limits are tighter than the provider's, or you need fallbacks configured. Cost tracking shows zero usually means the model name in your call does not match a model_name in config.yaml, so LiteLLM cannot resolve pricing.
Related deep dives
- LiteLLM vs Portkey vs OpenRouter — confirm LiteLLM is the right gateway choice
- LLM Cost Optimization Playbook — the 5 levers LiteLLM enables
- Deploy vLLM in Production — pair self-hosted inference with LiteLLM routing
Sources
- LiteLLM Documentation, "Docker Quick Start" and "Deploy (Docker, Helm, Terraform)," 2026 (docs.litellm.ai)
- TECHSY, "LiteLLM Proxy: 1 API for 100+ LLMs (15-min Setup)," 2026
- tanyongsheng, "LiteLLM Proxy for High-Availability LLM Services," 2026
- BerriAI/litellm GitHub repository, accessed June 2026
LiteLLM ships frequent releases; config keys and Docker image tags change. Verify against docs.litellm.ai for your installed version before deploying these configs to production.