How to Deploy vLLM in Production
A complete 2026 walkthrough: run vLLM in Docker, configure tensor parallelism across GPUs, tune memory for your VRAM budget, and expose an OpenAI-compatible API. Every command is copy-pasteable; verify against the official docs for your vLLM version.
What you need before starting
vLLM requires an NVIDIA GPU with Compute Capability 7.0 or higher (Volta V100 and newer). For a 7B model in FP16 you need roughly 14 GB of VRAM for weights plus a KV-cache budget; a single 80 GB A100 or H100 comfortably serves Llama-3-8B with room for thousands of tokens of cache. For 70B models, plan for at least 2 GPUs via tensor parallelism — a single H100's 80 GB cannot hold FP16 70B weights alone. Install Docker with NVIDIA Container Toolkit so the container can access the GPU.
Use our KV Cache Calculator to size your cache budget before deploying. A 70B model serving 32k context for 50 concurrent users can need 60+ GB of KV cache alone.
Step 1 — Run vLLM in Docker
The fastest path to a working server is the official vLLM Docker image. This single command pulls the image, exposes the OpenAI-compatible API on port 8000, and loads a model from HuggingFace. The --gpus all flag is mandatory — without it the container cannot see your GPU and vLLM will fail to initialize CUDA.
docker run --gpus all \
-p 8000:8000 \
--shm-size 8g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1
The --shm-size 8g flag gives the container enough shared memory for NCCL (the GPU communication library); without it, multi-GPU tensor parallelism often crashes silently. The volume mount caches the downloaded model on your host so you don't re-download on every restart.
[IMAGE: Diagram of the Docker container exposing port 8000 with GPU access]
Step 2 — Configure tensor parallelism for multi-GPU
For models larger than a single GPU's VRAM, use tensor parallelism (TP) to split the model's weight matrices across GPUs. The --tensor-parallel-size flag must equal your GPU count. On a 4-GPU node serving a 70B model, set TP to 4 so each GPU holds roughly a quarter of the weights.
# Serve a 70B model across 4 GPUs on one node
docker run --gpus all \
-p 8000:8000 \
--shm-size 16g \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
According to the official vLLM parallelism docs, tensor parallelism is the right default for single-node multi-GPU because it balances simplicity and performance. For multi-node setups, combine tensor parallel with pipeline parallel (--pipeline-parallel-size), and ensure every node has an identical environment — same model path, same Python version, same NCCL version. A common gotcha: tensor parallelism on newer GPUs (sm_120 / RTX 5090) sometimes needs an NCCL update inside the container.
Step 3 — Tune GPU memory for your workload
Three flags control how vLLM uses VRAM. Get these wrong and you either OOM on startup or leave GPU memory idle. The defaults assume a single model on a single GPU; for production you should set them explicitly.
| Flag | What it controls | Production guidance |
|---|---|---|
--gpu-memory-utilization | Fraction of VRAM vLLM may use (0.0–1.0) | 0.90 default; lower to 0.85 if you co-locate other processes |
--max-model-len | Maximum sequence length (prompt + output) | Match your model's context window — 32768 for Llama-3.1; lower saves KV-cache VRAM |
--max-num-seqs | Max concurrent requests in a batch | 256 default; raise for throughput, lower if you hit OOM under load |
The relationship is a zero-sum budget: total VRAM = weights + KV cache + activation memory. If you raise --max-model-len or --max-num-seqs, you consume more KV cache, which can force vLLM to reject requests. The 2026 Spheron production guide recommends starting with the defaults, then tuning --max-num-seqs upward until you see the first OOM under load, then backing off 20%.
If you serve an AWQ or GPTQ INT4 model, weights shrink 4x and you free VRAM for a larger KV cache. See our KV Cache Quantization guide and Quantization Calculator to recompute your budget.
Step 4 — Call the OpenAI-compatible API
vLLM exposes an API that is wire-compatible with OpenAI's, so any OpenAI SDK client works unchanged. Point the base_url at your vLLM server and use the model name you passed to --model.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # vLLM ignores the key unless you set --api-key
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in one sentence."}],
max_tokens=100,
)
print(response.choices[0].message.content)
For production, set --api-key to require authentication, and place a reverse proxy (nginx, Caddy, or a managed gateway like LiteLLM) in front for TLS termination, rate limiting, and request routing across multiple vLLM replicas.
Production checklist
- Health checks: vLLM exposes
/health— wire it into your load balancer so traffic only routes to ready replicas. - Persistence: mount the HuggingFace cache volume so container restarts don't re-download multi-gigabyte models.
- Observability: vLLM exports Prometheus metrics on
/metrics— trackvllm:num_requests_running,vllm:gpu_cache_usage_perc, andvllm:e2e_request_latency_seconds. - Autoscaling: scale on GPU utilization (target 80-90%) rather than CPU; pair with the right engine choice for your workload shape.
- Graceful shutdown: send SIGTERM and let vLLM finish in-flight requests; a hard kill drops active generations.
Common pitfalls
OOM on startup almost always means --max-model-len × --max-num-seqs exceeds your KV-cache budget after weights load. Lower one or both. Tensor parallelism hangs on multi-GPU usually means a stale NCCL — rebuild the container or update NCCL. Slow first request is normal cold-start (model load + CUDA graph capture takes 30-90 seconds); keep a warm replica in production rather than scaling to zero.
Related deep dives
- vLLM vs SGLang vs TGI — confirm vLLM is the right choice for your workload
- Speculative Decoding — add 2-3x throughput on top of vLLM
- KV Cache Calculator — size your memory budget before deploying
- MLA Attention — DeepSeek's memory compression, deployable on vLLM
Sources
- vLLM Documentation, "Parallelism and Scaling," 2026 (docs.vllm.ai)
- Spheron, "vLLM Production Deployment 2026: Multi-GPU Tensor Parallel," 2026
- Inference.net, "vLLM Docker Deployment: Production-Ready Setup Guide," 2026
- vLLM project, "vLLM v0.6.0 performance update," 2024 (baseline for current defaults)
vLLM ships frequent releases; flags and defaults change. Always verify against docs.vllm.ai for your installed version before relying on these commands in production.