LLM Inference GPU Benchmark: H100 vs A100 vs L40 in 2026
LLM inference is memory-bandwidth-bound, so the GPU's HBM throughput — not its peak TFLOPs — is the number that predicts tokens per second. The H100 has 3.35 TB/s, the A100 has 2.0 TB/s, and the L40 has 864 GB/s. But hourly cost scales too. Here is the cost-per-token breakdown across model sizes and quantization levels, with the decision rule for which GPU to rent or buy.
Why memory bandwidth is the bottleneck
During autoregressive decoding, each generated token requires reading the entire model's weights and the KV cache from GPU memory — once per token. The arithmetic intensity (FLOPs per byte read) is very low, so the GPU's compute units spend most of their time waiting for memory. This makes LLM inference a memory-bandwidth-bound workload: tokens per second scales roughly linearly with HBM bandwidth, not with peak TFLOPs.
This is why two GPUs with similar FLOPs can have very different LLM inference speeds, and why a GPU with lower peak compute but higher bandwidth can win. It is also why KV cache quantization and weight quantization (FP8, INT4) matter so much — they reduce the bytes that need to be read per token, directly translating bandwidth into throughput.
For decode-heavy workloads, the metric that matters is bandwidth per dollar, not TFLOPs per dollar. Prefill-heavy workloads (long input, short output) are more compute-bound and benefit from higher TFLOPs — the mix matters.
Specs at a glance
| Spec | H100 SXM | A100 80GB SXM | L40S |
|---|---|---|---|
| Architecture | Hopper | Ampere | Ada Lovelace |
| VRAM | 80 GB HBM3 | 80 GB HBM2e | 48 GB GDDR6 |
| Memory bandwidth | 3.35 TB/s | 2.0 TB/s | 864 GB/s |
| FP16 Tensor TFLOPs | 989 | 312 | 362 |
| FP8 Tensor TFLOPs | 1979 | — (not supported) | 733 |
| NVLink | ✓ (900 GB/s) | ✓ (600 GB/s) | ✗ (PCIe only) |
| TDP (power) | 700W | 400W | 350W |
| Cloud hourly (mid-2026) | ~$2.50-3.50 | ~$1.20-1.80 | ~$0.80-1.20 |
Note the A100's lack of FP8 — it is an Ampere-generation card and predates the FP8 Tensor Cores introduced in Hopper. This is a real disadvantage in 2026, when FP8 quantization is table stakes for cost-efficient serving.
Throughput: tokens per second (Llama-3-70B, FP16)
For a 70B model in FP16 — a memory-bandwidth-bound decode workload — throughput scales with HBM bandwidth. The H100's 3.35 TB/s versus the A100's 2.0 TB/s translates to roughly 1.5-1.7x higher tokens/sec per GPU. The L40S, despite decent FP16 TFLOPs, is crippled by its 864 GB/s bandwidth — it manages roughly one-third the A100's throughput on this workload.
The L40S is effectively disqualified for 70B-in-FP16 work because 48 GB cannot hold the ~140 GB of weights even at tensor parallelism 2. For 70B, you either use 80 GB cards (H100/A100) or quantize to FP8/INT4 to fit the L40S.
Cost per million tokens
Throughput alone is misleading — what matters is throughput per dollar. Using the mid-2026 cloud rates and the throughput numbers above, here is the effective cost per million output tokens. This is the number that should drive GPU selection.
| Workload | H100 ($/M tok) | A100 ($/M tok) | L40S ($/M tok) |
|---|---|---|---|
| Llama-3-70B FP16 (2-GPU TP) | ~$1.40 | ~$1.30 | — (won't fit) |
| Llama-3-70B FP8 (fits on fewer GPUs) | ~$0.75 | — (no FP8) | ~$1.80 |
| Llama-3-8B FP16 | ~$0.40 | ~$0.45 | ~$0.35 |
| Llama-3-8B FP8 | ~$0.22 | — (no FP8) | ~$0.20 |
For 7B-8B models, the L40S is the cost-per-token leader because its lower hourly rate outweighs its bandwidth disadvantage — you are not bandwidth-bound at that model size. The H100 wins decisively only for 70B+ where its bandwidth and FP8 support are essential.
The FP8 factor
FP8 quantization is the biggest single lever in this comparison, and it changes which GPU wins. FP8 halves the weight memory and doubles the effective Tensor Core throughput, but only Hopper (H100) and Ada (L40S) support it natively. The A100 — still common in legacy fleets — is locked out.
For a 70B model, FP8 means the weights drop from ~140 GB to ~70 GB, fitting on a single 80 GB H100 instead of two. That alone halves the cost-per-token and is the reason most 2026 production deployments of 70B models run FP8 on H100. If you are on A100s and cannot migrate, INT4/FP8 KV cache quantization and weight-only INT4 are your fallbacks, but they are less accurate and more complex than native FP8.
Multi-GPU: NVLink vs PCIe
For models that span multiple GPUs (70B+ in FP16, or large-context batches), the inter-GPU link matters. The H100 and A100 have NVLink (900 GB/s and 600 GB/s respectively), which is essential for tensor parallelism to not become a bottleneck. The L40S has PCIe only (~64 GB/s), making multi-GPU tensor parallelism painful — a 70B model split across two L40S loses significant throughput to PCIe transfers.
This is why the L40S shines for single-GPU workloads (7B-13B models) but falls off for multi-GPU tensor-parallel serving. If your model fits on one card, the L40S's NVLink absence is irrelevant; if it doesn't, the H100's NVLink fabric is essential.
Decision matrix by workload
| Your workload | Best GPU | Why |
|---|---|---|
| 7B-8B model, cost-minimized | L40S | Lowest hourly rate, not bandwidth-bound at this size |
| 7B-8B model, latency-minimized | H100 | Highest bandwidth, lowest single-stream latency |
| 70B model FP8 | H100 | Native FP8 + fits on one 80GB card + NVLink |
| 70B model FP16 | H100 or A100 (similar $/tok) | A100 competitive at FP16; H100 wins if you can use FP8 |
| 70B+ multi-GPU TP | H100 | NVLink essential for tensor parallelism |
| Legacy fleet, no migration budget | A100 | Still viable for FP16; use INT4 weight quant as FP8 substitute |
| Long-context (100k+ tokens) | H100 | KV cache grows huge; bandwidth + 80GB VRAM essential |
The utilization caveat
Cost-per-token numbers assume high GPU utilization. In practice, if your traffic is bursty and your GPUs sit idle 50% of the time, the cheaper-per-hour GPU (A100, L40S) wins — you are paying for idle time either way, so minimizing hourly cost matters more than maximizing peak throughput. Conversely, if you serve steady high traffic and can keep GPUs at 85%+ utilization, the H100's throughput advantage compounds.
This is also why serverless / spot pricing changes the calculus: if you can rent H100s on spot at A100 prices during off-peak, the H100 is always better. Use a cost optimization strategy that routes traffic to spot capacity when available.
GPU cloud pricing, model quantization support, and engine optimizations change fast. A decision made in Q1 may be wrong by Q3. The cost-per-token table above is a mid-2026 snapshot; re-run it against your provider's current rates before each capacity-planning cycle.
FAQ
Is the H100 worth it over the A100 for LLM inference?
For decode-heavy workloads, the H100's 3.35 TB/s bandwidth gives roughly 1.5-1.7x higher tokens/sec than the A100's 2.0 TB/s. At high utilization this wins on cost-per-token; at low utilization the A100's lower hourly rate is cheaper. The H100's FP8 support is the decisive factor for 70B models — it halves memory, letting one card do what two A100s do.
Is the L40 good for LLM inference?
For 7B-13B models, yes — the L40S is the cost-per-token leader because its low hourly rate outweighs its lower bandwidth at small model sizes. It struggles on 70B+ models (48 GB VRAM, 864 GB/s bandwidth, PCIe-only multi-GPU). Pick it for small-model cost optimization, not for large-model throughput.
How much VRAM do I need for LLM inference?
Model weights need ~2 bytes/parameter in FP16 (7B ≈ 14 GB, 70B ≈ 140 GB), plus KV cache that scales with concurrency and context length. For 7B with healthy concurrency, 24-48 GB suffices. For 70B FP16, you need 2-4×80 GB GPUs or FP8/INT4 quantization. Use our KV Cache Calculator to size precisely.
Does FP8 change the GPU choice?
Yes. FP8 (H100, L40S — not A100) halves weight memory and doubles Tensor Core throughput. It extends the L40S's useful range to 13B FP8 models and makes the H100 dramatically more efficient for 70B. The A100's lack of native FP8 is a genuine disadvantage in 2026.
Related deep dives
- vLLM vs SGLang vs TGI — which inference engine extracts the most from each GPU
- TensorRT-LLM: When to Use It — the H100's compiled-kernel advantage
- KV Cache Quantization — FP8/INT4 cache techniques that extend GPU life
- KV Cache Calculator — size your VRAM needs before choosing a GPU
- LLM Cost Optimization Playbook — the full cost-reduction stack beyond GPU choice
Sources
- NVIDIA, "H100 Tensor Core GPU Architecture Whitepaper," 2024 (3.35 TB/s HBM3, 989 FP16 TFLOPs, FP8 specs)
- NVIDIA, "A100 Tensor Core GPU Architecture Whitepaper," 2020 (2.0 TB/s HBM2e, 312 FP16 TFLOPs)
- NVIDIA, "L40S GPU Data Sheet," 2024 (48 GB GDDR6, 864 GB/s, FP8 support)
- Phala Network / Spheron 2026 cloud GPU pricing indices
- RunPod, "H100 vs A100 vs L40S for LLM Inference — Cost Per Token," 2025-2026
- vLLM & TensorRT-LLM throughput benchmarks, 2026 community reports
Cloud pricing fluctuates and varies by provider, region, and commitment level. Throughput numbers are representative based on published benchmarks and are directional — always benchmark on your provider's actual instances.