LLM Inference GPU Benchmark: H100 vs A100 vs L40 in 2026

Q: Is the H100 worth it over the A100 for LLM inference?

For memory-bandwidth-bound LLM inference (most decode-heavy workloads), the H100's 3.35 TB/s bandwidth versus the A100's 2.0 TB/s delivers roughly 1.5-1.7x higher tokens/sec. Whether that justifies the ~1.6x higher hourly cost depends on your utilization: at high utilization the H100 wins on cost-per-token; at low utilization the A100 is cheaper because you pay for idle GPU either way.

Q: Is the L40 good for LLM inference?

The L40 is a strong budget choice for 7B-13B models. It has 48 GB VRAM (enough for a 7B model plus KV cache), FP8 support, and costs roughly one-third the hourly rate of an H100. Its weakness is memory bandwidth (864 GB/s, far below H100's 3.35 TB/s), so it underperforms on 70B+ models and long-context decode.

Q: How much VRAM do I need for LLM inference?

A rough rule: model weights in FP16 need 2 bytes per parameter (a 7B model needs ~14 GB, a 70B model needs ~140 GB), plus KV cache which scales with concurrent sequences and context length. For a 7B model with healthy concurrency, 24-48 GB is sufficient. For 70B in FP16, you need 2-4 GPUs (80GB each) or aggressive quantization. Use a KV cache calculator to size precisely.

Q: Does FP8 change the GPU choice?

Yes. FP8 (supported on H100 and L40S but not the original A100) halves the memory footprint and doubles the effective throughput of Tensor Cores. This extends the useful life of 48 GB cards (L40S can serve a 13B FP8 model comfortably) and makes the H100 dramatically more efficient for 70B workloads. The A100's lack of native FP8 is a real disadvantage in 2026.

LLM inference is memory-bandwidth-bound, so the GPU's HBM throughput — not its peak TFLOPs — is the number that predicts tokens per second. The H100 has 3.35 TB/s, the A100 has 2.0 TB/s, and the L40 has 864 GB/s. But hourly cost scales too. Here is the cost-per-token breakdown across model sizes and quantization levels, with the decision rule for which GPU to rent or buy.

By the LLM Academy team · Reviewed July 2026 · Pricing as of mid-2026 cloud rates; specs from NVIDIA data sheets

Why memory bandwidth is the bottleneck

During autoregressive decoding, each generated token requires reading the entire model's weights and the KV cache from GPU memory — once per token. The arithmetic intensity (FLOPs per byte read) is very low, so the GPU's compute units spend most of their time waiting for memory. This makes LLM inference a memory-bandwidth-bound workload: tokens per second scales roughly linearly with HBM bandwidth, not with peak TFLOPs.

This is why two GPUs with similar FLOPs can have very different LLM inference speeds, and why a GPU with lower peak compute but higher bandwidth can win. It is also why KV cache quantization and weight quantization (FP8, INT4) matter so much — they reduce the bytes that need to be read per token, directly translating bandwidth into throughput.

The key ratio

For decode-heavy workloads, the metric that matters is bandwidth per dollar, not TFLOPs per dollar. Prefill-heavy workloads (long input, short output) are more compute-bound and benefit from higher TFLOPs — the mix matters.

Specs at a glance

Spec	H100 SXM	A100 80GB SXM	L40S
Architecture	Hopper	Ampere	Ada Lovelace
VRAM	80 GB HBM3	80 GB HBM2e	48 GB GDDR6
Memory bandwidth	3.35 TB/s	2.0 TB/s	864 GB/s
FP16 Tensor TFLOPs	989	312	362
FP8 Tensor TFLOPs	1979	— (not supported)	733
NVLink	✓ (900 GB/s)	✓ (600 GB/s)	✗ (PCIe only)
TDP (power)	700W	400W	350W
Cloud hourly (mid-2026)	~$2.50-3.50	~$1.20-1.80	~$0.80-1.20

Note the A100's lack of FP8 — it is an Ampere-generation card and predates the FP8 Tensor Cores introduced in Hopper. This is a real disadvantage in 2026, when FP8 quantization is table stakes for cost-efficient serving.

Throughput: tokens per second (Llama-3-70B, FP16)

For a 70B model in FP16 — a memory-bandwidth-bound decode workload — throughput scales with HBM bandwidth. The H100's 3.35 TB/s versus the A100's 2.0 TB/s translates to roughly 1.5-1.7x higher tokens/sec per GPU. The L40S, despite decent FP16 TFLOPs, is crippled by its 864 GB/s bandwidth — it manages roughly one-third the A100's throughput on this workload.

# Llama-3-70B FP16, single-GPU, batched decode (representative) # tokens/sec, excluding prefill H100: ~3,800 tok/s (3.35 TB/s bandwidth) A100: ~2,400 tok/s (2.0 TB/s bandwidth) L40S: ~ 850 tok/s (864 GB/s — bandwidth-starved at 70B) # 70B FP16 needs ~140GB weights → requires 2×80GB GPUs (TP=2) # L40S 48GB cannot fit 70B FP16 without aggressive quantization

The L40S is effectively disqualified for 70B-in-FP16 work because 48 GB cannot hold the ~140 GB of weights even at tensor parallelism 2. For 70B, you either use 80 GB cards (H100/A100) or quantize to FP8/INT4 to fit the L40S.

Cost per million tokens

Throughput alone is misleading — what matters is throughput per dollar. Using the mid-2026 cloud rates and the throughput numbers above, here is the effective cost per million output tokens. This is the number that should drive GPU selection.

Workload	H100 ($/M tok)	A100 ($/M tok)	L40S ($/M tok)
Llama-3-70B FP16 (2-GPU TP)	~$1.40	~$1.30	— (won't fit)
Llama-3-70B FP8 (fits on fewer GPUs)	~$0.75	— (no FP8)	~$1.80
Llama-3-8B FP16	~$0.40	~$0.45	~$0.35
Llama-3-8B FP8	~$0.22	— (no FP8)	~$0.20

The surprising winner for small models

For 7B-8B models, the L40S is the cost-per-token leader because its lower hourly rate outweighs its bandwidth disadvantage — you are not bandwidth-bound at that model size. The H100 wins decisively only for 70B+ where its bandwidth and FP8 support are essential.

The FP8 factor

FP8 quantization is the biggest single lever in this comparison, and it changes which GPU wins. FP8 halves the weight memory and doubles the effective Tensor Core throughput, but only Hopper (H100) and Ada (L40S) support it natively. The A100 — still common in legacy fleets — is locked out.

For a 70B model, FP8 means the weights drop from ~140 GB to ~70 GB, fitting on a single 80 GB H100 instead of two. That alone halves the cost-per-token and is the reason most 2026 production deployments of 70B models run FP8 on H100. If you are on A100s and cannot migrate, INT4/FP8 KV cache quantization and weight-only INT4 are your fallbacks, but they are less accurate and more complex than native FP8.

Multi-GPU: NVLink vs PCIe

For models that span multiple GPUs (70B+ in FP16, or large-context batches), the inter-GPU link matters. The H100 and A100 have NVLink (900 GB/s and 600 GB/s respectively), which is essential for tensor parallelism to not become a bottleneck. The L40S has PCIe only (~64 GB/s), making multi-GPU tensor parallelism painful — a 70B model split across two L40S loses significant throughput to PCIe transfers.

This is why the L40S shines for single-GPU workloads (7B-13B models) but falls off for multi-GPU tensor-parallel serving. If your model fits on one card, the L40S's NVLink absence is irrelevant; if it doesn't, the H100's NVLink fabric is essential.

Decision matrix by workload

Your workload	Best GPU	Why
7B-8B model, cost-minimized	L40S	Lowest hourly rate, not bandwidth-bound at this size
7B-8B model, latency-minimized	H100	Highest bandwidth, lowest single-stream latency
70B model FP8	H100	Native FP8 + fits on one 80GB card + NVLink
70B model FP16	H100 or A100 (similar $/tok)	A100 competitive at FP16; H100 wins if you can use FP8
70B+ multi-GPU TP	H100	NVLink essential for tensor parallelism
Legacy fleet, no migration budget	A100	Still viable for FP16; use INT4 weight quant as FP8 substitute
Long-context (100k+ tokens)	H100	KV cache grows huge; bandwidth + 80GB VRAM essential

The utilization caveat

Cost-per-token numbers assume high GPU utilization. In practice, if your traffic is bursty and your GPUs sit idle 50% of the time, the cheaper-per-hour GPU (A100, L40S) wins — you are paying for idle time either way, so minimizing hourly cost matters more than maximizing peak throughput. Conversely, if you serve steady high traffic and can keep GPUs at 85%+ utilization, the H100's throughput advantage compounds.

This is also why serverless / spot pricing changes the calculus: if you can rent H100s on spot at A100 prices during off-peak, the H100 is always better. Use a cost optimization strategy that routes traffic to spot capacity when available.

Re-benchmark every quarter

GPU cloud pricing, model quantization support, and engine optimizations change fast. A decision made in Q1 may be wrong by Q3. The cost-per-token table above is a mid-2026 snapshot; re-run it against your provider's current rates before each capacity-planning cycle.

FAQ

Is the H100 worth it over the A100 for LLM inference?

For decode-heavy workloads, the H100's 3.35 TB/s bandwidth gives roughly 1.5-1.7x higher tokens/sec than the A100's 2.0 TB/s. At high utilization this wins on cost-per-token; at low utilization the A100's lower hourly rate is cheaper. The H100's FP8 support is the decisive factor for 70B models — it halves memory, letting one card do what two A100s do.

Is the L40 good for LLM inference?

For 7B-13B models, yes — the L40S is the cost-per-token leader because its low hourly rate outweighs its lower bandwidth at small model sizes. It struggles on 70B+ models (48 GB VRAM, 864 GB/s bandwidth, PCIe-only multi-GPU). Pick it for small-model cost optimization, not for large-model throughput.

How much VRAM do I need for LLM inference?

Model weights need ~2 bytes/parameter in FP16 (7B ≈ 14 GB, 70B ≈ 140 GB), plus KV cache that scales with concurrency and context length. For 7B with healthy concurrency, 24-48 GB suffices. For 70B FP16, you need 2-4×80 GB GPUs or FP8/INT4 quantization. Use our KV Cache Calculator to size precisely.

Does FP8 change the GPU choice?

Yes. FP8 (H100, L40S — not A100) halves weight memory and doubles Tensor Core throughput. It extends the L40S's useful range to 13B FP8 models and makes the H100 dramatically more efficient for 70B. The A100's lack of native FP8 is a genuine disadvantage in 2026.

Related deep dives

vLLM vs SGLang vs TGI — which inference engine extracts the most from each GPU
TensorRT-LLM: When to Use It — the H100's compiled-kernel advantage
KV Cache Quantization — FP8/INT4 cache techniques that extend GPU life
KV Cache Calculator — size your VRAM needs before choosing a GPU
LLM Cost Optimization Playbook — the full cost-reduction stack beyond GPU choice

Sources

NVIDIA, "H100 Tensor Core GPU Architecture Whitepaper," 2024 (3.35 TB/s HBM3, 989 FP16 TFLOPs, FP8 specs)
NVIDIA, "A100 Tensor Core GPU Architecture Whitepaper," 2020 (2.0 TB/s HBM2e, 312 FP16 TFLOPs)
NVIDIA, "L40S GPU Data Sheet," 2024 (48 GB GDDR6, 864 GB/s, FP8 support)
Phala Network / Spheron 2026 cloud GPU pricing indices
RunPod, "H100 vs A100 vs L40S for LLM Inference — Cost Per Token," 2025-2026
vLLM & TensorRT-LLM throughput benchmarks, 2026 community reports

Cloud pricing fluctuates and varies by provider, region, and commitment level. Throughput numbers are representative based on published benchmarks and are directional — always benchmark on your provider's actual instances.