Question 1

How big is a 70B model in INT4?

Accepted Answer

A 70-billion-parameter model in INT4 quantization takes about 35 GB (70B × 0.5 bytes per parameter). In FP16 the same model is about 140 GB, so INT4 cuts the size by 4×. A 70B INT4 model fits comfortably on a single 80 GB A100 or H100, and can squeeze onto a 48 GB card with modest overhead.

Question 2

How is LLM model size calculated?

Accepted Answer

Model size in bytes equals parameter_count × bytes_per_parameter. FP32 uses 4 bytes per parameter, FP16/BF16 uses 2, INT8 uses 1, and INT4 uses 0.5. So a 7B model is 28 GB in FP32, 14 GB in FP16, 7 GB in INT8, and about 3.5 GB in INT4.

Question 3

What's the difference between FP16 and INT4 quantization?

Accepted Answer

FP16 stores each weight at 16-bit precision (2 bytes) — full quality, large size. INT4 packs each weight into 4 bits (0.5 bytes) — 4× smaller, with a small quality loss that is usually acceptable for inference. INT4 is the go-to for running large models on consumer GPUs, while FP16 is preferred for training and when quality is critical.

Question 4

Can I run Llama 3 70B on a single GPU?

Accepted Answer

In FP16, no — Llama 3 70B needs ~140 GB, which spans multiple 80 GB GPUs. In INT4 quantization (~35 GB) it fits on a single 80 GB A100/H100 with room for KV cache, or on a 48 GB RTX 6000 Ada with tight context. On consumer cards like the RTX 4090 (24 GB) even INT4 is too tight for a 70B model — you'd need a 32B-class model or smaller.

LLM Model Size Calculator

Configuration

Result

Will it fit on your GPU?

How model size is calculated

The four common precisions

Preset models