How to Run LLM Locally — Complete Guide

Q: What hardware do I need to run LLMs locally?

Minimum: 8GB RAM for 7B quantized models on CPU. Recommended: 16GB+ RAM and a GPU with 8GB+ VRAM (RTX 3060 or better). For larger models: 32GB+ RAM and 24GB+ VRAM (RTX 4090). Apple Silicon Macs with 16GB+ unified memory work very well due to shared CPU/GPU memory.

Last updated: June 23, 2026 · 15 min read

Running LLMs locally gives you complete privacy, zero API costs, and full control over your AI. With tools like Ollama and llama.cpp, you can run powerful models on your own hardware in minutes.

Why Run LLMs Locally?

Running LLMs on your own hardware offers several advantages over cloud APIs:

Privacy

Your data never leaves your machine. This is critical for sensitive applications like legal documents, medical records, personal journals, or proprietary code. No third party ever sees your prompts or responses.

No API Costs

Cloud API costs can add up quickly. A heavy user might spend $50-200/month on API calls. Running locally costs only electricity — typically a few dollars per month. The model itself is free (open-weight models).

No Internet Required

Once downloaded, models work completely offline. This is useful for travel, unreliable internet, or air-gapped environments.

Customization

You can fine-tune models, modify system prompts, adjust parameters, and experiment freely without API restrictions or rate limits.

No Censorship

Open-weight models give you full control. You can run uncensored variants for research, creative writing, or other use cases where cloud providers might filter outputs.

Hardware Requirements

The hardware you need depends on the model size and quantization level.

Minimum (CPU Only)

RAM: 8GB for 3B models, 16GB for 7B models
CPU: Any modern x86_64 or ARM processor
Storage: 4-8GB per model
Speed: 5-15 tokens/second for 7B Q4 models

Recommended (GPU)

VRAM: 8GB for 7B models, 24GB for 13B models
GPU: NVIDIA RTX 3060 12GB or better (RTX 4090 ideal)
RAM: 32GB+ for comfortable operation
Speed: 30-80 tokens/second for 7B Q4 models

Apple Silicon

Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs because of their unified memory architecture. The GPU can access all system RAM, so a MacBook with 32GB unified memory can run 13B+ models smoothly.

M1/M2 with 8GB: 3B-7B quantized models
M1/M2/M3 with 16GB: 7B-13B quantized models
M2/M3/M4 Pro/Max with 32-128GB: 13B-70B quantized models

Hardware	7B Q4 Speed	13B Q4 Speed	Max Model
CPU only (16GB RAM)	~10 tok/s	~5 tok/s	13B Q4
RTX 3060 12GB	~40 tok/s	~20 tok/s	13B Q4
RTX 4090 24GB	~80 tok/s	~50 tok/s	33B Q4
M3 Max 96GB	~60 tok/s	~40 tok/s	70B Q4

Ollama: The Easiest Option

Ollama is the simplest way to run LLMs locally. It handles everything: downloading models, GPU detection, quantization, and serving. If you're new to local LLMs, start here.

Installation

# macOS (with Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Running Your First Model

# Download and run Llama 3 8B (automatic GPU detection)
ollama run llama3

# That's it! You're now chatting with a local LLM.

Popular Models

# General purpose
ollama run llama3          # Meta's Llama 3 8B
ollama run mistral         # Mistral 7B
ollama run gemma2          # Google's Gemma 2

# Code generation
ollama run codellama       # Code Llama 7B
ollama run deepseek-coder  # DeepSeek Coder

# Creative writing
ollama run mixtral          # Mixtral 8x7B (MoE)
ollama run dolphin-mixtral  # Uncensored Mixtral

Useful Commands

# List installed models
ollama list

# Show model details
ollama show llama3

# Remove a model
ollama rm llama3

# Pull without running
ollama pull llama3:70b-q4_K_M

# Run with specific parameters
ollama run llama3 --temperature 0.7 --num-ctx 4096

Ollama API

Ollama runs a local API server compatible with the OpenAI format:

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Hello!"}]}'

This means you can use Ollama with any tool that supports OpenAI's API format — just change the base URL to http://localhost:11434/v1.

llama.cpp: Maximum Control

llama.cpp is the foundational C++ library for running LLMs locally. It's what powers Ollama under the hood. Use llama.cpp directly when you need maximum control over inference parameters.

Installation

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or with CMake for GPU support
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON  # For NVIDIA GPUs
cmake --build . --config Release

Basic Usage

# Interactive chat
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
  -c 4096 \
  --temp 0.7 \
  -i

# Single prompt
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 500

# Start an OpenAI-compatible server
./llama-server -m models/llama-3-8b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080

Key Parameters

-m: Path to GGUF model file
-c: Context length (default: 2048)
--temp: Temperature (0.0 = deterministic, 1.0 = creative)
-ngl: Number of layers to offload to GPU (0 = CPU only)
-t: Number of CPU threads
-n: Maximum tokens to generate

GUI Applications

If you prefer a graphical interface over the command line, several excellent options exist:

LM Studio

A polished desktop application for running local LLMs. Features include model discovery, chat interface, local server, and parameter adjustment. Available for Mac, Windows, and Linux.

Open WebUI

A self-hosted web interface that connects to Ollama or OpenAI-compatible APIs. Features a ChatGPT-like interface, conversation history, and multi-model support. Runs in Docker:

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

GPT4All

A lightweight desktop application focused on privacy. It runs models entirely locally with no data collection. Good for beginners who want a simple chat interface.

Jan

An open-source alternative to ChatGPT that runs locally. Features a clean UI, model management, and plugin system. Available for all major platforms.

Model Formats: GGUF Explained

GGUF (GPT-Generated Unified Format) is the standard format for running LLMs locally. Understanding it helps you choose the right model files.

What's in a GGUF File?

A GGUF file is self-contained — it includes:

Model weights: The neural network parameters (quantized)
Tokenizer: The vocabulary and token mapping
Metadata: Model architecture, context length, special tokens

Quantization Levels

GGUF files come in different quantization levels. The naming convention tells you the method and precision:

Suffix	Bits	Quality	Size (7B)	Best For
Q8_0	8.5	Near-lossless	~7.5GB	Quality-critical tasks
Q6_K	6.6	Excellent	~5.8GB	Best quality/size balance
Q5_K_M	5.7	Very good	~5.0GB	Good balance
Q4_K_M	4.8	Good	~4.4GB	Most popular choice
Q3_K_M	3.9	Acceptable	~3.5GB	Memory-constrained
Q2_K	2.9	Poor	~2.8GB	Not recommended

Q4_K_M is the sweet spot for most users — good quality with reasonable size.

Where to Download

Hugging Face: Search for "GGUF" in model names (e.g., "llama-3-8b-gguf")
Ollama Library: Pre-curated models optimized for Ollama
TheBloke: Popular quantizer with hundreds of GGUF models on Hugging Face

Choosing the Right Model

With hundreds of models available, choosing the right one can be overwhelming. Here's a practical guide:

For General Chat

Llama 3 8B: Best all-around 8B model, great quality and speed
Mistral 7B: Excellent for its size, strong reasoning
Gemma 2 9B: Google's offering, good at following instructions
Qwen 2.5 7B: Strong multilingual capabilities

For Coding

CodeLlama 7B/13B: Meta's code-specialized model
DeepSeek Coder 6.7B: Excellent code generation
StarCoder2 7B: Trained on The Stack v2

For Larger Models (if you have the hardware)

Llama 3 70B: Near-GPT-4 quality, needs 40GB+ VRAM
Mixtral 8x7B: MoE architecture, runs well on 32GB RAM
Qwen 2.5 72B: Excellent multilingual, strong reasoning

Start with Llama 3 8B (via Ollama). It's the best balance of quality, speed, and hardware requirements for most users. Upgrade to larger models only if you need better quality and have the hardware.

Performance Tips

Maximize Speed

Use GPU offloading: Set -ngl 99 to offload all layers to GPU
Match quantization to VRAM: If your model fits entirely in VRAM, it'll be much faster
Reduce context length: Shorter contexts use less memory and compute faster
Use flash attention: Enable with -fa flag in llama.cpp

Maximize Quality

Use higher quantization: Q6_K or Q8_0 if you have the memory
Choose larger models: 13B > 7B, 70B > 13B (if hardware allows)
Good system prompts: Clear instructions improve output quality
Temperature tuning: Lower (0.3-0.5) for factual tasks, higher (0.7-1.0) for creative

Common Issues

Slow generation: Check if GPU is being used (nvidia-smi or Activity Monitor)
Out of memory: Use a smaller quantization or smaller model
Repetitive output: Increase temperature or use repeat penalty
Poor quality: Try a different model or higher quantization level

Frequently Asked Questions

What hardware do I need to run LLMs locally?

Minimum: 8GB RAM for 3B models on CPU. Recommended: 16GB+ RAM and a GPU with 8GB+ VRAM (RTX 3060 or better). For larger models: 32GB+ RAM and 24GB+ VRAM (RTX 4090). Apple Silicon Macs with 16GB+ unified memory work very well due to shared CPU/GPU memory.

What is the easiest way to run LLMs locally?

Ollama is the easiest option. Install it with one command (brew install ollama on Mac, or curl on Linux), then run 'ollama run llama3' to download and start chatting. It handles model downloading, quantization, GPU detection, and serving automatically.

What is GGUF format?

GGUF (GPT-Generated Unified Format) is a file format used by llama.cpp for storing quantized models. It's a single file that contains model weights, tokenizer, and metadata. GGUF supports various quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) and is optimized for CPU and mixed CPU/GPU inference.

Can I run LLMs without a GPU?

Yes! Tools like llama.cpp and Ollama support CPU-only inference. With sufficient RAM (16GB+ for 7B models), you can run quantized models entirely on CPU. Performance is slower than GPU but still usable for interactive chat. Apple Silicon Macs get excellent CPU performance due to unified memory architecture.

How to Run LLM Locally — Complete Guide

Why Run LLMs Locally?

Privacy

No API Costs

No Internet Required

Customization

No Censorship

Hardware Requirements

Minimum (CPU Only)

Recommended (GPU)

Apple Silicon

Ollama: The Easiest Option

Installation

Running Your First Model

Popular Models

Useful Commands

Ollama API

llama.cpp: Maximum Control

Installation

Basic Usage

Key Parameters

GUI Applications

LM Studio

Open WebUI

GPT4All

Jan

Model Formats: GGUF Explained

What's in a GGUF File?

Quantization Levels

Where to Download

Choosing the Right Model

For General Chat

For Coding

For Larger Models (if you have the hardware)

Performance Tips

Maximize Speed

Maximize Quality

Common Issues

Frequently Asked Questions

What hardware do I need to run LLMs locally?

What is the easiest way to run LLMs locally?

What is GGUF format?

Can I run LLMs without a GPU?

What to Read Next