How to Run LLM Locally — Complete Guide
Running LLMs locally gives you complete privacy, zero API costs, and full control over your AI. With tools like Ollama and llama.cpp, you can run powerful models on your own hardware in minutes.
Why Run LLMs Locally?
Running LLMs on your own hardware offers several advantages over cloud APIs:
Privacy
Your data never leaves your machine. This is critical for sensitive applications like legal documents, medical records, personal journals, or proprietary code. No third party ever sees your prompts or responses.
No API Costs
Cloud API costs can add up quickly. A heavy user might spend $50-200/month on API calls. Running locally costs only electricity — typically a few dollars per month. The model itself is free (open-weight models).
No Internet Required
Once downloaded, models work completely offline. This is useful for travel, unreliable internet, or air-gapped environments.
Customization
You can fine-tune models, modify system prompts, adjust parameters, and experiment freely without API restrictions or rate limits.
No Censorship
Open-weight models give you full control. You can run uncensored variants for research, creative writing, or other use cases where cloud providers might filter outputs.
Hardware Requirements
The hardware you need depends on the model size and quantization level.
Minimum (CPU Only)
- RAM: 8GB for 3B models, 16GB for 7B models
- CPU: Any modern x86_64 or ARM processor
- Storage: 4-8GB per model
- Speed: 5-15 tokens/second for 7B Q4 models
Recommended (GPU)
- VRAM: 8GB for 7B models, 24GB for 13B models
- GPU: NVIDIA RTX 3060 12GB or better (RTX 4090 ideal)
- RAM: 32GB+ for comfortable operation
- Speed: 30-80 tokens/second for 7B Q4 models
Apple Silicon
Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs because of their unified memory architecture. The GPU can access all system RAM, so a MacBook with 32GB unified memory can run 13B+ models smoothly.
- M1/M2 with 8GB: 3B-7B quantized models
- M1/M2/M3 with 16GB: 7B-13B quantized models
- M2/M3/M4 Pro/Max with 32-128GB: 13B-70B quantized models
| Hardware | 7B Q4 Speed | 13B Q4 Speed | Max Model |
|---|---|---|---|
| CPU only (16GB RAM) | ~10 tok/s | ~5 tok/s | 13B Q4 |
| RTX 3060 12GB | ~40 tok/s | ~20 tok/s | 13B Q4 |
| RTX 4090 24GB | ~80 tok/s | ~50 tok/s | 33B Q4 |
| M3 Max 96GB | ~60 tok/s | ~40 tok/s | 70B Q4 |
Ollama: The Easiest Option
Ollama is the simplest way to run LLMs locally. It handles everything: downloading models, GPU detection, quantization, and serving. If you're new to local LLMs, start here.
Installation
# macOS (with Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Running Your First Model
# Download and run Llama 3 8B (automatic GPU detection)
ollama run llama3
# That's it! You're now chatting with a local LLM.
Popular Models
# General purpose
ollama run llama3 # Meta's Llama 3 8B
ollama run mistral # Mistral 7B
ollama run gemma2 # Google's Gemma 2
# Code generation
ollama run codellama # Code Llama 7B
ollama run deepseek-coder # DeepSeek Coder
# Creative writing
ollama run mixtral # Mixtral 8x7B (MoE)
ollama run dolphin-mixtral # Uncensored Mixtral
Useful Commands
# List installed models
ollama list
# Show model details
ollama show llama3
# Remove a model
ollama rm llama3
# Pull without running
ollama pull llama3:70b-q4_K_M
# Run with specific parameters
ollama run llama3 --temperature 0.7 --num-ctx 4096
Ollama API
Ollama runs a local API server compatible with the OpenAI format:
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"user","content":"Hello!"}]}'
This means you can use Ollama with any tool that supports OpenAI's API format — just change the base URL to http://localhost:11434/v1.
llama.cpp: Maximum Control
llama.cpp is the foundational C++ library for running LLMs locally. It's what powers Ollama under the hood. Use llama.cpp directly when you need maximum control over inference parameters.
Installation
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Or with CMake for GPU support
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON # For NVIDIA GPUs
cmake --build . --config Release
Basic Usage
# Interactive chat
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
-c 4096 \
--temp 0.7 \
-i
# Single prompt
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
-p "Explain quantum computing in simple terms:" \
-n 500
# Start an OpenAI-compatible server
./llama-server -m models/llama-3-8b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080
Key Parameters
-m: Path to GGUF model file-c: Context length (default: 2048)--temp: Temperature (0.0 = deterministic, 1.0 = creative)-ngl: Number of layers to offload to GPU (0 = CPU only)-t: Number of CPU threads-n: Maximum tokens to generate
GUI Applications
If you prefer a graphical interface over the command line, several excellent options exist:
LM Studio
A polished desktop application for running local LLMs. Features include model discovery, chat interface, local server, and parameter adjustment. Available for Mac, Windows, and Linux.
Open WebUI
A self-hosted web interface that connects to Ollama or OpenAI-compatible APIs. Features a ChatGPT-like interface, conversation history, and multi-model support. Runs in Docker:
docker run -d -p 3000:8080 \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
GPT4All
A lightweight desktop application focused on privacy. It runs models entirely locally with no data collection. Good for beginners who want a simple chat interface.
Jan
An open-source alternative to ChatGPT that runs locally. Features a clean UI, model management, and plugin system. Available for all major platforms.
Model Formats: GGUF Explained
GGUF (GPT-Generated Unified Format) is the standard format for running LLMs locally. Understanding it helps you choose the right model files.
What's in a GGUF File?
A GGUF file is self-contained — it includes:
- Model weights: The neural network parameters (quantized)
- Tokenizer: The vocabulary and token mapping
- Metadata: Model architecture, context length, special tokens
Quantization Levels
GGUF files come in different quantization levels. The naming convention tells you the method and precision:
| Suffix | Bits | Quality | Size (7B) | Best For |
|---|---|---|---|---|
| Q8_0 | 8.5 | Near-lossless | ~7.5GB | Quality-critical tasks |
| Q6_K | 6.6 | Excellent | ~5.8GB | Best quality/size balance |
| Q5_K_M | 5.7 | Very good | ~5.0GB | Good balance |
| Q4_K_M | 4.8 | Good | ~4.4GB | Most popular choice |
| Q3_K_M | 3.9 | Acceptable | ~3.5GB | Memory-constrained |
| Q2_K | 2.9 | Poor | ~2.8GB | Not recommended |
Q4_K_M is the sweet spot for most users — good quality with reasonable size.
Where to Download
- Hugging Face: Search for "GGUF" in model names (e.g., "llama-3-8b-gguf")
- Ollama Library: Pre-curated models optimized for Ollama
- TheBloke: Popular quantizer with hundreds of GGUF models on Hugging Face
Choosing the Right Model
With hundreds of models available, choosing the right one can be overwhelming. Here's a practical guide:
For General Chat
- Llama 3 8B: Best all-around 8B model, great quality and speed
- Mistral 7B: Excellent for its size, strong reasoning
- Gemma 2 9B: Google's offering, good at following instructions
- Qwen 2.5 7B: Strong multilingual capabilities
For Coding
- CodeLlama 7B/13B: Meta's code-specialized model
- DeepSeek Coder 6.7B: Excellent code generation
- StarCoder2 7B: Trained on The Stack v2
For Larger Models (if you have the hardware)
- Llama 3 70B: Near-GPT-4 quality, needs 40GB+ VRAM
- Mixtral 8x7B: MoE architecture, runs well on 32GB RAM
- Qwen 2.5 72B: Excellent multilingual, strong reasoning
Start with Llama 3 8B (via Ollama). It's the best balance of quality, speed, and hardware requirements for most users. Upgrade to larger models only if you need better quality and have the hardware.
Performance Tips
Maximize Speed
- Use GPU offloading: Set
-ngl 99to offload all layers to GPU - Match quantization to VRAM: If your model fits entirely in VRAM, it'll be much faster
- Reduce context length: Shorter contexts use less memory and compute faster
- Use flash attention: Enable with
-faflag in llama.cpp
Maximize Quality
- Use higher quantization: Q6_K or Q8_0 if you have the memory
- Choose larger models: 13B > 7B, 70B > 13B (if hardware allows)
- Good system prompts: Clear instructions improve output quality
- Temperature tuning: Lower (0.3-0.5) for factual tasks, higher (0.7-1.0) for creative
Common Issues
- Slow generation: Check if GPU is being used (
nvidia-smior Activity Monitor) - Out of memory: Use a smaller quantization or smaller model
- Repetitive output: Increase temperature or use repeat penalty
- Poor quality: Try a different model or higher quantization level
Frequently Asked Questions
What hardware do I need to run LLMs locally?
Minimum: 8GB RAM for 3B models on CPU. Recommended: 16GB+ RAM and a GPU with 8GB+ VRAM (RTX 3060 or better). For larger models: 32GB+ RAM and 24GB+ VRAM (RTX 4090). Apple Silicon Macs with 16GB+ unified memory work very well due to shared CPU/GPU memory.
What is the easiest way to run LLMs locally?
Ollama is the easiest option. Install it with one command (brew install ollama on Mac, or curl on Linux), then run 'ollama run llama3' to download and start chatting. It handles model downloading, quantization, GPU detection, and serving automatically.
What is GGUF format?
GGUF (GPT-Generated Unified Format) is a file format used by llama.cpp for storing quantized models. It's a single file that contains model weights, tokenizer, and metadata. GGUF supports various quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) and is optimized for CPU and mixed CPU/GPU inference.
Can I run LLMs without a GPU?
Yes! Tools like llama.cpp and Ollama support CPU-only inference. With sufficient RAM (16GB+ for 7B models), you can run quantized models entirely on CPU. Performance is slower than GPU but still usable for interactive chat. Apple Silicon Macs get excellent CPU performance due to unified memory architecture.