How to Run LLM Locally — Complete Guide

Last updated: June 23, 2026 · 15 min read

Running LLMs locally gives you complete privacy, zero API costs, and full control over your AI. With tools like Ollama and llama.cpp, you can run powerful models on your own hardware in minutes.

Why Run LLMs Locally?

Running LLMs on your own hardware offers several advantages over cloud APIs:

Privacy

Your data never leaves your machine. This is critical for sensitive applications like legal documents, medical records, personal journals, or proprietary code. No third party ever sees your prompts or responses.

No API Costs

Cloud API costs can add up quickly. A heavy user might spend $50-200/month on API calls. Running locally costs only electricity — typically a few dollars per month. The model itself is free (open-weight models).

No Internet Required

Once downloaded, models work completely offline. This is useful for travel, unreliable internet, or air-gapped environments.

Customization

You can fine-tune models, modify system prompts, adjust parameters, and experiment freely without API restrictions or rate limits.

No Censorship

Open-weight models give you full control. You can run uncensored variants for research, creative writing, or other use cases where cloud providers might filter outputs.

Hardware Requirements

The hardware you need depends on the model size and quantization level.

Minimum (CPU Only)

Recommended (GPU)

Apple Silicon

Apple Silicon Macs (M1, M2, M3, M4) are excellent for local LLMs because of their unified memory architecture. The GPU can access all system RAM, so a MacBook with 32GB unified memory can run 13B+ models smoothly.

Hardware7B Q4 Speed13B Q4 SpeedMax Model
CPU only (16GB RAM)~10 tok/s~5 tok/s13B Q4
RTX 3060 12GB~40 tok/s~20 tok/s13B Q4
RTX 4090 24GB~80 tok/s~50 tok/s33B Q4
M3 Max 96GB~60 tok/s~40 tok/s70B Q4

Ollama: The Easiest Option

Ollama is the simplest way to run LLMs locally. It handles everything: downloading models, GPU detection, quantization, and serving. If you're new to local LLMs, start here.

Installation

# macOS (with Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Running Your First Model

# Download and run Llama 3 8B (automatic GPU detection)
ollama run llama3

# That's it! You're now chatting with a local LLM.

Popular Models

# General purpose
ollama run llama3          # Meta's Llama 3 8B
ollama run mistral         # Mistral 7B
ollama run gemma2          # Google's Gemma 2

# Code generation
ollama run codellama       # Code Llama 7B
ollama run deepseek-coder  # DeepSeek Coder

# Creative writing
ollama run mixtral          # Mixtral 8x7B (MoE)
ollama run dolphin-mixtral  # Uncensored Mixtral

Useful Commands

# List installed models
ollama list

# Show model details
ollama show llama3

# Remove a model
ollama rm llama3

# Pull without running
ollama pull llama3:70b-q4_K_M

# Run with specific parameters
ollama run llama3 --temperature 0.7 --num-ctx 4096

Ollama API

Ollama runs a local API server compatible with the OpenAI format:

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Hello!"}]}'

This means you can use Ollama with any tool that supports OpenAI's API format — just change the base URL to http://localhost:11434/v1.

llama.cpp: Maximum Control

llama.cpp is the foundational C++ library for running LLMs locally. It's what powers Ollama under the hood. Use llama.cpp directly when you need maximum control over inference parameters.

Installation

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or with CMake for GPU support
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON  # For NVIDIA GPUs
cmake --build . --config Release

Basic Usage

# Interactive chat
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
  -c 4096 \
  --temp 0.7 \
  -i

# Single prompt
./llama-cli -m models/llama-3-8b-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 500

# Start an OpenAI-compatible server
./llama-server -m models/llama-3-8b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080

Key Parameters

GUI Applications

If you prefer a graphical interface over the command line, several excellent options exist:

LM Studio

A polished desktop application for running local LLMs. Features include model discovery, chat interface, local server, and parameter adjustment. Available for Mac, Windows, and Linux.

Open WebUI

A self-hosted web interface that connects to Ollama or OpenAI-compatible APIs. Features a ChatGPT-like interface, conversation history, and multi-model support. Runs in Docker:

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

GPT4All

A lightweight desktop application focused on privacy. It runs models entirely locally with no data collection. Good for beginners who want a simple chat interface.

Jan

An open-source alternative to ChatGPT that runs locally. Features a clean UI, model management, and plugin system. Available for all major platforms.

Model Formats: GGUF Explained

GGUF (GPT-Generated Unified Format) is the standard format for running LLMs locally. Understanding it helps you choose the right model files.

What's in a GGUF File?

A GGUF file is self-contained — it includes:

Quantization Levels

GGUF files come in different quantization levels. The naming convention tells you the method and precision:

SuffixBitsQualitySize (7B)Best For
Q8_08.5Near-lossless~7.5GBQuality-critical tasks
Q6_K6.6Excellent~5.8GBBest quality/size balance
Q5_K_M5.7Very good~5.0GBGood balance
Q4_K_M4.8Good~4.4GBMost popular choice
Q3_K_M3.9Acceptable~3.5GBMemory-constrained
Q2_K2.9Poor~2.8GBNot recommended

Q4_K_M is the sweet spot for most users — good quality with reasonable size.

Where to Download

Choosing the Right Model

With hundreds of models available, choosing the right one can be overwhelming. Here's a practical guide:

For General Chat

For Coding

For Larger Models (if you have the hardware)

Start with Llama 3 8B (via Ollama). It's the best balance of quality, speed, and hardware requirements for most users. Upgrade to larger models only if you need better quality and have the hardware.

Performance Tips

Maximize Speed

Maximize Quality

Common Issues

Frequently Asked Questions

What hardware do I need to run LLMs locally?

Minimum: 8GB RAM for 3B models on CPU. Recommended: 16GB+ RAM and a GPU with 8GB+ VRAM (RTX 3060 or better). For larger models: 32GB+ RAM and 24GB+ VRAM (RTX 4090). Apple Silicon Macs with 16GB+ unified memory work very well due to shared CPU/GPU memory.

What is the easiest way to run LLMs locally?

Ollama is the easiest option. Install it with one command (brew install ollama on Mac, or curl on Linux), then run 'ollama run llama3' to download and start chatting. It handles model downloading, quantization, GPU detection, and serving automatically.

What is GGUF format?

GGUF (GPT-Generated Unified Format) is a file format used by llama.cpp for storing quantized models. It's a single file that contains model weights, tokenizer, and metadata. GGUF supports various quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.) and is optimized for CPU and mixed CPU/GPU inference.

Can I run LLMs without a GPU?

Yes! Tools like llama.cpp and Ollama support CPU-only inference. With sufficient RAM (16GB+ for 7B models), you can run quantized models entirely on CPU. Performance is slower than GPU but still usable for interactive chat. Apple Silicon Macs get excellent CPU performance due to unified memory architecture.