Ollama Guide — Run LLMs Locally in Minutes
Ollama is the easiest way to run open-source LLMs on your own machine. Install it, pick a model, and start chatting — no cloud API keys, no GPU cloud, no vendor lock-in.
What is Ollama?
Ollama is an open-source tool that makes it incredibly easy to run Large Language Models locally on your computer. Think of it as "Docker for LLMs" — it handles downloading model weights, managing different models, optimizing inference for your hardware, and exposing a local API, all through simple commands.
Before Ollama, running a local LLM meant wrestling with Python environments, CUDA drivers, model format conversions (GGUF, GPTQ, AWQ), and inference engines (llama.cpp, vLLM, text-generation-inference). Ollama eliminates all of that. One install, one command, and you are chatting with a local AI.
Why Run LLMs Locally?
Running models locally has several advantages over cloud APIs:
- Privacy: Your data never leaves your machine. No API calls, no logs on someone else's server. This is critical for sensitive documents, medical data, or proprietary code.
- Cost: After the one-time hardware cost, inference is free. No per-token charges, no monthly bills. Run as many queries as you want.
- Offline access: Works without internet. Great for travel, unreliable connections, or air-gapped environments.
- No rate limits: No waiting for API quotas. Run concurrent requests, fine-tune prompts, experiment freely.
- Customization: Fine-tune models, create custom model configurations, and experiment with different quantization levels.
The trade-off is that local models are generally smaller and less capable than frontier cloud models (GPT-4, Claude). For many tasks, a 7B or 13B local model is good enough. For complex reasoning or coding, you may still want a cloud API. For a comparison of options, see our LLM comparison guide.
Installation
Ollama supports macOS, Linux, and Windows. Installation is a single command or installer.
macOS
# Option 1: Download from ollama.com
# Visit https://ollama.com/download and download the .dmg
# Option 2: Install with Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
This installs the Ollama binary and sets up a systemd service that starts automatically.
Windows
Download the installer from https://ollama.com/download. It installs as a background service with a system tray icon.
Verify Installation
ollama --version
# ollama version 0.6.x
If you see a version number, Ollama is ready to use.
Running Your First Model
Running a model is a single command. Ollama will download the model on first run (subsequent runs are instant).
Quick Start
# Run Llama 3 (8B, ~4.7GB download)
ollama run llama3
# That's it. You're now chatting with a local LLM.
What Happens Behind the Scenes
- Ollama checks if the model is already downloaded. If not, it downloads the model weights from the Ollama registry.
- The model is loaded into memory (RAM or VRAM).
- An interactive chat session starts in your terminal.
- When you exit, the model stays loaded for a few minutes for fast restarts, then unloads to free memory.
Other Popular Models
# Mistral 7B — fast, good at instruction following
ollama run mistral
# Gemma 2 (Google) — strong for its size
ollama run gemma2
# Phi-3 (Microsoft) — surprisingly capable small model
ollama run phi3
# Code Llama — specialized for programming
ollama run codellama
# Qwen 2.5 — leading Chinese-English model
ollama run qwen2.5
Chat in the Terminal
Once the model is loaded, you can chat directly in the terminal:
>>> What is the capital of France?
The capital of France is Paris. It's also the largest city in France
and one of the most populous cities in Europe.
>>> /bye
Type /bye to exit the chat session.
Model Management
Ollama provides simple commands to manage your local models.
List Installed Models
ollama list
# NAME ID SIZE MODIFIED
# llama3:latest 365c0bd3c071 4.7 GB 2 hours ago
# mistral:latest f974a74358d6 4.1 GB 3 days ago
Pull a Model (Download Without Running)
ollama pull llama3:70b
Remove a Model
ollama rm mistral
Copy / Rename a Model
# Create a copy with a custom name
ollama cp llama3 my-custom-llama
Model Tags and Quantization
Models come in different sizes via quantization tags. The tag format is model:tag:
# Default (usually Q4 quantized, good balance)
ollama run llama3
# Larger, higher quality (needs more RAM)
ollama run llama3:70b
# Smaller, faster (less RAM, lower quality)
ollama run llama3:8b-q3_K_S
Quantization reduces model size by using fewer bits per weight. Q4 is the sweet spot for most users — good quality with reasonable memory usage. Q3 is faster but noticeably worse. Q5 and Q8 are larger but only marginally better than Q4.
Choosing the Right Model
The right model depends on your hardware and use case:
| Model | Size | RAM Needed | Best For |
|---|---|---|---|
| Phi-3 Mini | 3.8B | 4 GB | Low-end hardware, quick tasks |
| Llama 3 | 8B | 8 GB | General purpose, good balance |
| Mistral | 7B | 8 GB | Instruction following, fast |
| Gemma 2 | 9B | 10 GB | Reasoning, analysis |
| Qwen 2.5 | 14B | 16 GB | Multilingual, coding |
| Llama 3 | 70B | 48 GB | Best local quality (needs GPU) |
General Recommendations
- 8 GB RAM: Start with Llama 3 8B or Mistral 7B. These are capable enough for most everyday tasks.
- 16 GB RAM: Try Qwen 2.5 14B or Llama 3 8B with higher quantization (Q5/Q6).
- 32+ GB RAM or GPU: Run Llama 3 70B or Mixtral for near-frontier quality.
- Coding: Use CodeLlama or Qwen 2.5 Coder. They are specifically trained on code.
Using the Ollama API
Ollama exposes a local REST API on http://localhost:11434. This is the real power — you can integrate local LLMs into any application.
Chat Completion (OpenAI-Compatible)
Ollama's API is compatible with OpenAI's format, so any tool that works with OpenAI can use Ollama:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [
{"role": "user", "content": "Explain quantum computing in one paragraph."}
]
}'
Using with Python
import requests
response = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3",
"messages": [
{"role": "user", "content": "What is 2 + 2?"}
],
"stream": False
})
print(response.json()["message"]["content"])
Using with LangChain
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")
response = llm.invoke("Explain RAG in simple terms.")
print(response)
For a complete LangChain walkthrough, see our LangChain tutorial.
Using with the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Because the API is OpenAI-compatible, you can use any OpenAI-compatible tool by simply changing the base_url.
Generate Endpoint
For simpler use cases (no chat history), use the /api/generate endpoint:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3", "prompt": "Write a haiku about coding."}'
Open WebUI — Chat Interface
If you want a ChatGPT-like web interface for your local models, Open WebUI (formerly Ollama WebUI) is the best option. It runs as a Docker container and connects to your local Ollama instance.
Install with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. Create an account (local only, no cloud), and you will see all your Ollama models in a clean chat interface.
Features
- Multi-model chat — switch between models mid-conversation
- Conversation history and organization
- Document upload and RAG
- Image generation integration
- Web search integration
- User management and permissions
Performance Tips
Get the most out of your local LLM setup:
- Use a GPU: If you have an NVIDIA or Apple Silicon GPU, Ollama will automatically use it. GPU inference is 5-10x faster than CPU.
- Choose the right quantization: Q4_K_M is the default and best balance. Use Q3 for less RAM, Q5 for slightly better quality.
- Keep models loaded: The first request after loading is slow. Subsequent requests are fast. Ollama keeps models in memory for 5 minutes by default. Adjust with
OLLAMA_KEEP_ALIVE. - Use streaming: For interactive applications, always use streaming. The first token appears quickly even if the full response takes time.
- Match model to task: Do not run a 70B model for simple tasks. Use a 7B model for quick Q&A and save the big model for complex reasoning.
- Set the number of threads:
OLLAMA_NUM_THREADScontrols CPU threads. Set it to your physical core count (not hyperthreaded count) for best performance. - Context length: Longer context uses more memory. If you do not need long conversations, reduce the context window to save RAM.
Frequently Asked Questions
What is Ollama?
Ollama is an open-source tool that lets you run Large Language Models locally on your machine. It simplifies downloading, managing, and running open-source models like Llama 3, Mistral, and Gemma with a single command. It also exposes a local API that is compatible with the OpenAI format.
How much RAM do I need to run Ollama?
It depends on the model size. A 7B parameter model needs about 8GB of RAM (or VRAM). A 13B model needs about 16GB. A 70B model needs about 48GB. Ollama supports CPU-only inference, but it will be significantly slower without a GPU.
Is Ollama free?
Yes, Ollama is completely free and open-source under the MIT license. The models it runs are also free (open-source models from Meta, Mistral, Google, etc.). You only pay for the hardware to run them.
Can I use Ollama with LangChain?
Yes, LangChain has a built-in Ollama integration (langchain-community). You can use Ollama models as drop-in replacements for OpenAI in any LangChain chain or agent. The Ollama API is also OpenAI-compatible, so any tool that works with OpenAI's API can point to Ollama's local endpoint.