Ollama Guide — Run LLMs Locally in Minutes

Last updated: June 23, 2026 · 10 min read

Ollama is the easiest way to run open-source LLMs on your own machine. Install it, pick a model, and start chatting — no cloud API keys, no GPU cloud, no vendor lock-in.

What is Ollama?

Ollama is an open-source tool that makes it incredibly easy to run Large Language Models locally on your computer. Think of it as "Docker for LLMs" — it handles downloading model weights, managing different models, optimizing inference for your hardware, and exposing a local API, all through simple commands.

Before Ollama, running a local LLM meant wrestling with Python environments, CUDA drivers, model format conversions (GGUF, GPTQ, AWQ), and inference engines (llama.cpp, vLLM, text-generation-inference). Ollama eliminates all of that. One install, one command, and you are chatting with a local AI.

Why Run LLMs Locally?

Running models locally has several advantages over cloud APIs:

Privacy: Your data never leaves your machine. No API calls, no logs on someone else's server. This is critical for sensitive documents, medical data, or proprietary code.
Cost: After the one-time hardware cost, inference is free. No per-token charges, no monthly bills. Run as many queries as you want.
Offline access: Works without internet. Great for travel, unreliable connections, or air-gapped environments.
No rate limits: No waiting for API quotas. Run concurrent requests, fine-tune prompts, experiment freely.
Customization: Fine-tune models, create custom model configurations, and experiment with different quantization levels.

The trade-off is that local models are generally smaller and less capable than frontier cloud models (GPT-4, Claude). For many tasks, a 7B or 13B local model is good enough. For complex reasoning or coding, you may still want a cloud API. For a comparison of options, see our LLM comparison guide.

Installation

Ollama supports macOS, Linux, and Windows. Installation is a single command or installer.

macOS

# Option 1: Download from ollama.com
# Visit https://ollama.com/download and download the .dmg

# Option 2: Install with Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama binary and sets up a systemd service that starts automatically.

Windows

Download the installer from https://ollama.com/download. It installs as a background service with a system tray icon.

Verify Installation

ollama --version
# ollama version 0.6.x

If you see a version number, Ollama is ready to use.

Running Your First Model

Running a model is a single command. Ollama will download the model on first run (subsequent runs are instant).

Quick Start

# Run Llama 3 (8B, ~4.7GB download)
ollama run llama3

# That's it. You're now chatting with a local LLM.

What Happens Behind the Scenes

Ollama checks if the model is already downloaded. If not, it downloads the model weights from the Ollama registry.
The model is loaded into memory (RAM or VRAM).
An interactive chat session starts in your terminal.
When you exit, the model stays loaded for a few minutes for fast restarts, then unloads to free memory.

Other Popular Models

# Mistral 7B — fast, good at instruction following
ollama run mistral

# Gemma 2 (Google) — strong for its size
ollama run gemma2

# Phi-3 (Microsoft) — surprisingly capable small model
ollama run phi3

# Code Llama — specialized for programming
ollama run codellama

# Qwen 2.5 — leading Chinese-English model
ollama run qwen2.5

Chat in the Terminal

Once the model is loaded, you can chat directly in the terminal:

>>> What is the capital of France?
The capital of France is Paris. It's also the largest city in France
and one of the most populous cities in Europe.

>>> /bye

Type /bye to exit the chat session.

Model Management

Ollama provides simple commands to manage your local models.

List Installed Models

ollama list
# NAME          ID           SIZE    MODIFIED
# llama3:latest 365c0bd3c071  4.7 GB  2 hours ago
# mistral:latest f974a74358d6 4.1 GB  3 days ago

Pull a Model (Download Without Running)

ollama pull llama3:70b

Remove a Model

ollama rm mistral

Copy / Rename a Model

# Create a copy with a custom name
ollama cp llama3 my-custom-llama

Model Tags and Quantization

Models come in different sizes via quantization tags. The tag format is model:tag:

# Default (usually Q4 quantized, good balance)
ollama run llama3

# Larger, higher quality (needs more RAM)
ollama run llama3:70b

# Smaller, faster (less RAM, lower quality)
ollama run llama3:8b-q3_K_S

Quantization reduces model size by using fewer bits per weight. Q4 is the sweet spot for most users — good quality with reasonable memory usage. Q3 is faster but noticeably worse. Q5 and Q8 are larger but only marginally better than Q4.

Choosing the Right Model

The right model depends on your hardware and use case:

Model	Size	RAM Needed	Best For
Phi-3 Mini	3.8B	4 GB	Low-end hardware, quick tasks
Llama 3	8B	8 GB	General purpose, good balance
Mistral	7B	8 GB	Instruction following, fast
Gemma 2	9B	10 GB	Reasoning, analysis
Qwen 2.5	14B	16 GB	Multilingual, coding
Llama 3	70B	48 GB	Best local quality (needs GPU)

General Recommendations

8 GB RAM: Start with Llama 3 8B or Mistral 7B. These are capable enough for most everyday tasks.
16 GB RAM: Try Qwen 2.5 14B or Llama 3 8B with higher quantization (Q5/Q6).
32+ GB RAM or GPU: Run Llama 3 70B or Mixtral for near-frontier quality.
Coding: Use CodeLlama or Qwen 2.5 Coder. They are specifically trained on code.

Using the Ollama API

Ollama exposes a local REST API on http://localhost:11434. This is the real power — you can integrate local LLMs into any application.

Chat Completion (OpenAI-Compatible)

Ollama's API is compatible with OpenAI's format, so any tool that works with OpenAI can use Ollama:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ]
  }'

Using with Python

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3",
    "messages": [
        {"role": "user", "content": "What is 2 + 2?"}
    ],
    "stream": False
})

print(response.json()["message"]["content"])

Using with LangChain

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
response = llm.invoke("Explain RAG in simple terms.")
print(response)

For a complete LangChain walkthrough, see our LangChain tutorial.

Using with the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Because the API is OpenAI-compatible, you can use any OpenAI-compatible tool by simply changing the base_url.

Generate Endpoint

For simpler use cases (no chat history), use the /api/generate endpoint:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3", "prompt": "Write a haiku about coding."}'

Open WebUI — Chat Interface

If you want a ChatGPT-like web interface for your local models, Open WebUI (formerly Ollama WebUI) is the best option. It runs as a Docker container and connects to your local Ollama instance.

Install with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account (local only, no cloud), and you will see all your Ollama models in a clean chat interface.

Features

Multi-model chat — switch between models mid-conversation
Conversation history and organization
Document upload and RAG
Image generation integration
Web search integration
User management and permissions

Performance Tips

Get the most out of your local LLM setup:

Use a GPU: If you have an NVIDIA or Apple Silicon GPU, Ollama will automatically use it. GPU inference is 5-10x faster than CPU.
Choose the right quantization: Q4_K_M is the default and best balance. Use Q3 for less RAM, Q5 for slightly better quality.
Keep models loaded: The first request after loading is slow. Subsequent requests are fast. Ollama keeps models in memory for 5 minutes by default. Adjust with OLLAMA_KEEP_ALIVE.
Use streaming: For interactive applications, always use streaming. The first token appears quickly even if the full response takes time.
Match model to task: Do not run a 70B model for simple tasks. Use a 7B model for quick Q&A and save the big model for complex reasoning.
Set the number of threads: OLLAMA_NUM_THREADS controls CPU threads. Set it to your physical core count (not hyperthreaded count) for best performance.
Context length: Longer context uses more memory. If you do not need long conversations, reduce the context window to save RAM.

Frequently Asked Questions

What is Ollama?

Ollama is an open-source tool that lets you run Large Language Models locally on your machine. It simplifies downloading, managing, and running open-source models like Llama 3, Mistral, and Gemma with a single command. It also exposes a local API that is compatible with the OpenAI format.

How much RAM do I need to run Ollama?

It depends on the model size. A 7B parameter model needs about 8GB of RAM (or VRAM). A 13B model needs about 16GB. A 70B model needs about 48GB. Ollama supports CPU-only inference, but it will be significantly slower without a GPU.

Is Ollama free?

Yes, Ollama is completely free and open-source under the MIT license. The models it runs are also free (open-source models from Meta, Mistral, Google, etc.). You only pay for the hardware to run them.

Can I use Ollama with LangChain?

Yes, LangChain has a built-in Ollama integration (langchain-community). You can use Ollama models as drop-in replacements for OpenAI in any LangChain chain or agent. The Ollama API is also OpenAI-compatible, so any tool that works with OpenAI's API can point to Ollama's local endpoint.

Ollama Guide — Run LLMs Locally in Minutes

What is Ollama?

Why Run LLMs Locally?

Installation

macOS

Linux

Windows

Verify Installation

Running Your First Model

Quick Start

What Happens Behind the Scenes

Other Popular Models

Chat in the Terminal

Model Management

List Installed Models

Pull a Model (Download Without Running)

Remove a Model

Copy / Rename a Model

Model Tags and Quantization

Choosing the Right Model

General Recommendations

Using the Ollama API

Chat Completion (OpenAI-Compatible)

Using with Python

Using with LangChain

Using with the OpenAI Python SDK

Generate Endpoint

Open WebUI — Chat Interface

Install with Docker

Features

Performance Tips

Frequently Asked Questions

What is Ollama?

How much RAM do I need to run Ollama?

Is Ollama free?

Can I use Ollama with LangChain?

What to Read Next