Ollama Guide — Run LLMs Locally in Minutes

Last updated: June 23, 2026 · 10 min read

Ollama is the easiest way to run open-source LLMs on your own machine. Install it, pick a model, and start chatting — no cloud API keys, no GPU cloud, no vendor lock-in.

What is Ollama?

Ollama is an open-source tool that makes it incredibly easy to run Large Language Models locally on your computer. Think of it as "Docker for LLMs" — it handles downloading model weights, managing different models, optimizing inference for your hardware, and exposing a local API, all through simple commands.

Before Ollama, running a local LLM meant wrestling with Python environments, CUDA drivers, model format conversions (GGUF, GPTQ, AWQ), and inference engines (llama.cpp, vLLM, text-generation-inference). Ollama eliminates all of that. One install, one command, and you are chatting with a local AI.

Why Run LLMs Locally?

Running models locally has several advantages over cloud APIs:

The trade-off is that local models are generally smaller and less capable than frontier cloud models (GPT-4, Claude). For many tasks, a 7B or 13B local model is good enough. For complex reasoning or coding, you may still want a cloud API. For a comparison of options, see our LLM comparison guide.

Installation

Ollama supports macOS, Linux, and Windows. Installation is a single command or installer.

macOS

# Option 1: Download from ollama.com
# Visit https://ollama.com/download and download the .dmg

# Option 2: Install with Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama binary and sets up a systemd service that starts automatically.

Windows

Download the installer from https://ollama.com/download. It installs as a background service with a system tray icon.

Verify Installation

ollama --version
# ollama version 0.6.x

If you see a version number, Ollama is ready to use.

Running Your First Model

Running a model is a single command. Ollama will download the model on first run (subsequent runs are instant).

Quick Start

# Run Llama 3 (8B, ~4.7GB download)
ollama run llama3

# That's it. You're now chatting with a local LLM.

What Happens Behind the Scenes

  1. Ollama checks if the model is already downloaded. If not, it downloads the model weights from the Ollama registry.
  2. The model is loaded into memory (RAM or VRAM).
  3. An interactive chat session starts in your terminal.
  4. When you exit, the model stays loaded for a few minutes for fast restarts, then unloads to free memory.

Other Popular Models

# Mistral 7B — fast, good at instruction following
ollama run mistral

# Gemma 2 (Google) — strong for its size
ollama run gemma2

# Phi-3 (Microsoft) — surprisingly capable small model
ollama run phi3

# Code Llama — specialized for programming
ollama run codellama

# Qwen 2.5 — leading Chinese-English model
ollama run qwen2.5

Chat in the Terminal

Once the model is loaded, you can chat directly in the terminal:

>>> What is the capital of France?
The capital of France is Paris. It's also the largest city in France
and one of the most populous cities in Europe.

>>> /bye

Type /bye to exit the chat session.

Model Management

Ollama provides simple commands to manage your local models.

List Installed Models

ollama list
# NAME          ID           SIZE    MODIFIED
# llama3:latest 365c0bd3c071  4.7 GB  2 hours ago
# mistral:latest f974a74358d6 4.1 GB  3 days ago

Pull a Model (Download Without Running)

ollama pull llama3:70b

Remove a Model

ollama rm mistral

Copy / Rename a Model

# Create a copy with a custom name
ollama cp llama3 my-custom-llama

Model Tags and Quantization

Models come in different sizes via quantization tags. The tag format is model:tag:

# Default (usually Q4 quantized, good balance)
ollama run llama3

# Larger, higher quality (needs more RAM)
ollama run llama3:70b

# Smaller, faster (less RAM, lower quality)
ollama run llama3:8b-q3_K_S

Quantization reduces model size by using fewer bits per weight. Q4 is the sweet spot for most users — good quality with reasonable memory usage. Q3 is faster but noticeably worse. Q5 and Q8 are larger but only marginally better than Q4.

Choosing the Right Model

The right model depends on your hardware and use case:

ModelSizeRAM NeededBest For
Phi-3 Mini3.8B4 GBLow-end hardware, quick tasks
Llama 38B8 GBGeneral purpose, good balance
Mistral7B8 GBInstruction following, fast
Gemma 29B10 GBReasoning, analysis
Qwen 2.514B16 GBMultilingual, coding
Llama 370B48 GBBest local quality (needs GPU)

General Recommendations

Using the Ollama API

Ollama exposes a local REST API on http://localhost:11434. This is the real power — you can integrate local LLMs into any application.

Chat Completion (OpenAI-Compatible)

Ollama's API is compatible with OpenAI's format, so any tool that works with OpenAI can use Ollama:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ]
  }'

Using with Python

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3",
    "messages": [
        {"role": "user", "content": "What is 2 + 2?"}
    ],
    "stream": False
})

print(response.json()["message"]["content"])

Using with LangChain

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
response = llm.invoke("Explain RAG in simple terms.")
print(response)

For a complete LangChain walkthrough, see our LangChain tutorial.

Using with the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Because the API is OpenAI-compatible, you can use any OpenAI-compatible tool by simply changing the base_url.

Generate Endpoint

For simpler use cases (no chat history), use the /api/generate endpoint:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3", "prompt": "Write a haiku about coding."}'

Open WebUI — Chat Interface

If you want a ChatGPT-like web interface for your local models, Open WebUI (formerly Ollama WebUI) is the best option. It runs as a Docker container and connects to your local Ollama instance.

Install with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account (local only, no cloud), and you will see all your Ollama models in a clean chat interface.

Features

Performance Tips

Get the most out of your local LLM setup:

Frequently Asked Questions

What is Ollama?

Ollama is an open-source tool that lets you run Large Language Models locally on your machine. It simplifies downloading, managing, and running open-source models like Llama 3, Mistral, and Gemma with a single command. It also exposes a local API that is compatible with the OpenAI format.

How much RAM do I need to run Ollama?

It depends on the model size. A 7B parameter model needs about 8GB of RAM (or VRAM). A 13B model needs about 16GB. A 70B model needs about 48GB. Ollama supports CPU-only inference, but it will be significantly slower without a GPU.

Is Ollama free?

Yes, Ollama is completely free and open-source under the MIT license. The models it runs are also free (open-source models from Meta, Mistral, Google, etc.). You only pay for the hardware to run them.

Can I use Ollama with LangChain?

Yes, LangChain has a built-in Ollama integration (langchain-community). You can use Ollama models as drop-in replacements for OpenAI in any LangChain chain or agent. The Ollama API is also OpenAI-compatible, so any tool that works with OpenAI's API can point to Ollama's local endpoint.