OpenClaw Ollama Setup: Run Local AI Models Offline

OpenClaw with Ollama enables a fully local AI agent — no API costs, no data sent to external servers, and no dependency on third-party service availability. Ollama runs LLM models directly on your hardware, exposing an OpenAI-compatible API that OpenClaw connects to seamlessly. This guide covers setup, model selection, and realistic performance expectations.

Requirements

Local LLM performance scales directly with RAM. As a rough guide:

RAM	Best Model	Quality Level
8GB	Llama 3.2 3B, Phi-3.5 mini	Good for simple tasks
16GB	Llama 3.1 8B, Mistral 7B	Very good general use
32GB	Llama 3.1 70B Q4, Qwen 2.5 32B	Near GPT-3.5 quality
64GB+	Llama 3.1 70B full, Qwen 2.5 72B	Excellent for most tasks

Apple Silicon (M2/M3/M4) unified memory makes Macs particularly efficient — a Mac Mini M4 Pro with 24GB runs Llama 3.1 70B at 4-bit quantization very well.

Installing Ollama

# macOS or Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: download installer from ollama.ai

# Verify installation
ollama --version

Pulling Models

# General purpose, excellent for most agent tasks
ollama pull llama3.1:8b

# Lightweight and fast, good for high-frequency tasks
ollama pull phi3.5

# Best coding model
ollama pull qwen2.5-coder:7b

# More capable, needs 32GB+ RAM
ollama pull llama3.1:70b

Running Ollama as a Service

# Start Ollama server (it listens on localhost:11434)
ollama serve

On macOS, Ollama starts automatically as a menu bar app. On Linux, enable the systemd service:

sudo systemctl enable ollama
sudo systemctl start ollama

Configuring OpenClaw for Ollama

{
  "llm": {
    "provider": "ollama",
    "base_url": "http://localhost:11434",
    "model": "llama3.1:8b",
    "temperature": 0.7,
    "max_tokens": 4096
  }
}

Performance Expectations

Local models are slower than cloud APIs for generation. On a Mac Mini M4 16GB:

Model	Tokens/second	Typical response time
Phi-3.5 mini	60–80 tok/s	1–3 seconds
Llama 3.1 8B	40–55 tok/s	3–8 seconds
Llama 3.1 70B Q4	8–15 tok/s	20–60 seconds

For an agent that processes a few messages per hour, these speeds are perfectly acceptable.

Hybrid Setup: Local + Cloud Fallback

Many users combine Ollama for routine tasks with a cloud LLM for complex reasoning:

{
  "llm": {
    "provider": "ollama",
    "model": "llama3.1:8b",
    "fallback_provider": "anthropic",
    "fallback_model": "claude-3-5-sonnet-20261022",
    "fallback_api_key": "sk-ant-...",
    "fallback_on_complexity": true
  }
}

nacre.sh with Ollama

nacre.sh supports connecting to an external Ollama instance via the dashboard's LLM provider settings. Your Ollama endpoint must be publicly accessible or within the same network as your nacre.sh instance.

Frequently Asked Questions

Which Ollama model is best for OpenClaw tasks?

For general agent use, Llama 3.1 8B is the best combination of quality and speed on 16GB+ systems. For coding tasks, Qwen 2.5 Coder 7B is excellent.

Can Ollama use my GPU?

Yes. Ollama automatically uses NVIDIA GPUs (via CUDA), AMD GPUs (via ROCm), and Apple Silicon's Metal framework. GPU acceleration significantly increases generation speed.

Is a locally running agent completely private?

Yes. Messages never leave your device when using Ollama with a local OpenClaw instance. This is the maximum privacy configuration available.

OpenClaw with Ollama: Running Local AI Models