Best Models for OpenClaw in 2026: Complete Guide

Which LLM works best with OpenClaw? We tested Claude, GPT, Gemini, and local Ollama models across real agent tasks. Find the best model for your budget and use case.

Last week, a developer on Reddit posted their OpenClaw API bill: $47 in a single day. Their agent was running Claude Opus for everything — including tasks that a $0.15/M-token model handles just fine. The week before, another user complained that their local Llama 8B model kept stalling mid-task, forcing them to restart every third command.

Both problems have the same root cause: picking the wrong LLM model for OpenClaw.

Unlike a simple chatbot where model choice barely matters, OpenClaw runs multi-step autonomous loops. Your agent might chain 8-12 tool calls in a single session — reading files, calling APIs, writing code, sending messages. If the model loses context at step 6 or fumbles a function call, the entire chain breaks. An overpowered model drains your API budget in minutes; an underpowered one fails mid-task.

This guide breaks down exactly which models to use for which tasks, based on real-world testing, community consensus, and current pricing data (March 2026). Whether you're optimizing for cost, capability, or privacy — you'll find your answer here.

TL;DR — Quick Picks

Best Overall: Claude Sonnet 4 — $3/$15 per M tokens, handles 80% of tasks
Best for Coding: Claude Opus 4.5 — $15/$75, best multi-file debugging
Best for Research: Gemini 3 Pro — $1.25/$10, 1M+ token context window
Best Budget: GPT-4o-mini — $0.15/$0.60, 20x cheaper than Sonnet
Best Free/Local: Qwen3.5 27B via Ollama — $0, matches GPT-5 Mini on SWE-bench
Best for Privacy: Qwen3 Coder or Llama 3.3 70B — open-source, self-hostable

What Is OpenClaw (and Why Model Choice Matters)

OpenClaw (formerly Clawdbot) is a free and open-source AI agent developed by Austrian developer Peter Steinberger. It hit 100,000 GitHub stars in February 2026 — one of the fastest-growing open-source projects in AI history. In February 2026, Steinberger joined OpenAI to continue his work on autonomous agents at a larger scale.

What makes OpenClaw different from a standard chatbot:

Runs on your machine — Mac, Windows, or Linux. Your data stays local by default
Any chat app — Telegram, WhatsApp, Discord, Slack, Signal, or iMessage
Persistent memory — Remembers your preferences and context across sessions (via MEMORY.md)
Full system access — Read/write files, run shell commands, execute scripts
Browser control — Browse the web, fill forms, extract data
Skills & plugins — Extend with community skills or build your own

The model powers everything. Every email it sends, every file it reads, every API call it makes goes through the LLM. A weak model at step 8 of a 12-step task means starting over. That's why model choice matters more in OpenClaw than in almost any other AI tool.

If you're new to OpenClaw, check out our OpenClaw trend analysis for a deeper look at why this project went viral.

What Makes a Model Work Well in OpenClaw

Now that you understand what OpenClaw does, let's look at what separates a model that works from one that doesn't.

Most AI benchmarks test single-turn responses. OpenClaw tasks are fundamentally different — a research agent might run 8-12 sequential tool calls, and the model needs to stay coherent through all of them.

Three capabilities matter most:

Tool-Calling Accuracy

OpenClaw's skills use structured function calls. The model must invoke shell commands and APIs with exact parameter shapes. If it fumbles the JSON schema or hallucinates a tool name, the agent stalls.

Context Retention

SOUL.md, AGENTS.md, USER.md, and MEMORY.md all load into context at startup. Add the conversation history and tool outputs, and you're easily at 10,000+ tokens before the agent does anything. The model needs to track all of this without losing the thread 50 messages in.

Instruction Adherence

SOUL.md sets behavioral rules — what the agent can and can't do, how it should respond, which tools to prefer. Weaker models drift from these rules mid-session, producing unpredictable behavior.

Price vs Capability vs Privacy — the tradeoffs

Cloud APIs (Anthropic, OpenAI, Google) offer the best capability but your prompts hit external servers
Open-source models via API providers (haimaker.ai) offer a middle ground — lower cost, better privacy compliance
Self-hosted local models (Ollama) are free and fully private, but require hardware and tolerate higher latency

The Impossible Triangle

You can optimize for two of three: price, capability, privacy. Rarely all three. Most users should pick the two that matter most and accept the tradeoff on the third.

Best Models for OpenClaw by Use Case

With those criteria in mind, here's how the top models stack up across the most common OpenClaw workflows.

Best Overall: Claude Sonnet 4

Price: $3/$15 per million tokens (input/output)

Claude Sonnet 4 is the safest default for new OpenClaw setups. It handles SOUL.md instructions better than any other model at its price point.

In a 12-step research agent test comparing Sonnet and GPT-4o on the same task, Sonnet stayed within SOUL.md scope on 9 out of 12 runs. GPT-4o drifted on 3, pulling in sources that were explicitly excluded.

Sonnet excels at:

Long SOUL.md files (5,000+ tokens) with many behavioral rules
Research agents that synthesize structured reports from multiple sources
Writing agents that maintain consistent tone across multi-step drafts
General-purpose ClawHub skills from the marketplace

Best instruction-following at mid-tier pricing
Fast enough for real-time Telegram/WhatsApp chat
Handles 80% of typical assistant tasks without breaking the bank
Strong tool-calling reliability

Not the cheapest option for simple, repetitive tasks
Opus outperforms it on very complex multi-file coding
Context window smaller than Gemini 3 Pro

Configuration:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-20250514"
      }
    }
  }
}

Best for Coding: Claude Opus 4.5

Price: $15/$75 per million tokens (input/output)

When the code needs to actually work — multi-file edits, complex debugging, architectural decisions — Opus 4.5 is worth the premium. It handles multi-step reasoning chains that Sonnet sometimes drops.

The cost-effective alternative: enable extended thinking on Sonnet 4. You pay more per reasoning token only when the task demands it, instead of paying Opus rates for everything.

When Opus Is Worth It

Use Opus for complex debugging sessions, multi-file refactors, and architectural planning. For everything else, Sonnet with extended thinking gives you 80% of Opus capability at a fraction of the cost.

Best for Research & Long Documents: Gemini 3 Pro

Price: ~$1.25/$10 per million tokens (input/output)

Gemini 3 Pro's killer feature is its 1M+ token context window. You can throw an entire codebase at it and ask it to find the bug. For long-document analysis, contract review, or codebase Q&A, nothing else comes close.

Gemini 3 Flash (~$0.075/$0.30) is the speed/cost option — cheap, fast, and surprisingly capable for simpler tasks. Google also offers a free tier for Flash.

Configuration for Gemini:

{
  "models": {
    "providers": {
      "haimaker": {
        "models": [
          { "id": "google/gemini-3-pro", "name": "Gemini 3 Pro" },
          { "id": "google/gemini-3-flash", "name": "Gemini 3 Flash" }
        ]
      }
    }
  }
}

Best Budget Options

Not every task needs a $15/M-token model. For high-volume, simple tasks, lightweight models cut costs by 10-20x without meaningful quality loss.

Model	Price (Input/Output per M tokens)	Best For
GPT-4o-mini	~$0.15/$0.60	Simple queries, template filling
Claude Haiku 3.5	~$0.25/$1.25	Formatting, classification, tagging
MiniMax M2.5	~$0.10/$0.50	High-volume simple automation
Gemini 3 Flash	~$0.075/$0.30	Speed-critical tasks, free tier available

When Budget Models Work

If your agent does something like: read a CSV row → apply a template → write an output file, a lightweight model handles it faster and cheaper. Save the premium models for tasks requiring judgment.

Best Free & Local Models for OpenClaw (Ollama)

Cloud APIs aren't the only option. If you have the hardware — or want zero API costs — local models have gotten surprisingly good.

Running models locally through Ollama costs nothing and keeps your data entirely on your machine. The tradeoff is hardware requirements and slightly lower capability on hard tasks.

Top Local Models Ranked

Rank	Model	SWE-bench	Speed (RTX 4090)	VRAM Required
1	Qwen3.5 27B	72.4%	~40 t/s	20-24GB
2	Qwen3.5 35B-A3B (MoE)	Lower	~112 t/s	8-16GB
3	Qwen3 Coder Plus	70.6%	~20 t/s	48GB+
4	Qwen3.5 9B	Basic	~80 t/s	8GB

Qwen3.5 27B is the standout — its 72.4% SWE-bench score puts it in the same range as GPT-5 Mini, a cloud model you'd normally pay per token for. On a single consumer GPU or a 32GB M-series Mac, you get cloud-quality results for free.

The 35B-A3B is a mixture-of-experts model that only activates 3B parameters per forward pass. It runs at 112 tokens/second on an RTX 3090 — fast enough to feel like a cloud API. Quality is lower on hard problems, but for boilerplate generation and simple edits, it's excellent.

Hardware Requirements

Tier	VRAM	Hardware Examples	Recommended Models
Entry	8-16GB	RTX 3070/4060, 16GB M1/M2 MacBook	Qwen3.5 9B, Qwen3.5 35B-A3B
Recommended	20-24GB	RTX 4090, 32GB M2/M3 Pro/Max	Qwen3.5 27B
Premium	48GB+	2x A6000, 64GB+ M2/M3 Ultra	Qwen3 Coder Plus, Llama 3.3 70B

M-Series Mac Users

If you're on an Apple Silicon Mac, unified memory works well for LLM inference. Apple has been optimizing Metal for LLM workloads. A 32GB M3 Pro runs Qwen3.5 27B comfortably.

How to Set Up Ollama with OpenClaw

Step 1: Install Ollama and pull a model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3.5:27b     # Best quality, needs 20GB+ VRAM
# OR
ollama pull qwen3.5:35b-a3b # Fast MoE model, runs on 16GB
# OR
ollama pull qwen3.5:9b      # Lightweight, runs on 8GB

Step 2: Configure OpenClaw:

Run the onboarding wizard:

openclaw onboard --auth-choice ollama

Or add Ollama manually in ~/.openclaw/openclaw.json:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434/v1",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3.5:27b",
            "name": "Qwen3.5 27B",
            "reasoning": false,
            "contextWindow": 131072,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3.5:27b"
      }
    }
  }
}

Step 3: Switch to your local model:

/model qwen-local

What Local Models Handle Well (and Where They Fall Short)

Strong at:

Reading and summarizing code
Boilerplate and CRUD code generation
File operations and simple refactoring
Agentic tool calling (Qwen3.5 27B BFCL-V4 score: 72.2)

Weak at:

Multi-file refactors (5+ files across different contexts)
Complex debugging across abstraction layers
Speed on dense models (~40 t/s vs cloud API 80-150 t/s)
Very long context (quality degrades past ~32K tokens on consumer hardware)

Best OpenAI Models for OpenClaw

We've covered Anthropic's lineup and local options. Here's where OpenAI fits in.

OpenAI's models offer solid general performance with fast response times. Here's how they stack up:

GPT-4o — The Coding & Tool-Calling Specialist

Price: Mid-tier (~$2.50/$10 per million tokens)

GPT-4o's function-calling accuracy on structured schemas is slightly higher than Claude's. It produces cleaner JSON outputs from ClawHub skills that return raw data, making it the go-to for coding agents and data extraction pipelines.

Best for:

Code generation and debugging agents
Structured data extraction (HTML tables, JSON transformations)
Multi-tool orchestration with strict output schemas
Tasks where response speed matters more than instruction adherence

GPT-4o-mini — The Budget Workhorse

Price: ~$0.15/$0.60 per million tokens

At 20x cheaper than Sonnet, GPT-4o-mini is the right choice for simple, high-volume tasks. Quality drops on anything requiring nuanced reasoning, but for template filling, classification, and formatting — it's hard to beat the value.

o3-mini — The Deep Reasoner

Price: Higher, with per-reasoning-token billing

For analytical agents that work through multi-step logic — financial analysis, scientific data interpretation, complex research synthesis — o3-mini at medium or high reasoning mode handles problems other models can't. It's slower (20-40 seconds per response) and more expensive, so use it only for specialized tasks, not as a daily driver.

The Hybrid Approach: Mix Cloud and Local

You don't have to pick one model and stick with it. In fact, the smartest approach is to not.

Most experienced OpenClaw users run a hybrid setup: local models for cheap stuff, cloud for hard stuff.

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3.5:27b",
        "thinking": "anthropic/claude-sonnet-4-20250514"
      }
    }
  }
}

The local model handles file reads, simple edits, and boilerplate — roughly 60-70% of a typical session. Sonnet handles debugging, architecture decisions, and multi-file work. Your daily API bill drops from $20-50 to around $5.

Switch manually when you know a task needs more capability:

/model sonnet

Best Practice: Route by Task Complexity

Use a cheap model for simple tasks, a mid-tier model for daily work, and a premium model for the hard stuff. Start with Claude Sonnet 4 as your default, and switch to Opus or a local model as needed.

Provider Comparison

Provider	Price Range (per M output tokens)	Best For	Privacy
Anthropic (Claude)	$3–$75	Tool calling, instruction following	No training on API data by default
OpenAI (GPT)	$0.60–$15	Coding, structured data, speed	Standard data processing
Google (Gemini)	$1.25–$10	Long documents, massive context	Google Cloud data policies
Open-source via haimaker.ai	$0.10–$5	Cost optimization, privacy compliance	Routes across GPU providers
Ollama (local)	Free	Full privacy, no API costs	Data never leaves your machine

Community Rankings (March 2026)

Beyond our own testing, here's what the broader OpenClaw community is actually using.

The PricePerToken community leaderboard tracks real-world model preferences from OpenClaw developers. As of March 27, 2026:

Kimi K2.5 — Top community vote
Claude Opus 4.5 — Premium choice
GLM 4.7 — Strong contender from Zhipu
Gemini 3 Flash Preview — Speed + value
Claude Opus 4.6 — Latest premium
Claude Sonnet 4.5 — Balanced choice
GPT-5.2 — OpenAI's latest
DeepSeek V3 — Open-source value
MiniMax M2.1 — Budget champion
Mixtral 8x7B Instruct — Classic open-source

Reddit's r/LocalLLaMA consistently recommends Qwen3.5 27B as the best local model, with multiple threads reporting successful setups on consumer hardware.

Looking for alternatives to OpenClaw itself? See our best OpenClaw alternatives guide.

Quick Decision Tree

Choose Your Model in 30 Seconds

"I just want something that works" → Claude Sonnet 4. Handles 80% of tasks, reasonable pricing
"I'm writing production code" → Claude Opus 4.5. Worth the premium for complex debugging
"I need to process long documents" → Gemini 3 Pro. 1M+ tokens of context
"I need it free" → Qwen3.5 27B via Ollama, or Gemini Flash free tier
"I need it cheap" → MiniMax M2.5 or GPT-4o-mini
"Data privacy is critical" → Qwen3 Coder / Llama 3.3 70B via haimaker.ai, or self-host with Ollama
"I use OpenClaw on Telegram" → Claude Sonnet 4 as default (any supported model works)

FAQ

What is the best model for OpenClaw beginners?

Claude Sonnet 4. It forgives imperfect SOUL.md files better than GPT-4o, and its instruction-following means agents are less likely to break on early configuration mistakes. Once you've dialed in your config, consider whether a lighter model fits your specific tasks.

Can I use different models for different agents in OpenClaw?

Not natively within a single OpenClaw instance. The model set in openclaw.json applies to all agents running through that gateway. The workaround is running separate gateway instances with different configs, or using the /model command to switch mid-session.

Why does my OpenClaw agent keep failing with local models?

Tool-calling accuracy is the most common cause. Smaller models like Llama 3.1 8B and Mistral 7B sometimes malform ClawHub skill calls, causing the agent to stall or retry indefinitely. Switching to Qwen3.5 27B or a cloud model like Claude Haiku resolves this in most cases.

Is Claude Opus worth the cost for OpenClaw?

For most users, no. Opus costs 5-10x more than Sonnet per session, and the practical performance difference in typical OpenClaw tasks is small. The advantage shows up in very long, complex reasoning chains — not in standard research or writing workflows.

What's the cheapest way to run OpenClaw?

Local models through Ollama cost nothing — Qwen3.5 27B runs on consumer hardware and matches cloud models on many tasks. For cloud APIs, Gemini 3 Flash (~~$0.075/$0.30 per M tokens) and GPT-4o-mini (~~$0.15/$0.60) are the cheapest capable options.

How do I switch models in OpenClaw?

Use the /model command mid-session: /model opus, /model haimaker/llama-3.3-70b, or /model qwen-local. To change the default, edit the model.primary field in ~/.openclaw/openclaw.json.

Does switching models affect my MEMORY.md files?

No. MEMORY.md is plain text that OpenClaw reads and injects into context regardless of which model is configured. Session memories carry over cleanly when you switch models.

Which model works best for OpenClaw on Telegram?

Any supported model works with Telegram — the channel and model are independent. Claude Sonnet 4 is the recommended default for Telegram because it balances speed, cost, and instruction-following for chat-based interactions. For budget Telegram setups, GPT-4o-mini handles simpler tasks well.

Can I use OpenClaw without an API key?

Yes, if you run local models through Ollama. You don't need any external API key — everything runs on your hardware. For cloud models, you'll need an API key from the respective provider (Anthropic, OpenAI, Google, or haimaker.ai).

What hardware do I need for local models?

At minimum: 8GB VRAM (RTX 3070 or 16GB M1 Mac) for Qwen3.5 9B. Recommended: 20-24GB VRAM (RTX 4090 or 32GB M-series Mac) for Qwen3.5 27B. Premium: 48GB+ VRAM for Qwen3 Coder Plus or Llama 3.3 70B.