Last week, a developer on Reddit posted their OpenClaw API bill: $47 in a single day. Their agent was running Claude Opus for everything — including tasks that a $0.15/M-token model handles just fine. The week before, another user complained that their local Llama 8B model kept stalling mid-task, forcing them to restart every third command.
Both problems have the same root cause: picking the wrong LLM model for OpenClaw.
Unlike a simple chatbot where model choice barely matters, OpenClaw runs multi-step autonomous loops. Your agent might chain 8-12 tool calls in a single session — reading files, calling APIs, writing code, sending messages. If the model loses context at step 6 or fumbles a function call, the entire chain breaks. An overpowered model drains your API budget in minutes; an underpowered one fails mid-task.
This guide breaks down exactly which models to use for which tasks, based on real-world testing, community consensus, and current pricing data (March 2026). Whether you're optimizing for cost, capability, or privacy — you'll find your answer here.
- Best Overall: Claude Sonnet 4 — $3/$15 per M tokens, handles 80% of tasks
- Best for Coding: Claude Opus 4.5 — $15/$75, best multi-file debugging
- Best for Research: Gemini 3 Pro — $1.25/$10, 1M+ token context window
- Best Budget: GPT-4o-mini — $0.15/$0.60, 20x cheaper than Sonnet
- Best Free/Local: Qwen3.5 27B via Ollama — $0, matches GPT-5 Mini on SWE-bench
- Best for Privacy: Qwen3 Coder or Llama 3.3 70B — open-source, self-hostable
What Is OpenClaw (and Why Model Choice Matters)
OpenClaw (formerly Clawdbot) is a free and open-source AI agent developed by Austrian developer Peter Steinberger. It hit 100,000 GitHub stars in February 2026 — one of the fastest-growing open-source projects in AI history. In February 2026, Steinberger joined OpenAI to continue his work on autonomous agents at a larger scale.
What makes OpenClaw different from a standard chatbot:
- Runs on your machine — Mac, Windows, or Linux. Your data stays local by default
- Any chat app — Telegram, WhatsApp, Discord, Slack, Signal, or iMessage
- Persistent memory — Remembers your preferences and context across sessions (via MEMORY.md)
- Full system access — Read/write files, run shell commands, execute scripts
- Browser control — Browse the web, fill forms, extract data
- Skills & plugins — Extend with community skills or build your own
The model powers everything. Every email it sends, every file it reads, every API call it makes goes through the LLM. A weak model at step 8 of a 12-step task means starting over. That's why model choice matters more in OpenClaw than in almost any other AI tool.
If you're new to OpenClaw, check out our OpenClaw trend analysis for a deeper look at why this project went viral.
What Makes a Model Work Well in OpenClaw
Now that you understand what OpenClaw does, let's look at what separates a model that works from one that doesn't.
Most AI benchmarks test single-turn responses. OpenClaw tasks are fundamentally different — a research agent might run 8-12 sequential tool calls, and the model needs to stay coherent through all of them.
Three capabilities matter most:
Tool-Calling Accuracy
OpenClaw's skills use structured function calls. The model must invoke shell commands and APIs with exact parameter shapes. If it fumbles the JSON schema or hallucinates a tool name, the agent stalls.
Context Retention
SOUL.md, AGENTS.md, USER.md, and MEMORY.md all load into context at startup. Add the conversation history and tool outputs, and you're easily at 10,000+ tokens before the agent does anything. The model needs to track all of this without losing the thread 50 messages in.
Instruction Adherence
SOUL.md sets behavioral rules — what the agent can and can't do, how it should respond, which tools to prefer. Weaker models drift from these rules mid-session, producing unpredictable behavior.
Price vs Capability vs Privacy — the tradeoffs
- Cloud APIs (Anthropic, OpenAI, Google) offer the best capability but your prompts hit external servers
- Open-source models via API providers (haimaker.ai) offer a middle ground — lower cost, better privacy compliance
- Self-hosted local models (Ollama) are free and fully private, but require hardware and tolerate higher latency
You can optimize for two of three: price, capability, privacy. Rarely all three. Most users should pick the two that matter most and accept the tradeoff on the third.
Best Models for OpenClaw by Use Case
With those criteria in mind, here's how the top models stack up across the most common OpenClaw workflows.
Best Overall: Claude Sonnet 4
Price: $3/$15 per million tokens (input/output)
Claude Sonnet 4 is the safest default for new OpenClaw setups. It handles SOUL.md instructions better than any other model at its price point.
In a 12-step research agent test comparing Sonnet and GPT-4o on the same task, Sonnet stayed within SOUL.md scope on 9 out of 12 runs. GPT-4o drifted on 3, pulling in sources that were explicitly excluded.
Sonnet excels at:
- Long SOUL.md files (5,000+ tokens) with many behavioral rules
- Research agents that synthesize structured reports from multiple sources
- Writing agents that maintain consistent tone across multi-step drafts
- General-purpose ClawHub skills from the marketplace
- Best instruction-following at mid-tier pricing
- Fast enough for real-time Telegram/WhatsApp chat
- Handles 80% of typical assistant tasks without breaking the bank
- Strong tool-calling reliability
- Not the cheapest option for simple, repetitive tasks
- Opus outperforms it on very complex multi-file coding
- Context window smaller than Gemini 3 Pro
Configuration:
{
"agents": {
"defaults": {
"model": {
"primary": "anthropic/claude-sonnet-4-20250514"
}
}
}
}
Best for Coding: Claude Opus 4.5
Price: $15/$75 per million tokens (input/output)
When the code needs to actually work — multi-file edits, complex debugging, architectural decisions — Opus 4.5 is worth the premium. It handles multi-step reasoning chains that Sonnet sometimes drops.
The cost-effective alternative: enable extended thinking on Sonnet 4. You pay more per reasoning token only when the task demands it, instead of paying Opus rates for everything.
Use Opus for complex debugging sessions, multi-file refactors, and architectural planning. For everything else, Sonnet with extended thinking gives you 80% of Opus capability at a fraction of the cost.
Best for Research & Long Documents: Gemini 3 Pro
Price: ~$1.25/$10 per million tokens (input/output)
Gemini 3 Pro's killer feature is its 1M+ token context window. You can throw an entire codebase at it and ask it to find the bug. For long-document analysis, contract review, or codebase Q&A, nothing else comes close.
Gemini 3 Flash (~$0.075/$0.30) is the speed/cost option — cheap, fast, and surprisingly capable for simpler tasks. Google also offers a free tier for Flash.
Configuration for Gemini:
{
"models": {
"providers": {
"haimaker": {
"models": [
{ "id": "google/gemini-3-pro", "name": "Gemini 3 Pro" },
{ "id": "google/gemini-3-flash", "name": "Gemini 3 Flash" }
]
}
}
}
}
Best Budget Options
Not every task needs a $15/M-token model. For high-volume, simple tasks, lightweight models cut costs by 10-20x without meaningful quality loss.
| Model | Price (Input/Output per M tokens) | Best For |
|---|---|---|
| GPT-4o-mini | ~$0.15/$0.60 | Simple queries, template filling |
| Claude Haiku 3.5 | ~$0.25/$1.25 | Formatting, classification, tagging |
| MiniMax M2.5 | ~$0.10/$0.50 | High-volume simple automation |
| Gemini 3 Flash | ~$0.075/$0.30 | Speed-critical tasks, free tier available |
If your agent does something like: read a CSV row → apply a template → write an output file, a lightweight model handles it faster and cheaper. Save the premium models for tasks requiring judgment.
Best Free & Local Models for OpenClaw (Ollama)
Cloud APIs aren't the only option. If you have the hardware — or want zero API costs — local models have gotten surprisingly good.
Running models locally through Ollama costs nothing and keeps your data entirely on your machine. The tradeoff is hardware requirements and slightly lower capability on hard tasks.
Top Local Models Ranked
| Rank | Model | SWE-bench | Speed (RTX 4090) | VRAM Required |
|---|---|---|---|---|
| 1 | Qwen3.5 27B | 72.4% | ~40 t/s | 20-24GB |
| 2 | Qwen3.5 35B-A3B (MoE) | Lower | ~112 t/s | 8-16GB |
| 3 | Qwen3 Coder Plus | 70.6% | ~20 t/s | 48GB+ |
| 4 | Qwen3.5 9B | Basic | ~80 t/s | 8GB |
Qwen3.5 27B is the standout — its 72.4% SWE-bench score puts it in the same range as GPT-5 Mini, a cloud model you'd normally pay per token for. On a single consumer GPU or a 32GB M-series Mac, you get cloud-quality results for free.
The 35B-A3B is a mixture-of-experts model that only activates 3B parameters per forward pass. It runs at 112 tokens/second on an RTX 3090 — fast enough to feel like a cloud API. Quality is lower on hard problems, but for boilerplate generation and simple edits, it's excellent.
Hardware Requirements
| Tier | VRAM | Hardware Examples | Recommended Models |
|---|---|---|---|
| Entry | 8-16GB | RTX 3070/4060, 16GB M1/M2 MacBook | Qwen3.5 9B, Qwen3.5 35B-A3B |
| Recommended | 20-24GB | RTX 4090, 32GB M2/M3 Pro/Max | Qwen3.5 27B |
| Premium | 48GB+ | 2x A6000, 64GB+ M2/M3 Ultra | Qwen3 Coder Plus, Llama 3.3 70B |
If you're on an Apple Silicon Mac, unified memory works well for LLM inference. Apple has been optimizing Metal for LLM workloads. A 32GB M3 Pro runs Qwen3.5 27B comfortably.
How to Set Up Ollama with OpenClaw
Step 1: Install Ollama and pull a model:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3.5:27b # Best quality, needs 20GB+ VRAM
# OR
ollama pull qwen3.5:35b-a3b # Fast MoE model, runs on 16GB
# OR
ollama pull qwen3.5:9b # Lightweight, runs on 8GB
Step 2: Configure OpenClaw:
Run the onboarding wizard:
openclaw onboard --auth-choice ollama
Or add Ollama manually in ~/.openclaw/openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"api": "openai-completions",
"models": [
{
"id": "qwen3.5:27b",
"name": "Qwen3.5 27B",
"reasoning": false,
"contextWindow": 131072,
"maxTokens": 8192
}
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3.5:27b"
}
}
}
}
Step 3: Switch to your local model:
/model qwen-local
What Local Models Handle Well (and Where They Fall Short)
Strong at:
- Reading and summarizing code
- Boilerplate and CRUD code generation
- File operations and simple refactoring
- Agentic tool calling (Qwen3.5 27B BFCL-V4 score: 72.2)
Weak at:
- Multi-file refactors (5+ files across different contexts)
- Complex debugging across abstraction layers
- Speed on dense models (~40 t/s vs cloud API 80-150 t/s)
- Very long context (quality degrades past ~32K tokens on consumer hardware)
Best OpenAI Models for OpenClaw
We've covered Anthropic's lineup and local options. Here's where OpenAI fits in.
OpenAI's models offer solid general performance with fast response times. Here's how they stack up:
GPT-4o — The Coding & Tool-Calling Specialist
Price: Mid-tier (~$2.50/$10 per million tokens)
GPT-4o's function-calling accuracy on structured schemas is slightly higher than Claude's. It produces cleaner JSON outputs from ClawHub skills that return raw data, making it the go-to for coding agents and data extraction pipelines.
Best for:
- Code generation and debugging agents
- Structured data extraction (HTML tables, JSON transformations)
- Multi-tool orchestration with strict output schemas
- Tasks where response speed matters more than instruction adherence
GPT-4o-mini — The Budget Workhorse
Price: ~$0.15/$0.60 per million tokens
At 20x cheaper than Sonnet, GPT-4o-mini is the right choice for simple, high-volume tasks. Quality drops on anything requiring nuanced reasoning, but for template filling, classification, and formatting — it's hard to beat the value.
o3-mini — The Deep Reasoner
Price: Higher, with per-reasoning-token billing
For analytical agents that work through multi-step logic — financial analysis, scientific data interpretation, complex research synthesis — o3-mini at medium or high reasoning mode handles problems other models can't. It's slower (20-40 seconds per response) and more expensive, so use it only for specialized tasks, not as a daily driver.
The Hybrid Approach: Mix Cloud and Local
You don't have to pick one model and stick with it. In fact, the smartest approach is to not.
Most experienced OpenClaw users run a hybrid setup: local models for cheap stuff, cloud for hard stuff.
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3.5:27b",
"thinking": "anthropic/claude-sonnet-4-20250514"
}
}
}
}
The local model handles file reads, simple edits, and boilerplate — roughly 60-70% of a typical session. Sonnet handles debugging, architecture decisions, and multi-file work. Your daily API bill drops from $20-50 to around $5.
Switch manually when you know a task needs more capability:
/model sonnet
Use a cheap model for simple tasks, a mid-tier model for daily work, and a premium model for the hard stuff. Start with Claude Sonnet 4 as your default, and switch to Opus or a local model as needed.
Provider Comparison
| Provider | Price Range (per M output tokens) | Best For | Privacy |
|---|---|---|---|
| Anthropic (Claude) | $3–$75 | Tool calling, instruction following | No training on API data by default |
| OpenAI (GPT) | $0.60–$15 | Coding, structured data, speed | Standard data processing |
| Google (Gemini) | $1.25–$10 | Long documents, massive context | Google Cloud data policies |
| Open-source via haimaker.ai | $0.10–$5 | Cost optimization, privacy compliance | Routes across GPU providers |
| Ollama (local) | Free | Full privacy, no API costs | Data never leaves your machine |
Community Rankings (March 2026)
Beyond our own testing, here's what the broader OpenClaw community is actually using.
The PricePerToken community leaderboard tracks real-world model preferences from OpenClaw developers. As of March 27, 2026:
- Kimi K2.5 — Top community vote
- Claude Opus 4.5 — Premium choice
- GLM 4.7 — Strong contender from Zhipu
- Gemini 3 Flash Preview — Speed + value
- Claude Opus 4.6 — Latest premium
- Claude Sonnet 4.5 — Balanced choice
- GPT-5.2 — OpenAI's latest
- DeepSeek V3 — Open-source value
- MiniMax M2.1 — Budget champion
- Mixtral 8x7B Instruct — Classic open-source
Reddit's r/LocalLLaMA consistently recommends Qwen3.5 27B as the best local model, with multiple threads reporting successful setups on consumer hardware.
Looking for alternatives to OpenClaw itself? See our best OpenClaw alternatives guide.
Quick Decision Tree
- "I just want something that works" → Claude Sonnet 4. Handles 80% of tasks, reasonable pricing
- "I'm writing production code" → Claude Opus 4.5. Worth the premium for complex debugging
- "I need to process long documents" → Gemini 3 Pro. 1M+ tokens of context
- "I need it free" → Qwen3.5 27B via Ollama, or Gemini Flash free tier
- "I need it cheap" → MiniMax M2.5 or GPT-4o-mini
- "Data privacy is critical" → Qwen3 Coder / Llama 3.3 70B via haimaker.ai, or self-host with Ollama
- "I use OpenClaw on Telegram" → Claude Sonnet 4 as default (any supported model works)
FAQ
What is the best model for OpenClaw beginners?
Claude Sonnet 4. It forgives imperfect SOUL.md files better than GPT-4o, and its instruction-following means agents are less likely to break on early configuration mistakes. Once you've dialed in your config, consider whether a lighter model fits your specific tasks.
Can I use different models for different agents in OpenClaw?
Not natively within a single OpenClaw instance. The model set in openclaw.json applies to all agents running through that gateway. The workaround is running separate gateway instances with different configs, or using the /model command to switch mid-session.
Why does my OpenClaw agent keep failing with local models?
Tool-calling accuracy is the most common cause. Smaller models like Llama 3.1 8B and Mistral 7B sometimes malform ClawHub skill calls, causing the agent to stall or retry indefinitely. Switching to Qwen3.5 27B or a cloud model like Claude Haiku resolves this in most cases.
Is Claude Opus worth the cost for OpenClaw?
For most users, no. Opus costs 5-10x more than Sonnet per session, and the practical performance difference in typical OpenClaw tasks is small. The advantage shows up in very long, complex reasoning chains — not in standard research or writing workflows.
What's the cheapest way to run OpenClaw?
Local models through Ollama cost nothing — Qwen3.5 27B runs on consumer hardware and matches cloud models on many tasks. For cloud APIs, Gemini 3 Flash ($0.075/$0.30 per M tokens) and GPT-4o-mini ($0.15/$0.60) are the cheapest capable options.
How do I switch models in OpenClaw?
Use the /model command mid-session: /model opus, /model haimaker/llama-3.3-70b, or /model qwen-local. To change the default, edit the model.primary field in ~/.openclaw/openclaw.json.
Does switching models affect my MEMORY.md files?
No. MEMORY.md is plain text that OpenClaw reads and injects into context regardless of which model is configured. Session memories carry over cleanly when you switch models.
Which model works best for OpenClaw on Telegram?
Any supported model works with Telegram — the channel and model are independent. Claude Sonnet 4 is the recommended default for Telegram because it balances speed, cost, and instruction-following for chat-based interactions. For budget Telegram setups, GPT-4o-mini handles simpler tasks well.
Can I use OpenClaw without an API key?
Yes, if you run local models through Ollama. You don't need any external API key — everything runs on your hardware. For cloud models, you'll need an API key from the respective provider (Anthropic, OpenAI, Google, or haimaker.ai).
What hardware do I need for local models?
At minimum: 8GB VRAM (RTX 3070 or 16GB M1 Mac) for Qwen3.5 9B. Recommended: 20-24GB VRAM (RTX 4090 or 32GB M-series Mac) for Qwen3.5 27B. Premium: 48GB+ VRAM for Qwen3 Coder Plus or Llama 3.3 70B.


