Skip to main content

Running with Local Models

OpenClaw can run entirely offline using local LLMs, eliminating API costs and keeping all data on your machine. It supports seven categories of local model providers, all connecting through the same Gateway.


Supported Local Backends

BackendDefault URLBest ForSetup
LM Studio (recommended)http://localhost:1234/v1Quick start, GUI-based model managementLow
Ollamahttp://127.0.0.1:11434CLI-first, auto-discovery, consumer hardwareLow
vLLMhttp://127.0.0.1:8000/v1Production, GPU clusters, tensor parallelismMedium
SGLanghttp://127.0.0.1:30000/v1High-throughput serving, RadixAttentionMedium
MLXVariesApple Silicon optimizedMedium
LiteLLMVariesUnified proxy across 100+ providersMedium
CustomAny URLAny OpenAI-compatible serverMedium

All local providers are bundled as plugins and require a dummy API key for auto-discovery (e.g., VLLM_API_KEY='vllm-local').

info

LM Studio is the recommended local stack — it provides a GUI for model management, uses the Responses API, and has the best out-of-the-box experience with OpenClaw.


Setup with LM Studio

Install and Configure

  1. Download LM Studio and install
  2. Download a model (Qwen3 32B recommended for best balance)
  3. Start the local server (it runs on port 1234 by default)
~/.openclaw/openclaw.json
{
"models": {
"providers": {
"lmstudio": {
"baseUrl": "http://localhost:1234/v1",
"apiKey": "lm-studio-local",
"api": "openai-completions",
"models": {
"qwen3-32b": {
"contextWindow": 32768,
"maxTokens": 4096
}
}
}
}
}
}

Setup with Ollama

Install Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | bash

# Pull recommended models
ollama pull qwen3:32b # Best balance of quality and speed
ollama pull qwen3:14b # Good for most tasks, less RAM
ollama pull llama3.3:70b # Best quality (needs 40GB+ VRAM)

Configure OpenClaw

OpenClaw supports three Ollama operating modes:

ModeDescriptionWhen To Use
Cloud + LocalBoth cloud and Ollama models availableDefault — best flexibility
Cloud onlyOllama disabledWhen you don't need local
Local onlyCloud providers disabledAir-gapped / privacy-first
~/.openclaw/openclaw.json
{
"brain": {
"provider": "local",
"local": {
"endpoint": "http://localhost:11434",
"model": "qwen3:32b",
"type": "ollama"
}
}
}

Restart the gateway:

openclaw gateway restart

Auto-Discovery

OpenClaw auto-discovers installed Ollama models via /api/tags and detects capabilities via /api/show:

  • Vision models: Detected automatically (e.g., llava, moondream)
  • Reasoning models: Identified by name heuristics (r1, reasoning, think)
  • Context window: Read from model metadata

The /v1 Warning

danger

Do NOT use Ollama's /v1 OpenAI-compatible endpoint. OpenClaw uses Ollama's native /api/chat endpoint instead because Ollama's /v1 streaming implementation does not properly emit tool_calls delta chunks. This causes tool calls to silently fail — the model returns empty content with finish_reason: 'stop' instead of tool call payloads.

This bug affected Mistral Small 3.2 24B, Qwen 2.5 7B, Llama 3.1 8B, and others across versions v2026.1.29 through v2026.2.15, and was fixed by switching to the native Ollama endpoint after v2026.3.2. The underlying Ollama /v1 streaming limitation persists as of mid-2026.

Context Window Tuning

Ollama has two separate context window settings:

SettingWherePurpose
contextWindowOpenClaw configControls how OpenClaw budgets context tokens
params.num_ctxOllama runtimeControls how much context Ollama actually allocates

Set both to match your needs:

~/.openclaw/openclaw.json
{
"models": {
"providers": {
"ollama": {
"models": {
"qwen3:32b": {
"contextWindow": 32768,
"params": {
"num_ctx": 32768
}
}
}
}
}
}
}

Runtime Parameters

Fine-tune Ollama inference behavior:

ParameterDefaultDescription
temperature0.7Randomness (0 = deterministic)
top_p0.9Nucleus sampling threshold
top_k40Top-K sampling
min_p0.0Minimum probability threshold
num_predict-1Max tokens to generate (-1 = unlimited)
num_batch512Batch size for prompt processing
num_threadautoCPU threads for inference
use_mmaptrueMemory-map model files
keep_alive5mHow long to keep model loaded after last request
{
"models": {
"providers": {
"ollama": {
"models": {
"qwen3:32b": {
"params": {
"temperature": 0.7,
"num_ctx": 32768,
"keep_alive": "-1"
}
}
}
}
}
}
}
tip

Set keep_alive to "-1" to prevent Ollama from unloading the model between requests. This avoids cold-start delays on each heartbeat cycle.

Multiple Ollama Hosts

Run different models on different machines with custom provider IDs:

~/.openclaw/openclaw.json
{
"models": {
"providers": {
"ollama-fast": {
"type": "ollama",
"host": "http://gpu-server-1:11434",
"models": {
"gemma4:12b": { "contextWindow": 32768 }
}
},
"ollama-large": {
"type": "ollama",
"host": "http://gpu-server-2:11434",
"models": {
"qwen3.5:27b": { "contextWindow": 65536 }
}
}
}
}
}

Each host has independent auth, timeout, and model configuration. You can set up fallback chains across hosts:

{
"brain": {
"model": "ollama-fast/gemma4:12b",
"fallbacks": ["ollama-large/qwen3.5:27b"]
}
}

Setup with vLLM

vLLM is recommended for production with dedicated GPU hardware — it supports tensor parallelism, continuous batching, and PagedAttention for efficient memory use.

pip install vllm

# Single GPU
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000

# Multi-GPU (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2
~/.openclaw/openclaw.json
{
"models": {
"providers": {
"vllm": {
"baseUrl": "http://localhost:8000/v1",
"apiKey": "vllm-local",
"api": "openai-completions",
"models": {
"meta-llama/Llama-3.1-70B-Instruct": {
"contextWindow": 131072,
"maxTokens": 4096
}
}
}
}
}
}

vLLM vs Ollama

FeatureOllamavLLM
SetupOne commandpip install + config
GUINoNo
Multi-GPULimitedFull tensor parallelism
BatchingSequentialContinuous batching
MemoryStandardPagedAttention (efficient)
QuantizationGGUF (Q4/Q5/Q8)AWQ, GPTQ, FP8
Best forDevelopment, consumer hardwareProduction, multi-user

Setup with SGLang

SGLang offers high-throughput serving with RadixAttention:

pip install sglang

python -m sglang.launch_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--port 30000
~/.openclaw/openclaw.json
{
"models": {
"providers": {
"sglang": {
"baseUrl": "http://localhost:30000/v1",
"apiKey": "sglang-local",
"api": "openai-completions"
}
}
}
}

Custom OpenAI-Compatible Providers

Any server that implements the OpenAI Chat Completions API works:

~/.openclaw/openclaw.json
{
"models": {
"providers": {
"my-local": {
"baseUrl": "http://localhost:5000/v1",
"apiKey": "dummy",
"api": "openai-completions",
"timeoutSeconds": 120,
"models": {
"my-model": {
"contextWindow": 32768,
"maxTokens": 8192
}
}
}
}
}
}
caution

The api field is mandatory. Omitting it produces a vague error: No API provider registered for api: undefined — with no indication of which field is missing.

Compatibility Flags

Some local backends need compatibility adjustments:

FlagPurposeWhen Needed
requiresStringContent: trueSend message content as plain string, not content-part arraysServers that reject structured content
strictMessageKeys: trueStrip messages to role and content onlyServers that reject extra fields
tool_choice: "required"Force tool use modeBackends whose parser only works in forced mode
{
"models": {
"providers": {
"my-local": {
"baseUrl": "http://localhost:5000/v1",
"apiKey": "dummy",
"api": "openai-completions",
"requiresStringContent": true,
"strictMessageKeys": true
}
}
}
}

The models.providers Config Structure

All local (and custom cloud) providers use the same JSON structure:

~/.openclaw/openclaw.json
{
"models": {
"mode": "merge",
"providers": {
"provider-id": {
"baseUrl": "http://...",
"apiKey": "...",
"api": "openai-completions",
"timeoutSeconds": 120,
"models": {
"model-name": {
"contextWindow": 200000,
"maxTokens": 8192,
"reasoning": false,
"inputTypes": ["text"],
"cost": { "input": 0, "output": 0 }
}
}
}
}
}
}
FieldDefaultDescription
baseUrlProvider-specificAPI endpoint URL
apiKeyAPI key (use dummy value for local)
apiRequired. openai-completions or anthropic-messages
timeoutSeconds60Request timeout
models.*.contextWindow200000Context window budget for OpenClaw
models.*.maxTokens8192Max output tokens
models.*.reasoningfalseWhether model supports extended thinking
models.*.costToken costs (for budget tracking)

Merge Mode

When adding local providers alongside cloud, use models.mode: "merge" to preserve hosted options:

{
"models": {
"mode": "merge",
"providers": {
"ollama": { ... }
}
}
}

Without merge mode, custom providers replace the defaults entirely.


Hardware Requirements

Model SizeRAMGPU VRAMQualityExample Models
7-8B8 GB6 GBBasic tasks, simple chatQwen3 8B, Llama 3.3 8B
13-14B16 GB10 GBGood general useQwen3 14B
30-34B32 GB24 GBStrong coding and reasoningQwen3 32B (Q4), DeepSeek-R1-Distill-32B
70B64 GB40 GB+Near cloud-qualityLlama 3.3 70B

Quantization

Quantized models reduce VRAM requirements with minimal quality loss:

QuantizationVRAM SavingsQuality ImpactBest For
Q4_K_M~75%Small loss on complex tasksConsumer GPUs (8-16 GB)
Q5_K_M~65%Minimal lossMid-range GPUs (16-24 GB)
Q8_0~50%Negligible lossHigh-end consumer (24 GB+)
FP160% (baseline)Full qualityData center GPUs (40 GB+)
# Pull a quantized model
ollama pull qwen3:32b-q4_K_M # Fits in 24 GB VRAM
ollama pull qwen3:14b-q8_0 # Fits in 16 GB VRAM

Apple Silicon

Apple Silicon Macs can run models using unified memory:

MacUnified MemoryRecommended Model
M1/M2 (8 GB)8 GBQwen3 8B (Q4)
M1/M2 Pro (16 GB)16 GBQwen3 14B
M1/M2 Max (32 GB)32 GBQwen3 32B (Q4)
M2/M3/M4 Ultra (64-192 GB)64-192 GBLlama 3.3 70B (FP16)
tip

For Apple Silicon, MLX provides optimized inference. Ollama also uses Metal acceleration automatically on macOS.


Tool-Use Capability

OpenClaw requires models that support function calling. Not all local models handle this well:

ModelTool UseCodingReasoningNotes
Qwen3 32BGoodExcellentGoodBest overall local model
Qwen 2.5 CoderGoodExcellentGoodOptimized for coding tasks
DeepSeek-R1-Distill-32BModerateGoodExcellentBest for reasoning-heavy tasks
Llama 3.3 70BGoodGoodGoodNear cloud-quality, needs big GPU
Mistral Small 3.2 24BModerateGoodGoodGood balance
Gemma 4 12BModerateGoodModerateLightweight option
caution

Models smaller than ~14B parameters often struggle with complex multi-step tool calling. For reliable agent behavior with multiple tool calls, use 30B+ models or fall back to cloud providers.


Hybrid Mode

Use local models for cheap tasks and cloud models for complex ones:

~/.openclaw/openclaw.json
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6"
},
"heartbeat": {
"model": "ollama/qwen3:14b"
},
"agents": {
"list": [
{
"id": "researcher",
"model": "claude-opus-4-6"
},
{
"id": "worker",
"model": "ollama/qwen3:32b"
}
]
}
}

Three Hybrid Strategies

StrategyDescriptionConfig
Primary with fallbacksCloud primary, local fallback when cloud is downbrain.fallback points to local
Local-firstLocal primary, cloud safety net for complex tasksbrain.provider: "local" with cloud fallback
Merge modeBoth available, route by taskmodels.mode: "merge"

Performance Tips

  1. Keep the model loaded — Set keep_alive: "-1" in Ollama to avoid cold-start delays
  2. Use GPU offloading — Even partial GPU offload dramatically speeds inference
  3. Match model to task — 8B for heartbeat, 32B+ for complex reasoning
  4. Monitor VRAM — Watch for OOM errors with nvidia-smi or ollama ps
  5. Set timeouts — Local inference can be slow; set timeoutSeconds: 120 or higher
  6. Use quantization — Q4_K_M fits most models in consumer VRAM with minimal quality loss
  7. Enable concurrent loading — Set OLLAMA_NUM_PARALLEL for multiple simultaneous requests
# Environment variables for Ollama tuning
export OLLAMA_KEEP_ALIVE=-1 # Never unload models
export OLLAMA_NUM_PARALLEL=4 # Handle 4 concurrent requests
export OLLAMA_MAX_LOADED_MODELS=2 # Keep 2 models in memory

Troubleshooting

ProblemCauseFix
Tool calls silently failUsing Ollama's /v1 endpointEnsure OpenClaw uses native /api/chat (default since v2026.3.2)
No API provider registered for api: undefinedMissing api field in configAdd "api": "openai-completions" to provider config
VRAM OOMModel too large for GPUUse a smaller quantization (Q4_K_M) or smaller model
Slow inferenceCPU-only, no GPU offloadEnable GPU with ollama run --gpu or check CUDA drivers
Model not foundModel not pulled or wrong nameRun ollama list to check, ollama pull model-name
Empty responsesContext window mismatchSet both contextWindow and params.num_ctx to match
Connection refusedOllama not runningStart with ollama serve or check port
Timeout errorsSlow inference exceeding defaultIncrease timeoutSeconds to 120-300
Fallback overwrites primaryKnown bug (#47705)Update to latest version, report if persists

Limitations

  • Local models are generally less capable than Claude Opus or GPT-5 for complex tasks
  • Complex multi-step reasoning may fail more often, especially with smaller models
  • Browser automation skills may need cloud models for reliable tool calling
  • Speed depends heavily on your hardware — expect 10-50 tokens/sec on consumer GPUs vs. 100+ from cloud APIs
  • Some OpenAI-specific features (prompt caching, reasoning-compat payload shaping) are not available with custom providers

See Also