Skip to main content

Performance Tuning

A 24/7 agent on a VPS can easily cost $300+/month if left untuned. This guide covers every optimization lever — from a single config change that cuts costs 50% to advanced model routing that achieves a 97% reduction.

Quick wins

If you only do three things: switch heartbeat to a cheap model, set quiet hours, and increase heartbeat interval. These alone can cut costs 70%+ with zero impact on agent quality.


Optimization Impact Summary

TechniqueCost ImpactEffortSection
Cheap heartbeat model-50% to -100%LowHeartbeat
Quiet hours-42%LowHeartbeat
Increase heartbeat interval-50% to -75%LowHeartbeat
Prompt caching (Anthropic)-30% to -50%ZeroCaching
Context reductionUp to -80%MediumMemory
Model routing by task-40% to -60%MediumModel Routing
Disable thinking mode-50%LowThinking Mode
Session resets-20% to -50%LowToken Optimization
Local models-100%HighLocal Models

Token Optimization

Context Management

Every message you send includes context — memory, system prompts, conversation history. Context is the biggest hidden cost driver.

# Check current context token setting
openclaw config get memory.max_context_tokens

# Reduce context loaded per request
openclaw config set memory.max_context_tokens 2000

Real-world impact: One user reduced per-session cost from $0.40 to $0.05 by lowering context from 50K to 8K tokens.

Session Resets

Conversations accumulate tokens over time. Long sessions send increasingly large payloads with every message.

# Start a fresh session (clears conversation history, keeps memory)
/new

# Reset everything including temporary context
/reset

# Compact old messages into a summary
/compact

OpenClaw automatically creates a fresh session daily at 4:00 AM local time. For chatty agents, consider more frequent resets.

Response Size

Cap response length to prevent verbose outputs:

~/.openclaw/openclaw.json
{
"brain": {
"max_tokens": 4096
}
}

Smaller max_tokens means faster responses and lower cost per message.


Model Routing

The most impactful optimization: use the right model for each task. There's a 60x cost difference between the cheapest and most expensive models.

Model Cost Tiers

TierModelCost (per M tokens)Best For
FreeQwen3 14B (local)$0Heartbeat, simple checks
Ultra-cheapGemini 2.5 Flash-Lite~$0.05Monitoring, status checks
CheapGemini 2.5 Flash~$0.20Heartbeat, simple tasks
BudgetDeepSeek V3.2~$0.40General conversation
MidClaude Haiku 4.5~$3Quick tasks, chat
StandardClaude Sonnet 4.6~$9General use, coding
PremiumClaude Opus 4.8~$15Complex reasoning

Heartbeat Override

Route the heartbeat to a cheap model while keeping the main brain on a capable one:

~/.openclaw/openclaw.json
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6",

"heartbeat_override": {
"provider": "local",
"local": {
"endpoint": "http://localhost:11434",
"model": "qwen3:14b"
}
}
}
}

The heartbeat runs every 30 minutes and is the single largest cost driver (~35% of total spend). Routing it to a free local model eliminates that cost entirely.

Fallback Chains

Configure automatic fallback when the primary provider is rate-limited or down:

{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"fallback": {
"provider": "openrouter",
"model": "deepseek/deepseek-chat-v3-0324"
}
}
}

OpenRouter Auto-Routing

Let OpenRouter pick the most cost-effective model per prompt:

{
"brain": {
"provider": "openrouter",
"model": "openrouter/auto"
}
}

This routes each request to the cheapest provider that can handle it — useful for mixed workloads.


Heartbeat Optimization

The heartbeat is the #1 cost driver for always-on agents. Default settings (30-minute interval, main model) can easily spend $500+/month.

Interval Tuning

~/.openclaw/openclaw.json
{
"heartbeat": {
"interval": 3600
}
}
IntervalCycles/DayCost Impact
30 min (default)48Baseline
1 hour24-50%
2 hours12-75%
4 hours6-87%

Quiet Hours

Stop the heartbeat while you're asleep:

{
"heartbeat": {
"interval": 3600,
"quiet_hours": {
"start": "23:00",
"end": "07:00"
}
}
}

8 hours of quiet = -33% heartbeat cost on top of interval savings.

Heartbeat Model Cost

ModelCost per Cycle (~120K tokens)Monthly (24/day)
Opus 4.8~$0.75~$540
Sonnet 4.6~$0.45~$324
Haiku 4.5~$0.15~$108
Gemini 2.5 Flash~$0.05~$36
Flash-Lite~$0.01~$7
Local (Ollama)$0$0

Simplify HEARTBEAT.md

Every instruction in HEARTBEAT.md costs tokens. A shorter heartbeat task list = fewer tokens per cycle:

Before (expensive)
## System Health
- Check CPU, memory, disk, network, SSL certs, DNS, NTP
- Scan all 50 channels for disconnects
- Review last 1000 log lines for patterns
- Analyze token spending trends
- Generate a 500-word health report
After (efficient)
## System Health
- Check channel connectivity and MCP health
- Alert on Telegram only if something is wrong

Monitor Heartbeat Cost

# See heartbeat execution history and cost
openclaw stats heartbeat

# Watch heartbeat in real-time
openclaw logs --filter heartbeat --follow

Caching

Prompt Caching (Anthropic)

Anthropic automatically caches system prompts and repeated context. Cache hits cost 10% of normal input token price.

~/.openclaw/openclaw.json
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"cache_retention": "short"
}
}
SettingDurationSavings
"short" (default)5 minutes30-50% for frequent messages
"long"ExtendedHigher savings for sustained conversations

How it works: Your system prompt (SOUL.md + skills + memory context) is 5K-10K tokens. With caching, that's charged once every 5 minutes instead of every message. For a chatty channel, this saves 30-50%.

Memory Retrieval Cache

Built-in memory uses hybrid search (70% vector + 30% BM25 keyword) with local SQLite — no external API calls for retrieval. This is already optimized; no configuration needed.


Memory Optimization

Context Token Budget

The max_context_tokens setting controls how much memory is loaded into each prompt:

{
"memory": {
"max_context_tokens": 2000
}
}
SettingRecall QualityCost Impact
1000Minimal — only most relevantLowest cost
2000 (default)Good balanceStandard
4000Better recall+50% context cost
8000+Near-complete recallExpensive

Start at 2000. Only increase if the agent frequently "forgets" important context.

Dreaming (Memory Consolidation)

Enable Dreaming to automatically consolidate short-term memory into durable long-term memory, reducing redundancy:

{
"plugins": {
"entries": {
"memory-core": {
"enabled": true,
"config": {
"dreaming": {
"enabled": true,
"cadence": "0 3 * * *"
}
}
}
}
}
}

Dreaming runs three phases (Light → REM → Deep) that deduplicate, identify patterns, and promote the strongest memories. This keeps memory lean and relevant.

Alternative Memory Systems

For heavy agents, community memory systems can dramatically reduce token usage:

SystemToken SavingsBest For
memU~90%24/7 autonomous agents
OpenViking~95%Production at scale (3-tier hierarchical loading)
Mem0~91%Managed/team environments

See Memory Systems Compared for a full comparison.


Thinking Mode

Extended thinking (reasoning mode) can increase token usage by 10-50x per message. It's powerful for complex problems but devastating for costs if left on globally.

Disable globally
{
"brain": {
"thinking": {
"enabled": false
}
}
}
Enable with budget cap
{
"brain": {
"thinking": {
"enabled": true,
"budget_tokens": 10000
}
}
}

Rule of thumb: Disable thinking mode for heartbeat, simple chat, and routine tasks. Enable it only for complex reasoning, multi-step planning, or code generation.


Latency Reduction

Faster Models

PriorityModelLatency
FastestGroq (LPU hardware)~18x faster than GPU
Very fastCerebras (2,957 tok/s)Near-instant for short responses
FastGemini 2.5 FlashLow latency, high throughput
StandardClaude Sonnet 4.6Balanced
SlowClaude Opus 4.8Highest quality, highest latency

Reduce Response Size

Smaller max_tokens = faster time-to-first-token for many providers:

{
"brain": {
"max_tokens": 2048
}
}

MCP Server Latency

  • stdio servers have zero network overhead — prefer these for local tools
  • Remote MCP servers add network round-trips — run on the same machine when possible
  • Use allowed_tools to reduce the number of tools the agent considers (fewer options = faster decision)
{
"mcp": {
"servers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user"],
"allowed_tools": ["read_file", "list_directory"]
}
}
}
}

Channel Optimization

Rate Limiting

Prevent runaway costs from chatty channels:

{
"channels": {
"discord": {
"max_messages_per_hour": 30,
"cooldown_seconds": 5,
"require_mention": true
}
}
}

Mention-Gating

In group channels, require_mention: true prevents the agent from processing every message — only responding when directly addressed. This alone can reduce token usage 80%+ in active groups.

Channel-Specific Models

For channels that don't need the best model (e.g., a status-check bot):

{
"agents": {
"status-bot": {
"channels": ["slack-status"],
"brain": {
"model": "claude-haiku-4-5-20251001"
}
}
}
}

Local Model Tuning

Running models locally eliminates API costs entirely. Here's how to maximize performance.

Ollama Optimization

# Keep model loaded in memory (prevents cold-start latency)
export OLLAMA_KEEP_ALIVE=-1

# Set concurrent requests
export OLLAMA_NUM_PARALLEL=4

Quantization

Smaller quantizations use less RAM with minimal quality loss:

QuantizationRAM SavingsQuality Impact
Q8_0BaselineNone
Q6_K~25% lessNegligible
Q5_K_M~40% lessMinimal
Q4_K_M~50% lessSlight
# Pull a quantized model
ollama pull qwen3:14b-q4_K_M

GPU Offloading

GPU inference is 5-20x faster than CPU. Ensure your model fits in GPU VRAM:

Model SizeVRAM Needed (Q4)Recommended GPU
7-8B~4-5 GBAny 6GB+ GPU
14B~8-10 GBRTX 3060 12GB
32B~18-20 GBRTX 3090 24GB
70B~35-40 GBA100 40GB

vLLM for Production

For highest throughput with local models:

vllm serve qwen3-32b \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 # Split across 2 GPUs

See Local Models and Cloud GPU Models for detailed hardware guides.


Resource Management

Memory (RAM)

# Check current usage
ps aux | grep openclaw
RSSStatusAction
150-250 MBHealthy idleNormal
250-500 MBBusy/activeNormal under load
500+ MBWarningRestart soon
1+ GBCriticalRestart now

Reduce RAM usage:

  • Disconnect unused channels
  • Disable unused plugins
  • Lower max_context_tokens
  • Set Docker memory limit: deploy.resources.limits.memory: 2G

Disk

# Check disk usage
du -sh ~/.openclaw/
du -sh ~/.openclaw/logs/
du -sh ~/.openclaw/memory/

Manage disk:

  • Log rotation: max_size: "10m", max_files: 5 (~50 MB cap)
  • Prune old memory files periodically
  • Monitor with heartbeat alerts

CPU

CPU usage is normally low except during active token processing. Sustained high CPU indicates:

  • Infinite loop in a skill or tool
  • Stuck MCP server
  • Excessive browser automation
# Check CPU
top -p $(cat ~/.openclaw/gateway.pid)

# Check what's running
openclaw logs --follow

Skill Optimization

Efficient Skill Design

  • Declare only needed toolstools: [shell] instead of tools: [shell, browser, filesystem, http]
  • Clear instructions — vague skills cause retries and extra tool calls
  • Short prompts — every token in the skill is loaded per invocation

Avoid Always-On Skills

Skills with trigger "*" (always active) are loaded into every message. Use specific triggers:

Efficient — only fires on keyword
trigger: "deploy"
Expensive — loaded every message
trigger: "*"

Provider-Specific Tips

Anthropic

  • Prompt caching is automatic — high-frequency conversations save 30-50%
  • Batch API for non-urgent tasks (50% cheaper, results within 24 hours)
  • Use Haiku for simple tasks, Sonnet for general work, Opus only for complex reasoning

OpenRouter

  • Auto-routing (openrouter/auto) picks cheapest capable model per prompt
  • Built-in cost tracking across all providers
  • Single API key for 100+ models

Google Gemini

  • 2.5 Flash: fastest and cheapest cloud model for most tasks
  • 1M token context window: useful for large codebases without chunking
  • Free tier available for low-volume usage

DeepSeek

  • V3.2: excellent quality at ~$0.40/M tokens — great default for cost-conscious setups
  • Works well as primary model or fallback

Real-World Optimization Example

Before: $1,200/month

  • Opus 4.5 for everything (heartbeat, chat, skills)
  • 30-minute heartbeat, 24/7
  • 50K context tokens
  • No session resets
  • Extended thinking enabled

After: $36/month (97% reduction)

  1. Local Qwen3 14B for heartbeat ($0)
  2. DeepSeek V3.2 for chat ($0.40/M tokens)
  3. Sonnet 4.6 for complex tasks only
  4. Context reduced to 8K tokens
  5. 1-hour heartbeat with quiet hours
  6. Thinking mode disabled by default
  7. Session auto-resets enabled

Tuning Checklist

Quick Wins (do first)

  • Switch heartbeat to cheap/local model
  • Set quiet hours for heartbeat
  • Increase heartbeat interval to 1+ hours
  • Set max_context_tokens to 2000-4000
  • Enable mention-gating in group channels
  • Set max_messages_per_hour rate limits

Medium Effort

  • Enable Dreaming for memory consolidation
  • Configure fallback provider chain
  • Simplify HEARTBEAT.md instructions
  • Audit skills for unnecessary "*" triggers
  • Filter MCP server tools with allowed_tools
  • Disable extended thinking for routine tasks

Advanced

  • Set up local models for heartbeat/simple tasks
  • Configure per-channel model routing
  • Deploy on GPU hardware for local inference
  • Use vLLM for high-throughput local serving
  • Implement budget alerts via heartbeat monitoring

Monitor Results

# Before and after comparison
openclaw stats tokens --period 7
openclaw gateway usage-cost
openclaw stats heartbeat

See Also