Skip to main content

Monitoring & Observability

A 24/7 agent needs 24/7 visibility. This guide covers how to monitor OpenClaw in production — from built-in CLI tools to OpenTelemetry tracing to cost dashboards and alerting.


Built-In Monitoring

Status & Health

# Quick status — gateway, channels, tasks, memory usage
openclaw status

# Machine-readable for scripting
openclaw status --json

# Full health check — config, dependencies, environment
openclaw doctor

# Export comprehensive diagnostic bundle
openclaw diagnostics > ~/openclaw-debug.txt

openclaw doctor validates:

  • Gateway process state
  • Config file syntax
  • Node.js version compatibility
  • Filesystem permissions
  • MCP server connectivity
  • Plugin health
  • Environment variables

Statistics

# Token usage breakdown
openclaw stats tokens

# Token usage over a specific period
openclaw stats tokens --period 7

# Per-channel message statistics
openclaw stats channels

# Heartbeat execution history
openclaw stats heartbeat

# Cost breakdown by model, channel, and task type
openclaw gateway usage-cost

Component Diagnostics

# MCP servers
openclaw mcp doctor # Diagnose all connections
openclaw mcp status # Connection overview
openclaw mcp probe <name> # Test specific server

# Plugins
openclaw plugins doctor # Diagnose plugin issues
openclaw plugins inspect <name> --runtime # Runtime state

# Channels
openclaw channel list # All channels and status
openclaw channel status <name> # Specific channel details

# Security
openclaw security audit # Config and permissions
openclaw security audit --deep # WebSocket probe, browser exposure, plugins

Logging

Log Levels

Control verbosity with the OPENCLAW_LOG_LEVEL environment variable:

LevelWhat You SeeUse When
debugEverything — API calls, messages, tool invocations, memory writesDebugging specific issues
infoStandard operations (default)Normal production
warnWarnings and errors onlyQuiet production, high-traffic agents
errorErrors onlyMinimal logging
# Run with debug logging
OPENCLAW_LOG_LEVEL=debug openclaw gateway

# Set permanently in config
openclaw config set logging.level "info"

Filtering by Component

# Real-time log stream, all components
openclaw logs --follow

# Filter to specific component
openclaw logs --filter heartbeat --follow
openclaw logs --filter channel --follow
openclaw logs --filter brain --follow
openclaw logs --filter hands --follow
openclaw logs --filter plugin --follow
openclaw logs --filter mcp --follow

# Last N lines (useful for post-mortem)
openclaw logs --lines 200

Log Rotation

Configure rotation to prevent disk exhaustion:

~/.openclaw/openclaw.json
{
"logging": {
"level": "info",
"path": "~/.openclaw/logs",
"max_size": "10m",
"max_files": 5
}
}

With these defaults, logs rotate at 10 MB and keep 5 files — about 50 MB maximum disk usage.

Audit Logging

Record a tamper-evident trail of all agent actions:

~/.openclaw/openclaw.json
{
"logging": {
"audit": {
"enabled": true,
"path": "~/.openclaw/logs/audit.log",
"log_tool_calls": true,
"log_memory_writes": true,
"log_channel_messages": true
}
}
}

Each audit entry records:

  • Timestamp
  • User/sender ID
  • Action type (tool call, message, memory write)
  • Result (success/failure)
  • IP address (for webhook/API calls)

Audit logs are retained for 90 days by default (configurable up to 365). Exportable as CSV or JSON for compliance reviews.


Metrics

What OpenClaw Tracks

CategoryMetricsCommand
TokensUsage by model, input/output split, cost per requestopenclaw stats tokens
CostTotal spend, cost by model/channel/task, daily/weekly trendsopenclaw gateway usage-cost
ChannelsMessages sent/received, errors, connection uptimeopenclaw stats channels
HeartbeatExecution count, tokens per cycle, cost per cycle, timingopenclaw stats heartbeat
MemoryContext tokens loaded, memory file count, disk usageopenclaw status
MCPServer connectivity, tool count, response timesopenclaw mcp status
PluginsLoad status, hook execution, errorsopenclaw plugins list

Token Cost Breakdown

Typical cost distribution for a 24/7 agent:

ComponentShareOptimization
Heartbeat~35%Increase interval, use cheaper model, set quiet hours
Chat~25%Rate limit channels, session resets
Skills~20%Simplify skill prompts, reduce tool calls
Context~15%Lower max_context_tokens
Sub-agents~5%Limit multi-agent workflows
# Identify your biggest cost driver
openclaw gateway usage-cost

OpenTelemetry

Community Observability Plugin

The OpenClaw Observability Plugin exports traces and metrics to OpenTelemetry-compatible backends.

Supported backends:

  • Grafana Cloud
  • Dynatrace
  • Jaeger
  • Any OTLP-compatible collector

What it traces:

  • Complete agent workflow spans (start to finish)
  • Individual tool execution spans
  • LLM API call spans with token breakdowns
  • Memory read/write operations
  • Session context propagation across distributed traces

Span attributes:

  • Session ID, user ID, model name
  • Channel source, message priority
  • Token count (input/output), cost
  • Tool name, execution duration, result status

Setup

# Install the plugin
openclaw plugins install openclaw-observability

# Configure OTLP endpoint
~/.openclaw/openclaw.json
{
"plugins": {
"entries": {
"openclaw-observability": {
"enabled": true,
"config": {
"otlp_endpoint": "https://otlp.grafana.net/otlp",
"otlp_headers": {
"Authorization": "Basic ${GRAFANA_OTLP_TOKEN}"
},
"service_name": "openclaw-agent",
"trace_all_tool_calls": true,
"trace_llm_calls": true,
"include_token_metrics": true
}
}
}
}
}

Local Development with Jaeger

For local tracing without a cloud backend:

docker-compose.otel.yml
services:
openclaw:
image: openclaw/openclaw:latest
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318

jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver

Access the Jaeger UI at http://localhost:16686 to visualize traces.

Knostic Telemetry (Privacy-First)

For environments where data must stay local:

  • Local-only processing — nothing leaves your machine
  • Cryptographic hash chains for tamper evidence
  • Opt-in granularity (choose what to track)
  • No external dependencies

Health Checks

CLI Health Check

# Quick check — exit code 0 = healthy, non-zero = unhealthy
openclaw doctor
echo $? # 0 = healthy

Docker

docker-compose.yml
services:
openclaw:
image: openclaw/openclaw:latest
healthcheck:
test: ["CMD", "openclaw", "doctor"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
# Check health status
docker inspect --format='{{.State.Health.Status}}' openclaw

# View health check history
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' openclaw

Kubernetes

deployment.yaml
livenessProbe:
exec:
command: ["openclaw", "doctor"]
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 15

readinessProbe:
exec:
command: ["openclaw", "status"]
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 10

External Uptime Monitoring

Point any HTTP uptime monitor at the gateway health endpoint, or use SSH-based checks:

# SSH-based health check from external monitor
ssh deploy@your-server "openclaw doctor" && echo "UP" || echo "DOWN"

Alerting

Heartbeat-Based Alerts

The simplest alerting — embed monitoring instructions in your heartbeat:

~/.openclaw/HEARTBEAT.md
## System Health (every heartbeat cycle)

- Check that all channels are connected via `openclaw channel list`
- Check MCP server health via `openclaw mcp status`
- Monitor memory usage — alert if RSS exceeds 500 MB
- Check disk usage — alert if ~/.openclaw exceeds 1 GB
- Review error logs since last heartbeat
- If any issues found, send a summary to Telegram with severity level

The agent runs these checks every heartbeat cycle (default: 30 minutes) and sends alerts through whichever channel you configure.

Outgoing Webhooks

Fire HTTP webhooks on specific events:

~/.openclaw/openclaw.json
{
"webhooks": {
"outgoing": [
{
"event": "agent.error",
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
},
{
"event": "agent.task.completed",
"url": "https://your-app.com/task-done"
},
{
"event": "security.alert",
"url": "https://events.pagerduty.com/v2/enqueue"
},
{
"event": "channel.disconnected",
"url": "https://hooks.slack.com/services/YOUR/ALERT/URL"
}
]
}
}

Cron-Based Alerts

Schedule periodic checks that alert on failure:

# Hourly health check during business hours
openclaw cron add "health-check" \
--schedule "0 9-17 * * 1-5" \
--message "Run openclaw doctor. If any checks fail, send an alert to Slack with the failure details."

# Nightly cost check
openclaw cron add "cost-check" \
--schedule "0 23 * * *" \
--message "Check today's token spend via openclaw gateway usage-cost. If over $5, alert on Telegram with breakdown."

# Weekly security audit
openclaw cron add "security-audit" \
--schedule "0 3 * * 0" \
--message "Run openclaw security audit --deep. Report any new findings to Telegram."

GitHub Actions Nightly Check

.github/workflows/nightly-health.yml
name: Nightly Health Check
on:
schedule:
- cron: '0 3 * * *'

jobs:
health:
runs-on: ubuntu-latest
steps:
- name: Check agent health
env:
AGENT_HOST: ${{ secrets.AGENT_HOST }}
SSH_KEY: ${{ secrets.SSH_KEY }}
run: |
mkdir -p ~/.ssh
echo "$SSH_KEY" > ~/.ssh/key && chmod 600 ~/.ssh/key
ssh -i ~/.ssh/key -o StrictHostKeyChecking=no \
deploy@$AGENT_HOST \
"openclaw doctor && openclaw mcp doctor && openclaw plugins doctor"

- name: Alert on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "OpenClaw health check FAILED. Check logs."}'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Cost Monitoring

Built-In Cost Tracking

# Token usage summary
openclaw stats tokens

# Full cost breakdown
openclaw gateway usage-cost

# Heartbeat-specific costs
openclaw stats heartbeat

Budget Controls

~/.openclaw/openclaw.json
{
"budget": {
"daily_limit_usd": 5.00,
"monthly_limit_usd": 50.00,
"alert_threshold": 0.8 // Alert at 80% of limit
}
}

Cost Dashboards

ToolWhat It Shows
Clawalytics (clawalytics.com)Real-time spend, per-agent breakdown, daily charts, suspicious activity alerts
ClawWatcherReal-time token usage, cost per model, skills/actions tracking
claw-dashSessions, 24h tokens, costs, model info, cron jobs, system health
openclaw-dashboardBrowser notifications for usage limits, cost analysis by model
OpenClaw Cost Calculator (calculator.vlvt.sh)Pre-deployment cost estimation

Cost Anomaly Detection

Watch for these patterns:

SymptomLikely CauseFix
Cost doubles overnightHeartbeat running expensive modelUse cheap model for heartbeat
Sudden cost spikeInfinite loop or stuck toolCheck openclaw logs --follow
Gradual cost increaseContext snowball (growing memory)Reset sessions, lower max_context_tokens
High cost per messageThinking/reasoning mode enabledDisable extended thinking
Cost from unexpected modelFallback provider triggeredCheck primary provider health

Dashboards

Control UI (Built-In)

OpenClaw includes a web dashboard at http://localhost:18789 (requires gateway auth token):

  • Connected channels and their status
  • Active tasks and Workboard state
  • Memory usage and recent writes
  • Recent logs
  • Heartbeat status and history

Remote access via SSH tunnel:

ssh -N -L 18789:127.0.0.1:18789 user@your-server
# Then open http://localhost:18789 in your browser

Grafana + OpenTelemetry

With the observability plugin exporting to Grafana Cloud:

Recommended dashboard panels:

  • Agent response latency (p50, p95, p99)
  • Token usage over time (by model)
  • Error rate by component
  • Channel message throughput
  • Heartbeat execution timeline
  • Cost accumulation curve
  • MCP server health matrix

Custom Monitoring Script

Build a simple health dashboard with --json output:

monitor.sh
#!/bin/bash
while true; do
clear
echo "=== OpenClaw Monitor ==="
echo ""

# Gateway status
echo "--- Status ---"
openclaw status 2>/dev/null || echo "GATEWAY DOWN"
echo ""

# Channel health
echo "--- Channels ---"
openclaw channel list 2>/dev/null
echo ""

# Recent errors
echo "--- Recent Errors ---"
openclaw logs --lines 10 2>/dev/null | grep -i "error\|warn" || echo "None"
echo ""

# Token usage
echo "--- Token Usage (24h) ---"
openclaw stats tokens --period 1 2>/dev/null
echo ""

sleep 60
done

Security Monitoring

Audit Trail

Enable audit logging to track every agent action:

{
"logging": {
"audit": {
"enabled": true,
"log_tool_calls": true,
"log_memory_writes": true,
"log_channel_messages": true
}
}
}

Review periodically:

# Recent audit entries
tail -50 ~/.openclaw/logs/audit.log

# Search for specific actions
grep "tool_call" ~/.openclaw/logs/audit.log | tail -20
grep "memory_write" ~/.openclaw/logs/audit.log | tail -20

Drift Detection

Monitor for unauthorized changes to critical files:

FileRiskTool
SOUL.mdPersonality tamperingClawSec soul-guardian
openclaw.jsonConfig manipulationFile hash monitoring
skills/Malicious skill injectionopenclaw security scan --all
memory/Memory poisoningAudit logging

Community Security Tools

ToolWhat It Does
ClawSecComplete security suite — SOUL.md drift detection, audit watchdog, CVE monitoring
ClawprintSHA-256 hash chain audit trail, web dashboard, 24/7 daemon mode, secret redaction
ClawBandsHuman-in-the-loop approval before dangerous tool calls, JSON audit logging
SkillGuardSkill vulnerability scanner — prompt injection, credential leaks, malicious patterns
Security Monitor32-script suite for proactive threat monitoring with daily automated scans

Automated Security Audits

# Schedule weekly deep audit
openclaw cron add "deep-audit" \
--schedule "0 3 * * 0" \
--message "Run openclaw security audit --deep and openclaw security scan --all. Report findings to Telegram."

Process Management

Systemd

# Install as systemd service
openclaw onboard --install-daemon

# Check service status
systemctl status openclaw

# View service logs
journalctl -u openclaw --follow

# Restart
systemctl restart openclaw

Docker Restart Policy

docker-compose.yml
services:
openclaw:
restart: unless-stopped # Restart on crash, not on manual stop
deploy:
resources:
limits:
memory: 2G
cpus: "2.0"

Process Monitoring

# Check if running
openclaw status

# Check resource usage
ps aux | grep openclaw

# Memory usage benchmarks
# Healthy idle: 150–250 MB
# Warning: >500 MB
# Critical: >1 GB — restart recommended

Self-Monitoring Patterns

Comprehensive Heartbeat Monitor

~/.openclaw/HEARTBEAT.md
## System Health (every heartbeat)

Check the following and send a Telegram alert if any fail:

1. **Gateway**: Run `openclaw status` — confirm gateway is running
2. **Channels**: Run `openclaw channel list` — all channels should show connected
3. **MCP**: Run `openclaw mcp status` — all servers should respond
4. **Memory**: Check RSS with `ps aux | grep openclaw` — alert if >500 MB
5. **Disk**: Check `df -h ~/.openclaw` — alert if usage >85%
6. **Errors**: Check `openclaw logs --lines 20` for ERROR or WARN entries
7. **Cost**: Check today's spend — alert if >$3 (daily budget: $5)

Format the alert as:
- OK items as a count
- Failed items with details
- Only send if something is wrong

Watchdog Pattern

Use a second agent or cron job to monitor the primary:

/etc/cron.d/openclaw-watchdog
*/5 * * * * deploy openclaw doctor > /dev/null 2>&1 || \
curl -s -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \
-d '{"text":"OpenClaw health check failed!"}'

Key Files to Monitor

PathWhat to Watch For
~/.openclaw/logs/gateway.logErrors, crashes, slow responses
~/.openclaw/logs/audit.logUnusual actions, unauthorized access
~/.openclaw/gateway.pidStale PID (process died but PID file remains)
~/.openclaw/openclaw.jsonUnauthorized config changes
~/.openclaw/SOUL.mdTampering (personality/behavior changes)
~/.openclaw/memory/Unexpected growth, corruption
~/.openclaw/skills/New files (unauthorized skill install)

See Also