Skip to main content

Voice & Multimodal

OpenClaw can see images, hear voice messages, speak back in natural voice, analyze screenshots, generate images, and stream real-time audio โ€” all through its standard channel and skill architecture. This guide covers every multimodal capability and how to wire it up.

Start here

The fastest path to multimodal: enable Whisper for voice-to-text on Telegram, configure a vision-capable model, and install an image generation skill from ClawHub. You'll have a voice-controlled, vision-enabled, image-generating agent in under 30 minutes.


Architecture Overviewโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Channel Layer โ”‚
โ”‚ Telegram ยท WhatsApp ยท Discord ยท Signal ยท Slack โ”‚
โ”‚ (receives voice, images, files) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ STT Engine โ”‚ โ”‚ Attachment โ”‚
โ”‚ (Whisper) โ”‚ โ”‚ Parser โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Brain (LLM) โ”‚
โ”‚ Claude ยท GPT-4o ยท Gemini โ”‚
โ”‚ (text + vision + reasoning) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ TTS Engine โ”‚ โ”‚ Skills โ”‚
โ”‚ (Edge/11L) โ”‚ โ”‚ (DALL-E, โ”‚
โ”‚ โ”‚ โ”‚ SD, etc.) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Channel Reply โ”‚
โ”‚ (voice + images + text) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Every multimodal feature flows through the same message pipeline โ€” channels handle media I/O, the brain processes it, and skills/TTS engines generate media output. No special architecture changes needed.


Media Attachmentsโ€‹

All 50+ channel adapters support a standard attachment interface:

interface ChannelMessage {
text: string;
sender: string;
channel: string;
attachments?: Array<{
type: 'image' | 'file' | 'audio' | 'video' | 'link';
url: string;
name?: string;
mimeType?: string;
}>;
}

When a user sends a photo, voice note, or file through any channel, it arrives as an attachment with type metadata. The brain receives both the text and attachment URLs in its context.

Supported Media by Channelโ€‹

ChannelImagesVoice/AudioVideoFiles
TelegramYesYes (OGG)YesYes
WhatsAppYesYes (OGG)YesYes
DiscordYesYesYesYes
SlackYesYesYesYes
SignalYesYesYesYes
iMessageYesYesYesYes
MatrixYesYesYesYes
TeamsYesNoNoYes

Channel Permissionsโ€‹

Control media sending per channel:

~/.openclaw/openclaw.json
{
"channels": {
"telegram": {
"permissions": {
"send_media": true,
"send_voice": true,
"receive_media": true
}
}
}
}

Voice Controlโ€‹

Speech-to-Text (STT)โ€‹

OpenAI Whisper transcribes incoming voice messages into text for the brain to process.

Install Whisperโ€‹

# Install via pip
pip install openai-whisper

# Or with conda
conda install -c conda-forge openai-whisper

Create a Voice Transcription Skillโ€‹

skills/voice-transcribe/skill.yaml
name: voice-transcribe
description: Transcribe voice messages using Whisper
trigger: attachment:audio
tools:
- shell
skills/voice-transcribe/SKILL.md
## Voice Transcription

When a voice message arrives:
1. Download the audio attachment to /tmp/voice_msg.ogg
2. Run: whisper --model base --output_format txt /tmp/voice_msg.ogg
3. Read the transcription from /tmp/voice_msg.txt
4. Process the transcribed text as a normal user message
5. Clean up temporary files

Whisper Model Selectionโ€‹

ModelSizeSpeedAccuracyBest For
tiny39 MBFastestGoodQuick responses, low-resource VPS
base74 MBFastBetterRecommended default
small244 MBMediumGoodBetter accuracy when needed
medium769 MBSlowVery goodNon-English or accented speech
large-v31.5 GBSlowestBestMaximum accuracy, GPU recommended
# Test transcription quality
whisper --model base --language en voice_message.ogg

# Force English for faster processing
whisper --model base --language en --task transcribe voice_message.ogg

Text-to-Speech (TTS)โ€‹

Two options: free Edge TTS or premium ElevenLabs.

Edge TTS (Free)โ€‹

Microsoft's Edge TTS engine โ€” high quality, zero cost, no API key:

# Install
pip install edge-tts

# Generate speech
edge-tts --text "Hello, I'm your OpenClaw agent" \
--voice en-US-AriaNeural \
--write-media reply.mp3

Popular voices:

VoiceStyle
en-US-AriaNeuralFriendly, natural
en-US-GuyNeuralCasual male
en-GB-SoniaNeuralBritish female
en-US-JennyNeuralProfessional female
# List all available voices
edge-tts --list-voices | grep en-

ElevenLabs (Premium)โ€‹

Custom voice cloning, highest quality:

Environment
export ELEVENLABS_API_KEY=your_key_here
skills/voice-reply/skill.yaml
name: voice-reply
description: Reply with voice using ElevenLabs
trigger: "speak|say|voice"
tools:
- shell
- http
skills/voice-reply/SKILL.md
## Voice Reply

When asked to speak or reply with voice:
1. Generate the text response normally
2. Call ElevenLabs API:
```
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/YOUR_VOICE_ID" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "YOUR_RESPONSE", "model_id": "eleven_turbo_v2_5"}' \
--output /tmp/reply.mp3
```
3. Send the audio file as a reply attachment
4. Clean up temporary files

Full Voice Pipelineโ€‹

Combine STT + brain + TTS for a complete voice-controlled agent:

skills/voice-agent/skill.yaml
name: voice-agent
description: Full voice-in, voice-out pipeline
trigger: attachment:audio
tools:
- shell
- http
skills/voice-agent/SKILL.md
## Voice Agent

Complete voice pipeline:
1. **Transcribe** the incoming voice message with Whisper:
`whisper --model base --language en --output_format txt /tmp/incoming.ogg`
2. **Process** the transcription as a text message โ€” reason, plan, act
3. **Synthesize** the response as speech:
`edge-tts --text "YOUR_RESPONSE" --voice en-US-AriaNeural --write-media /tmp/reply.mp3`
4. **Send** the audio file back through the channel
5. Also include the text response for accessibility

Voice Cost Comparisonโ€‹

EngineCostQualityLatency
Whisper (local)FreeExcellent2-5s (CPU), under 1s (GPU)
Edge TTSFreeVery good1-3s
ElevenLabs~$0.30/1K charsExcellent1-2s
Google Cloud TTS~$0.016/1K charsVery good1-2s

For most setups, Whisper + Edge TTS gives production-quality voice at zero cost.


Vision & Image Analysisโ€‹

Vision-Capable Modelsโ€‹

These models can analyze images, screenshots, documents, and visual content directly:

ModelVision SupportBest For
Claude Sonnet 4.6YesDetailed image analysis, document understanding
Claude Opus 4.8YesComplex visual reasoning
GPT-4oYesGeneral multimodal, fast
Gemini 2.5 ProYesLarge images, long documents
Gemini 2.5 FlashYesFast visual analysis, low cost
LLaVA (local)YesFree, on-device vision

Configure a Vision Modelโ€‹

~/.openclaw/openclaw.json
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6"
}
}

When a user sends an image through any channel, the brain automatically receives it if the model supports vision. No additional configuration needed โ€” the channel adapter passes image attachments to the LLM.

Image Analysis Skillโ€‹

For explicit image analysis commands:

skills/analyze-image/skill.yaml
name: analyze-image
description: Analyze images sent by the user
trigger: "analyze|describe|what is this|what do you see"
tools:
- shell
skills/analyze-image/SKILL.md
## Image Analysis

When the user sends an image and asks for analysis:
1. The image is already in your context via the channel attachment
2. Describe what you see in detail
3. Answer any specific questions about the image
4. If asked to extract text, perform OCR on the image content
5. If asked about a screenshot, identify the application and UI elements

Browser Screenshotsโ€‹

The built-in browser Hand captures screenshots for visual analysis:

~/.openclaw/openclaw.json
{
"hands": {
"browser": {
"enabled": true,
"headless": true,
"timeout": 60000,
"allowed_domains": ["*"]
}
}
}

Use cases:

  • Web monitoring: Screenshot a page, analyze for changes
  • Visual QA: Capture a UI, check layout and content
  • Data extraction: Screenshot a dashboard, read values from charts

MCP Vision Serversโ€‹

Puppeteer MCP for screenshots
{
"mcp": {
"servers": {
"puppeteer": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-puppeteer"]
}
}
}
}

The Puppeteer MCP server provides puppeteer_screenshot and puppeteer_navigate tools for programmatic screenshot capture.


Image Generationโ€‹

ClawHub Skillsโ€‹

60+ image and video generation skills available:

# Browse image generation skills
openclaw clawhub search "image generation"
openclaw clawhub search "DALL-E"
openclaw clawhub search "stable diffusion"

# Install one
openclaw clawhub install dalle-generator

DALL-E Skillโ€‹

skills/dalle/skill.yaml
name: dalle
description: Generate images with DALL-E
trigger: "generate|draw|create image|make a picture"
tools:
- http
skills/dalle/SKILL.md
## Image Generation

When the user asks to generate an image:
1. Craft a detailed prompt from their request
2. Call the OpenAI Images API:
```
curl -X POST "https://api.openai.com/v1/images/generations" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "dall-e-3", "prompt": "DETAILED_PROMPT", "size": "1024x1024", "quality": "standard"}'
```
3. Download the generated image
4. Send it as a channel attachment

Stable Diffusion (Local)โ€‹

Run image generation locally with no API costs:

# Install via Ollama (if supported) or standalone
# Using ComfyUI for local Stable Diffusion
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188

Connect via MCP or HTTP skill for zero-cost image generation on GPU hardware.


Real-Time Audio Streamingโ€‹

Gemini Live APIโ€‹

For real-time bidirectional audio (used by VisionClaw smart glasses):

const ws = new WebSocket('wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent');

ws.onopen = () => {
// Send setup message
ws.send(JSON.stringify({
setup: {
model: 'models/gemini-2.0-flash-exp',
generationConfig: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Aoede' }
}
}
}
}
}));
};

// Stream audio chunks
function sendAudioChunk(base64Audio: string) {
ws.send(JSON.stringify({
realtimeInput: {
mediaChunks: [{
mimeType: 'audio/pcm;rate=16000',
data: base64Audio
}]
}
}));
}

This enables sub-second voice interaction โ€” the user speaks, the model responds in audio, with no STT/TTS pipeline latency.

When to Use Real-Time vs Pipelineโ€‹

ApproachLatencyCostQualityBest For
Whisper + Edge TTS3-8sFreeVery goodMessaging apps, async
Gemini Liveunder 1sAPI costGoodReal-time voice, wearables
ElevenLabs Streaming1-2sPremiumExcellentPremium voice quality

Wearable & Mobile Multimodalโ€‹

VisionClaw (Smart Glasses)โ€‹

The VisionClaw project (681 stars) connects Meta Ray-Ban smart glasses to OpenClaw:

  • Video capture at ~1 FPS from glasses camera
  • Bidirectional audio via Gemini Live API WebSocket
  • 56+ connected skills from the OpenClaw ecosystem
  • Fallback mode using iPhone camera for testing
Smart Glasses โ†’ iOS App โ†’ Gemini Live API โ†’ OpenClaw Gateway
โ†‘ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Audio Response โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Requirements: iOS 17.0+, Xcode 15.0+, Meta Ray-Ban smart glasses (or iPhone camera fallback)

See the Ecosystem page for setup details and repository links.

ClawPhone (Android)โ€‹

Run a multimodal agent on a $25 Android smartphone:

  • Camera access via Termux:API
  • Screen overlay daemon for always-on interaction
  • Hardware control (sensors, GPS, flashlight)
  • Remote control via Discord channel

macOS Companionโ€‹

Official menu bar app with:

  • Voice control โ€” speak commands directly
  • Gateway health monitoring
  • Debug tools for development
  • Available as Universal Binary on macOS 15+

Multimodal Skill Patternsโ€‹

Pattern 1: Describe and Actโ€‹

Receive an image, analyze it, take action:

skills/receipt-scanner/skill.yaml
name: receipt-scanner
description: Scan receipts and log expenses
trigger: attachment:image
tools:
- shell
- filesystem
skills/receipt-scanner/SKILL.md
## Receipt Scanner

When an image attachment arrives:
1. Analyze the image โ€” identify if it's a receipt
2. Extract: merchant name, date, total amount, line items
3. Append to ~/expenses.csv: date, merchant, amount, category
4. Reply with a summary: "Logged $42.50 at Whole Foods (Groceries)"

Pattern 2: Visual Monitoringโ€‹

Periodically capture and analyze visual state:

skills/site-monitor/skill.yaml
name: site-monitor
description: Monitor website appearance for changes
trigger: "monitor site|check site"
tools:
- browser
- shell
skills/site-monitor/SKILL.md
## Site Monitor

When asked to monitor a site:
1. Navigate to the URL using the browser
2. Take a screenshot
3. Compare against the previous screenshot (stored in ~/.openclaw/screenshots/)
4. If significant visual changes detected, alert the user
5. Save the current screenshot for next comparison

Pattern 3: Multimodal Chainโ€‹

Combine vision input with media output:

Example: Photo โ†’ Description โ†’ Voice
User sends a photo of a sunset via Telegram

1. Vision model analyzes: "A vibrant sunset over the ocean with
orange and purple hues reflecting on calm water"
2. Agent crafts a poetic description
3. Edge TTS converts description to speech
4. Reply includes both text and voice note

Pattern 4: Document Processingโ€‹

Extract and act on document content:

skills/doc-processor/skill.yaml
name: doc-processor
description: Process documents and PDFs
trigger: attachment:file
tools:
- shell
- filesystem
skills/doc-processor/SKILL.md
## Document Processor

When a document is attached:
1. Identify file type (PDF, DOCX, image of text)
2. For PDFs: extract text, summarize key points
3. For images of documents: use vision to OCR and extract content
4. Answer questions about the document
5. Offer to create action items, summaries, or translations

Configuration Referenceโ€‹

Full Multimodal Configโ€‹

~/.openclaw/openclaw.json
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6"
},

"hands": {
"browser": {
"enabled": true,
"headless": true
}
},

"channels": {
"telegram": {
"permissions": {
"send_media": true,
"send_voice": true,
"receive_media": true
}
}
},

"mcp": {
"servers": {
"puppeteer": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-puppeteer"]
}
}
}
}

Environment Variablesโ€‹

~/.openclaw/.env
# Voice
ELEVENLABS_API_KEY=your_key # Optional: premium TTS
OPENAI_API_KEY=your_key # For Whisper API (if not running locally)

# Image Generation
OPENAI_API_KEY=your_key # DALL-E
STABILITY_API_KEY=your_key # Stable Diffusion API

# Real-Time Audio
GOOGLE_API_KEY=your_key # Gemini Live API

Performance Tipsโ€‹

Voice Latencyโ€‹

  • Use Whisper tiny or base for fastest transcription
  • Run Whisper on GPU for sub-second transcription
  • Use Edge TTS (no network round-trip to paid API)
  • Pre-download Whisper models to avoid cold-start delay

Vision Costโ€‹

  • Gemini 2.5 Flash is the cheapest vision model
  • Resize large images before sending to the LLM (saves tokens)
  • Claude charges per image tile โ€” smaller images cost less
  • Cache common visual analysis patterns in skills

Storageโ€‹

  • Voice messages are temporary โ€” clean up /tmp files after processing
  • Screenshots accumulate โ€” set a retention policy
  • Generated images can be large โ€” compress before sending
# Clean up old voice/image temp files (add to cron)
find /tmp -name "voice_*.ogg" -mtime +1 -delete
find /tmp -name "reply_*.mp3" -mtime +1 -delete
find ~/.openclaw/screenshots -mtime +7 -delete

Resource Usageโ€‹

FeatureRAM ImpactCPU ImpactGPU Impact
Whisper tiny+200 MBHigh during transcriptionPreferred
Whisper base+500 MBHigh during transcriptionPreferred
Whisper large+3 GBVery highRequired
Edge TTSMinimalLowNone
ElevenLabsMinimalNone (API)None
Local Stable Diffusion+4 GBMediumRequired (4+ GB VRAM)
Browser screenshots+200 MBLowNone

Troubleshootingโ€‹

Voice messages not transcribedโ€‹

# Check Whisper is installed
whisper --help

# Test with a sample file
whisper --model base test.ogg

# Check ffmpeg (required by Whisper)
ffmpeg -version

If ffmpeg is missing: sudo apt install ffmpeg

Images not analyzedโ€‹

  1. Confirm your model supports vision (Claude Sonnet/Opus, GPT-4o, Gemini)
  2. Check channel permissions include receive_media: true
  3. Verify the image URL is accessible to the brain
  4. Check logs: openclaw logs --filter "attachment" --follow

TTS audio not playingโ€‹

# Test Edge TTS directly
edge-tts --text "test" --voice en-US-AriaNeural --write-media /tmp/test.mp3

# Verify the file was created
ls -la /tmp/test.mp3

# Check the channel can send audio
openclaw config get channels.telegram.permissions.send_voice

High latency on voice pipelineโ€‹

  • Switch Whisper to tiny model
  • Ensure GPU is being used: nvidia-smi (should show whisper process)
  • Run Whisper with --language en to skip language detection
  • Consider Gemini Live for real-time voice instead of STT+TTS pipeline

See Alsoโ€‹