Voice & Multimodal
OpenClaw can see images, hear voice messages, speak back in natural voice, analyze screenshots, generate images, and stream real-time audio โ all through its standard channel and skill architecture. This guide covers every multimodal capability and how to wire it up.
The fastest path to multimodal: enable Whisper for voice-to-text on Telegram, configure a vision-capable model, and install an image generation skill from ClawHub. You'll have a voice-controlled, vision-enabled, image-generating agent in under 30 minutes.
Architecture Overviewโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Channel Layer โ
โ Telegram ยท WhatsApp ยท Discord ยท Signal ยท Slack โ
โ (receives voice, images, files) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโผโโโโโโโ โโโโโโโโผโโโโโโโ
โ STT Engine โ โ Attachment โ
โ (Whisper) โ โ Parser โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ
โโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโ
โ Brain (LLM) โ
โ Claude ยท GPT-4o ยท Gemini โ
โ (text + vision + reasoning) โ
โโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโ
โ โ
โโโโโโโโผโโโโโโโ โโโโโโโโผโโโโโโโ
โ TTS Engine โ โ Skills โ
โ (Edge/11L) โ โ (DALL-E, โ
โ โ โ SD, etc.) โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ
โโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโ
โ Channel Reply โ
โ (voice + images + text) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Every multimodal feature flows through the same message pipeline โ channels handle media I/O, the brain processes it, and skills/TTS engines generate media output. No special architecture changes needed.
Media Attachmentsโ
All 50+ channel adapters support a standard attachment interface:
interface ChannelMessage {
text: string;
sender: string;
channel: string;
attachments?: Array<{
type: 'image' | 'file' | 'audio' | 'video' | 'link';
url: string;
name?: string;
mimeType?: string;
}>;
}
When a user sends a photo, voice note, or file through any channel, it arrives as an attachment with type metadata. The brain receives both the text and attachment URLs in its context.
Supported Media by Channelโ
| Channel | Images | Voice/Audio | Video | Files |
|---|---|---|---|---|
| Telegram | Yes | Yes (OGG) | Yes | Yes |
| Yes | Yes (OGG) | Yes | Yes | |
| Discord | Yes | Yes | Yes | Yes |
| Slack | Yes | Yes | Yes | Yes |
| Signal | Yes | Yes | Yes | Yes |
| iMessage | Yes | Yes | Yes | Yes |
| Matrix | Yes | Yes | Yes | Yes |
| Teams | Yes | No | No | Yes |
Channel Permissionsโ
Control media sending per channel:
{
"channels": {
"telegram": {
"permissions": {
"send_media": true,
"send_voice": true,
"receive_media": true
}
}
}
}
Voice Controlโ
Speech-to-Text (STT)โ
OpenAI Whisper transcribes incoming voice messages into text for the brain to process.
Install Whisperโ
# Install via pip
pip install openai-whisper
# Or with conda
conda install -c conda-forge openai-whisper
Create a Voice Transcription Skillโ
name: voice-transcribe
description: Transcribe voice messages using Whisper
trigger: attachment:audio
tools:
- shell
## Voice Transcription
When a voice message arrives:
1. Download the audio attachment to /tmp/voice_msg.ogg
2. Run: whisper --model base --output_format txt /tmp/voice_msg.ogg
3. Read the transcription from /tmp/voice_msg.txt
4. Process the transcribed text as a normal user message
5. Clean up temporary files
Whisper Model Selectionโ
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny | 39 MB | Fastest | Good | Quick responses, low-resource VPS |
base | 74 MB | Fast | Better | Recommended default |
small | 244 MB | Medium | Good | Better accuracy when needed |
medium | 769 MB | Slow | Very good | Non-English or accented speech |
large-v3 | 1.5 GB | Slowest | Best | Maximum accuracy, GPU recommended |
# Test transcription quality
whisper --model base --language en voice_message.ogg
# Force English for faster processing
whisper --model base --language en --task transcribe voice_message.ogg
Text-to-Speech (TTS)โ
Two options: free Edge TTS or premium ElevenLabs.
Edge TTS (Free)โ
Microsoft's Edge TTS engine โ high quality, zero cost, no API key:
# Install
pip install edge-tts
# Generate speech
edge-tts --text "Hello, I'm your OpenClaw agent" \
--voice en-US-AriaNeural \
--write-media reply.mp3
Popular voices:
| Voice | Style |
|---|---|
en-US-AriaNeural | Friendly, natural |
en-US-GuyNeural | Casual male |
en-GB-SoniaNeural | British female |
en-US-JennyNeural | Professional female |
# List all available voices
edge-tts --list-voices | grep en-
ElevenLabs (Premium)โ
Custom voice cloning, highest quality:
export ELEVENLABS_API_KEY=your_key_here
name: voice-reply
description: Reply with voice using ElevenLabs
trigger: "speak|say|voice"
tools:
- shell
- http
## Voice Reply
When asked to speak or reply with voice:
1. Generate the text response normally
2. Call ElevenLabs API:
```
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/YOUR_VOICE_ID" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "YOUR_RESPONSE", "model_id": "eleven_turbo_v2_5"}' \
--output /tmp/reply.mp3
```
3. Send the audio file as a reply attachment
4. Clean up temporary files
Full Voice Pipelineโ
Combine STT + brain + TTS for a complete voice-controlled agent:
name: voice-agent
description: Full voice-in, voice-out pipeline
trigger: attachment:audio
tools:
- shell
- http
## Voice Agent
Complete voice pipeline:
1. **Transcribe** the incoming voice message with Whisper:
`whisper --model base --language en --output_format txt /tmp/incoming.ogg`
2. **Process** the transcription as a text message โ reason, plan, act
3. **Synthesize** the response as speech:
`edge-tts --text "YOUR_RESPONSE" --voice en-US-AriaNeural --write-media /tmp/reply.mp3`
4. **Send** the audio file back through the channel
5. Also include the text response for accessibility
Voice Cost Comparisonโ
| Engine | Cost | Quality | Latency |
|---|---|---|---|
| Whisper (local) | Free | Excellent | 2-5s (CPU), under 1s (GPU) |
| Edge TTS | Free | Very good | 1-3s |
| ElevenLabs | ~$0.30/1K chars | Excellent | 1-2s |
| Google Cloud TTS | ~$0.016/1K chars | Very good | 1-2s |
For most setups, Whisper + Edge TTS gives production-quality voice at zero cost.
Vision & Image Analysisโ
Vision-Capable Modelsโ
These models can analyze images, screenshots, documents, and visual content directly:
| Model | Vision Support | Best For |
|---|---|---|
| Claude Sonnet 4.6 | Yes | Detailed image analysis, document understanding |
| Claude Opus 4.8 | Yes | Complex visual reasoning |
| GPT-4o | Yes | General multimodal, fast |
| Gemini 2.5 Pro | Yes | Large images, long documents |
| Gemini 2.5 Flash | Yes | Fast visual analysis, low cost |
| LLaVA (local) | Yes | Free, on-device vision |
Configure a Vision Modelโ
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6"
}
}
When a user sends an image through any channel, the brain automatically receives it if the model supports vision. No additional configuration needed โ the channel adapter passes image attachments to the LLM.
Image Analysis Skillโ
For explicit image analysis commands:
name: analyze-image
description: Analyze images sent by the user
trigger: "analyze|describe|what is this|what do you see"
tools:
- shell
## Image Analysis
When the user sends an image and asks for analysis:
1. The image is already in your context via the channel attachment
2. Describe what you see in detail
3. Answer any specific questions about the image
4. If asked to extract text, perform OCR on the image content
5. If asked about a screenshot, identify the application and UI elements
Browser Screenshotsโ
The built-in browser Hand captures screenshots for visual analysis:
{
"hands": {
"browser": {
"enabled": true,
"headless": true,
"timeout": 60000,
"allowed_domains": ["*"]
}
}
}
Use cases:
- Web monitoring: Screenshot a page, analyze for changes
- Visual QA: Capture a UI, check layout and content
- Data extraction: Screenshot a dashboard, read values from charts
MCP Vision Serversโ
{
"mcp": {
"servers": {
"puppeteer": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-puppeteer"]
}
}
}
}
The Puppeteer MCP server provides puppeteer_screenshot and puppeteer_navigate tools for programmatic screenshot capture.
Image Generationโ
ClawHub Skillsโ
60+ image and video generation skills available:
# Browse image generation skills
openclaw clawhub search "image generation"
openclaw clawhub search "DALL-E"
openclaw clawhub search "stable diffusion"
# Install one
openclaw clawhub install dalle-generator
DALL-E Skillโ
name: dalle
description: Generate images with DALL-E
trigger: "generate|draw|create image|make a picture"
tools:
- http
## Image Generation
When the user asks to generate an image:
1. Craft a detailed prompt from their request
2. Call the OpenAI Images API:
```
curl -X POST "https://api.openai.com/v1/images/generations" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "dall-e-3", "prompt": "DETAILED_PROMPT", "size": "1024x1024", "quality": "standard"}'
```
3. Download the generated image
4. Send it as a channel attachment
Stable Diffusion (Local)โ
Run image generation locally with no API costs:
# Install via Ollama (if supported) or standalone
# Using ComfyUI for local Stable Diffusion
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188
Connect via MCP or HTTP skill for zero-cost image generation on GPU hardware.
Real-Time Audio Streamingโ
Gemini Live APIโ
For real-time bidirectional audio (used by VisionClaw smart glasses):
const ws = new WebSocket('wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent');
ws.onopen = () => {
// Send setup message
ws.send(JSON.stringify({
setup: {
model: 'models/gemini-2.0-flash-exp',
generationConfig: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Aoede' }
}
}
}
}
}));
};
// Stream audio chunks
function sendAudioChunk(base64Audio: string) {
ws.send(JSON.stringify({
realtimeInput: {
mediaChunks: [{
mimeType: 'audio/pcm;rate=16000',
data: base64Audio
}]
}
}));
}
This enables sub-second voice interaction โ the user speaks, the model responds in audio, with no STT/TTS pipeline latency.
When to Use Real-Time vs Pipelineโ
| Approach | Latency | Cost | Quality | Best For |
|---|---|---|---|---|
| Whisper + Edge TTS | 3-8s | Free | Very good | Messaging apps, async |
| Gemini Live | under 1s | API cost | Good | Real-time voice, wearables |
| ElevenLabs Streaming | 1-2s | Premium | Excellent | Premium voice quality |
Wearable & Mobile Multimodalโ
VisionClaw (Smart Glasses)โ
The VisionClaw project (681 stars) connects Meta Ray-Ban smart glasses to OpenClaw:
- Video capture at ~1 FPS from glasses camera
- Bidirectional audio via Gemini Live API WebSocket
- 56+ connected skills from the OpenClaw ecosystem
- Fallback mode using iPhone camera for testing
Smart Glasses โ iOS App โ Gemini Live API โ OpenClaw Gateway
โ โ
โโโโโโโโโโโโโโโ Audio Response โโโโโโโโโโโโโโโโ
Requirements: iOS 17.0+, Xcode 15.0+, Meta Ray-Ban smart glasses (or iPhone camera fallback)
See the Ecosystem page for setup details and repository links.
ClawPhone (Android)โ
Run a multimodal agent on a $25 Android smartphone:
- Camera access via Termux:API
- Screen overlay daemon for always-on interaction
- Hardware control (sensors, GPS, flashlight)
- Remote control via Discord channel
macOS Companionโ
Official menu bar app with:
- Voice control โ speak commands directly
- Gateway health monitoring
- Debug tools for development
- Available as Universal Binary on macOS 15+
Multimodal Skill Patternsโ
Pattern 1: Describe and Actโ
Receive an image, analyze it, take action:
name: receipt-scanner
description: Scan receipts and log expenses
trigger: attachment:image
tools:
- shell
- filesystem
## Receipt Scanner
When an image attachment arrives:
1. Analyze the image โ identify if it's a receipt
2. Extract: merchant name, date, total amount, line items
3. Append to ~/expenses.csv: date, merchant, amount, category
4. Reply with a summary: "Logged $42.50 at Whole Foods (Groceries)"
Pattern 2: Visual Monitoringโ
Periodically capture and analyze visual state:
name: site-monitor
description: Monitor website appearance for changes
trigger: "monitor site|check site"
tools:
- browser
- shell
## Site Monitor
When asked to monitor a site:
1. Navigate to the URL using the browser
2. Take a screenshot
3. Compare against the previous screenshot (stored in ~/.openclaw/screenshots/)
4. If significant visual changes detected, alert the user
5. Save the current screenshot for next comparison
Pattern 3: Multimodal Chainโ
Combine vision input with media output:
User sends a photo of a sunset via Telegram
1. Vision model analyzes: "A vibrant sunset over the ocean with
orange and purple hues reflecting on calm water"
2. Agent crafts a poetic description
3. Edge TTS converts description to speech
4. Reply includes both text and voice note
Pattern 4: Document Processingโ
Extract and act on document content:
name: doc-processor
description: Process documents and PDFs
trigger: attachment:file
tools:
- shell
- filesystem
## Document Processor
When a document is attached:
1. Identify file type (PDF, DOCX, image of text)
2. For PDFs: extract text, summarize key points
3. For images of documents: use vision to OCR and extract content
4. Answer questions about the document
5. Offer to create action items, summaries, or translations
Configuration Referenceโ
Full Multimodal Configโ
{
"brain": {
"provider": "anthropic",
"model": "claude-sonnet-4-6"
},
"hands": {
"browser": {
"enabled": true,
"headless": true
}
},
"channels": {
"telegram": {
"permissions": {
"send_media": true,
"send_voice": true,
"receive_media": true
}
}
},
"mcp": {
"servers": {
"puppeteer": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-puppeteer"]
}
}
}
}
Environment Variablesโ
# Voice
ELEVENLABS_API_KEY=your_key # Optional: premium TTS
OPENAI_API_KEY=your_key # For Whisper API (if not running locally)
# Image Generation
OPENAI_API_KEY=your_key # DALL-E
STABILITY_API_KEY=your_key # Stable Diffusion API
# Real-Time Audio
GOOGLE_API_KEY=your_key # Gemini Live API
Performance Tipsโ
Voice Latencyโ
- Use Whisper
tinyorbasefor fastest transcription - Run Whisper on GPU for sub-second transcription
- Use Edge TTS (no network round-trip to paid API)
- Pre-download Whisper models to avoid cold-start delay
Vision Costโ
- Gemini 2.5 Flash is the cheapest vision model
- Resize large images before sending to the LLM (saves tokens)
- Claude charges per image tile โ smaller images cost less
- Cache common visual analysis patterns in skills
Storageโ
- Voice messages are temporary โ clean up
/tmpfiles after processing - Screenshots accumulate โ set a retention policy
- Generated images can be large โ compress before sending
# Clean up old voice/image temp files (add to cron)
find /tmp -name "voice_*.ogg" -mtime +1 -delete
find /tmp -name "reply_*.mp3" -mtime +1 -delete
find ~/.openclaw/screenshots -mtime +7 -delete
Resource Usageโ
| Feature | RAM Impact | CPU Impact | GPU Impact |
|---|---|---|---|
| Whisper tiny | +200 MB | High during transcription | Preferred |
| Whisper base | +500 MB | High during transcription | Preferred |
| Whisper large | +3 GB | Very high | Required |
| Edge TTS | Minimal | Low | None |
| ElevenLabs | Minimal | None (API) | None |
| Local Stable Diffusion | +4 GB | Medium | Required (4+ GB VRAM) |
| Browser screenshots | +200 MB | Low | None |
Troubleshootingโ
Voice messages not transcribedโ
# Check Whisper is installed
whisper --help
# Test with a sample file
whisper --model base test.ogg
# Check ffmpeg (required by Whisper)
ffmpeg -version
If ffmpeg is missing: sudo apt install ffmpeg
Images not analyzedโ
- Confirm your model supports vision (Claude Sonnet/Opus, GPT-4o, Gemini)
- Check channel permissions include
receive_media: true - Verify the image URL is accessible to the brain
- Check logs:
openclaw logs --filter "attachment" --follow
TTS audio not playingโ
# Test Edge TTS directly
edge-tts --text "test" --voice en-US-AriaNeural --write-media /tmp/test.mp3
# Verify the file was created
ls -la /tmp/test.mp3
# Check the channel can send audio
openclaw config get channels.telegram.permissions.send_voice
High latency on voice pipelineโ
- Switch Whisper to
tinymodel - Ensure GPU is being used:
nvidia-smi(should show whisper process) - Run Whisper with
--language ento skip language detection - Consider Gemini Live for real-time voice instead of STT+TTS pipeline
See Alsoโ
- Channels โ Channel setup and media permissions
- Custom Channels โ Building channels with attachment support
- MCP Servers โ Puppeteer and media-handling servers
- Skill Development โ Building custom skills
- Advanced Recipes โ Voice-controlled agent recipe
- Model Selection โ Choosing vision-capable models
- Ecosystem โ VisionClaw, ClawPhone, and companion apps
- Local Models โ Running Whisper and LLaVA locally
- Performance Tuning โ Optimizing multimodal costs