Setup & Deployment

How to Set Up OpenClaw Voice Mode with ElevenLabs (Complete 2026 Guide)

18 min read · Updated 2026-03-05

By DoneClaw Team · We run managed OpenClaw deployments and write from hands-on production experience.

OpenClaw isn't just a text-based AI agent. With voice capabilities, you can turn your autonomous assistant into a conversational partner that speaks back to you—literally. Whether you want voice notes in Telegram, hands-free "Talk Mode" on macOS, or audio responses on mobile, OpenClaw's Text-to-Speech (TTS) system delivers. This guide covers everything you need to configure OpenClaw's voice features: ElevenLabs for premium neural voices, OpenAI TTS as a fallback, Edge TTS for zero-cost setup, and the interactive Talk Mode for continuous voice conversations.

Why Voice Matters for AI Agents

Text-only AI assistants miss a crucial dimension of communication. Voice adds:

OpenClaw supports multiple TTS providers, each with distinct advantages:

Provider: ElevenLabs; Cost: Pay-per-use; Quality: Premium neural voices; Setup Complexity: Medium; Best For: Natural, expressive speech.

Provider: OpenAI TTS; Cost: Pay-per-use; Quality: High; Setup Complexity: Easy; Best For: GPT ecosystem users.

Provider: Edge TTS; Cost: Free; Quality: Good; Setup Complexity: Easy; Best For: Budget-conscious users.

**Hands-free interaction** – Perfect for cooking, driving, or when your hands are busy
**Accessibility** – Voice output helps users who prefer audio or have visual impairments
**Emotional connection** – Hearing a voice feels more personal than reading text
**Speed** – Sometimes listening is faster than reading

Prerequisites

Before configuring voice, ensure you have:

**OpenClaw installed** – Follow our OpenClaw Setup Guide for Beginners
**API keys** (optional, depends on provider):
ElevenLabs: Get your API key at elevenlabs.io
OpenAI: Get your API key at platform.openai.com
**A configured messaging channel** – Telegram, WhatsApp, or Discord (voice works best with Telegram)

Configuring Text-to-Speech in OpenClaw

**Method 1: ElevenLabs (Recommended)**

ElevenLabs delivers the most natural-sounding AI voices available. Here's how to set it up:

**Step 1: Get Your ElevenLabs API Key**

**Step 2: Configure OpenClaw**

Add the following to your `~/.openclaw/openclaw.json`:

**Step 3: Choose Your Voice**

To find a voice ID:

Popular voice IDs to try:

**Step 4: Restart OpenClaw**

**Step 5: Enable TTS**

Enable voice responses with:

Or enable only after receiving a voice note:

**Method 2: OpenAI TTS**

If you're already using OpenAI for LLM inference, OpenAI TTS offers seamless integration:

Available OpenAI voices:

**Method 3: Edge TTS (Free)**

No API key required. OpenClaw defaults to Edge TTS when no keys are present:

Popular Edge Neural Voices:

**Understanding TTS Auto Modes**

The `auto` setting controls when OpenClaw sends audio:

Mode: `off`; Behavior: No automatic TTS (default).

Mode: `always`; Behavior: Every response becomes audio.

Mode: `inbound`; Behavior: Only reply with audio after receiving a voice note.

Mode: `tagged`; Behavior: Only when reply contains `[[tts]]` directive.

Go to elevenlabs.io and create an account
Navigate to **Profile** → **API Keys**
Copy your API key
Go to **ElevenLabs** → **Voice Library**
Browse or create a custom voice
Copy the Voice ID from the voice details
**Rachel** (21m00Tcm4TlvDq8ikWAM) – Warm, conversational
**Domi** (9DT18BkRxPQq0VHpQAxz) – Confident, clear
**Arnold** (8W3TzLqL6pY2wNJdVm3R) – Deep, authoritative
**alloy** – Neutral, versatile
**echo** – Warm, friendly
**fable** – Expressive, storytelling
**onyx** – Deep, authoritative
**nova** – Bright, energetic
**shimmer** – Soft, gentle
`en-US-MichelleNeural` – Professional, clear
`en-US-GuyNeural` – Conversational
`en-GB-SoniaNeural` – British, friendly

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
      elevenlabs: {
        apiKey: "YOUR_ELEVENLABS_API_KEY",
        voiceId: "YOUR_VOICE_ID",
        modelId: "eleven_multilingual_v2",
        voiceSettings: {
          stability: 0.5,
          similarityBoost: 0.75,
          style: 0.0,
          useSpeakerBoost: true,
          speed: 1.0
        }
      }
    }
  }
}

openclaw gateway restart

/tts always

/tts inbound

{
  messages: {
    tts: {
      auto: "always",
      provider: "openai",
      openai: {
        apiKey: "YOUR_OPENAI_API_KEY",
        model: "gpt-4o-mini-tts",
        voice: "alloy"
      }
    }
  }
}

{
  messages: {
    tts: {
      auto: "always",
      provider: "edge",
      edge: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        rate: "+10%",
        pitch: "-5%"
      }
    }
  }
}

Setting Up Talk Mode (Hands-Free Voice Conversation)

Talk Mode transforms OpenClaw into a real-time voice assistant. It works on macOS, iOS, and Android.

**How Talk Mode Works**

**Enabling Talk Mode**

Add to your `~/.openclaw/openclaw.json`:

**Talk Mode Controls (macOS)**

**Voice Directives in Replies**

Your AI agent can control voice parameters per-reply using JSON directives:

Available directive keys:

**Listen** – Voice activity detection picks up speech
**Transcribe** – Speech converted to text via Whisper
**Process** – Transcript sent to your AI agent
**Speak** – Response converted to audio via ElevenLabs (streaming playback)
**Menu bar toggle**: Click **Talk** to enable/disable
**Overlay states**:
☁️ **Listening** – Mic active, showing audio levels
💭 **Thinking** – Processing your request
🔊 **Speaking** – Playing audio response
**Interrupt**: Click cloud to stop speaking
**Exit**: Click X to close Talk Mode
`voice` / `voiceId` – Override voice
`model` – Change TTS model
`speed` / `rate` – Adjust speaking speed (WPM)
`stability` – Voice consistency (ElevenLabs)
`similarityBoost` – Voice similarity
`style` – Expressiveness level
`once` – Apply only to current reply

{
  talk: {
    voiceId: "YOUR_ELEVENLABS_VOICE_ID",
    modelId: "eleven_v3",
    outputFormat: "mp3_44100_128",
    apiKey: "YOUR_ELEVENLABS_API_KEY",
    interruptOnSpeech: true
  }
}

Here you go.

[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]

Deep Dive: Understanding Each TTS Provider

**ElevenLabs: The Premium Choice**

ElevenLabs has quickly become the gold standard for AI text-to-speech. Their multilingual v2 model produces remarkably natural speech with proper intonation, emotional range, and minimal artifacts. The platform offers:

**Key Features:**

**Cost Breakdown:** ElevenLabs uses a credit-based system. For TTS:

New users receive free credits to start. For typical OpenClaw usage (100-200 voice responses daily), expect $2-15/month.

**Why ElevenLabs Works Best for OpenClaw:** The streaming API delivers audio incrementally, reducing perceived latency. Combined with emotional control (style, similarity boost), your agent sounds genuinely conversational rather than robotic.

**OpenAI TTS: Ecosystem Integration**

If you're already paying for OpenAI API access (GPT-4, etc.), OpenAI TTS provides a convenient add-on:

**Key Features:**

**Cost:**

**Best For:** Users deeply invested in the OpenAI ecosystem who want one bill for all AI services.

**Edge TTS: The Free Alternative**

Microsoft's Edge TTS provides surprisingly good neural voices at no cost:

**Key Features:**

**Limitations:**

**Best For:**

**Voice Library** – 70+ pre-made voices across languages
**Voice Design** – Create custom voices from text descriptions
**Voice Cloning** – Clone your own voice (with consent)
**Emotional Range** – Control stability vs. expressiveness
**Multilingual Support** – 32 languages in one model
**Mini models**: ~$0.01 per 1,000 characters
**Standard voices**: ~$0.10 per 1,000 characters
**Premium voices**: ~$0.30 per 1,000 characters
**Simple API** – Integrated with existing OpenAI keys
**6 Voices** – Alloy, Echo, Fable, Onyx, Nova, Shimmer
**gpt-4o-mini-tts** – Efficient model for fast generation
**Reliable** – Backed by OpenAI infrastructure
$0.015 per 1,000 characters (gpt-4o-mini-tts)
Slightly higher for other models
**No API Key** – Works out of the box
**100+ Voices** – Multiple languages and voice types
**Neural Voices** – Modern ML-based synthesis
**Relatively Natural** – Better than older TTS systems
Requires internet connection (not local)
No SLA or guaranteed uptime
Slight robotic quality compared to ElevenLabs
Character limits (~10 minutes audio per request)
Testing TTS functionality without spending money
Low-volume use cases
Projects with tight budgets

Skip 60 minutes of setup — deploy in 60 seconds

DoneClaw handles Docker, servers, security, and updates. Your OpenClaw agent is ready to chat in under a minute.

Deploy Now

Detailed Troubleshooting Guide

**Issue 1: Voice Not Playing**

**Symptoms**: Text appears but no audio, or you see `[[audio_as_voice]]` but no voice note in Telegram

**Root Causes & Solutions**:

**A. Missing or Invalid API Key**

**B. Channel Not Supporting Voice** Different platforms handle audio differently:

Platform: Telegram; Support Level: Full; Notes: Voice notes, playable inline.

Platform: WhatsApp; Support Level: Partial; Notes: Audio as file attachment.

Platform: Discord; Support Level: Partial; Notes: Audio as attachment, not voice channel.

Platform: iMessage; Support Level: Full; Notes: Full support.

Platform: Signal; Support Level: Partial; Notes: Audio as file.

**C. Audio Format Mismatch** Some channels require specific formats. Add output format to config:

**D. Run OpenClaw Doctor**

This command checks:

**Issue 2: MEDIA Path Showing in Reply**

**Symptoms**: Response includes text like `MEDIA:/Users/you/.openclaw/cache/tts/abc123.ogg` instead of sending audio

**Root Causes**:

**A. Channel Integration Bug** This was a known issue in earlier 2026 versions. Update:

**B. Channel Doesn't Support TTS** If using Discord webhooks or certain API integrations, voice may not work:

**C. File Permissions** Ensure OpenClaw can read/write the cache directory:

**Issue 3: Audio Cuts Off or Skips**

**Symptoms**: Voice message plays partially, then stops

**Root Causes & Solutions**:

**A. Text Too Long** Default TTS has character limits. Reduce:

**B. Network Timeout** For longer texts, increase timeout:

**C. Enable Auto-Summary** Let AI summarize long responses before TTS:

**Issue 4: Wrong Voice or Accent**

**Symptoms**: Voice doesn't match expectations, wrong language

**Solutions**:

**A. Explicitly Set Language**

**B. Choose Language-Specific Voice** Each voice has a primary language. Rachel is English; other voices suit other languages better.

**C. Check Voice Settings**

**Issue 5: High API Costs**

Voice can get expensive if not controlled. Here's how to manage:

**Strategy 1: Hard Limits**

**Strategy 2: Inbound-Only Mode** Only respond with voice when you initiate with voice:

Now voice only plays after you send a voice note.

**Strategy 3: Use Free TTS**

**Strategy 4: Monitor Usage** Check ElevenLabs dashboard regularly. Set up billing alerts.

**Issue 6: Talk Mode Not Activating**

**Symptoms**: Menu bar shows no Talk option, or clicking does nothing

**Solutions**:

**A. Grant Permissions** Talk Mode needs:

**B. Install Dependencies** Some systems need additional packages:

**C. Check Node Compatibility** Talk Mode works best on:

**Issue 7: Voice Sounds Robotic**

**Symptoms**: Unnatural intonation, flat delivery

**Solutions**:

**A. Use ElevenLabs v3**

**B. Adjust Style Parameter**

**C. Use Speaker Boost**

API key configuration
Channel permissions
Audio format compatibility
TTS service availability
Use Telegram for guaranteed voice support
Check channel documentation for audio support
Microphone access
Speech recognition permission (macOS)
macOS (full support)
iOS (via app)
Android (via app)
Linux (limited, needs manual setup)

# Verify your key is set
echo $ELEVENLABS_API_KEY

# If not set, add to your shell profile
echo 'export ELEVENLABS_API_KEY="your-key"' >> ~/.bashrc
source ~/.bashrc

{
  elevenlabs: {
    outputFormat: "mp3_44100_128"
  }
}

openclaw doctor --fix

openclaw gateway update
openclaw gateway restart

chmod -R 755 ~/.openclaw/cache/tts

{
  messages: {
    tts: {
      maxTextLength: 1500
    }
  }
}

{
  messages: {
    tts: {
      timeoutMs: 60000
    }
  }
}

{
  messages: {
    tts: {
      summaryModel: "openai/gpt-4o-mini",
      // Summary triggers when text exceeds:
      maxTextLength: 2000
    }
  }
}

{
  elevenlabs: {
    languageCode: "en",  // 2-letter ISO code
    // Or use multilingual model:
    modelId: "eleven_multilingual_v2"
  }
}

{
  elevenlabs: {
    voiceSettings: {
      stability: 0.5,
      similarityBoost: 0.75,
      style: 0.0  // Higher = more expressive, may sound different
    }
  }
}

{
  messages: {
    tts: {
      maxTextLength: 1000,
      // Fail if exceeded:
      // /tts audio fails, returns text instead
    }
  }
}

/tts inbound

{
  messages: {
    tts: {
      provider: "edge"
    }
  }
}

# Install Whisper for speech-to-text
openclaw install whisper

{
  elevenlabs: {
    modelId: "eleven_v3"  // Better than v2
  }
}

{
  elevenlabs: {
    voiceSettings: {
      style: 0.3,  // Higher = more expressive
      stability: 0.4  // Lower = more variable
    }
  }
}

{
  elevenlabs: {
    voiceSettings: {
      useSpeakerBoost: true
    }
  }
}

Use Cases: What Can You Do With Voice?

**1. Hands-Free Cooking Assistant**

While cooking, ask questions verbally:

**2. Voice-Enabled Smart Home Controller**

Control lights, thermostat, appliances through conversation.

**3. Accessibility Companion**

For users with visual impairments or reading difficulties, voice makes AI assistance accessible.

**4. Language Learning Partner**

Practice conversation in different languages with instant AI responses.

**5. Hands-Free Email Triage**

Listen to email summaries and dictate responses while commuting.

**6. Bedtime Storyteller**

Request stories told aloud to children—or yourself.

**7. Meditation Guide**

Combine voice with timed prompts for guided meditation sessions.

"What's the next step?"
"How long simmer?"
"Substitute for missing ingredient?"

Technical Deep Dive: How TTS Works in OpenClaw

**Architecture Overview**

**Supported Audio Formats**

Provider: ElevenLabs; Output Formats: mp3_44100_128, pcm_44100, pcm_24000, ogg_vorbis.

Provider: OpenAI; Output Formats: mp3, opus, aac, flac.

Provider: Edge; Output Formats: audio-24khz-48kbitrate-mono-mp3, webm-24khz-16bit-mono-opus.

**Latency Considerations**

Expected latencies:

For best experience, use ElevenLabs with streaming enabled (`useStream: true`).

**ElevenLabs Streaming**: 300-800ms first audio, then real-time
**OpenAI**: 1-3 seconds for full response
**Edge**: 500ms-2 seconds

User Message → OpenClaw Gateway → AI Model (LLM)
                                    ↓
                              Response Text
                                    ↓
                         TTS Engine (configured provider)
                                    ↓
                              Audio File
                                    ↓
                         Channel Dispatch → Telegram/WhatsApp/etc.

Security & Privacy Considerations

**What Data is Sent?**

When using TTS:

**Minimizing Privacy Exposure**

**Text content** – Your AI responses go to TTS provider
**Voice preferences** – Stored locally in your config
**Audio cache** – Stored in `~/.openclaw/cache/tts`
**Use Local TTS When Possible**
Ollama with local Whisper for STT
Self-hosted TTS (advanced)
**Clear Cache Regularly**
**Don't Share Voice-Cloned Voices**
ElevenLabs voice cloning requires consent
Be careful cloning others' voices
**Review Provider Policies**
ElevenLabs: Privacy Policy
OpenAI: Privacy Policy
Microsoft Edge: Privacy

   rm -rf ~/.openclaw/cache/tts/*

Advanced Configuration

**Per-Session Voice Settings**

Override voice settings per conversation:

**Multiple Voice Providers (Fallback)**

Configure primary and fallback providers:

If ElevenLabs fails, OpenClaw automatically falls back to OpenAI, then Edge TTS.

**Model-Driven Voice Control**

Allow your AI to control its own voice:

The model can then use directives like:

/tts voice YOUR_VOICE_ID
/tts speed 1.2
/tts always

{
  messages: {
    tts: {
      auto: "always",
      provider: "elevenlabs",
      summaryModel: "openai/gpt-4o-mini",
      elevenlabs: { /* config */ },
      openai: { /* config */ },
      edge: { enabled: true }
    }
  }
}

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: true,
        allowProvider: true
      }
    }
  }
}

[[tts:voiceId=different_voice speed=1.2]]

Cost Comparison

Here's what to expect in terms of costs:

Provider: ElevenLabs; Cost per 1,000 characters: ~$0.01-0.30; Notes: Depends on voice quality.

Provider: OpenAI TTS; Cost per 1,000 characters: ~$0.015; Notes: gpt-4o-mini-tts.

Provider: Edge TTS; Cost per 1,000 characters: Free; Notes: No API key required.

**Example monthly costs** (100 responses/day, ~500 chars each):

ElevenLabs: ~$1.50-45/month
OpenAI TTS: ~$2.25/month
Edge TTS: Free

Conclusion

Adding voice to OpenClaw transforms it from a text-based assistant into a true conversational partner. Whether you choose premium ElevenLabs voices, cost-effective OpenAI TTS, or free Edge TTS, the setup takes minutes. Start with the free Edge TTS to test, then upgrade to ElevenLabs when you want premium voice quality. Your AI agent will thank you—and you'll wonder how you ever managed without voice.

Skip the setup? DoneClaw deploys OpenClaw for you — $29/mo with 7-day free trial, zero configuration.

Skip 60 minutes of setup — deploy in 60 seconds

DoneClaw handles Docker, servers, security, and updates. Your OpenClaw agent is ready to chat in under a minute.

Deploy Now

Frequently asked questions

Can I use my own custom voice?

Yes! ElevenLabs lets you create custom voices or clone an existing voice. Get your Voice ID from the voice settings and use it in your `voiceId` configuration. **To create a custom voice:** Go to ElevenLabs → Voice Library → Add New Voice Choose "Voice Design" or "Voice Clone" Follow prompts to create/record voice Copy the Voice ID and use in config

Does voice work on mobile?

Yes. On iOS and Android, voice works through the respective apps. You'll need to configure TTS as above, and Talk Mode supports voice activity detection on mobile. **Mobile-specific notes:** iOS: Use the OpenClaw iOS app with microphone permissions Android: Use the OpenClaw Android app Both support background audio playback

Can I use voice without an API key?

Yes. OpenClaw defaults to Edge TTS (Microsoft's free neural TTS) when no API keys are present. Quality is decent and it's completely free. **Auto-detection:** # If no keys found, automatically uses Edge TTS # No config change needed!

What's the difference between TTS and Talk Mode?

**TTS (Text-to-Speech):** Converts written responses to audio Triggered automatically after AI response Works on any messaging platform **Talk Mode:** Continuous voice conversation loop Listens via microphone in real-time Requires more permissions and setup Available on macOS/iOS/Android

Can I use voice in multiple languages?

Yes, especially with ElevenLabs multilingual model: { elevenlabs: { modelId: "eleven_multilingual_v2", languageCode: "auto" // Detect from text } } Your agent can respond in the language you use!