Giving My AI Agent a Voice (Without the Cloud)
Most AI voice solutions route your audio through a cloud API. Google, AWS, Azure — they all want your voice data. I wanted my AI agent to hear and speak without a single packet leaving my network. Here's how I made it work with faster-whisper and edge-tts on a home server.
Why Local Voice?
Privacy is the obvious reason, but there are practical ones too:
- Latency: Cloud STT adds 200-500ms round-trip. Local inference on a decent CPU is 100-200ms for short utterances
- Cost: Cloud STT services charge per minute. Local is free after hardware
- Reliability: No internet dependency. Your voice agent works when your ISP doesn't
The Architecture
The voice pipeline has three stages:
- Capture: Discord captures audio from a voice channel via the bot's voice client
- STT: Audio is transcribed locally using faster-whisper (the CTranslate2-optimized fork of OpenAI's Whisper)
- TTS: The agent's text response is converted to speech using edge-tts (Microsoft Edge's TTS engine, free)
All of this runs inside a custom Docker image on Unraid. No GPU required — faster-whisper's "base" model runs fine on a mini6900hx CPU.
Building the Custom Docker Image
The stock Hermes Agent image doesn't include voice dependencies. I built a custom image based on it:
FROM local/hermes-agent:latest
RUN pip install faster-whisper edge-tts opuslib
# Download the whisper base model at build time
RUN python -c "from faster_whisper import WhisperModel; \
WhisperModel('base', device='cpu', compute_type='int8')"
Downloading the model at build time means the container starts fast — no model download on first run. The base model is about 150MB and provides good accuracy for English speech.
The Discord Voice Connection
Connecting a Discord bot to a voice channel requires the applications.commands OAuth scope in addition to the bot scope. This tripped me up initially — the bot could join voice channels but slash commands to trigger voice mode didn't register.
The voice client uses Opus encoding (hence opuslib in the pip install). Discord sends audio as 48kHz Opus frames. These get decoded to raw PCM, resampled to 16kHz (what Whisper expects), and fed into faster-whisper.
STT Performance
On the mini6900hx with the base model and int8 quantization:
- Short commands (under 5 seconds): ~150ms transcription time
- Sentences (5-15 seconds): ~300-500ms
- Long utterances (30+ seconds): ~1-2 seconds
For a conversational voice agent, these latencies are acceptable. Not real-time enough for a call center, but fine for home automation commands and casual chat.
TTS with edge-tts
For text-to-speech, edge-tts is the surprise winner. It's a Python package that interfaces with Microsoft Edge's TTS service. The voices are natural-sounding, there are dozens of options, and it's completely free with no API key required.
import edge_tts
async def speak(text: str, output_path: str):
communicate = edge_tts.Communicate(text, voice="en-US-GuyNeural")
await communicate.save(output_path)
The generated MP3 is sent as an audio attachment in Discord. The whole round-trip — user speaks, agent transcribes, generates response, synthesizes speech, sends audio — is under 3 seconds for typical interactions.
Lessons
- faster-whisper is fast enough without a GPU. The base model on CPU handles conversational speech well. Don't over-invest in hardware until you hit actual bottlenecks.
- edge-tts is the best free TTS option. Natural voices, no API key, works offline (mostly). Hard to beat for $0.
- Discord OAuth scopes matter. If your bot can't register slash commands, check that you have
applications.commandsin the invite URL. - Build the model into the Docker image. Runtime model downloads add 30+ seconds to cold starts. Do it once at build time.
I build self-hosted voice AI systems — local STT/TTS, Discord integration, all running on your hardware, zero cloud dependency.
Work with me →