Realtime
Speech-to-speech via a realtime model (OpenAI / Azure OpenAI). Lowest latency, natural barge-in,
most expressive. Needs a realtime provider key.
Streaming
STT → your agent → TTS. Uses your host’s existing speech stack and full agent toolchain.
Needs
ffmpeg on PATH. No realtime key required.Choosing
| If you want… | Use |
|---|---|
| The lowest latency and most natural turn-taking | Realtime |
| Speech-to-speech expressivity (tone, emotion) | Realtime |
| To reuse your agent’s full tool/RAG pipeline per turn | Streaming |
| To avoid a realtime provider subscription | Streaming |
| Tight control over the STT and TTS vendors | Streaming |
Switching
- OpenClaw
- Hermes
Set
mode in the plugin config:Streaming shells out to
ffmpeg for resampling - make sure it’s installed and on PATH.