Skip to main content
The plugin runs in one of two dialogue modes. They use the same media bridge and the same features; they differ only in how speech becomes a reply.

Realtime

Speech-to-speech via a realtime model (OpenAI / Azure OpenAI). Lowest latency, natural barge-in, most expressive. Needs a realtime provider key.

Streaming

STT → your agent → TTS. Uses your host’s existing speech stack and full agent toolchain. Needs ffmpeg on PATH. No realtime key required.

Choosing

If you want…Use
The lowest latency and most natural turn-takingRealtime
Speech-to-speech expressivity (tone, emotion)Realtime
To reuse your agent’s full tool/RAG pipeline per turnStreaming
To avoid a realtime provider subscriptionStreaming
Tight control over the STT and TTS vendorsStreaming
Both modes support barge-in (the caller can interrupt), the vision ring, group-call gating, DTMF, bilingual EN/AR, and the avatar driver cues.

Switching

Set mode in the plugin config:
{ "mode": "realtime" }   // or "streaming"
Streaming shells out to ffmpeg for resampling - make sure it’s installed and on PATH.