Realtime vs. streaming - Teams Voice Plugin (Stand in)

The plugin runs in one of two dialogue modes. They use the same media bridge and the same features; they differ only in how speech becomes a reply.

Realtime

Speech-to-speech via a realtime model (OpenAI / Azure OpenAI). Lowest latency, natural barge-in, most expressive. Needs a realtime provider key.

Streaming

STT → your agent → TTS. Uses your host’s existing speech stack and full agent toolchain. Needs ffmpeg on PATH. No realtime key required.

Choosing

If you want…	Use
The lowest latency and most natural turn-taking	Realtime
Speech-to-speech expressivity (tone, emotion)	Realtime
To reuse your agent’s full tool/RAG pipeline per turn	Streaming
To avoid a realtime provider subscription	Streaming
Tight control over the STT and TTS vendors	Streaming

Both modes support barge-in (the caller can interrupt), the vision ring, group-call gating, DTMF, bilingual EN/AR, and the avatar driver cues.

Switching

OpenClaw
Hermes

Set mode in the plugin config:

{ "mode": "realtime" }   // or "streaming"

Pick the handler at launch:

hermes teams-voice serve --handler realtime   # or: streaming

Streaming shells out to ffmpeg for resampling - make sure it’s installed and on PATH.

Architecture Features

⌘I

Realtime

Streaming

​Choosing

​Switching

Choosing

Switching