# Building EdgeVox: Chaining STT → Local LLM → TTS Without Touching the Cloud

**Published:** May 30, 2026
**Tags:** Edge AI, On-device AI, Voice AI, Local-First AI, Speech Recognition, Text-to-Speech, ROS2, Privacy

**Summary:** A first-hand build narrative of EdgeVox — a fully offline voice agent that chains speech-to-text, a local LLM, and text-to-speech on one device. The architecture in plain language, ROS2 integration, the latency budget, and the failure modes nobody warns you about.


---

A cloud voice assistant is three API calls in a trench coat: stream your audio to a speech-to-text endpoint, send the transcript to an LLM endpoint, stream the reply to a text-to-speech endpoint. Every one of those calls is a network round-trip and a data disclosure. [EdgeVox](https://github.com/nrl-ai/edgevox) is my attempt to collapse all three onto a single device — a laptop, a Jetson, a robot's onboard computer — so the microphone audio never leaves the machine.

This is the part of the build I wish someone had written down before I started: not the demo, but the trade-offs.

<figure className="not-prose my-8">
  <img
    className="w-full rounded-xl border border-warm-200 shadow-sm"
    src="/posts-data/2026-05-30-edgevox-offline-voice-agent/thumbnail.png"
    alt="EdgeVox — an offline voice agent that chains speech-to-text, a local LLM, and text-to-speech on one device."
  />
  <figcaption className="mt-3 text-center text-sm text-warm-600">
    EdgeVox — a streaming voice pipeline that runs speech in and a spoken reply out, entirely
    on-device.
  </figcaption>
</figure>

## Why a fully offline voice agent is harder than it sounds

The naive view is that "offline" just means swapping three SaaS endpoints for three local libraries. The catch is that a conversation is a _streaming, interruptible, real-time_ system, and the moment you own the whole stack you also own every constraint the cloud was hiding from you.

Three constraints dominate:

- **It is a pipeline, not a request.** Speech-to-text (STT), the LLM, and text-to-speech (TTS) all run on the same hardware, competing for the same CPU, GPU, and memory. A cloud setup hides this behind three independently-scaled services. On one device, the LLM decoding a long answer and the STT transcribing your next sentence are fighting over the same silicon.
- **Latency is felt, not logged.** In a chat app, 800 ms of extra latency is a faint annoyance. In a _spoken_ exchange, silence longer than roughly a second reads as "the thing is broken" and the user starts talking over it. The metric that matters is **time-to-first-audio** — how long after you stop speaking before the agent starts speaking back — and it is a sum across every stage.
- **Barge-in is mandatory, not a feature.** Real conversations interrupt. If you can't talk over the agent to cut it off, it feels like a hold-music phone menu. Supporting interruption means the LLM has to be cancellable mid-generation and the TTS has to be stoppable mid-utterance — neither of which a request/response cloud API ever forced you to think about.

Own the stack and these stop being someone else's problem. That's the cost. The benefit is that the entire conversation — audio, transcript, and reply — stays on the device, with no API key, no per-token bill, and no privacy policy to trust. For anything touching sensitive speech, that is the only acceptable trade. (I made the broader case for keeping AI on-device in [Vietnam's Sovereign AI Conversation](/blog/2026-05-03-sovereign-ai-vietnam).)

## The architecture, in plain language

Picture an assembly line for sound. A spoken sentence goes in one end; a spoken answer comes out the other. Each station on the line does exactly one job and hands its output to the next:

```
Mic → VAD → STT → Agent (LLM) → SentenceSplit → TTS → Speaker
```

Read left to right, that is the whole system:

1. **Mic** captures raw audio from your microphone.
2. **VAD** (voice activity detection) is the doorman. It listens for _when_ you start and stop talking, so the heavy machinery downstream only wakes up for actual speech — not for the fan, the keyboard, or silence.
3. **STT** (speech-to-text) is the transcriber. It turns the chunk of audio into a string of words.
4. **Agent (LLM)** is the brain. It reads the transcript, decides what to say or do, and — if you've given it tools — can call a function like "turn on the kitchen light" instead of just chatting.
5. **SentenceSplit** is the impatient editor. Instead of waiting for the brain to finish the entire answer, it grabs the _first complete sentence_ the moment it's ready and rushes it to the next station.
6. **TTS** (text-to-speech) is the voice. It turns text back into audio.
7. **Speaker** plays it. You hear the reply.

The single most important design rule sits underneath this line and is worth stating plainly: **the brain is allowed to be slow, but the reflexes are not.** EdgeVox borrows the classic three-layer pattern from robotics:

- **Deliberative layer (~1 Hz — "thinking"):** the LLM. Smart but slow. It plans and converses.
- **Executive layer (10–50 Hz — "doing"):** the skills the agent runs, each of which can report progress and be cancelled partway through.
- **Reactive layer (100+ Hz — "reflexes"):** motor control and safety. This layer **never** waits on the LLM.

Why does that matter? Imagine telling a robot "stop." If the word "stop" had to travel up to a language model, get tokenized, generate a response, and come back down before the wheels actually stopped, the robot would roll right off the table while the model was still composing a sentence. So in EdgeVox a stop-word preempts a running action _before the LLM is ever consulted_. The reflexes are wired directly to the brakes. Everything else in the design follows from that one rule: keep the slow, probabilistic brain out of the fast, safety-critical path.

## Picking components that fit on one device

Each station on the assembly line has to earn its place in the memory budget of a consumer machine. The components EdgeVox ships with:

| Stage                    | Component                                            | Why                                                                                                                          |
| ------------------------ | ---------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| Voice activity detection | Silero VAD v6 (~2 MB)                                | Decides when you've started and stopped talking, in 32 ms frames. Cheap and accurate enough to gate everything downstream.   |
| Speech-to-text           | faster-whisper (`whisper-small` or `large-v3-turbo`) | CTranslate2 backend, runs on CPU/CUDA/Metal. `small` for 8 GB devices, `large-v3-turbo` when there is a GPU.                 |
| LLM                      | Gemma 4 E2B IT, Q4 quant via llama.cpp               | About 1.8 GB on disk. Small enough to co-reside with STT and TTS, capable enough to drive tool calls.                        |
| Text-to-speech           | Kokoro 82M / Piper / Supertonic                      | Kokoro for the major languages; Piper and the ~99M-param Supertonic ONNX model cover the long tail and run real-time on CPU. |
| Wake word                | pymicro-wakeword                                     | "Hey Jarvis", "Alexa", "Hey Mycroft", "Okay Nabu" — so the LLM isn't woken by every stray noise.                             |

The two configurations I actually run tell the whole memory story:

- **MacBook Air M1 (8 GB):** `whisper-small` + the Q4 LLM fits in about **3.4 GB** of model weight.
- **PC with a GPU:** `whisper-large-v3-turbo` + the same LLM is about **5.8 GB**.

The selection rule is unglamorous: pick the largest model in each stage whose weights _and runtime working set_ still leave headroom for the other two stages on your worst target device. A pipeline that fits beautifully on the PC and runs out of memory on the Air is a pipeline that doesn't ship. Every component is swappable behind a small interface, so the STT or TTS backend is a config choice, not a rewrite — which matters because the right model for English on a desktop is rarely the right model for Vietnamese on a Jetson.

## ROS2 integration — what it buys you, and what it costs

EdgeVox grew up as a desktop voice pipeline, but the reason it has a robotics shape is that a voice agent on a robot is a genuinely different animal from one on a laptop. ROS2 is the standard nervous system for that world, so the pipeline ships a ROS2 bridge you opt into with a single `--ros2` flag.

**What it buys you:** the voice loop stops being a closed box. Transcriptions, the agent's response, audio levels, and a JSON stream of every tool call and skill goal get **published** as ROS2 topics; text input, interrupts, language switches, and navigation commands can be **subscribed**. Any skill the agent exposes becomes callable by a stock ROS2 action client through a generic `execute_skill` action. In practice this means the same agent code drives a simulated robot (IR-SIM in 2D, MuJoCo in 3D) or a real one over the standard odometry-and-velocity contract, unchanged.

**What it costs you:** ROS2 is a heavy dependency with its own build system, its own message-compilation step, and its own runtime. It is absolutely the wrong choice for a desktop chat app. So it stays **opt-in** — the pipeline runs perfectly with zero ROS2 installed, and the bridge attaches only when you ask for it. That split is the actual design lesson: the integration that makes the project valuable on a robot is dead weight on a laptop, so it has to be a layer you add, never a dependency you inherit.

## Latency budget: how to measure each stage honestly

Here is where I'm going to disappoint anyone who came for a leaderboard. **EdgeVox does not publish a measured latency number yet** — and that is a deliberate choice, not an omission.

The temptation in this space is enormous. It is trivially easy to write "~800 ms time-to-first-audio on a Jetson Orin Nano" in a README and let it ride. I know, because an early version of this project shipped almost exactly that number — and it had never been benchmarked on a Jetson at all. It got stripped. A made-up number with a tilde in front of it is still a made-up number; the tilde just launders it.

So instead of numbers, here is the **budget shape** and the **measurement protocol**, which are the parts that actually transfer.

The budget shape: time-to-first-audio is a sum, and the LLM's first token is almost always the dominant term.

```
time-to-first-audio
  = VAD endpoint detection        (you stopped talking)
  + STT transcription             (audio → text)
  + LLM time-to-first-token       (the long pole)
  + first-sentence TTS synthesis  (text → first audio chunk)
```

Two structural tricks keep that sum under the ~1-second "feels responsive" threshold without needing a bigger machine:

- **Stream at sentence granularity.** Don't wait for the LLM to finish the whole reply before synthesizing. Split the token stream into sentences and hand the _first_ sentence to TTS while the LLM is still generating the second. The user hears audio while the model is still thinking. (This is the "impatient editor" station from the diagram above, and it is the single biggest perceived-latency win in the whole system.)
- **Make the LLM's first token cheap.** First-token latency is governed by prompt length, so a tight system prompt and aggressive history compaction pay off directly in perceived responsiveness.

And the protocol — the rules I hold any number to before it is allowed near a README:

1. **Warm up before timing.** The first call loads weights and compiles kernels; it is not representative. Discard at least three warm-up runs.
2. **Report best-of-N, N ≥ 3.** A single run is noise. Cold-start artifacts have produced "135× faster" claims that were really ~21× once the comparison target's lazy model-load was excluded.
3. **Fingerprint the hardware.** A latency number without the CPU, GPU, OS, and model revision attached is meaningless — it is the single most context-dependent metric in the whole system.
4. **Pin and date the comparison.** If you benchmark against another tool, record its exact version and the date you ran it, and re-run when either changes.

The benchmark harness lives in the repo. The measured cells in the docs stay empty until a real run fills them, with the hardware fingerprint attached. Empty is honest; fake is not — and on a privacy product, credibility is the entire pitch.

## The failure modes nobody warns you about

The demo works on the first try. The _product_ breaks on all the things the demo didn't exercise. These are the ones that cost me the most time.

**VAD is where conversations actually go to die.** Voice activity detection sounds like a solved sub-problem and is in fact the source of half the bad UX. Set the endpoint threshold too eager and the agent cuts you off mid-sentence; too lazy and there is a dead pause after every utterance while it waits to be _sure_ you're done. Worse, the agent's own TTS output is sound — so without protection, the bot hears itself talking and "interrupts" its own reply. EdgeVox runs acoustic echo cancellation plus an energy-ratio gate by default, specifically so the bot doesn't transcribe its own voice. Skipping that doesn't fail in the demo (you're not talking while it talks); it fails the first time a real user interrupts.

**Barge-in has to reach all the way down to the decoder.** Cutting the bot off can't just stop the audio — it has to abort the LLM mid-generation, or the model keeps burning compute on a reply nobody will hear and the _next_ turn is laggy because the GPU is still busy. EdgeVox threads a cancel signal into llama.cpp's stopping criteria so generation actually halts at the next decode step, and the barge-in path re-arms cleanly so you can interrupt twice in a row. "Stop the speaker" is the easy 80%; "stop the model" is the 20% that makes it feel real.

**Context bleed degrades small local models fast.** Cloud-scale models tolerate long, messy histories. A quantized local model does not — its tool-calling reliability decays noticeably after a handful of multi-step hops as the context fills with prior tool output. Two fixes mattered: strict per-conversation history isolation so one session's state can't leak into the next, and — for multi-step agent tasks — preferring an explicit plan-then-execute approach over a free-form reasoning loop, which holds up far better than letting a small model improvise its way through six tool calls.

**A model that is great in isolation can make the pipeline worse.** This is the one I'd most want to save you from. I tried adding a dedicated post-processing model to clean up the STT output for one language — it scored well on its own benchmark, so it looked like free quality. Chained into the live pipeline, it made end-to-end transcription _worse_, not better: the two models disagreed on conventions at the boundary, and the correction step introduced more errors than it fixed. The lesson is to **measure the whole chain, not the component.** A stage that improves a sub-metric in isolation can still be a net negative once it is wired into everything else, and the only way you find out is by benchmarking the pipeline you actually ship.

## Closing

A fully offline voice agent is not "the cloud version, but local." It is a real-time streaming system where you own every constraint — the shared hardware budget, the felt latency, the interrupt path, the small-model failure modes — that a stack of cloud APIs quietly absorbed for you. In exchange, the conversation never leaves the device.

EdgeVox is Apache-2.0-licensed and on PyPI (`pip install edgevox`). The source is at [github.com/nrl-ai/edgevox](https://github.com/nrl-ai/edgevox) and the docs are at [edgevox.nrl.ai](https://edgevox.nrl.ai). If you build something with it — or benchmark it on hardware I haven't — I'd genuinely like to see the numbers.

> **Citation.** Nguyen, Viet-Anh and Neural Research Lab. _EdgeVox: on-device voice agents for robotics._ 2026. https://github.com/nrl-ai/edgevox (Apache-2.0 License).

