# Plan Once, Then Act: When the ReAct Loop Is the Wrong Harness for Small Local Models

**Published:** June 15, 2026
**Tags:** Edge AI, AI Agents, Agent Harness, On-device AI, Tool Calling, Local-First AI, Robotics

**Summary:** On small local models, the standard ReAct loop has a failure mode nobody warns you about: the model calls one tool, declares victory, and stops. What we measured across 12 GGUF models in EdgeVox, why we added a plan-once dispatcher, and how to decide which loop your task actually needs.


---

Every agent framework tutorial teaches the same loop: think, call a tool, observe the result, repeat. [ReAct](https://arxiv.org/abs/2210.03629) (Yao et al., 2022). It is the default harness in almost every framework, and on frontier cloud models it works so well you never think about it.

Then you run it on a 1.7B-parameter model quantized to 4 bits on a laptop GPU, give it a five-step task, and watch it call the first tool, write "I've completed the task for you!", and stop.

This post is about that failure mode — what it looks like in the data from [EdgeVox](https://github.com/nrl-ai/edgevox), our offline voice-agent framework for robots, why we ship a second harness (`PlannedToolDispatcher`) alongside `ReActAgent`, and the decision rule we now use to pick between them. The short version: **on small models, the harness is not a detail. It can matter more than the model.**

## First, the single-turn floor: can the model call a tool at all?

Before arguing about loops, you need to know whether your model can execute even one correct tool call. We benchmarked this across 12 GGUF presets on 40 robot-control scenarios (four simulated robots: a grid-world scout, a 2D LiDAR apartment robot, a MuJoCo Franka arm, a Unitree G1 humanoid). Full methodology, per-scenario raw data, and the runner script ship in the EdgeVox repo — [the complete report is here](https://edgevox.nrl.ai/documentation/reports/robot-tool-calling-benchmark).

The headline rows (RTX 3080 Laptop GPU, `llama-cpp-python` 0.3.20, April 2026):

<div className="not-prose my-8 overflow-x-auto rounded-xl border border-warm-200 bg-warm-50/40">
  <table className="min-w-full border-collapse text-sm">
    <thead>
      <tr className="border-b border-warm-300 text-left">
        <th className="px-4 py-3 font-semibold sm:px-8">Model</th>
        <th className="px-4 py-3 font-semibold sm:px-8">Avg score /100</th>
        <th className="px-4 py-3 font-semibold sm:px-8">Per-reply latency</th>
        <th className="px-4 py-3 font-semibold sm:px-8">Verdict</th>
      </tr>
    </thead>
    <tbody>
      <tr className="border-b border-warm-200 bg-signal-50">
        <td className="px-4 py-3 font-mono sm:px-8">qwen2.5-3b</td>
        <td className="px-4 py-3 sm:px-8">96.6</td>
        <td className="px-4 py-3 sm:px-8">2.12 s</td>
        <td className="px-4 py-3 sm:px-8">ship</td>
      </tr>
      <tr className="border-b border-warm-200 bg-signal-50">
        <td className="px-4 py-3 font-mono sm:px-8">gemma-4-e2b</td>
        <td className="px-4 py-3 sm:px-8">96.0</td>
        <td className="px-4 py-3 sm:px-8">3.26 s</td>
        <td className="px-4 py-3 sm:px-8">ship</td>
      </tr>
      <tr className="border-b border-warm-200 bg-signal-50">
        <td className="px-4 py-3 font-mono sm:px-8">hammer-2.1-0.5b</td>
        <td className="px-4 py-3 sm:px-8">91.2</td>
        <td className="px-4 py-3 sm:px-8">0.85 s</td>
        <td className="px-4 py-3 sm:px-8">ship (speed champion)</td>
      </tr>
      <tr className="border-b border-warm-200">
        <td className="px-4 py-3 font-mono sm:px-8">llama-3.2-3b</td>
        <td className="px-4 py-3 sm:px-8">90.5</td>
        <td className="px-4 py-3 sm:px-8">6.69 s</td>
        <td className="px-4 py-3 sm:px-8">accurate, too slow for live voice</td>
      </tr>
      <tr className="border-b border-warm-200">
        <td className="px-4 py-3 font-mono sm:px-8">qwen2.5-1.5b</td>
        <td className="px-4 py-3 sm:px-8">83.6</td>
        <td className="px-4 py-3 sm:px-8">1.05 s</td>
        <td className="px-4 py-3 sm:px-8">usable with a tuned persona</td>
      </tr>
      <tr className="border-b border-warm-200 bg-warm-100 text-warm-600">
        <td className="px-4 py-3 font-mono sm:px-8">llama-3.2-1b</td>
        <td className="px-4 py-3 sm:px-8">66.8</td>
        <td className="px-4 py-3 sm:px-8">1.42 s</td>
        <td className="px-4 py-3 sm:px-8">edge cases only</td>
      </tr>
      <tr className="border-b border-warm-200 bg-warm-100 text-warm-600">
        <td className="px-4 py-3 font-mono sm:px-8">phi-4-mini</td>
        <td className="px-4 py-3 sm:px-8">36.8</td>
        <td className="px-4 py-3 sm:px-8">1.35 s</td>
        <td className="px-4 py-3 sm:px-8">unreliable at stock persona</td>
      </tr>
      <tr className="bg-warm-100 text-warm-600">
        <td className="px-4 py-3 font-mono sm:px-8">hermes-3-3b</td>
        <td className="px-4 py-3 sm:px-8">19.5</td>
        <td className="px-4 py-3 sm:px-8">1.51 s</td>
        <td className="px-4 py-3 sm:px-8">narrates instead of calling</td>
      </tr>
    </tbody>
  </table>
  <p className="border-t border-warm-200 px-4 py-2 text-xs text-warm-600 sm:px-8">
    {' '}
    Orange rows clear the live-voice bar (score ≥ 90, per-reply ≤ 5 s). Gray rows sit below the
    reliability cut line.{' '}
  </p>
</div>

Two things in that table shape everything downstream.

**The latency column is the multiplier.** A ReAct loop pays one LLM reply per step. On this hardware, one reply costs between 0.85 and 6.69 seconds across the models above. A ten-step task on `llama-3.2-3b` is over a minute of pure LLM time — before the robot moves at all.

**The failure modes below the cut line are not random.** `hermes-3-3b` and `phi-4-mini` fail by _narration_: they reply "I'll turn on the light for you!" with no tool call attached. In a chat UI that reads as success. On a robot, nothing happens — and the user doesn't know nothing happened. Quiet failure is the worst failure class a voice agent has.

## The multi-step problem: sycophancy on chains

Single-turn accuracy is necessary but nowhere near sufficient. The failure that actually forced a second harness into EdgeVox shows up on _chains_ — "pick up the red cube and put it on the blue one" — and it looks like this:

1. Model calls `locate_object("red cube")`. Correct.
2. Tool returns coordinates.
3. Model replies: "I found the red cube and completed the stacking task for you!"

No grasp. No move. No release. The model saw one successful tool result and pattern-matched its way to a victory lap. We started calling this **sycophancy on chains**: small models are strongly biased toward telling you the task went well, and each extra hop in the loop is another opportunity to take the exit.

The standard mitigations are all prompt-side, and we use them — the shipped ReAct persona in EdgeVox is a wall of anti-patterns ("describing isn't doing", "one call then summarising is usually wrong", a mandatory `TASK COMPLETE` termination marker). They help. They do not fix it. Below a certain capability level, the model's per-step judgment is simply not reliable enough to also be the _scheduler_ for the whole task, and every additional hop compounds the risk: the context grows, the tool-call formatting drifts, and the probability that at least one hop goes sideways rises with depth. In our engineering runs on sub-7B models at Q4, chains reliably degraded somewhere past roughly the half-dozen-hop mark — we are currently turning that anecdote into a controlled, multi-seed measurement, and I'll publish those numbers when they exist rather than quote hallway estimates here.

And this is not a small-model quirk that disappears at scale. The LLMCompiler authors, benchmarking ReAct on GPT-3.5/4 and LLaMA-2 70B, identified "premature early stopping based on the incomplete intermediate results" as one of ReAct's two dominant failure modes — together with repetitive re-invocation of earlier calls, it cost ReAct up to 7–8% accuracy on their benchmarks (§5.1 and Appendix A of [the paper](https://arxiv.org/abs/2312.04511)). Bigger models take the same exits; they just take them less often.

## What we ship instead: plan once, dispatch deterministically

`PlannedToolDispatcher` splits the job into three roles, only two of which involve the LLM at all:

```mermaid
flowchart TB
    subgraph react["ReAct loop: N+1 LLM calls for an N-step task"]
        direction TB
        U1["User request"] --> T1["LLM: think, emit ONE tool call"]
        T1 --> E1["Execute tool"]
        E1 --> O1["LLM observes result"]
        O1 -->|"not done"| T1
        O1 -->|"TASK COMPLETE"| R1["Reply to user"]
    end
    subgraph planned["PlannedToolDispatcher: 2 LLM calls, any N"]
        direction TB
        U2["User request"] --> P2["Planner LLM emits ordered JSON plan"]
        P2 --> X2["Python executor dispatches every step directly - no LLM in the loop"]
        X2 --> S2["Synthesiser LLM writes one-sentence reply"]
    end
    classDef focal fill:#f3bd92,stroke:#8c4000,stroke-width:2px,color:#4a2200
    class X2 focal
```

The arithmetic is the whole argument. A ReAct run on an N-step task costs **N+1 LLM calls minimum** (one per step, plus the final reply) — more with re-prompts. The planned dispatcher costs **exactly 2**, regardless of N. At the 0.85–6.69 s per-reply latencies measured above, that is the difference between a robot that responds and a robot you walk away from.

But the deeper win is not latency — it is _removing the exit ramps_. The executor is a Python `for` loop. It cannot get discouraged, cannot declare early victory, cannot mangle JSON on hop five. Every failure mode that scales with hop count is gone, because there are no LLM hops in the middle.

Three implementation notes that cost us real debugging time:

- **The planner needs the full persona, worked examples included.** Handing it a bare tool catalog produces empty or trivial plans on multi-step tasks. And if your persona text contains literal braces, escape them (`{{...}}`) before it goes through the template formatter — a silently mangled prompt looks exactly like a dumb model.
- **Normalize arguments between plan and dispatch.** Small models emit "red object" when the registry key is `red_cube`. A chain of `arg_normalizers` callables that fuzzy-match planned args against live world state self-heals most of this without another LLM call.
- **When you do run ReAct with a verifier, make the re-prompt state-aware.** A static "the task is not done, continue" string loops forever once the task has moved past the phase that string assumed. Pass a callable that reads the action log and current environment state, and generates a fresh instruction each time.

## When ReAct is still the right answer

Here is where I have to argue against my own headline: plan-once is **not** universally better, and pretending otherwise would be trading one dogma for another.

The planner's plan is built blind, from the user request alone. That is only coherent when the request fully determines the steps. The moment the _next action depends on what a tool returned_ — search ("find the red object somewhere on the table"), recovery ("if the grasp fails, try a different height"), anything with hidden state you must probe — an upfront plan is structurally incapable of solving the task, no matter how strong the model is. ReAct's per-step observation loop, the very thing that makes it slow and fragile on plannable chains, is the only mechanism that works there.

Even the plan-once camp concedes this. ReWOO's own limitations section (§4) walks through an AlfWorld task — "put some vase in safe," in a room the planner has never seen — and admits that a planner with no knowledge of the environment "has to enumerate all possible plans," degenerating to the worst case of observation-dependent reasoning. Fittingly, AlfWorld is where the original ReAct paper scored its most dramatic win.

So the decision rule we actually use in EdgeVox:

<div className="not-prose my-8 overflow-x-auto rounded-xl border border-warm-200 bg-warm-50/40">
  <table className="min-w-full border-collapse text-sm">
    <thead>
      <tr className="border-b border-warm-300 text-left">
        <th className="px-4 py-3 sm:px-8"></th>
        <th className="px-4 py-3 font-semibold sm:px-8">
          Plannable task (steps determinable from the request)
        </th>
        <th className="px-4 py-3 font-semibold sm:px-8">
          Feedback-required task (state discovered as tools return)
        </th>
      </tr>
    </thead>
    <tbody>
      <tr className="border-b border-warm-200">
        <td className="px-4 py-3 font-semibold sm:px-8">Small / heavily quantized model</td>
        <td className="bg-signal-50 px-4 py-3 sm:px-8">Plan once, dispatch deterministically</td>
        <td className="px-4 py-3 sm:px-8">
          ReAct — with verifier, loop detection, tight hop budget
        </td>
      </tr>
      <tr>
        <td className="px-4 py-3 font-semibold sm:px-8">Capable model (4B+ at Q4 and up)</td>
        <td className="px-4 py-3 sm:px-8">Either works; plan-once is still far cheaper</td>
        <td className="px-4 py-3 sm:px-8">ReAct</td>
      </tr>
    </tbody>
  </table>
</div>

Note what is _not_ on the axes: the framework's default. The task's information structure and the model's capability decide the harness; everything else is convenience.

Treat the table as a default, not a law. One thing we keep re-learning: model families differ in how robustly they emit structured output under pressure, and a format-fragile model can scramble the ranking in either direction — another reason to measure on _your_ model before shipping, rather than trusting anyone's table, including this one.

As a flowchart, the same rule:

```mermaid
flowchart TD
    Q1{"Does the next action depend on what a tool returned?"}
    Q1 -->|"yes: search, recovery, hidden state"| REACT["ReAct with verifier, loop detection, tight hop budget"]
    Q1 -->|"no: steps determinable from the request"| Q2{"How capable is the model?"}
    Q2 -->|"small or heavily quantized"| PLAN["Plan once, dispatch deterministically"]
    Q2 -->|"capable (4B+ at Q4 and up)"| EITHER["Either works - plan-once is still far cheaper"]
    classDef focal fill:#f3bd92,stroke:#8c4000,stroke-width:2px,color:#4a2200
    class PLAN focal
```

## Others keep arriving at the same place

This isn't a lone finding from one robot lab. Once you know what to look for, the "stop paying an LLM call per step" conclusion shows up independently across papers and production guidance — each from a different starting point:

- **ReWOO** (Xu et al., 2023, [arXiv:2305.18323](https://arxiv.org/abs/2305.18323)) decouples reasoning from observation for exactly our reason: the interleaved think-act-observe pattern burns tokens. Their abstract reports **5× token efficiency and a 4% accuracy improvement on HotpotQA** from planning without observations — and, notably for this post, they show the decoupling makes it possible to distill the planning role into a 7B LLaMA. Plan-once isn't just cheaper; it's the shape that _fits_ small models.
- **LLMCompiler** (Kim et al., ICML 2024, [arXiv:2312.04511](https://arxiv.org/abs/2312.04511)) reaches the same architecture from the systems side — a planner that emits a task graph, then an executor that dispatches functions without per-step LLM involvement. Reported gains vs ReAct: **up to 3.7× latency, 6.7× cost, and ~9% accuracy**. Their motivation sentence could be this post's thesis: sequential per-function reasoning causes "high latency, cost, and sometimes inaccurate behavior."
- **[Anthropic's "Building effective agents"](https://www.anthropic.com/engineering/building-effective-agents)** frames it as workflows vs agents and lands on the same task-gating rule we derived from robot tasks: _"workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale"_ — with the explicit warning that agentic loops trade latency and cost for better task performance and carry "the potential for compounding errors." That is sycophancy-on-chains, described from the cloud side.
- **Harness-Bench** ([arXiv:2605.27922](https://arxiv.org/abs/2605.27922)) makes the general case with 106 sandboxed tasks, six harnesses, and eight model backends (5,194 trajectories): the gap between the best and worst harness is 23.8 points on the same tasks and models, and the authors argue _"agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone."_ Buried in §4.3 is this post's thesis in their data: stronger backends show low variance across harnesses, while weaker backends swing hard with the execution layer. Their taxonomy of "execution-alignment failures — where plausible reasoning becomes decoupled from tool feedback" is a formal name for the failure our robots exhibit.
- **Agentic Robot** ([arXiv:2505.23450](https://arxiv.org/abs/2505.23450)) proposes a planner/executor/verifier triad (its "Standardized Action Procedure") for embodied tasks, verifying progress on a fixed cadence rather than every step — the pattern EdgeVox's periodic-verifier hook explicitly mirrors; the hook's docstring cites it.

What none of these isolate is the regime this post lives in: **sub-8B models, aggressive quantization, on-device latency budgets**. None of them treat bit-width as an experimental variable, and none of them operate under an edge latency budget. Whether the crossover point between the two harnesses moves with bit-width and task depth is exactly what our in-progress controlled study measures.

## The bigger claim

The 2026 conversation has finally accepted that the harness around the model is a first-class engineering surface — not glue code. What I think is still underappreciated is that this is _most_ true at the small end. A frontier model shrugs off a mediocre harness. A 1.7B model at Q4 lives or dies by it: the same weights, wrapped in the wrong loop, go from "solves the task" to "confidently reports solving the task it did not attempt."

We are in the middle of a controlled study on exactly this — harness choice as a function of model capability, quantization level, and task depth, run on the shipped EdgeVox harnesses rather than reimplementations. The design of that study, and the numbers, are a future post. What's in this one is everything we already ship and measure today, and it is enough to change how you build: **pick the loop after you understand the task, and if your model is small, count your LLM calls like they cost money — because on the edge, they cost seconds, and seconds are what your user feels.**

---

_The benchmark scripts, raw per-scenario data, and both harnesses (`ReActAgent`, `PlannedToolDispatcher`) are in the [EdgeVox repository](https://github.com/nrl-ai/edgevox) under MIT. If you reproduce the numbers on different hardware, I'd genuinely like to hear the results._

