# Nous Hermes and Mixture-of-Agents: when models confer before they answer

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/articles/nous-hermes-moa
> date: 2026-06-27
> tags: llm, multi-agent, mixture-of-agents, nous-research, open-weights, explainer
"Hermes MoA" isn't one thing, so let me separate the threads before building anything,
because the difference between them is the difference between a real result and a
marketing claim.

- **Mixture-of-Agents (MoA)** is a real, well-cited technique from Together AI / Duke /
  Stanford — [arXiv:2406.04692](https://arxiv.org/abs/2406.04692). It's the foundation
  everyone means by "MoA".
- **Nous Hermes** is a real open-weight model family. The current
  [Hermes 4](https://arxiv.org/abs/2508.18255) (14B / 70B / 405B) is a single
  hybrid-reasoning model — it does *not* itself use MoA.
- The genuine bridge between them is Nous's **Forge Reasoning API** (Nov 2024), which
  really did run MoA — plus Monte Carlo Tree Search and Chain of Code — on top of
  Hermes 70B.
- There is also a body of **2026 "Hermes Agent MoA" claims** that I'll get to at the
  end, and explicitly flag as unverified.

So the honest construction is: here's the MoA mechanism, here's the open-weight Hermes
line and why it's a natural host for it, and here's how Nous actually shipped the two
together. Let's build it.

## The core idea: proposers and aggregators

A single LLM gets one shot at your prompt. MoA's bet is that several models, allowed to
*read each other's drafts and synthesize*, beat any one of them — even when the
individual drafts are mediocre.

The structure is a stack of **L layers**, each with **n agents**. Agents play two
roles:

- **Proposers** generate candidate responses. Diversity matters more here than any
  single proposer's quality — you want different mistakes, not the same answer four
  times.
- **Aggregators** take the candidates and synthesize a single, better response.

The data flow is the load-bearing part: every agent in layer $i$ receives **all
outputs from layer $i-1$**, concatenated into an *Aggregate-and-Synthesize* prompt that
tells the model to critically evaluate the candidates and fuse them. The final layer's
aggregator emits the answer. Watch one round play out — four diverse proposers, each
right about a different piece, fused into an answer that beats all of them:

<MoARoundtable />

Stack that into layers and the synthesized answer sharpens further. Add depth and watch
the quality climb:

<MoANetwork />

<Figure
  src="/articles/nous-hermes-moa/moa-fig2.png"
  alt="The Mixture-of-Agents architecture: four layers of agents, each layer's proposers feeding all of their outputs into every agent of the next layer, culminating in a final aggregated answer."
  caption="The MoA architecture (paper, Figure 2): L layers × n agents. Each agent reads all outputs of the previous layer through an Aggregate-and-Synthesize prompt; the final aggregator returns the answer. The paper's default is 3 layers of 6 proposers, with Qwen1.5-110B as the final aggregator."
/>

## Why conferring helps: collaborativeness

The empirical observation that motivates the whole thing: an LLM produces a *better*
answer when shown other models' responses — **even when those responses are
individually weaker than what it would have written alone.** The paper calls this
collaborativeness, and it's the reason MoA isn't just "best-of-n with extra steps".

<Figure
  src="/articles/nous-hermes-moa/moa-fig1.png"
  alt="Bar chart showing AlpacaEval 2.0 LC win rates increasing for several models when they are provided other models' responses as context, versus answering alone."
  caption="Collaborativeness (paper, Figure 1): AlpacaEval 2.0 win rate rises across models when each is shown peers' answers as auxiliary context — the effect MoA is built to exploit."
/>

The mechanism is concrete. The aggregator isn't voting; it's reading. A proposer that
nailed the units, another that caught an edge case, a third that structured the
explanation — the aggregate-and-synthesize prompt lets one model keep the part each
proposer got right and drop the rest.

<Diagram caption="The Aggregate-and-Synthesize prompt: the layer-i aggregator receives the original query plus every layer-(i−1) proposer's full response as auxiliary context, and is instructed to critically fuse them into one improved answer — not to pick a winner.">
  <svg viewBox="0 0 640 230" role="img" aria-label="Multiple proposer responses plus the original query feed an aggregate-and-synthesize prompt, which the aggregator turns into one fused answer." style={{ width: "100%", height: "auto" }}>
    {/* query */}
    <rect x="14" y="100" width="96" height="34" rx="8" fill="oklch(0.8 0.1 250)" opacity="0.4" stroke="var(--border)" />
    <text x="62" y="121" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">query</text>
    {/* proposers */}
    {["proposer 1","proposer 2","proposer 3"].map((p,i) => (
      <g key={p}>
        <rect x="14" y={14 + i*62} width="120" height="34" rx="8" fill="var(--background)" stroke="var(--border)" />
        <text x="74" y={35 + i*62} textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--muted-foreground)">{p}</text>
        <line x1="134" y1={31 + i*62} x2="250" y2="115" stroke="var(--muted-foreground)" strokeWidth="1" strokeDasharray="3 3" />
      </g>
    ))}
    <line x1="110" y1="117" x2="250" y2="117" stroke="var(--muted-foreground)" strokeWidth="1" strokeDasharray="3 3" />
    {/* aggregate prompt */}
    <rect x="250" y="78" width="170" height="78" rx="10" fill="oklch(0.72 0.13 150)" opacity="0.25" stroke="var(--border)" />
    <text x="335" y="108" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">aggregate &amp;</text>
    <text x="335" y="123" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">synthesize</text>
    <text x="335" y="142" textAnchor="middle" fontFamily="monospace" fontSize="8" fill="var(--muted-foreground)">critically fuse, don't vote</text>
    <line x1="420" y1="117" x2="468" y2="117" stroke="var(--muted-foreground)" strokeWidth="1.3" markerEnd="url(#aa)" />
    <defs><marker id="aa" markerWidth="7" markerHeight="7" refX="6" refY="3.5" orient="auto"><path d="M0,0 L7,3.5 L0,7 Z" fill="var(--muted-foreground)" /></marker></defs>
    {/* aggregator */}
    <rect x="468" y="92" width="158" height="50" rx="10" fill="oklch(0.72 0.14 150)" opacity="0.85" />
    <text x="547" y="113" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="oklch(0.2 0 0)">aggregator</text>
    <text x="547" y="129" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="oklch(0.2 0 0)">→ one answer</text>
  </svg>
</Diagram>

## What MoA scores

The headline result is that a stack of **open-source** models, conferring, beats a
single frontier model. On AlpacaEval 2.0 (length-controlled win rate):

<BenchBars
  title="AlpacaEval 2.0 — LC win rate (%)"
  unit="%"
  bars={[
    { label: "MoA w/ GPT-4o", value: 65.7, highlight: true },
    { label: "MoA (open only)", value: 65.1, highlight: true },
    { label: "GPT-4 Omni", value: 57.5 },
  ]}
/>

Open-only MoA hits **65.1%** against GPT-4 Omni's **57.5%** — a +7.6-point margin with
no closed model in the loop. On MT-Bench it scores **9.25** (9.40 with GPT-4o added)
versus GPT-4 Omni's 9.19, and on FLASK it leads on robustness, correctness, factuality,
and completeness against the strongest single proposer.

<Figure
  src="/articles/nous-hermes-moa/moa-fig6.png"
  alt="Performance-versus-cost Pareto frontier showing MoA configurations achieving higher win rate per dollar than single-model baselines."
  caption="Cost-performance (paper, Figure 5a): MoA configurations sit on a better win-rate-per-dollar frontier — the gain isn't free (you pay for n proposers × L layers of calls), but it's competitive on cost, not just quality."
/>

The cost is exactly what you'd expect: $n$ proposers across $L$ layers means many model
calls per answer, and latency stacks with depth. MoA buys quality with compute. Whether
that trade is worth it depends entirely on how much you value the marginal correctness.

<MoACost />

## Design choices that actually move the needle

MoA has three knobs, and they don't all behave the way intuition says:

- **Width (proposers) over depth (layers).** The bulk of the gain comes from having
  several *diverse* proposers in the first layer; stacking more layers helps less and
  costs latency linearly. The toy above saturates fast in $L$ for exactly this reason.
- **Diversity beats raw strength.** Proposers that fail differently give the aggregator
  more to work with than several copies of the strongest model. The paper's pool is
  deliberately heterogeneous — Qwen, WizardLM, Llama-3, Mixtral, dbrx — not six clones.
- **The aggregator is a real choice.** Not every strong model is a good *synthesizer*;
  the aggregator has to read several candidate answers and fuse them faithfully rather
  than just re-emit its own. The paper uses Qwen1.5-110B-Chat as the final aggregator,
  and the role-suitability of a model as aggregator vs proposer is measured separately.

You can see the effect dimension-by-dimension on FLASK, which scores along twelve skill
axes rather than one number:

<Figure
  src="/articles/nous-hermes-moa/moa-fig3.png"
  alt="FLASK evaluation across twelve skill dimensions, showing Mixture-of-Agents improving over the strongest single proposer on robustness, correctness, factuality, and completeness."
  caption="FLASK (paper, Figure 3): MoA's gains aren't uniform — it pulls ahead most on robustness, correctness, factuality, and completeness, the dimensions where cross-checking multiple drafts helps most."
/>

The shape of that result is the tell: MoA helps exactly where having several independent
attempts to cross-check is valuable, and barely moves dimensions that a single competent
model already nails.

## Where Hermes comes in

Nous Research builds the **Hermes** line — open-weight models post-trained with a
deliberate *neutral alignment* philosophy: minimal gratuitous refusals, maximal user
steerability. Hermes 4 (70B and 405B on Llama-3.1 bases, 14B on a Qwen3 base) adds
**hybrid reasoning** — a single checkpoint with a toggleable `<think>…</think>` block,
so you get reasoning and instruct behavior from one model — plus strong function
calling and JSON-schema structured output. It was trained on ~60B tokens (~5M samples)
built with Nous's DataForge and the Atropos RL environment, rejection-sampled against
roughly a thousand task-specific verifiers, on 192× B200 GPUs.

On capability it's competitive with the open frontier (Hermes 4 405B, reasoning mode):

| Benchmark | Hermes 4 405B (reasoning) | non-reasoning |
|---|---|---|
| MATH-500 | 96.3 | 73.8 |
| AIME'24 | 81.9 | 11.4 |
| AIME'25 | 78.1 | 10.6 |
| GPQA Diamond | 70.5 | 39.4 |
| LiveCodeBench v6 | 61.3 | 28.1 |
| MMLU | 87.2 | 73.6 |

But the number that captures the *philosophy* is RefusalBench — Nous's own measure of
how often a model refuses across 32 categories of typically-refused requests (higher =
fewer refusals, except for a few inverted safety categories scored the other way):

<BenchBars
  title="RefusalBench — higher means fewer refusals (avg of 5 runs)"
  unit=""
  bars={[
    { label: "Hermes 4 (reasoning)", value: 57.1, highlight: true },
    { label: "Grok 4", value: 51.3 },
    { label: "Hermes 4 (non-reasoning)", value: 43.2 },
    { label: "DeepSeek V3", value: 28.1 },
    { label: "Gemini 2.5 Pro", value: 24.2 },
    { label: "GPT-4o", value: 17.7 },
    { label: "Opus 4.1", value: 15.4 },
    { label: "GPT-5", value: 11.3 },
  ]}
/>

That steerability is what makes Hermes a natural MoA citizen. Open weights mean you can
run a whole proposer pool yourself; neutral alignment means the aggregator won't refuse
to synthesize half its inputs. Hermes is built to be *driven*, which is exactly what a
multi-agent harness does to it.

## The real bridge: Forge

The genuine "Nous ran MoA on Hermes" artifact is the **Forge Reasoning API** (beta,
Nov 2024). Forge combined three inference-time techniques on top of Hermes 70B:
Mixture-of-Agents, Monte Carlo Tree Search, and Chain of Code. The MoA piece is exactly
the mechanism above — "models respond, confer, and synthesize new answers" — applied to
a Hermes-centric pool. If you want a concrete instance of MoA on the Hermes line that
actually shipped, Forge is it.

Forge stacked three inference-time techniques that compose cleanly because they attack
different failure modes:

- **Mixture-of-Agents** — breadth. Several models propose and an aggregator synthesizes,
  the mechanism above.
- **Monte Carlo Tree Search** — depth. Instead of one greedy chain, explore a tree of
  reasoning continuations and back up value estimates, spending more search on promising
  branches. This is the "think longer on hard problems" axis.
- **Chain of Code** — grounding. Offload the steps that are better *executed* than
  *reasoned about* (arithmetic, string manipulation, logic) into code that actually
  runs, so the model isn't bluffing its way through a calculation.

Breadth, depth, and grounding are orthogonal, which is why bolting all three onto a
fixed Hermes backbone bought more than any one alone.

A practical MoA instantiation Nous-style also collapses the textbook diagram into
something cheap: a small **reference** model runs first *without* tool schemas (avoiding
refusals and saving tokens), its output is appended as private context, and the
**aggregator** — the real Hermes agent — does the actual tool-calling loop with the
reference draft in hand. One layer, two roles, most of the benefit. It's a reminder that
"MoA" in production rarely looks like the 3×6 textbook diagram; it's whatever
proposer/aggregator split pays for itself.

<Callout type="warning">
There is a wave of **June 2026 "Hermes Agent MoA 2.0"** content claiming MoA presets
that beat "Claude Opus 4.8" and "GPT-5.5" on an unpublished "HermesBench" (e.g. a
quoted 0.8202 vs 0.7607/0.7412 for the individual models). I could not verify any of
it: the cited models aren't confirmably released, the benchmark has no published
leaderboard, and the supporting sources are a crypto-news post (which hedges with
"claiming") and social posts. Treat the *mechanism* as faithful MoA, but treat the
*numbers* as marketing-stage and unverified — not established fact.
</Callout>

## What I make of it

- **The result is real and a little counterintuitive.** Open models that read each
  other's drafts beat a single frontier model on AlpacaEval, and the lift comes from
  collaborativeness — synthesis from diverse, even weaker, drafts. That's a genuine,
  reproducible finding with public code.
- **Hermes is the right host, not the inventor.** MoA is Together AI's; Hermes is
  Nous's open-weight, neutral-alignment line; Forge is where Nous actually combined
  them. Keep the attribution straight and the story is clean.
- **The cost is the catch, as always.** $n \times L$ model calls per answer and latency
  that grows with depth. MoA is for when correctness is worth real compute — agentic
  pipelines, hard reasoning — not for chat you need back in 200ms.
- **Be skeptical of the 2026 leaderboard claims.** The mechanism is sound; the
  benchmark numbers floating around are not yet something I'd cite.

---

*Built on Together AI's [Mixture-of-Agents Enhances Large Language Model
Capabilities](https://arxiv.org/abs/2406.04692) (Wang et al., 2024;
[code](https://github.com/togethercomputer/moa)), the [Hermes 4 Technical
Report](https://arxiv.org/abs/2508.18255) (Nous Research, 2025), and Nous's [Forge
Reasoning API](https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference).*
