~/satyajit

Nous Hermes and Mixture-of-Agents: when models confer before they answer

mdjsonmcp

2026-06-27 · 11 min · llm · multi-agent · mixture-of-agents · nous-research · open-weights · explainer

"Hermes MoA" isn't one thing, so let me separate the threads before building anything, because the difference between them is the difference between a real result and a marketing claim.

So the honest construction is: here's the MoA mechanism, here's the open-weight Hermes line and why it's a natural host for it, and here's how Nous actually shipped the two together. Let's build it.

The core idea: proposers and aggregators

A single LLM gets one shot at your prompt. MoA's bet is that several models, allowed to read each other's drafts and synthesize, beat any one of them — even when the individual drafts are mediocre.

The structure is a stack of L layers, each with n agents. Agents play two roles:

The data flow is the load-bearing part: every agent in layer ii receives all outputs from layer i1i-1, concatenated into an Aggregate-and-Synthesize prompt that tells the model to critically evaluate the candidates and fuse them. The final layer's aggregator emits the answer. Watch one round play out — four diverse proposers, each right about a different piece, fused into an answer that beats all of them:

mixture-of-agents · one round, live
prompt: Why is the sky blue?
Qwen-110Bmechanism, but vague

WizardLMnames the physics

Llama-3-70Bthe quantitative law

Mixtralweak, hand-wavy

↓ all four drafts → aggregator (aggregate & synthesize)
aggregator · Hermes 4 — synthesize, don’t vote

Qwen-110B
0%
WizardLM
0%
Llama-3-70B
0%
Mixtral
0%
MoA
0%

The synthesized answer (71%) clears the best single proposer (64%) because each draft contributed a different correct piece — the mechanism, the law, the number, the perception caveat. That lift from fusing diverse, individually-incomplete drafts is collaborativeness.

Stack that into layers and the synthesized answer sharpens further. Add depth and watch the quality climb:

mixture-of-agents · depth
single model
Hermes 4 · one pass
output · “What is the Kuiper Belt?”

The Kuiper Belt is past Neptune. It has icy bodies. Pluto is one of them.

answer quality (stand-in win rate)57%

One model, one pass — fluent but thin, and it drops the scattered disc entirely.

The Mixture-of-Agents architecture: four layers of agents, each layer's proposers feeding all of their outputs into every agent of the next layer, culminating in a final aggregated answer.
The MoA architecture (paper, Figure 2): L layers × n agents. Each agent reads all outputs of the previous layer through an Aggregate-and-Synthesize prompt; the final aggregator returns the answer. The paper's default is 3 layers of 6 proposers, with Qwen1.5-110B as the final aggregator.

Why conferring helps: collaborativeness

The empirical observation that motivates the whole thing: an LLM produces a better answer when shown other models' responses — even when those responses are individually weaker than what it would have written alone. The paper calls this collaborativeness, and it's the reason MoA isn't just "best-of-n with extra steps".

Bar chart showing AlpacaEval 2.0 LC win rates increasing for several models when they are provided other models' responses as context, versus answering alone.
Collaborativeness (paper, Figure 1): AlpacaEval 2.0 win rate rises across models when each is shown peers' answers as auxiliary context — the effect MoA is built to exploit.

The mechanism is concrete. The aggregator isn't voting; it's reading. A proposer that nailed the units, another that caught an edge case, a third that structured the explanation — the aggregate-and-synthesize prompt lets one model keep the part each proposer got right and drop the rest.

queryproposer 1proposer 2proposer 3aggregate &synthesizecritically fuse, don't voteaggregator→ one answer
The Aggregate-and-Synthesize prompt: the layer-i aggregator receives the original query plus every layer-(i−1) proposer's full response as auxiliary context, and is instructed to critically fuse them into one improved answer — not to pick a winner.

What MoA scores

The headline result is that a stack of open-source models, conferring, beats a single frontier model. On AlpacaEval 2.0 (length-controlled win rate):

AlpacaEval 2.0 — LC win rate (%)
MoA w/ GPT-4o
65.7%
MoA (open only)
65.1%
GPT-4 Omni
57.5%
020406080

Open-only MoA hits 65.1% against GPT-4 Omni's 57.5% — a +7.6-point margin with no closed model in the loop. On MT-Bench it scores 9.25 (9.40 with GPT-4o added) versus GPT-4 Omni's 9.19, and on FLASK it leads on robustness, correctness, factuality, and completeness against the strongest single proposer.

Performance-versus-cost Pareto frontier showing MoA configurations achieving higher win rate per dollar than single-model baselines.
Cost-performance (paper, Figure 5a): MoA configurations sit on a better win-rate-per-dollar frontier — the gain isn't free (you pay for n proposers × L layers of calls), but it's competitive on cost, not just quality.

The cost is exactly what you'd expect: nn proposers across LL layers means many model calls per answer, and latency stacks with depth. MoA buys quality with compute. Whether that trade is worth it depends entirely on how much you value the marginal correctness.

cost vs quality · depth L × width n
L1
L2
agg
layers L2
proposers n4
model calls
9
latency (depth)
quality (stand-in)
70.2%

A shallow, wide stack captures most of the gain cheaply: a couple of layers and a handful of diverse proposers. Quality saturates fast, so more depth mostly costs latency.

Design choices that actually move the needle

MoA has three knobs, and they don't all behave the way intuition says:

You can see the effect dimension-by-dimension on FLASK, which scores along twelve skill axes rather than one number:

FLASK evaluation across twelve skill dimensions, showing Mixture-of-Agents improving over the strongest single proposer on robustness, correctness, factuality, and completeness.
FLASK (paper, Figure 3): MoA's gains aren't uniform — it pulls ahead most on robustness, correctness, factuality, and completeness, the dimensions where cross-checking multiple drafts helps most.

The shape of that result is the tell: MoA helps exactly where having several independent attempts to cross-check is valuable, and barely moves dimensions that a single competent model already nails.

Where Hermes comes in

Nous Research builds the Hermes line — open-weight models post-trained with a deliberate neutral alignment philosophy: minimal gratuitous refusals, maximal user steerability. Hermes 4 (70B and 405B on Llama-3.1 bases, 14B on a Qwen3 base) adds hybrid reasoning — a single checkpoint with a toggleable <think>…</think> block, so you get reasoning and instruct behavior from one model — plus strong function calling and JSON-schema structured output. It was trained on ~60B tokens (~5M samples) built with Nous's DataForge and the Atropos RL environment, rejection-sampled against roughly a thousand task-specific verifiers, on 192× B200 GPUs.

On capability it's competitive with the open frontier (Hermes 4 405B, reasoning mode):

BenchmarkHermes 4 405B (reasoning)non-reasoning
MATH-50096.373.8
AIME'2481.911.4
AIME'2578.110.6
GPQA Diamond70.539.4
LiveCodeBench v661.328.1
MMLU87.273.6

But the number that captures the philosophy is RefusalBench — Nous's own measure of how often a model refuses across 32 categories of typically-refused requests (higher = fewer refusals, except for a few inverted safety categories scored the other way):

RefusalBench — higher means fewer refusals (avg of 5 runs)
Hermes 4 (reasoning)
57.1
Grok 4
51.3
Hermes 4 (non-reasoning)
43.2
DeepSeek V3
28.1
Gemini 2.5 Pro
24.2
GPT-4o
17.7
Opus 4.1
15.4
GPT-5
11.3
0204060

That steerability is what makes Hermes a natural MoA citizen. Open weights mean you can run a whole proposer pool yourself; neutral alignment means the aggregator won't refuse to synthesize half its inputs. Hermes is built to be driven, which is exactly what a multi-agent harness does to it.

The real bridge: Forge

The genuine "Nous ran MoA on Hermes" artifact is the Forge Reasoning API (beta, Nov 2024). Forge combined three inference-time techniques on top of Hermes 70B: Mixture-of-Agents, Monte Carlo Tree Search, and Chain of Code. The MoA piece is exactly the mechanism above — "models respond, confer, and synthesize new answers" — applied to a Hermes-centric pool. If you want a concrete instance of MoA on the Hermes line that actually shipped, Forge is it.

Forge stacked three inference-time techniques that compose cleanly because they attack different failure modes:

Breadth, depth, and grounding are orthogonal, which is why bolting all three onto a fixed Hermes backbone bought more than any one alone.

A practical MoA instantiation Nous-style also collapses the textbook diagram into something cheap: a small reference model runs first without tool schemas (avoiding refusals and saving tokens), its output is appended as private context, and the aggregator — the real Hermes agent — does the actual tool-calling loop with the reference draft in hand. One layer, two roles, most of the benefit. It's a reminder that "MoA" in production rarely looks like the 3×6 textbook diagram; it's whatever proposer/aggregator split pays for itself.

What I make of it


Built on Together AI's Mixture-of-Agents Enhances Large Language Model Capabilities (Wang et al., 2024; code), the Hermes 4 Technical Report (Nous Research, 2025), and Nous's Forge Reasoning API.

share