# Sakana Fugu: a multi-agent system as a model

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/articles/sakana-fugu
> date: 2026-06-23
> tags: llm, multi-agent, orchestration, reinforcement-learning, explainer
No single LLM wins everywhere. One model leads on competition math, another on
agentic coding, a third on multilingual work, and open models win on cost. The
usual response is to pick one and absorb its weak spots. Sakana AI's bet is the
other one: don't pick a model — *orchestrate a pool of them*, and make the
orchestration itself the model.

That product is **Sakana Fugu** (and a heavier tier, Fugu Ultra), shipped behind a
single API. Underneath are two ICLR 2026 papers that attack the same problem from
opposite ends: [TRINITY](https://arxiv.org/abs/2512.04695) *evolves* a tiny
coordinator over frozen models, and the
[Conductor](https://arxiv.org/abs/2512.04388) *reinforcement-learns* a 7B model to
write orchestration plans in natural language. This is a walk through both, and what
they add up to.

## A multi-agent system as a model

Fugu's framing is the whole pitch: one OpenAI-compatible endpoint. You send a
request to `model: fugu`; behind it a learned coordinator assembles a team from a
pool of frontier and open models, runs them over several turns, and returns one
answer. You never see the routing.

<FuguPool />

<Figure
  src="/articles/sakana-fugu/fugu-architecture.png"
  alt="Sakana Fugu over a pool of closed and open models, with Fugu itself as one of the workers."
  caption="Sakana's own framing of the idea: one Fugu endpoint coordinating a pool of closed and open models — and Fugu can even call itself as a worker (the recursive node on the right)."
/>

The pool is swappable — you can opt a model out for compliance and the coordinator
routes around it — and billing is a single top-tier rate rather than stacked
per-model fees. There's even an export-controls angle: because Fugu can hit
frontier-level quality by coordinating open and semi-open models, you get the
capability without hard dependence on any one restricted vendor.

But the API is the boring part. The interesting part is that the coordinator is
*learned*, not hand-written. There are two ways to learn it.

## TRINITY: evolve a tiny coordinator

TRINITY's constraint shapes everything: you cannot fine-tune GPT-5's weights, and
merging models with incompatible architectures doesn't work. So freeze every model
in the pool, and learn only a tiny thing on top that decides who does what.

<Figure
  src="/articles/sakana-fugu/trinity-architecture.png"
  alt="TRINITY's coordination architecture: a coordinator selects an agent and a role each turn, looping Thinker, Worker, Verifier, with a worked example."
  caption="TRINITY's coordination loop, from the paper: the coordinator picks an agent and a role each turn, with a worked Thinker → Worker → Verifier example on the right."
/>

### The coordinator is under 20,000 parameters

A small model — Qwen3-0.6B — reads the current problem state and produces a hidden
vector; a linear head turns that into a choice of *agent* and *role*. Given the
penultimate-token hidden state $h(s)\in\mathbb{R}^{d}$ from the small model, a head
$f_\theta$ of roughly 10K parameters emits logits over $L$ agents plus 3 roles, and
the coordinator samples its action $a$ from

$$
\pi_\theta(a \mid s) \;\propto\; \exp\!\big(f_\theta(h(s))_a\big),
\qquad a \in \{1,\dots,L\}\cup\{\mathrm{T},\mathrm{W},\mathrm{V}\}
$$

where $s$ is the running transcript, $\mathrm{T},\mathrm{W},\mathrm{V}$ are the three
roles below, and $\theta$ is everything that gets trained. On top of the head, TRINITY
adds *singular-value fine-tuning*: take an SVD of one or two of the small model's
weight matrices and learn only the singular-value scales, keeping the orthogonal
factors fixed. That's a few thousand more numbers. Total trainable: **under 20K
parameters.** The 0.6B backbone and all seven frontier and open models stay frozen.

<Diagram caption="The entire trainable surface of TRINITY: a hidden state, a ~10K linear head, and a categorical choice over agents and roles. Everything below the head is frozen.">
  <svg viewBox="0 0 640 200" role="img" aria-label="The TRINITY coordinator: the small model maps the problem state to a hidden vector; a tiny linear head turns it into logits over agents and roles." style={{ width: "100%", height: "auto" }}>
    <rect x="16" y="74" width="104" height="44" rx="8" fill="var(--background)" stroke="var(--border)" />
    <text x="68" y="92" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--foreground)">problem</text>
    <text x="68" y="108" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--foreground)">state s</text>
    <line x1="120" y1="96" x2="156" y2="96" stroke="var(--muted-foreground)" strokeWidth="1.3" />
    <rect x="156" y="66" width="120" height="60" rx="8" fill="oklch(0.72 0.05 260)" opacity="0.25" stroke="var(--border)" />
    <text x="216" y="90" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--foreground)">Qwen3-0.6B</text>
    <text x="216" y="106" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">frozen · SLM</text>
    <line x1="276" y1="96" x2="312" y2="96" stroke="var(--muted-foreground)" strokeWidth="1.3" />
    <text x="294" y="88" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">h(s)</text>
    <rect x="312" y="72" width="96" height="48" rx="8" fill="oklch(0.72 0.15 150)" opacity="0.85" />
    <text x="360" y="92" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="oklch(0.2 0 0)">head fθ</text>
    <text x="360" y="107" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="oklch(0.2 0 0)">~10K params</text>
    <line x1="408" y1="96" x2="444" y2="96" stroke="var(--muted-foreground)" strokeWidth="1.3" />
    <rect x="444" y="40" width="180" height="50" rx="8" fill="var(--background)" stroke="var(--border)" />
    <text x="534" y="60" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">L agent logits</text>
    <text x="534" y="76" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">GPT-5 · Claude · Gemini · …</text>
    <rect x="444" y="102" width="180" height="50" rx="8" fill="var(--background)" stroke="var(--border)" />
    <text x="534" y="122" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">3 role logits</text>
    <text x="534" y="138" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">Thinker · Worker · Verifier</text>
  </svg>
</Diagram>

<Figure
  src="/articles/sakana-fugu/trinity-hidden-state-separability.png"
  alt="The small model's hidden states are linearly separable by task type (SVM) and form clear task clusters in a t-SNE plot."
  caption="Why a ~10K linear head is enough: the small model's hidden states already separate by task type — a linear SVM classifies them almost perfectly (left), and t-SNE shows clean task clusters (right)."
/>

### Three roles, looped until accepted

Each turn, the coordinator gives the chosen agent one of three roles:

- **Thinker** — plan, decompose, or critique; no direct work.
- **Worker** — do the work: derive, compute, write code.
- **Verifier** — check the current answer and return `ACCEPT` or `REVISE`.

It loops, accumulating a transcript, and halts the moment a Verifier accepts (or a
fixed turn budget $K$ is exhausted):

$$
\tau \;=\; \min\{\, k \le K \;:\; R_k = \mathrm{V} \ \text{and}\ u_k = \mathrm{ACCEPT} \,\}
$$

where $R_k$ is the role at turn $k$ and $u_k$ is the verifier's verdict. Step through
one problem — watch a wrong answer get caught and revised before it's accepted:

<TrinityLoop />

### Trained by evolution, not gradients

Why not just RL the head? Because the reward is binary — the final answer is right or
wrong — and the head is tiny, so the per-parameter gradient signal is buried in
noise. TRINITY instead optimizes the coordinator with a *derivative-free* evolution
strategy, maximizing expected terminal reward:

$$
J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\big[\, R(\tau) \,\big],
\qquad R(\tau) \in \{0, 1\}
$$

The optimizer is separable CMA-ES: it keeps a diagonal Gaussian over the ~10K
parameters, samples a small population each generation —
$\lambda = \lceil 4 + 3\ln n \rceil \approx 32$ for $n \approx 10{,}000$ — evaluates
each candidate's fitness by actually running rollouts, and shifts the distribution
toward the winners. The paper shows the coordination objective is nearly
block-separable, which is exactly the regime where a diagonal evolution strategy
beats both random search and gradient RL under a tight evaluation budget. The honest
cost: no gradients means you pay in *environment evaluations*, and each one is a full
multi-turn rollout against real model APIs.

### It beats every model in its pool

This is the result that matters. Transferred zero-shot to four held-out tasks, the
evolved coordinator outscored every individual model in its pool — including GPT-5,
Gemini-2.5-Pro, and Claude-4-Sonnet. On LiveCodeBench it set a record at the time of
submission:

<BenchBars
  title="LiveCodeBench v6 — pass@1 (%)"
  unit=""
  bars={[
    { label: "TRINITY", value: 86.2, highlight: true },
    { label: "GPT-5", value: 83.8 },
    { label: "Gemini-2.5-Pro", value: 67.2 },
    { label: "Claude-4-Sonnet", value: 46.5 },
  ]}
/>

And the multi-turn loop earns its keep: accuracy climbs from 0.823 at two turns to
0.863 at six. One cheap evolved head, a frozen pool, and the ensemble beats its best
member.

<Figure
  src="/articles/sakana-fugu/trinity-livecodebench.png"
  alt="TRINITY's LiveCodeBench result and its accuracy rising with the turn budget."
  caption="TRINITY's own result: on LiveCodeBench it reaches 0.862 pass@1, above GPT-5 (0.838), Gemini-2.5-Pro (0.672), and Claude-4-Sonnet (0.465) — and accuracy keeps climbing with the turn budget (bottom)."
/>

## Conductor: orchestration written in natural language

The Conductor attacks the same problem with a bigger hammer: a 7B model (Qwen2.5-7B)
trained with RL to *write the entire workflow itself*, in natural language.

### Three lists are a workflow

For each problem the Conductor emits three synchronized lists:

- `model_id` — which agent runs each step.
- `subtasks` — a natural-language instruction for each step.
- `access_list` — which earlier outputs each step is allowed to read.

Those three lists *are* a directed graph. The `access_list` is the load-bearing
idea: `[]` means the step sees only the original question, `["all"]` means it sees
everything produced so far, and `[0, 2]` means it sees steps 0 and 2. By choosing
access lists, the Conductor designs the communication topology — a chain, parallel
branches, a verify-and-merge — *per problem*, not from a fixed template. Flip between
the topologies it learns to produce:

<ConductorWorkflow />

### Trained with GRPO

The Conductor is trained end-to-end with GRPO. For each question it samples a group
of $G = 64$ candidate workflows, scores each, and pushes the policy toward the
above-average ones using the group-normalized advantage

$$
A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \dots, r_G)}{\operatorname{std}(r_1, \dots, r_G)}
$$

The reward $r_i$ is blunt on purpose: $0$ if the three lists don't parse, $1$ if the
final workflow output is correct, and $0.5$ otherwise — with no KL penalty
($\beta = 0$). The whole thing trains on just 960 problems for 200 iterations on two
H100s. To make one Conductor work over *any* pool, they then fine-tune it with
randomly sampled $k$-model subsets per question, so it adapts to whatever agents you
hand it.

<Figure
  src="/articles/sakana-fugu/conductor-training-emergence.png"
  alt="Conductor accuracy climbing over 200 GRPO iterations for out-of-distribution, in-distribution, and mixed agent pools."
  caption="Coordination strategy emerging during training: accuracy climbs over 200 GRPO iterations as the Conductor learns to design better workflows — fastest when its few-shot examples are held out-of-distribution."
/>

### It can call itself

The Conductor may name *itself* as a worker. That spawns a fresh sub-workflow on its
own draft — a recursive topology that turns inference depth into a tunable compute
axis, what Sakana calls dynamic test-time scaling. Recursion buys a point or two on
the hardest benchmarks for under 2× the agent calls.

### Results

A 7B model orchestrating frontier workers beats the frontier workers. In a
controlled run over the same pool:

<BenchBars
  title="LiveCodeBench — controlled, shared worker pool (%)"
  unit=""
  bars={[
    { label: "Conductor (7B)", value: 64.3, highlight: true },
    { label: "GPT-5", value: 57.5 },
    { label: "Gemini-2.5-Pro", value: 40.1 },
    { label: "Claude-4", value: 38.0 },
    { label: "MoA", value: 38.6 },
  ]}
/>

Unconstrained, the headline numbers were each a new high at publication and each
above the best single worker: **83.9% on LiveCodeBench, 87.5% on GPQA-Diamond, 93.3%
on AIME25** — reached with about 3 agent calls per question, versus 5–8 for prior
multi-agent methods.

<Figure
  src="/articles/sakana-fugu/conductor-leaderboard.png"
  alt="Conductor leading both GPQA-Diamond and LiveCodeBench against every individual worker model."
  caption="The Conductor (highlighted) tops both GPQA-Diamond and LiveCodeBench against every individual worker in its pool — GPT-5, Gemini-2.5-Pro, DeepSeek-R1, and Claude Opus 4."
/>

<Figure
  src="/articles/sakana-fugu/conductor-efficiency.png"
  alt="Scatter of average performance versus average number of agent calls: the Conductor is high-performance at about 3 calls, versus MoA at 8 calls."
  caption="Performance versus cost: the Conductor sits top-left — higher accuracy than every multi-agent baseline at roughly 3 agent calls, where MoA needs 8."
/>

## Two routes to the same place

TRINITY and the Conductor are the same idea — a learned layer that coordinates a
pool — built at opposite scales:

| | TRINITY | Conductor |
|---|---|---|
| Learnable size | < 20K params (evolved head) | 7B params (RL-trained model) |
| Training | derivative-free sep-CMA-ES | GRPO (reinforcement learning) |
| Output per step | (agent, role) | a full natural-language workflow |
| Coordination | fixed Thinker/Worker/Verifier loop | a topology it designs per problem |
| Reads the task via | the small model's hidden state | reasoning in language |
| Adapts to new pools | re-evolve (cheap) | randomized-pool fine-tune |

TRINITY is the minimal, almost-free coordinator; the Conductor is the expressive one
that designs bespoke pipelines. Fugu uses both as its engine.

## What ships: Fugu and Fugu Ultra

Two tiers. Base **Fugu** balances quality and latency over a lean pool. **Fugu
Ultra** coordinates a deeper pool over more turns for hard, high-stakes problems, and
takes longer for it. On Sakana's reported numbers, both match or beat the frontier:

<BenchBars
  title="SWE-Bench Pro (%)"
  unit=""
  bars={[
    { label: "Fugu Ultra", value: 73.7, highlight: true },
    { label: "Claude Opus 4.8", value: 69.2 },
  ]}
/>

Fugu Ultra also posts **50.0 on Humanity's Last Exam**, against baselines in the
41–50 range. It's an OpenAI-compatible endpoint — change the base URL and key, no SDK
migration — and it bills at a single top-tier rate. (Not available in the EU yet,
pending GDPR; the exact routing decisions are kept proprietary.)

<Figure
  src="/articles/sakana-fugu/fugu-benchmarks.png"
  alt="Fugu and Fugu Ultra versus Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks."
  caption="Fugu and Fugu Ultra (red) against Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks (Sakana). Fugu Ultra leads on SWE-Bench Pro (73.7 vs 69.2), GPQA-D, LiveCodeBench, and Humanity's Last Exam."
/>

## What I make of it

The honest read:

- **The win is real.** An orchestration layer that beats every model it coordinates —
  and generalizes zero-shot to unseen tasks — is a genuine result. "Coordination" is
  now a trainable layer that sits *above* frontier models rather than inside one.
- **The costs are real too.** Every model in the pool has to be available at
  inference; you trade single-model simplicity for a fleet, and latency rises with
  the extra turns. The biggest gains concentrate on long-tail reasoning and coding
  benchmarks — on easy tasks the lift is small — and leaning on GPT-5/Claude/Gemini as
  workers inherits their cost.
- **The framing is the interesting part.** TRINITY argues the coordinator can be
  almost free: 20K evolved parameters over frozen models. The Conductor argues
  coordination is itself a reasoning skill worth a 7B model and a full RL run. Both
  point the same way — as individual models plateau, the next axis is how you make
  several of them work together, and that orchestration is learnable.

---

*Built on Sakana AI's [TRINITY: An Evolved LLM Coordinator](https://arxiv.org/abs/2512.04695)
and [Learning to Orchestrate Agents in Natural Language with the Conductor](https://arxiv.org/abs/2512.04388),
both ICLR 2026. Product: [Sakana Fugu](https://sakana.ai/fugu/).*
