# Sakana Fugu: a multi-agent system as a model > Satyajit Ghana — Head of Engineering @ Inkers Technology > canonical: https://ai.thesatyajit.com/articles/sakana-fugu > date: 2026-06-23 > tags: llm, multi-agent, orchestration, reinforcement-learning, explainer No single LLM wins everywhere. One model leads on competition math, another on agentic coding, a third on multilingual work, and open models win on cost. The usual response is to pick one and absorb its weak spots. Sakana AI's bet is the other one: don't pick a model — *orchestrate a pool of them*, and make the orchestration itself the model. That product is **Sakana Fugu** (and a heavier tier, Fugu Ultra), shipped behind a single API. Underneath are two ICLR 2026 papers that attack the same problem from opposite ends: [TRINITY](https://arxiv.org/abs/2512.04695) *evolves* a tiny coordinator over frozen models, and the [Conductor](https://arxiv.org/abs/2512.04388) *reinforcement-learns* a 7B model to write orchestration plans in natural language. This is a walk through both, and what they add up to. ## A multi-agent system as a model Fugu's framing is the whole pitch: one OpenAI-compatible endpoint. You send a request to `model: fugu`; behind it a learned coordinator assembles a team from a pool of frontier and open models, runs them over several turns, and returns one answer. You never see the routing.

The pool is swappable — you can opt a model out for compliance and the coordinator routes around it — and billing is a single top-tier rate rather than stacked per-model fees. There's even an export-controls angle: because Fugu can hit frontier-level quality by coordinating open and semi-open models, you get the capability without hard dependence on any one restricted vendor. But the API is the boring part. The interesting part is that the coordinator is *learned*, not hand-written. There are two ways to learn it. ## TRINITY: evolve a tiny coordinator TRINITY's constraint shapes everything: you cannot fine-tune GPT-5's weights, and merging models with incompatible architectures doesn't work. So freeze every model in the pool, and learn only a tiny thing on top that decides who does what.

### The coordinator is under 20,000 parameters A small model — Qwen3-0.6B — reads the current problem state and produces a hidden vector; a linear head turns that into a choice of *agent* and *role*. Given the penultimate-token hidden state $h(s)\in\mathbb{R}^{d}$ from the small model, a head $f_\theta$ of roughly 10K parameters emits logits over $L$ agents plus 3 roles, and the coordinator samples its action $a$ from $$ \pi_\theta(a \mid s) \;\propto\; \exp\!\big(f_\theta(h(s))_a\big), \qquad a \in \{1,\dots,L\}\cup\{\mathrm{T},\mathrm{W},\mathrm{V}\} $$ where $s$ is the running transcript, $\mathrm{T},\mathrm{W},\mathrm{V}$ are the three roles below, and $\theta$ is everything that gets trained. On top of the head, TRINITY adds *singular-value fine-tuning*: take an SVD of one or two of the small model's weight matrices and learn only the singular-value scales, keeping the orthogonal factors fixed. That's a few thousand more numbers. Total trainable: **under 20K parameters.** The 0.6B backbone and all seven frontier and open models stay frozen.

### Three roles, looped until accepted Each turn, the coordinator gives the chosen agent one of three roles: - **Thinker** — plan, decompose, or critique; no direct work. - **Worker** — do the work: derive, compute, write code. - **Verifier** — check the current answer and return `ACCEPT` or `REVISE`. It loops, accumulating a transcript, and halts the moment a Verifier accepts (or a fixed turn budget $K$ is exhausted): $$ \tau \;=\; \min\{\, k \le K \;:\; R_k = \mathrm{V} \ \text{and}\ u_k = \mathrm{ACCEPT} \,\} $$ where $R_k$ is the role at turn $k$ and $u_k$ is the verifier's verdict. Step through one problem — watch a wrong answer get caught and revised before it's accepted: ### Trained by evolution, not gradients Why not just RL the head? Because the reward is binary — the final answer is right or wrong — and the head is tiny, so the per-parameter gradient signal is buried in noise. TRINITY instead optimizes the coordinator with a *derivative-free* evolution strategy, maximizing expected terminal reward: $$ J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\big[\, R(\tau) \,\big], \qquad R(\tau) \in \{0, 1\} $$ The optimizer is separable CMA-ES: it keeps a diagonal Gaussian over the ~10K parameters, samples a small population each generation — $\lambda = \lceil 4 + 3\ln n \rceil \approx 32$ for $n \approx 10{,}000$ — evaluates each candidate's fitness by actually running rollouts, and shifts the distribution toward the winners. The paper shows the coordination objective is nearly block-separable, which is exactly the regime where a diagonal evolution strategy beats both random search and gradient RL under a tight evaluation budget. The honest cost: no gradients means you pay in *environment evaluations*, and each one is a full multi-turn rollout against real model APIs. ### It beats every model in its pool This is the result that matters. Transferred zero-shot to four held-out tasks, the evolved coordinator outscored every individual model in its pool — including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet. On LiveCodeBench it set a record at the time of submission: And the multi-turn loop earns its keep: accuracy climbs from 0.823 at two turns to 0.863 at six. One cheap evolved head, a frozen pool, and the ensemble beats its best member.

## Conductor: orchestration written in natural language The Conductor attacks the same problem with a bigger hammer: a 7B model (Qwen2.5-7B) trained with RL to *write the entire workflow itself*, in natural language. ### Three lists are a workflow For each problem the Conductor emits three synchronized lists: - `model_id` — which agent runs each step. - `subtasks` — a natural-language instruction for each step. - `access_list` — which earlier outputs each step is allowed to read. Those three lists *are* a directed graph. The `access_list` is the load-bearing idea: `[]` means the step sees only the original question, `["all"]` means it sees everything produced so far, and `[0, 2]` means it sees steps 0 and 2. By choosing access lists, the Conductor designs the communication topology — a chain, parallel branches, a verify-and-merge — *per problem*, not from a fixed template. Flip between the topologies it learns to produce: ### Trained with GRPO The Conductor is trained end-to-end with GRPO. For each question it samples a group of $G = 64$ candidate workflows, scores each, and pushes the policy toward the above-average ones using the group-normalized advantage $$ A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \dots, r_G)}{\operatorname{std}(r_1, \dots, r_G)} $$ The reward $r_i$ is blunt on purpose: $0$ if the three lists don't parse, $1$ if the final workflow output is correct, and $0.5$ otherwise — with no KL penalty ($\beta = 0$). The whole thing trains on just 960 problems for 200 iterations on two H100s. To make one Conductor work over *any* pool, they then fine-tune it with randomly sampled $k$-model subsets per question, so it adapts to whatever agents you hand it.

### It can call itself The Conductor may name *itself* as a worker. That spawns a fresh sub-workflow on its own draft — a recursive topology that turns inference depth into a tunable compute axis, what Sakana calls dynamic test-time scaling. Recursion buys a point or two on the hardest benchmarks for under 2× the agent calls. ### Results A 7B model orchestrating frontier workers beats the frontier workers. In a controlled run over the same pool: Unconstrained, the headline numbers were each a new high at publication and each above the best single worker: **83.9% on LiveCodeBench, 87.5% on GPQA-Diamond, 93.3% on AIME25** — reached with about 3 agent calls per question, versus 5–8 for prior multi-agent methods.

## Two routes to the same place TRINITY and the Conductor are the same idea — a learned layer that coordinates a pool — built at opposite scales: | | TRINITY | Conductor | |---|---|---| | Learnable size | < 20K params (evolved head) | 7B params (RL-trained model) | | Training | derivative-free sep-CMA-ES | GRPO (reinforcement learning) | | Output per step | (agent, role) | a full natural-language workflow | | Coordination | fixed Thinker/Worker/Verifier loop | a topology it designs per problem | | Reads the task via | the small model's hidden state | reasoning in language | | Adapts to new pools | re-evolve (cheap) | randomized-pool fine-tune | TRINITY is the minimal, almost-free coordinator; the Conductor is the expressive one that designs bespoke pipelines. Fugu uses both as its engine. ## What ships: Fugu and Fugu Ultra Two tiers. Base **Fugu** balances quality and latency over a lean pool. **Fugu Ultra** coordinates a deeper pool over more turns for hard, high-stakes problems, and takes longer for it. On Sakana's reported numbers, both match or beat the frontier: Fugu Ultra also posts **50.0 on Humanity's Last Exam**, against baselines in the 41–50 range. It's an OpenAI-compatible endpoint — change the base URL and key, no SDK migration — and it bills at a single top-tier rate. (Not available in the EU yet, pending GDPR; the exact routing decisions are kept proprietary.)

## What I make of it The honest read: - **The win is real.** An orchestration layer that beats every model it coordinates — and generalizes zero-shot to unseen tasks — is a genuine result. "Coordination" is now a trainable layer that sits *above* frontier models rather than inside one. - **The costs are real too.** Every model in the pool has to be available at inference; you trade single-model simplicity for a fleet, and latency rises with the extra turns. The biggest gains concentrate on long-tail reasoning and coding benchmarks — on easy tasks the lift is small — and leaning on GPT-5/Claude/Gemini as workers inherits their cost. - **The framing is the interesting part.** TRINITY argues the coordinator can be almost free: 20K evolved parameters over frozen models. The Conductor argues coordination is itself a reasoning skill worth a 7B model and a full RL run. Both point the same way — as individual models plateau, the next axis is how you make several of them work together, and that orchestration is learnable. --- *Built on Sakana AI's [TRINITY: An Evolved LLM Coordinator](https://arxiv.org/abs/2512.04695) and [Learning to Orchestrate Agents in Natural Language with the Conductor](https://arxiv.org/abs/2512.04388), both ICLR 2026. Product: [Sakana Fugu](https://sakana.ai/fugu/).*